[Python正则表达式]字符串中xml标签的匹配
  现在有⼀个需求,⽐如给定如下数据:
0-0-0 0:0:0 #### the 68th annual golden globe awards ####  the king s speech earns 7 nominations  ####  <LOCATION>LOS ANGELES</LOCATION><ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATIO
  要求按⾏把<></>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:<X>A B C</X>需要转换成A_B_C/X
  由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就⾮常⿇烦,⽽且仅仅是⽤?是⽆法真正保证是⾮贪婪的。所以需要在正则匹配时给之前匹配好的字符串标⼀个名字。
python下,正则最终写出来是这样:
1 LABEL_PATTERN = repile('(<(?P<label>\S+)>.+?</(?P=label)>)')
  接下来我们需要做是在原字符串中出对应的⼦串,并且记下他们的位置,接下来就是预处理出需要替换成的样⼦,再⽤⼀个正则就好了。
1 LABEL_CONTENT_PATTERN = repile('<(?P<label>\S+)>(.*?)</(?P=label)>')
  对字符串集合做整次的map,对每⼀个字符串进⾏匹配,再吧这两部分匹配结果zip在⼀起,就可以获得⼀个start-end的tuple,⼤致这样。
1 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION')
2 ('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')
3 ('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')
4 ('<PERSON>Firth</PERSON>', 'Firth/PERSON')
5 ('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON')
6 ('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON')
7 ('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON')
8 ('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON')
9 ('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')
10 ('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')
11 ('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hour
12 ('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')
13 ('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON')
14 ('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORG
15 ('<PERSON>Grandin</PERSON>', 'Grandin/PERSON')
16 ('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION')
17 ('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION')
18 ('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION')
19 ('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')
  处理的代码如下:
1def read_file(path):
2if not ists(path):
3print'path : \''+ path + '\' not find.'
4return []
5    content = ''
6try:
7        with open(path, 'r') as fp:
8            content += reduce(lambda x,y:x+y, fp)
9finally:
10        fp.close()
11return content.split('\n')
12
13def get_label(each):
14    pair = zip(LABEL_PATTERN.findall(each),
15                          map(lambda x: x[1].replace('', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16return map(lambda x: (x[0][0], x[1]), pair)
17
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)
  接下来简单处理以下就好:
1for i in range(0, len(src)):
2for pat in pattern[i]:
3        src[i] = re.sub(pat[0], pat[1], src[i])
  所有代码:
1# -*- coding: utf-8 -*-
2import re
3import os
4
5# FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_'
6 FILE_PATH = '/home/kirai/workspace/sina_news_'
7 LABEL_PATTERN = repile('(<(?P<label>\S+)>.+?</(?P=label)>)')
8 LABEL_CONTENT_PATTERN = repile('<(?P<label>\S+)>(.*?)</(?P=label)>')
9
10def read_file(path):
11if not ists(path):
12print'path : \''+ path + '\' not find.'
13return []
14    content = ''
15try:
16        with open(path, 'r') as fp:
17            content += reduce(lambda x,y:x+y, fp)
18finally:
19        fp.close()
20return content.split('\n')
21
22def get_label(each):
23    pair = zip(LABEL_PATTERN.findall(each),
python处理xml文件24                          map(lambda x: x[1].replace('', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25return map(lambda x: (x[0][0], x[1]), pair)
26
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29
30for i in range(0, len(src)):
31for pat in pattern[i]:
32        src[i] = re.sub(pat[0], pat[1], src[i])

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。