[Python正则表达式]字符串中xml标签的匹配
现在有⼀个需求,⽐如给定如下数据:
0-0-0 0:0:0 #### the 68th annual golden globe awards #### the king s speech earns 7 nominations #### <LOCATION>LOS ANGELES</LOCATION><ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATIO
要求按⾏把<></>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:<X>A B C</X>需要转换成A_B_C/X
由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就⾮常⿇烦,⽽且仅仅是⽤?是⽆法真正保证是⾮贪婪的。所以需要在正则匹配时给之前匹配好的字符串标⼀个名字。
python下,正则最终写出来是这样:
1 LABEL_PATTERN = repile('(<(?P<label>\S+)>.+?</(?P=label)>)')
接下来我们需要做是在原字符串中出对应的⼦串,并且记下他们的位置,接下来就是预处理出需要替换成的样⼦,再⽤⼀个正则就好了。
1 LABEL_CONTENT_PATTERN = repile('<(?P<label>\S+)>(.*?)</(?P=label)>')
对字符串集合做整次的map,对每⼀个字符串进⾏匹配,再吧这两部分匹配结果zip在⼀起,就可以获得⼀个start-end的tuple,⼤致这样。
1 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION')
2 ('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')
3 ('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')
4 ('<PERSON>Firth</PERSON>', 'Firth/PERSON')
5 ('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON')
6 ('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON')
7 ('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON')
8 ('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON')
9 ('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')
10 ('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')
11 ('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hour
12 ('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')
13 ('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON')
14 ('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORG
15 ('<PERSON>Grandin</PERSON>', 'Grandin/PERSON')
16 ('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION')
17 ('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION')
18 ('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION')
19 ('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')
处理的代码如下:
1def read_file(path):
2if not ists(path):
3print'path : \''+ path + '\' not find.'
4return []
5 content = ''
6try:
7 with open(path, 'r') as fp:
8 content += reduce(lambda x,y:x+y, fp)
9finally:
10 fp.close()
11return content.split('\n')
12
13def get_label(each):
14 pair = zip(LABEL_PATTERN.findall(each),
15 map(lambda x: x[1].replace('', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16return map(lambda x: (x[0][0], x[1]), pair)
17
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)
接下来简单处理以下就好:
1for i in range(0, len(src)):
2for pat in pattern[i]:
3 src[i] = re.sub(pat[0], pat[1], src[i])
所有代码:
1# -*- coding: utf-8 -*-
2import re
3import os
4
5# FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_'
6 FILE_PATH = '/home/kirai/workspace/sina_news_'
7 LABEL_PATTERN = repile('(<(?P<label>\S+)>.+?</(?P=label)>)')
8 LABEL_CONTENT_PATTERN = repile('<(?P<label>\S+)>(.*?)</(?P=label)>')
9
10def read_file(path):
11if not ists(path):
12print'path : \''+ path + '\' not find.'
13return []
14 content = ''
15try:
16 with open(path, 'r') as fp:
17 content += reduce(lambda x,y:x+y, fp)
18finally:
19 fp.close()
20return content.split('\n')
21
22def get_label(each):
23 pair = zip(LABEL_PATTERN.findall(each),
python处理xml文件24 map(lambda x: x[1].replace('', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25return map(lambda x: (x[0][0], x[1]), pair)
26
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29
30for i in range(0, len(src)):
31for pat in pattern[i]:
32 src[i] = re.sub(pat[0], pat[1], src[i])
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论