pythonhtml⽂本转为text⽂本
翻了⼀些博客,看到有博主是⾃⼰写了将html转为text的函数,但是由于项⽬时间⽐较紧,所以⾃⼰懒得动脑筋去写了,这⾥推荐⼤家⽤⼀下nltk模块中clean_html()函数,⽤法如下:
import nltk
html="""
<!DOCTYPE html>
<html>
<head>
<title>这是个标题</title>
</head>
<body>
<h1>这是⼀个⼀个简单的HTML</h1>
<p>Hello World!</p>
</body>
</html>
"""
print(nltk.clean_html(html))
你以为这样就结束了吗?不,你会看到红红的报错信息:
"To remove HTML markup, use BeautifulSoup's get_text() function"
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
究其原因,查看它报错所提到的util.py⽂件,⾥⾯是这样的:
def clean_html(html):
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
def clean_html(html):#利⽤nltk的clean_html()函数将html⽂件解析为text⽂件
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)","", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?","", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>"," ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r" "," ", cleaned)
cleaned = re.sub(r"  "," ", cleaned)
cleaned = re.sub(r"  "," ", cleaned)
return cleaned.strip()
综上所述,不再需要导⼊nltk模块,⽽是直接把clean_html()的实现放⼊⾃⼰的项⽬中就可以直接使⽤了,也就是这样:
import re
def clean_html(html):#利⽤nltk的clean_html()函数将html⽂件解析为text⽂件
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)","", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?","", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>"," ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r" "," ", cleaned)
cleaned = re.sub(r"  "," ", cleaned)
cleaned = re.sub(r"  "," ", cleaned)
return cleaned.strip()
html="""
<!DOCTYPE html>
<html>
<head>
<title>这是个标题</title>
</head>
<body>
<h1>这是⼀个⼀个简单的HTML</h1>
<p>Hello World!</p>
</body>
</html>
"""
print(clean_html(html))
说这么多也就是想复现⼀下⾃⼰的⼼路历程,仅记录⾃⼰的学习经验,也希望能帮助到其他学习者。
from html.parser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc
class _DeHTMLParser(HTMLParser):
def__init__(self):
HTMLParser.__init__(self)
self.__text =[]
def handle_data(self, data):
text = data.strip()
if len(text)>0:
text = sub('[ \t\r\n]+',' ', text)
self.__text.append(text +' ')
def handle_starttag(self, tag, attrs):
if tag =='p':
self.__text.append('\n\n')
elif tag =='br':
self.__text.append('\n')
def handle_startendtag(self, tag, attrs):
if tag =='br':
self.__text.append('\n\n')
def text(self):
return''.join(self.__text).strip()
def dehtml(text):
try:
parser = _DeHTMLParser()
text函数什么意思parser.feed(text)
parser.close()
()
except:
print_exc(file=stderr)
return text
print(dehtml(html))#直接使⽤dehtml()函数就可以
但是不推荐使⽤这个⾃定义的函数,因为它不能把html中⼀些样式代码部分去掉,所以有⼀点⼩瑕疵,如果不太在意这个问题的话可以使⽤这个函数,影响不⼤。

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。