Beautiful Soup中文文档
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup# To get everything
下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中,自己运行看看。from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>python中文文档
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
navigate soup的一些方法:
# u'html'
# u'head'
head = ts[0].contents[0]
head.parent.name
# u'html'
<
# <title>Page title</title>
# u'body'
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
下面是一些方法搜索soup,获得特定标签或有着特定属性的标签:
titleTag = soup.html.head.title
titleTag
# <title>Page title</title>
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p', align="center")
# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
soup.find('p', align="center")
# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=repile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'
修改soup也很简单:
titleTag['id'] = 'theTitle'
soup.html.head
# <head><title id="theTitle">New title</title></head>
act()
soup.prettify()
# <html>
# <head>
也就是说那个文档不是一个有效的HTML,但是它也不是太糟糕。下面是一个比较糟糕的文档。在一些问题中,它的<FORM>的开始在<TABLE>外面,结束在<TABLE>里面。(这种HTML在一些大公司的页面上也屡见不鲜)
from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
<table>
<td><input name="input1">Row 1 cell 1
<tr><td>Row 2 cell 1
</form>
<td>Row 2 cell 2<br>This</br> sure is a long cell
</body>
</html>"""
Beautiful Soup也可以处理这个文档:
print BeautifulSoup(html).prettify()
# <html>
# <form>
# <table>
# <td>
# <input name="input1" />
# Row 1 cell 1
# </td>
# <tr>
# <td>
# Row 2 cell 1
# </td>
# </tr>
# </table>
# </form>
# <td>
# Row 2 cell 2
# <br />
# This
# sure is a long cell
# </td>
# </html>
table的最后一个单元格已经在标签<TABLE>外了;Beautiful Soup决定关闭<TABLE>标签当它在<FORM>标签哪里关闭了。写这个文档家伙原本打算使用<FORM>标签扩展到table的结尾,但是Beautiful Soup肯定不知道这些。即使遇到这样糟糕的情况,Beautiful Soup仍可以剖析这个不合格文档,使你开业存取所有数据。
剖析XML
BeautifulSoup类似浏览器,是个具有启发性的类,可以尽可能的推测HTML文档作者的意图。但是XML没有固定的标签集合,因此这些启发式的功能没有作用。因此BeautifulSoup处理XML不是很好。
使用BeautifulStoneSoup类剖析XML文档。它是一个概括的类,没有任何特定的XML方言已经简单的标签内嵌规则。下面是范例:
from BeautifulSoup import BeautifulStoneSoup
xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"
soup = BeautifulStoneSoup(xml)
print soup.prettify()
# <doc>
# <tag1>
# Contents 1
# <tag2>
# Contents 2
# </tag2>
# </tag1>
# <tag1>
# Contents 3
# </tag1>
# </doc>
BeautifulStoneSoup的一个主要缺点就是它不知道如何处理自结束标签。HTML有固定的自结束标签集合,但是XML取决对应的DTD文件。你可以通过传递selfClosingTags的参数的名字到BeautifulStoneSoup的构造器中,指定自结束标签:
from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
# Text 1
# <selfclosing>
# Text 2
# </selfclosing>
# </tag>
print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
# Text 1
# <selfclosing />
# Text 2 # </tag>
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论