python爬⾍开发之BeautifulSoup模块从安装到详细使⽤⽅法与实例
python爬⾍模块Beautiful Soup简介
简单来说,Beautiful Soup是python的⼀个库,最主要的功能是从⽹页抓取数据。官⽅解释如下: Beautiful Soup提供⼀些简单的、python式的函数⽤来处理导航、搜索、修改分析树等功能。它是⼀个⼯
具箱,通过解析⽂档为⽤户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出⼀个完整的应⽤程序。Beautiful Soup⾃动将输⼊⽂档转换为Unicode编码,输出⽂档转换为utf-8编码。你不
需要考虑编码⽅式,除⾮⽂档没有指定⼀个编码⽅式,这时,Beautiful Soup就不能⾃动识别编码⽅式了。然后,你仅仅需要说明⼀下原始编码⽅式就可以了。Beautiful Soup已成为和lxml、html6lib⼀样
出⾊的python解释器,为⽤户灵活地提供不同的解析策略或强劲的速度。
python爬⾍模块Beautiful Soup安装
Beautiful Soup 3 ⽬前已经停⽌开发,推荐在现在的项⽬中使⽤Beautiful Soup 4,不过它已经被移植到BS4了,也就是说导⼊时我们需要 import bs4 。所以这⾥我们⽤的版本是 Beautiful Soup 4.3.2 (简
称BS4),另外据说 BS4 对 Python3 的⽀持不够好,不过我⽤的是 Python2.7.7,如果有⼩伙伴⽤的是 Python3 版本,可以考虑下载 BS3 版本。可以利⽤ pip 或者 easy_install 来安装,以下两种⽅法均
easy_install beautifulsoup4
pip install beautifulsoup4
如果想安装最新的版本,请直接下载安装包来⼿动安装,也是⼗分⽅便的⽅法。下载完成之后解压,运⾏下⾯的命令即可完成安装
sudo python setup.py install
然后需要安装 lxml
easy_install lxml
pip install lxml
另⼀个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析⽅式与浏览器相同,可以选择下列⽅法来安装html5l
ib:
easy_install html5lib
pip install html5lib
Beautiful Soup⽀持Python标准库中的HTML解析器,还⽀持⼀些第三⽅的解析器,如果我们不安装它,则 Python 会使⽤ Python默认的解析器,lxml 解析器更加强⼤,速度更快,推荐安装。
解析器使⽤⽅法优势劣势
Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库执⾏速度适中⽂档容错能⼒强Python 2.7.3 or 3.2.2)前的版本中⽂档容错能⼒差
lxml HTML 解析器BeautifulSoup(markup, “lxml”)速度快⽂档容错能⼒强需要安装C语⾔库
lxml XML 解析器BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”)速度快唯⼀⽀持XML的解析器需要安装C语⾔库
html5lib BeautifulSoup(markup, “html5lib”)最好的容错性以浏览器的⽅式解析⽂档⽣成HTML5格式的⽂档速度慢
创建Beautiful Soup对象
⾸先必须要导⼊ bs4 库
from bs4 import BeautifulSoup
我们创建⼀个字符串,后⾯的例⼦我们便会⽤它来演⽰
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofo <a href="example/lacie" rel="external nofollow" rel="external nofollow" rel="exter
nal nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofo <a href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofol and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
创建 beautifulsoup 对象
soup = BeautifulSoup(html)
另外,我们还可以⽤本地 HTML ⽂件来创建对象,例如
soup = BeautifulSoup(open('index.html'))
上⾯这句代码便是将本地 index.html ⽂件打开,⽤它来创建 soup 对象。下⾯我们来打印⼀下 soup 对象的内容,格式化输出
print soup.prettify()
指定编码:当html为其他类型编码(⾮utf-8和asc ii),⽐如GB2312的话,则需要指定相应的字符编码,BeautifulSoup才能正确解析。
htmlCharset = "GB2312"
soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)
#!/usr/bin/python
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import re
#待分析字符串
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title aq">
<b>
The Dormouse's story
</b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external n
ofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofo  <a href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofo  and
<a href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofo  and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
# html字符串创建BeautifulSoup对象
soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')
#输出第⼀个 title 标签
print soup.title
#输出第⼀个 title 标签的标签名称
print soup.title.name
#输出第⼀个 title 标签的包含内容
print soup.title.string
#输出第⼀个 title 标签的⽗标签的标签名称
print soup.title.parent.name
#输出第⼀个 p 标签
print soup.p
#输出第⼀个 p 标签的 class 属性内容
print soup.p['class']
#输出第⼀个 a 标签的 href 属性内容
print soup.a['href']
'''
soup的属性可以被添加,删除或修改. 再说⼀次, soup的属性操作⽅法与字典⼀样
'''
#修改第⼀个 a 标签的href属性为 www.baidu/
soup.a['href'] = 'www.baidu/'
#给第⼀个 a 标签添加 name 属性
soup.a['name'] = u'百度'
#删除第⼀个 a 标签的 class 属性为
del soup.a['class']
##输出第⼀个 p 标签的所有⼦节点
print ts
#输出第⼀个 a 标签
print soup.a
#输出所有的 a 标签,以列表形式显⽰
print soup.find_all('a')
#输出第⼀个 id 属性等于 link3 的 a 标签
print soup.find(id="link3")
#获取所有⽂字内容
_text())
#输出第⼀个 a 标签的所有属性信息
print soup.a.attrs
for link in soup.find_all('a'):
#获取 link 的 href 属性内容
('href'))
#对soup.p的⼦节点进⾏循环输出
for child in soup.p.children:
print(child)
#正则匹配,名字中带有b的标签
for tag in soup.find_all(repile("b")):
print(tag.name)
import bs4#导⼊BeautifulSoup库 Soup = BeautifulSoup(html)#其中html 可以是字符串,也可以是句柄需要注意的是,Be
autifulSoup会⾃动检测传⼊⽂件的编码格式,然后转化为Unicode格式通过如上两句话,BS⾃动把⽂档⽣成为如上图中的解析树。
Beautiful Soup四⼤对象种类
Beautiful Soup将复杂HTML⽂档转换成⼀个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
1. Tag
2. NavigableString
3. BeautifulSoup
4. Comment
(1)Tag
Tag 是什么?通俗点讲就是 HTML 中的⼀个个标签,例如
<title>The Dormouse's story</title>
<a class="sister" href="//www.jb51/" id="link1">jb51</a>
上⾯的 title a 等等 HTML 标签加上⾥⾯包括的内容就是 Tag,下⾯我们来感受⼀下怎样⽤ Beautiful Soup 来⽅便地获取 Tags 下⾯每⼀段代码中注释部分即为运⾏结果print soup.title
#<title>The Dormouse's story</title>
print soup.head
#<head><title>The Dormouse's story</title></head>
print soup.a
#<a class="sister" href="//www.jb51/" id="link1"><!-- Elsie --></a>
print soup.p
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
利⽤ soup加标签名轻松地获取这些标签的内容,是不是感觉⽐正则表达式⽅便多了?不过有⼀点是,它查的是在所有内容中的第⼀个符合要求的标签,如果要查询所有的标签,我们在后⾯进⾏介绍。soup.title 得到的是title标签,soup.p 得到的是⽂档中的第⼀个p标签,要想得到所有标签,得⽤find_all函数。find_all 函数返回的是⼀个序列,可以对它进⾏循环,依次得到想到的东西.。我们可以验证⼀下这些对象的类型
print type(soup.a)
#<class 'bs4.element.Tag'>
对于 Tag,它有两个重要的属性,是 name 和 attrs
name
print soup.name
print soup.head.name
#[document]
#head
soup 对象本⾝⽐较特殊,它的 name 即为 [document],对于其他内部标签,输出的值便为标签本⾝的名称。 attrs
print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}
在这⾥,我们把 p 标签的所有属性打印输出了出来,得到的类型是⼀个字典。如果我们想要单独获取某个属性,可以这样,例如我们获取它的 class 叫什么
print soup.p['class']
#['title']
还可以这样,利⽤get⽅法,传⼊属性的名称,⼆者是等价的
print ('class')
#['title']
我们可以对这些属性和内容等等进⾏修改,例如
soup.p['class']="newClass"
print soup.p
#<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
还可以对这个属性进⾏删除,例如
del soup.p['class']
print soup.p
#<p name="dromouse"><b>The Dormouse's story</b></p>
不过,对于修改删除的操作,不是我们的主要⽤途,在此不做详细介绍了,如果有需要,请查看前⾯提供的官⽅⽂档
head = soup.find('head')
#head = soup.head
#head = ts[0].contents[0]
print head
html = ts[0]  # <html> ... </html>
head = ts[0]  # <head> ... </head>
body = ts[1]  # <body> ... </body>
可以通过Tag.attrs访问,返回字典结构的属性。或者Tag.name这样访问特定属性值,如果是多值属性则以列表形式返回。
(2)NavigableString
既然我们已经得到了标签的内容,那么问题来了,我们要想获取标签内部的⽂字怎么办呢?很简单,⽤ .string 即可,例如
print soup.p.string
#The Dormouse's story
这样我们就轻松获取到了标签⾥⾯的内容,想想如果⽤正则表达式要多⿇烦。它的类型是⼀个 NavigableString,翻译过来叫可以遍历的字符串,不过我们最好还是称它英⽂名字吧。来检查⼀下它的类型
print type(soup.p.string)
#<class 'bs4.element.NavigableString'>
(3)BeautifulSoup
BeautifulSoup 对象表⽰的是⼀个⽂档的全部内容.⼤部分时候,可以把它当作 Tag 对象,是⼀个特殊的 Tag,我们可以分别获取它的类型,名称,以及属性来感受⼀下print type(soup.name)
#<type 'unicode'>
print soup.name
# [document]
print soup.attrs
#{} 空字典
(4)Comment
Comment 对象是⼀个特殊类型的 NavigableString 对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的⽂本处理造成意想不到的⿇烦。我们⼀个带注释的标签print soup.a
print soup.a.string
print type(soup.a.string)
正则匹配一个或连续多个运⾏结果如下
<a class="sister" href="//www.jb51/" id="link1"><!-- Elsie --></a>
Elsie
<class 'bs4.element.Comment'>
a 标签⾥的内容实际上是注释,但是如果我们利⽤ .string 来输出它的内容,我们发现它已经把注释符号去掉了,所以这可能会给我们带来不必要的⿇烦。另外我们打印输出下它的类型,发现它是⼀个Comment 类型,所以,我们在使⽤前最好做⼀下判断,判断代码如下
if type(soup.a.string)==bs4.element.Comment:
print soup.a.string
上⾯的代码中,我们⾸先判断了它的类型,是否为 Comment 类型,然后再进⾏其他操作,如打印输出。
Beautiful Soup模块遍历⽂档树
(1)直接⼦节点
Tag.Tag_child1:直接通过下标名称访问⼦节点。 ts:以列表形式返回所有⼦节点。 Tag.children:⽣成器,可⽤于循环访问:for child in Tag.children 要点:.contents .children 属性
.contents tag 的 .content 属性可以将tag的⼦节点以列表的⽅式输出。可以使⽤ [num] 的形式获得。使⽤contents向后遍历树,使⽤parent向前遍历树
print ts
#[<title>The Dormouse's story</title>]
输出⽅式为列表,我们可以⽤列表索引来获取它的某⼀个元素
print ts[0]
#<title>The Dormouse's story</title>
.children 它返回的不是⼀个 list,不过我们可以通过遍历获取所有⼦节点。我们打印输出 .children 看⼀下,可以发现它是⼀个 list ⽣成器对象。可以使⽤list可以将其转化为列表。当然可以使⽤for 语句遍历⾥⾯的孩⼦。
print soup.head.children
#<listiterator object at 0x7f71457f5710>
我们怎样获得⾥⾯的内容呢?很简单,遍历⼀下就好了,代码及结果如下
for child in  soup.body.children:
print child
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= and they lived at the bottom of a well.</p>
<p class="story">...</p>
(2)所有⼦孙节点
知识点:.descendants 属性 .descendants .contents 和 .children 属性仅包含tag的直接⼦节点,.descendants 属性可以对所有tag的⼦孙节点进⾏递归循环,和 children类似,我们也需要遍历获取其中的
内容。 Tag.descendants:⽣成器,可⽤于循环访问:for des inTag.descendants
for child in soup.descendants:
print child
运⾏结果如下,可以发现,所有的节点都被打印出来了,先⽣成最外层的 HTML标签,其次从 head 标签⼀个个剥离,以此类推。
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= <a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were
<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= Elsie
,
<a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= Lacie
and
<a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel= Tillie
;
and they lived at the bottom of a well.
<p class="story">...</p>
...
(3)节点内容
知识点:.string 属性 Tag.String:Tag只有⼀个String⼦节点是,可以这么访问,否则返回None Tag.Strings:⽣成器,可⽤于循环访问:for str in Tag.Strings 如果tag只有⼀个 NavigableString 类型⼦节
点,那么这个tag可以使⽤ .string 得到⼦节点。如果⼀个tag仅有⼀个⼦节点,那么这个tag也可以使⽤ .string ⽅法,输出结果与当前唯⼀⼦节点的 .string 结果相同。通俗点说就是:如果⼀个标签⾥⾯没有标签
了,那么 .string 就会返回标签⾥⾯的内容。如果标签⾥⾯只有唯⼀的⼀个标签了,那么 .string 也会返回最⾥⾯的内容。如果超过⼀个标签的话,那么就会返回None。例如print soup.head.string
#The Dormouse's story
print soup.title.string
#The Dormouse's story
如果tag包含了多个⼦节点,tag就⽆法确定,string ⽅法应该调⽤哪个⼦节点的内容, .string 的输出结果是 None
print soup.html.string
# None
(4)多个内容
知识点: .strings .stripped_strings 属性 .strings 获取多个内容,不过需要遍历获取,⽐如下⾯的例⼦
for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'
.stripped_strings  输出的字符串中可能包含了很多空格或空⾏,使⽤ .stripped_strings 可以去除多余空⽩内容
for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'
(5)⽗节点
知识点: .parent 属性使⽤parent获取⽗节点。 Tag.parent:⽗节点 Tag.parents:⽗到根的所有节点
body = soup.body html = body.parent            # html是body的⽗亲
p = soup.p
print p.parent.name
#body
content = soup.head.title.string
print content.parent.name
#title
(6)全部⽗节点
知识点:.parents 属性通过元素的 .parents 属性可以递归得到元素的所有⽗辈节点,例如
content = soup.head.title.string
for parent in content.parents:
print parent.name
title
head
html
[document]
(7)兄弟节点
知识点:.next_sibling .previous_sibling 属性
使⽤nextSibling, previousSibling获取前后兄弟
<_sibling
<_siblings
Tag.previous_sibling
Tag.previous_siblings
兄弟节点可以理解为和本节点处在统⼀级的节点,.next_sibling 属性获取了该节点的下⼀个兄弟节点,.previous_sibling 则与之相反,如果节点不存在,则返回 None。
注意:实际⽂档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空⽩,因为空⽩或者换⾏也可以被视作⼀个节点,所以得到的结果可能是空⽩或者换⾏
print _sibling
#  实际该处为空⽩
print soup.p.prev_sibling
#None 没有前⼀个兄弟节点,返回 None
print __sibling
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="example/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow"
rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" re #<a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" re #<a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel #and they lived at the bottom of a well.</p>
#下⼀个节点的下⼀个兄弟节点是我们可以看到的节点
.next⽅法:只能针对单⼀元素进⾏.next,或者说是对contents列表元素的挨个清点。⽐如
则ts[1].next等价于ts[2]
head = body.previousSibling    # head和body在同⼀层,是body的前⼀个兄弟
p1 = ts[0]          # p1, p2都是body的⼉⼦,我们⽤contents[0]取得p1
p2 = p1.nextSibling            # p2与p1在同⼀层,是p1的后⼀个兄弟, 当然t[1]也可得到
contents[]的灵活运⽤也可以寻关系节点,寻祖先或者⼦孙可以采⽤findParent(s), findNextSibling(s), findPreviousSibling(s)
(8)全部兄弟节点
知识点:.next_siblings .previous_siblings 属性通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出
for sibling in _siblings:
print(repr(sibling))
# u',\n'
# <a class="sister" href="example/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow"  # u' and\n'
# <a class="sister" href="example/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" r  # u'; and they lived at the bottom of a well.'
# None
(9)前后节点
知识点:.next_element .previous_element 属性与 .next_sibling .previous_sibling 不同,它并不是针对于兄弟节点,⽽是在所有节点,不分层次。⽐如 head 节点为
<head><title>The Dormouse's story</title></head>
那么它的下⼀个节点便是 title,它是不分层次关系的
print _element
#<title>The Dormouse's story</title>
(10)所有前后节点
知识点:.next_elements .previous_elements 属性通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问⽂档的解析内容,就好像⽂档正在被解析⼀样
for element in last__elements:
print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。