python爬虫BeautifulSoup4库--688IT编程网

python爬⾍BeautifulSoup4库

和 lxml ⼀样，Beautiful Soup 也是⼀个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，⽽Beautiful Soup 是基于HTML DOM（Document Object Model）的，会载⼊整个⽂档，解析整个DOM树，因此时间和内存开销都会⼤很多，所以性能要低于lxml。

BeautifulSoup ⽤来解析 HTML ⽐较简单，API⾮常⼈性化，⽀持CSS选择器、Python标准库中的HTML解析器，也⽀持 lxml 的 XML解析器。

Beautiful Soup 3 ⽬前已经停⽌开发，推荐现在的项⽬使⽤Beautiful Soup 4。

安装和⽂档：

1. 安装：pip install bs4。

2. 中⽂⽂档：

3. Pycharm教程：

⼏⼤解析⼯具对⽐：

解析⼯具解析速度使⽤难度

BeautifulSoup最慢最简单

lxml快简单

正则最快最难

简单使⽤：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="example/elsie" class="sister" id="link1"></a>,

<a href="example/lacie" class="sister" id="link2">Lacie</a> and

<a href="example/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

#创建 Beautiful Soup 对象

# 使⽤lxml来进⾏解析

soup = BeautifulSoup(html,"lxml")

print(soup.prettify())

python中文文档

四个常⽤的对象：

Beautiful Soup将复杂HTML⽂档转换成⼀个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种:

1. Tag

2. NavigatableString

3. BeautifulSoup

4. Comment

1. Tag：

Tag 通俗点讲就是 HTML 中的⼀个个标签。⽰例代码如下：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="example/elsie" class="sister" id="link1"></a>,

<a href="example/lacie" class="sister" id="link2">Lacie</a> and

<a href="example/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

#创建 Beautiful Soup 对象

soup = BeautifulSoup(html,'lxml')

print soup.title

# <title>The Dormouse's story</title>

print soup.head

# <head><title>The Dormouse's story</title></head>

print soup.a

# <a class="sister" href="example/elsie" id="link1"></a>

print soup.p

# The Dormouse's story

print type(soup.p)

# <class 'bs4.element.Tag'>

我们可以利⽤ soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是

注意，它查的是在所有内容中的第⼀个符合要求的标签。如果要查询所有的标签，后⾯会进⾏介绍。

对于Tag，它有两个重要的属性，分别是name和attrs。⽰例代码如下：

print soup.name

# [document] #soup 对象本⾝⽐较特殊，它的 name 即为 [document]

print soup.head.name

# head #对于其他内部标签，输出的值便为标签本⾝的名称

print soup.p.attrs

# {'class': ['title'], 'name': 'dromouse'}

# 在这⾥，我们把 p 标签的所有属性打印输出了出来，得到的类型是⼀个字典。

print soup.p['class'] # ('class')

# ['title'] #还可以利⽤get⽅法，传⼊属性的名称，⼆者是等价的

soup.p['class'] = "newClass"

print soup.p # 可以对这些属性和内容等等进⾏修改

# The Dormouse's story

2. NavigableString：

如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的⽂字。⽰例代码如下：

print soup.p.string

# The Dormouse's story

print type(soup.p.string)

# <class 'bs4.element.NavigableString'>thon

3. BeautifulSoup：

BeautifulSoup 对象表⽰的是⼀个⽂档的全部内容.⼤部分时候，可以把它当作 Tag 对象，它⽀持遍历⽂档树和搜索⽂档树中描述的⼤部分的⽅法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag，所以它没有name和attribute属性.但有时查看它的 .name 属性是很⽅便的，所以BeautifulSoup 对象包含了⼀个值为 “[document]” 的特殊属性 .name

soup.name

# '[document]'

4. Comment：

Tag , NavigableString , BeautifulSoup ⼏乎覆盖了html和xml中的所有内容，但是还有⼀些特殊对象.容易让⼈担⼼的内容是⽂档的注释部分: markup = ""

soup = BeautifulSoup(markup)

comment = soup.b.string

type(comment)

# <class 'bs4.element.Comment'>

Comment 对象是⼀个特殊类型的 NavigableString 对象:

comment

# 'Hey, buddy. Want to buy a used parser'

遍历⽂档树：

1. contents和children：

html_doc = """

<html><head><title>The Dormouse's story</title></head>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="example/elsie" class="sister" id="link1">Elsie</a>,

<a href="example/lacie" class="sister" id="link2">Lacie</a> and

<a href="example/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')

head_tag = soup.head

# 返回所有⼦节点的列表

print(ts)

# 返回所有⼦节点的迭代器

for child in head_tag.children:

print(child)

2. strings 和 stripped_strings

如果tag中包含多个字符串 [2] ，可以使⽤ .strings 来循环获取：

for string in soup.strings:

print(repr(string))

# u"The Dormouse's story"

# u'\n\n'

# u"The Dormouse's story"

# u'\n\n'

# u'Once upon a time there were three little sisters; and their names were\n'

# u'Elsie'

# u',\n'

# u'Lacie'

# u' and\n'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'\n\n'

# u'...'

# u'\n'

输出的字符串中可能包含了很多空格或空⾏，使⽤ .stripped_strings 可以去除多余空⽩内容：

for string in soup.stripped_strings:

print(repr(string))

# u"The Dormouse's story"

# u'Once upon a time there were three little sisters; and their names were'

# u'Elsie'

# u','

# u'Lacie'

# u'and'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'...'

搜索⽂档树：

1. find和find_all⽅法：

搜索⽂档树，⼀般⽤得⽐较多的就是两个⽅法，⼀个是find，⼀个是find_all。find⽅法是到第⼀个满⾜条件的标签后就⽴即返回，只返回⼀个元素。find_all⽅法是把所有满⾜条件的标签都选到，然后返回回去。使⽤这两个⽅法，最常⽤的⽤法是出⼊name以及attr参数出符合要求的标签。

soup.find_all("a",attrs={"id":"link2"})

或者是直接传⼊属性的的名字作为关键字参数：

soup.find_all("a",id='link2')

2. select⽅法：

使⽤以上⽅法可以⽅便的出元素。但有时候使⽤css选择器的⽅式可以更加的⽅便。使⽤css选择器的语法，应该使⽤select⽅法。以下列出⼏种常⽤的css选择器⽅法：

（1）通过标签名查：

print(soup.select('a'))

（2）通过类名查：

通过类名，则应该在类的前⾯加⼀个.。⽐如要查class=sister的标签。⽰例代码如下：

print(soup.select('.sister'))

（3）通过id查：

通过id查，应该在id的名字前⾯加⼀个＃号。⽰例代码如下：

print(soup.select("#link1"))

（4）组合查：

组合查即和写 class ⽂件时，标签名与类名、id名进⾏的组合原理是⼀样的，例如查 p 标签中，id 等于 link1的内容，⼆者需要⽤空格分开：

print(soup.select("p #link1"))

直接⼦标签查，则使⽤ > 分隔：

print(soup.select("head > title"))

（5）通过属性查：

查时还可以加⼊属性元素，属性需要⽤中括号括起来，注意属性和标签属于同⼀节点，所以中间不能加空格，否则会⽆法匹配到。⽰例代码如下：

print(soup.select('a[href="example/elsie"]'))

（6）获取内容

以上的 select ⽅法返回的结果都是列表形式，可以遍历形式输出，然后⽤ get_text() ⽅法来获取它的内容。

688IT编程网

python爬虫BeautifulSoup4库

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

python爬虫BeautifulSoup4库

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式