BS4(BeautifulSoup4)的使用--find_all()篇--688IT编程网

BS4（BeautifulSoup4）的使⽤--find_all（）篇

注意的是：

1.有些tag属性在搜索不能使⽤,⽐如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

但是可以通过find_all()⽅法的attrs参数定义⼀个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

表达式可以是字符串、布尔值、正则表达式

2.class属性要⽤class_=""

find_all( , , , , )

find_all()⽅法搜索当前tag的所有tag⼦节点，并判断是否符合过滤器的条件.这⾥有⼏个例⼦:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")

# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>,

# <a class="sister" href="example/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")

# [<a class="sister" href="example/lacie" id="link2">Lacie</a>]

import re

soup.find(text=repile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

有⼏个⽅法很相似，还有⼏个⽅法是新的，参数中的text和id是什么含义? 为什么find_all("p", "title")返回的是CSS Class为”title”的<p>标签? 我们来仔细看⼀下find_all()的参数

name 参数

name参数可以查所有名字为name的tag，字符串对象会被⾃动忽略掉.

简单的⽤法如下:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

重申: 搜索name参数的值可以使任⼀类型的，字符窜，正则表达式，列表,⽅法或是True .

keyword 参数

如果⼀个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含⼀个名字为id的参数,Beautiful Soup会搜索每个tag的”id”属性.

soup.find_all(id='link2')

# [<a class="sister" href="example/lacie" id="link2">Lacie</a>]

如果传⼊href参数,Beautiful Soup会搜索每个tag的”href”属性:

soup.find_all(href=repile("elsie"))

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>]

搜索指定名字的属性时可以使⽤的参数值包括 , , , .

下⾯的例⼦在⽂档树中查所有包含id属性的tag,⽆论id的值是什么:

soup.find_all(id=True)

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>,

# <a class="sister" href="example/tillie" id="link3">Tillie</a>]

使⽤多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=repile("elsie"), id='link1')

# [<a class="sister" href="example/elsie" id="link1">three</a>]

有些tag属性在搜索不能使⽤,⽐如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

但是可以通过find_all()⽅法的attrs参数定义⼀个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

按CSS搜索

按照CSS类名搜索tag的功能⾮常实⽤，但标识CSS类名的关键字class在Python中是保留字，使⽤class做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始，可以通过class_参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>,

# <a class="sister" href="example/tillie" id="link3">Tillie</a>]

class_参数同样接受不同类型的过滤器，字符串，正则表达式,⽅法或True :

soup.find_all(class_=repile("itl"))

# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):

return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>,

# <a class="sister" href="example/tillie" id="link3">Tillie</a>]

tag的class属性是 .按照CSS类名搜索tag时，可以分别搜索tag中的每个CSS类名:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')

css_soup.find_all("p", class_="strikeout")

# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")

# [<p class="body strikeout"></p>]

搜索class属性时也可以通过CSS值完全匹配:

css_soup.find_all("p", class_="body strikeout")

# [<p class="body strikeout"></p>]

完全匹配class的值时，如果CSS类名的顺序与实际不符，将搜索不到结果:

soup.find_all("a", attrs={"class": "sister"})

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>,

# <a class="sister" href="example/tillie" id="link3">Tillie</a>]

text参数

通过text参数可以搜搜⽂档中的字符串内容.与name参数的可选值⼀样, text参数接受 , , , . 看例⼦:

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=repile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):

""Return True if this string is the only child of its parent tag.""

return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然text参数⽤于搜索字符串，还可以与其它参数混合使⽤来过滤tag.Beautiful Soup会到.string⽅法与text参数值相符的tag.下⾯代码⽤来搜索内容⾥⾯包含“Elsie”的<a>标签:

soup.find_all("a", text="Elsie")

# [<a href="example/elsie" class="sister" id="link1">Elsie</a>]

limit

find_all()⽅法返回全部的搜索结构，如果⽂档树很⼤那么搜索会很慢.如果我们不需要全部结果，可以使⽤limit参数限制返回结果的数量.效果与SQL中的limit关键字类似，当搜索到的结果数量达到limit的限

制时，就停⽌搜索返回结果.

⽂档树中有3个tag符合搜索条件，但结果只返回了2个，因为我们限制了返回数量:

soup.find_all("a", limit=2)

# [<a class="sister" href="example/elsie" id="link1">Elsie</a>,

# <a class="sister" href="example/lacie" id="link2">Lacie</a>]

recursive参数

调⽤tag的find_all()⽅法时,Beautiful Soup会检索当前tag的所有⼦孙节点，如果只想搜索tag的直接⼦节点，可以使⽤参数recursive=False .

⼀段简单的⽂档:

<html>

<head>

cssclass属性<title>

The Dormouse's story

</title>

</head>

...

是否使⽤recursive参数的搜索结果:

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

像调⽤find_all()⼀样调⽤

find_all()⼏乎是Beautiful Soup中最常⽤的搜索⽅法，所以我们定义了它的简写⽅法. BeautifulSoup对象和tag对象可以被当作⼀个⽅法来使⽤，这个⽅法的执⾏结果与调⽤这个对象的find_all()⽅法相同，下⾯两⾏代码是等价的:

soup.find_all("a")

soup("a")

这两⾏代码也是等价的:

soup.title.find_all(text=True)

soup.title(text=True)

688IT编程网

BS4(BeautifulSoup4)的使用--find_all()篇

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

一种任意人头与任意人体的3D结合方法

正则匹配c语言中8进制

fortran数据格式

python中文本转数字用的公式

gh 文本变数值

js判断输入是否为正整数、浮点数等数字的函数代码

qt浮点数正则表达式

QT正则表达式限制输入值

手机号码和电话号码的正则表达式

str转浮点-概述说明以及解释

英豪结尾的诗句

Java正则表达式:符合以特定字符串开头,以特定字符串结尾的所有结果

machinebuilder使用手册

ASP.NET网站建设基本常用代码

LCD显示实时时钟

经纬度正则表达式解析

前端科学计数法转数字

python正则表达式re之compile函数解析

pythonunittest之断言及示例

[lua]lua中匹配字符串小数

最新文章

nginx map用法正则

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

python中re.findall函数实例用法

nginx url表达式

nginx 正则匹配参数

标签列表

688IT编程网

BS4(BeautifulSoup4)的使用--find_all()篇

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

一种任意人头与任意人体的3D结合方法

正则匹配c语言中8进制

fortran数据格式

python中文本转数字用的公式

gh 文本变数值

js判断输入是否为正整数、浮点数等数字的函数代码

qt浮点数正则表达式

QT正则表达式限制输入值

手机号码和电话号码的正则表达式

str转浮点-概述说明以及解释

英豪结尾的诗句

Java正则表达式:符合以特定字符串开头,以特定字符串结尾的所有结果

machinebuilder使用手册

ASP.NET网站建设基本常用代码

LCD显示实时时钟

经纬度正则表达式解析

前端科学计数法转数字

python正则表达式re之compile函数解析

pythonunittest之断言及示例

[lua]lua中匹配字符串小数

最新文章

nginx map用法 正则

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

python中re.findall函数实例用法

nginx url表达式

nginx 正则匹配参数

标签列表

nginx map用法正则

nginx map用法正则