Beautiful Soup中文文档--688IT编程网

Beautiful Soup中文文档

from BeautifulSoup import BeautifulSoup # For processing HTML

from BeautifulSoup import BeautifulStoneSoup # For processing XML

import BeautifulSoup# To get everything

下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中，自己运行看看。from BeautifulSoup import BeautifulSoup

import re

doc = ['<html><head><title>Page title</title></head>',

'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',

'</html>']

soup = BeautifulSoup(''.join(doc))

print soup.prettify()

# <html>

# <head>

# <title>

# Page title

# </title>

# </head>python中文文档

# <body>

# <p id="firstpara" align="center">

# This is paragraph

# <b>

# one

# </b>

# .

# </p>

# <p id="secondpara" align="blah">

# This is paragraph

# <b>

# two

# </b>

# .

# </p>

# </body>

# </html>

navigate soup的一些方法:

# u'html'

# u'head'

head = ts[0].contents[0]

head.parent.name

# u'html'

# <title>Page title</title>

# u'body'

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>

# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

下面是一些方法搜索soup，获得特定标签或有着特定属性的标签：

titleTag = soup.html.head.title

titleTag

# <title>Page title</title>

titleTag.string

# u'Page title'

len(soup('p'))

# 2

soup.findAll('p', align="center")

# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]

soup.find('p', align="center")

# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>

soup('p', align="center")[0]['id']

# u'firstpara'

soup.find('p', align=repile('^b.*'))['id']

# u'secondpara'

soup.find('p').b.string

# u'one'

soup('p')[1].b.string

# u'two'

修改soup也很简单：

titleTag['id'] = 'theTitle'

soup.html.head

# <head><title id="theTitle">New title</title></head>

act()

soup.prettify()

# <html>

# <head>

也就是说那个文档不是一个有效的HTML，但是它也不是太糟糕。下面是一个比较糟糕的文档。在一些问题中，它的<FORM>的开始在<TABLE>外面，结束在<TABLE>里面。(这种HTML在一些大公司的页面上也屡见不鲜)

from BeautifulSoup import BeautifulSoup

html = """

<html>

<form>

<table>

<td><input name="input1">Row 1 cell 1

<tr><td>Row 2 cell 1

</form>

<td>Row 2 cell 2<br>This</br> sure is a long cell

</body>

</html>"""

Beautiful Soup也可以处理这个文档：

print BeautifulSoup(html).prettify()

# <html>

# <form>

# <table>

# <td>

# <input name="input1" />

# Row 1 cell 1

# </td>

# <tr>

# <td>

# Row 2 cell 1

# </td>

# </tr>

# </table>

# </form>

# <td>

# Row 2 cell 2

# <br />

# This

# sure is a long cell

# </td>

# </html>

table的最后一个单元格已经在标签<TABLE>外了；Beautiful Soup决定关闭<TABLE>标签当它在<FORM>标签哪里关闭了。写这个文档家伙原本打算使用<FORM>标签扩展到table的结尾，但是Beautiful Soup肯定不知道这些。即使遇到这样糟糕的情况,Beautiful Soup仍可以剖析这个不合格文档，使你开业存取所有数据。

剖析XML

BeautifulSoup类似浏览器，是个具有启发性的类，可以尽可能的推测HTML文档作者的意图。但是XML没有固定的标签集合，因此这些启发式的功能没有作用。因此BeautifulSoup处理XML不是很好。

使用BeautifulStoneSoup类剖析XML文档。它是一个概括的类，没有任何特定的XML方言已经简单的标签内嵌规则。下面是范例：

from BeautifulSoup import BeautifulStoneSoup

xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"

soup = BeautifulStoneSoup(xml)

print soup.prettify()

# <doc>

# <tag1>

# Contents 1

# <tag2>

# Contents 2

# </tag2>

# </tag1>

# <tag1>

# Contents 3

# </tag1>

# </doc>

BeautifulStoneSoup的一个主要缺点就是它不知道如何处理自结束标签。HTML有固定的自结束标签集合，但是XML取决对应的DTD文件。你可以通过传递selfClosingTags的参数的名字到BeautifulStoneSoup的构造器中，指定自结束标签:

from BeautifulSoup import BeautifulStoneSoup

xml = "<tag>Text 1<selfclosing>Text 2"

print BeautifulStoneSoup(xml).prettify()

# <tag>

# Text 1

# <selfclosing>

# Text 2

# </selfclosing>

# </tag>

print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()

# <tag>

# Text 1

# <selfclosing />

# Text 2 # </tag>

688IT编程网

Beautiful Soup中文文档

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

Beautiful Soup中文文档

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式