python爬取三国演义⽂本,统计三国演义中出场次数前30的⼈物,并⽣成词
云、图表
⽬录
1.⽬标
python爬取三国演义,⽣成词云、图表
2.码前须知
项⽬⽬标:三国⼈物名称及出现次数-----数据统计分析
提出问题:哪个⼈物在三国演义中出现的次数最多?,我们希望通过数据分析来获得答案。
分析⼯具:pandas,Matplotlib
pip install bs4
pip install lxml
pip install pandas
pip install Matplotlib
bs4数据解析必备知识点:标签定位,提取标签中的数据值
1.实例化⼀个BeautifulSoup对象,并将页⾯源码数据加载到该对象中,lxml是解析器,固定的参数,下⾯是举例
#本地html加载到该对象:
fp = open(’./test.html’,‘r’,encoding=‘utf-8’)
soup = BeautifulSoup(fp,‘lxml’)
print(soup)
#互联⽹上获取的源码数据(常⽤)
page_text =
soup = BeautifulSoup(page_text,‘lxml’)
2.通过调⽤BeautifulSoup对象中相关的属性或者⽅法对标签进⾏定位和提取
bs4具体属性的⽤法
1.标签 ,如< p > < a >< div >等等
soup.tagName
例如
soup.a #返回的是html第⼀次出现的tagName的a标签
soup.div #返回的是html第⼀次出现的tagName的div标签
2.查
soup.find(‘div’) ⽤法相当于soup.div
属性定位, < div class=‘song’ >
soup.find(‘div’,class_=‘song’) class_需要带下划线
class_/id/attr
3.所有符合条件的标签
soup.find_all(‘a’) #返回符合所有的a标签,也可以属性定位
4.select放置选择器 类选择器. .代表的就是tang
soup.select(’.tang’)
soup.select(‘某种选择器(id,class,标签,选择器)’)
返回的是⼀个列表
定位到标签下⾯的标签 >表⽰标签⼀个层级选择器
soup.select(’.tang > ul >li >a’[0])
空格表⽰多级选择器
soup.select(‘tang’ > ul a’[0]) 与上述的表达式相同
常⽤层级选择器
5.获取标签中间的⽂本数据
/string/get_text()
区别
text/get_text():可以获取某⼀个标签中所有的⽂本内容,即使不是直系的⽂本string:只可以获取直系⽂本
6.获取标签中的属性值
soup.a[‘href’] 相当于列表操作
3.操作流程
1.爬取数据来源: 古诗词⽹《三国演义》
2.编码流程:
指定URL–www.shicimingju/book/sanguoyanyi.html
发起请求–requests
获取响应数据—页⾯信息
数据解析(通过bs4) --1进⾏指定标签的定位;2取得标签当中的⽂本内容持久化存储–保存⽂件
2.⽂本词频统计:中⽂分词库–jieba库,具体的解释已在代码处声明
3.⽣成词云:wordcloud,具体的解释已在代码处声明
4.⽣成柱状分析图:matplotlib,具体的解释已在代码处声明
–url⾸页
–我们要的是这个⽬录的标题和点击后的页⾯内容
点击后的页⾯,发现规律
详细页⾯要获取的内容
4.完整代码
from bs4 import BeautifulSoup
import requests
import jieba#优秀的中⽂分词第三⽅库
import wordcloud
import pandas as pd
url编码和utf8区别from matplotlib import pyplot as plt
#1.对⾸页html进⾏爬取
url ='www.shicimingju/book/sanguoyanyi.html'
url ='www.shicimingju/book/sanguoyanyi.html'
headers={
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}
fp =open('./','w',encoding='utf-8')
page_text = (url=url,headers=headers).text
#2.数据解析
#实例化对象
soup = BeautifulSoup(page_text,'lxml')
#获得li标签
li_list = soup.select('.book-mulu > ul > li')
#取得li标签⾥的属性
for li in li_list:
#通过bs4的⽅法直接获取a标签直系⽂本
title = li.a.string
#对url进⾏拼接得到详情页的url
detail_url ='www.shicimingju'+li.a['href']
#对详情页发起请求
detail_page_text = (url=detail_url,headers=headers).text
#解析详情页的标签内容,重新实例化⼀个详情页bs对象,lxml解析器
detail_soup = BeautifulSoup(detail_page_text,'lxml')
#属性定位
div_tag = detail_soup.find('div',class_='chapter_content')
#解析到了章节的内容,利⽤text⽅法获取
content =
#持久化存储
fp.write(title+':'+content+'\n')
print(title,'爬取成功')
print('爬取⽂本成功,进⾏下⼀步,jieba分词,并⽣成⼀个sanguo.xlsx⽂件⽤于数据分析')
#排除⼀些不是⼈名,但是出现次数⽐较靠前的单词
excludes ={"将军","却说","荆州","⼆⼈","不可","不能","如此","商议","如何","主公","军⼠","左右","军马","引兵","次⽇","⼤喜","天下","东吴","于是","今⽇","不敢","魏兵","陛下","⼀⼈","都督","⼈马","不知","汉中","只见","众将","后主","蜀兵","上马","⼤叫","太守","此⼈","夫⼈","先主","后⼈","背后","城中","天⼦","⼀⾯","何不", "⼤军","忽报","先⽣","百姓","何故","然后","先锋","不如","赶来","原来","令⼈","江东","下马","喊声","正是","徐州","忽然","因此","成都","不见","未知","⼤败","⼤事","之后","⼀军","引军","起兵","军中","接应","进兵","⼤惊","可以","以为","⼤怒","不得","⼼中","下⽂","⼀声","追赶","粮草","曹兵","⼀齐","分解","回报","分付","只得","出马","三千","⼤将","许都","随后","报知","前⾯","之兵","且说","众官","洛阳","领兵","何⼈","星夜","精兵","城上","之计","不肯","相见","其⾔","⼀⽇","⽽⾏","⽂武","襄阳","准备","若何","出战","亲⾃","必有","此事","军师","之中","伏兵","祁⼭","乘势","忽见","⼤笑","樊城","兄弟","⾸级","⽴于","西川","朝廷","三军","⼤王","传令","当先","五百","⼀彪","坚守","此时","之间","投降","五千","埋伏","长安","三路","遣使","英雄"}
#打开爬取下来的⽂件,并设置编码格式
txt =open("","r", encoding='utf-8').read()
#精确模式,把⽂本精确的切分开,不存在冗余单词,返回列表类型
words = jieba.lcut(txt)
#构造⼀个字典,来表达单词和出现频率的对应关系
counts ={}
#逐⼀从words中取出每⼀个元素
for word in words:
#已经有这个键的话就把相应的值加1,没有的话就取值为0,再加1
if len(word)==1:
continue
elif word =="诸葛亮"or word =="孔明⽈":
rword ="孔明"
elif word =="关公"or word =="云长":
rword ="关⽻"
elif word =="⽞德"or word =="⽞德⽈":
rword ="刘备"
elif word =="孟德"or word =="丞相":
rword ="曹操"
else:
rword = word
#如果在⾥⾯返回他的次数,如果不在则添加到字典⾥⾯并加⼀
counts[rword]= (rword,0)+1
#删除停⽤词
for word in excludes:
del counts[word]
del counts[word]
#排序,变成list类型,并使⽤sort⽅法
items =list(counts.items())
#对⼀个列表按照键值对的2个元素的第⼆个元素进⾏排序
#Ture从⼤到⼩,结果保存在items中,第⼀个元素就是出现次数最多的元素items.sort(key=lambda x:x[1], reverse=True)
#将前⼗个单词以及出现的次数打印出来
name=[]
times=[]
for i in range(30):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
name.append(word)
times.append(count)
print(name)
print(times)
#创建索引
id=[]
for i in range(1,31):
id.append(i)
#数据帧,相当于Excel中的⼀个⼯作表
df = pd.DataFrame({
'id':id,
'name':name,
'times':times,
})
#⾃定义索引,不然pandas会使⽤默认的索引,这会导致⽣成的⼯作表
#也会存在这些索引,默认从0开始
df = df.set_index('id')
print(df)
<_excel('sanguo.xlsx')
print("DONE!")
print('⽣成⽂件成功,进⾏下⼀步,⽣成词云')
#词云部分
w=wordcloud.WordCloud(
font_path="C:\\Windows\\Fonts\\f",#设置字体
background_color="white",#设置词云背景颜⾊
max_words=1000,#词云允许最⼤词汇数
max_font_size=100,#最⼤字体⼤⼩
random_state=50#配⾊⽅案的种数
)
txt=" ".join(name)
<_file("ciyun.png")
print("done!")
print("词云⽣成并保存成功,进⾏下⼀步⽣成柱状图")
dirpath ='sanguo.xlsx'
data = pd.read_excel(dirpath,index_col='id',sheet_name='Sheet1')#指定id列为索引#print(data.head())#到此数据正常
print('OK!,到此数据正常')
#柱状图部分
#直接使⽤plt.bar() 绘制柱状图,颜⾊紫罗兰
plt.bar(data.name,data.times,color="#87CEFA")
#添加中⽂字体⽀持
#设置标题,x轴,y轴,fontsize设置字号
plt.title('三国⼈物名字前三⼗名出现的次数',fontsize=16)
plt.xlabel('⼈名')
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论