python学习之网页文章爬取与词云生成--688IT编程网

python学习之⽹页⽂章爬取与词云⽣成The second homework the of Map visualization

⽹页⽂章爬取与词云⽣成

作业思路

主要分两部分，第⼀部分是⽹页爬取⽂章，第⼆部分是词频统计与词云⽣成

第⼀部分⽹页爬取⽂章

过程：分成三步，分别定义三个函数

3.saveFile，将爬取的⽂章标题和内容保存到本地

代码:

引⽤的库

import requests#发起⽹络请求，交接收返回的服务器数据

import bs4

import os#将数据输出本地⽂件

from bs4 import BeautifulSoup

getHtml⽤于获取⽹页内容

#定义getHtml函数⽤于获取⽹页内容

def getHtml(url):

#伪装浏览器访问

headers={'user-agent':'Mozilla/5.0 '}

#⽤ept..语句处理异常

try:

#获取⽹页内容

(url,headers=headers)

#异常处理语句

r.raise_for_status()

#更改编码⽅式，避免乱码问题

#返回⽹页内容

except:

return"⽹页获取异常"

getContent⽤于获得⽹页内容

#定义getContent函数⽤于访问⽂章内容，爬取⽂章标题和正⽂

def getContent(html):

#解析⽹页内容

soup=bs4.BeautifulSoup(html,'html.parser')

#获取⽂章标题，标题存放在h1标签中

title=

#获取⽂章内容，将在class对应值为content的div标签中查所有p标签，并将搜索到的内容以数组的形式存⼊plist中。 plist=soup.find('div',attrs={'class':'content'}).find_all('p')

#定义数组变量content，循环接收plist中的⽂本内容

content=''

for i in plist:getsavefilename

content+=i.text+'n'

#将标题和内容整合成⽂章

article=title+content

#返回⽂章内容

return content

saveFile保存爬取到内容

#将爬取的⽂章内容保存到本地，参数：要保存内容，路径，⽂件名

def saveFile(content,path,filename):

#判断是否有这个路径，没有的话，新建⼀个

if not ists(path):

os.mkdir(path)

#保存⽂件

with open(filename,'w',encoding='utf-8')as f:#w表⽰⽂件只写

f.write(content)

f.close()

print('⽂件保存成功')

main主函数

def main():

url="news.china/2019-09/23/content_75233135.shtml"

html=getHtml(url)

content=getContent(html)

path="C:/Users/Lattee/Desktop/"

filename="../1.txt"

saveFile(content,path,filename)

main()

第⼆部分制作词云

1.读取⽂本⽂件

2.利⽤jieba分词，存⼊分档

3.词云⽣成

4.界⾯显⽰

5.存成图⽚

代码：

引⽤的库

import jieba #引⼊jieba库⽤于分词

import wordcloud #词云展⽰库

import matplotlib.pyplot as plt #图像展⽰库

import numpy as np #numpy数据处理库

from PIL import Image #图像处理库

import collections #词频统计库

import seaborn as sns

读取⽂本，jieba分词，统计词频

#打开⽂本⽂件，读取内容存放到变量text中

path="1.txt"

f=open(path,'r',encoding='utf-8')#r表⽰⽂件只读

ad()

f.close

#处理中⽂显⽰

#利⽤jieba库的精确模式lcut函数分词

sWords=jieba.lcut(text)

#将分词后的⽂本以空格分开并存⼊txt

#txt=" ".join(sWords)

#统计词频

wordlist=[]

stopwords={'，','。','的','、','n','和','了','是','要','在','”','“','将','也'}

for word in sWords:

if word not in stopwords:

wordlist.append(word)

word_counts=collections.Counter(wordlist)

word_counts_top10=st_common(10)

#绘图显⽰Top10词汇

plt.xlabel('⾼频词语',fontproperties="SimHei",fontsize=20)

plt.ylabel('出现频率/次',fontproperties="SimHei",fontsize=20)

scale_x=range(10)

x=[]

y=[]

for i in range(len(word_counts_top10)):

x.append(word_counts_top10[i][0])

y.append(word_counts_top10[i][1])

#print("{}\t{}".format(word_counts_top10[i][0],word_counts_top10[i][1]))

#plt.figure(figsize=(15,15))

plt.title("top10词频统计",color='red',fontsize=20)

plt.bar(x,y,width=0.5,color='c')

for i in range(10):

plt.annotate(y[i],xy=(i,y[i]),xytext=(i,y[i]+1),color='red',ha='center',fontsize=15) sns.despine()

plt.savefig(fname='count_bar.jpg')

词云⽣成

#词云⽣成

mask=np.array(Image.open("timg.png"))

w=wordcloud.WordCloud(font_path="f", width=1000,

height=700,

background_color="white",

mask=mask,

scale=2,

max_words=100)

#colors=ImageColorGenerator(mask)

#绘制以颜⾊为背景图颜⾊参考

#image_color =ImageColorGenerator(mask)

#w.recolor(color_func=image_color)

#设置画布⼤⼩

plt.figure(figsize=(20,20))

#将区域分为两部分，⼀部分显⽰原图

plt.subplot(121)

plt.imshow(mask)

plt.title('原图',fontproperties='SimHei',fontsize=30) plt.axis('off')#隐藏坐标轴

#另⼀部分显⽰词云图

plt.subplot(122)

plt.title('词云图',fontproperties='SimHei',fontsize=30) plt.imshow(w)

plt.axis('off')#隐藏坐标轴

plt.show()

#保存⽂件

<_file("1.png")

运⾏结果：

688IT编程网

python学习之网页文章爬取与词云生成

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

python学习之网页文章爬取与词云生成

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式