python采集⼩说⽹站完整教程(附完整代码)python 采集⽹站数据,本教程⽤的是scrapy蜘蛛
1、安装Scrapy框架
命令⾏执⾏:
小白学python买什么书pip install scrapy
安装的scrapy依赖包和原先你安装的其他python包有冲突话,推荐使⽤Virtualenv安装
安装完成后,随便个⽂件夹创建爬⾍
scrapy startproject 你的蜘蛛名称
⽂件夹⽬录
爬⾍规则写在spiders⽬录下
items.py ——需要爬取的数据
pipelines.py ——执⾏数据保存
settings —— 配置
middlewares.py——下载器
下⾯是采集⼀个⼩说⽹站的源码
先在items.py定义采集的数据
# author ⼩⽩<qq:810735403>
import scrapy
class BookspiderItem(scrapy.Item):
# define the fields for your item here like:
i = scrapy.Field()
book_name = scrapy.Field()
book_img = scrapy.Field()
浮点数的取值范围由阶码的位数决定book_author = scrapy.Field()
book_last_chapter = scrapy.Field()spring boot 启动器
book_last_time = scrapy.Field()
book_list_name = scrapy.Field()
book_content = scrapy.Field()
pass
编写采集规则
# author ⼩⽩<qq:810735403>
import scrapy
from..items import BookspiderItem
class Book(scrapy.Spider):
name ="BookSpider"
start_urls =[
'www.xbiquge.la/xiaoshuodaquan/'
]
def parse(self, response):
bookAllList = response.css('.novellist:first-child>ul>li')
for all in bookAllList:
booklist =all.css('a::attr(href)').extract_first()
yield scrapy.Request(booklist,callback=self.list)
def list(self,response):
book_name = response.css('#info>h1::text').extract_first()
book_img = response.css('#fmimg>img::attr(src)').extract_first()
book_author = response.css('#info p:nth-child(2)::text').extract_first()
book_last_chapter = response.css('#info p:last-child::text').extract_first()哈希表扩容
book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()
bookInfo ={
'book_name':book_name,
'book_img':book_img,
'book_author':book_author,
'book_last_chapter':book_last_chapter,
'book_last_time':book_last_time
}
list= response.css('#list>dl>dd>a::attr(href)').extract()
i =0
for var in list:
i +=1
bookInfo['i']= i # 获取抓取时的顺序,保存数据时按顺序保存
yield scrapy.Request('www.xbiquge.la'+var,meta=bookInfo,callback=self.info)
def info(self,response):
self.a['book_name'])
content = response.css('#content::text').extract()
item = BookspiderItem()
item['i']= a['i']
item['book_name']= a['book_name']
item['book_img']= a['book_img']
item['book_author']= a['book_author']
item['book_last_chapter']= a['book_last_chapter']
item['book_last_time']= a['book_last_time']
item['book_list_name']= response.css('.bookname h1::text').extract_first()
item['book_content']=''.join(content)
yield item
oracle高级函数大全
保存数据
import os
class BookspiderPipeline(object):
def process_item(self, item, spider):
curPath ='E:/⼩说/'
tempPath =str(item['book_name'])
targetPath = curPath + tempPath
if not ists(targetPath):
os.makedirs(targetPath)
book_list_name =str(str(item['i'])+item['book_list_name'])
filename_path = targetPath+'/'+book_list_name+'.txt'
print('------------')
print(filename_path)
with open(filename_path,'a',encoding='utf-8')as f:
f.write(item['book_content'])
return item
执⾏
scrapy crawl BookSpider
即可完成⼀个⼩说程序的采集
这⾥推荐使⽤
scrapy shell 爬取的⽹页url
printf是什么意思啊然后 response.css('') 测试规则是否正确
在这⾥还是要推荐下我⾃⼰建的Python开发学习:810735403,⾥都是学Python开发的,如果你正在学习Python ,欢迎你加⼊,⼤家都是软件开发党,不定期分享⼲货(只有Python软件开发相关的),包括我⾃⼰整理的⼀份2020最新的Python进阶资料和⾼级开发教程,欢迎进阶中和进想深⼊Python的⼩伙伴!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论