python采集小说网站完整教程(附完整代码)--688IT编程网

python采集⼩说⽹站完整教程（附完整代码）python 采集⽹站数据，本教程⽤的是scrapy蜘蛛

1、安装Scrapy框架

命令⾏执⾏：

小白学python买什么书pip install scrapy

安装的scrapy依赖包和原先你安装的其他python包有冲突话，推荐使⽤Virtualenv安装

安装完成后，随便个⽂件夹创建爬⾍

scrapy startproject 你的蜘蛛名称

⽂件夹⽬录

爬⾍规则写在spiders⽬录下

items.py ——需要爬取的数据

pipelines.py ——执⾏数据保存

settings —— 配置

middlewares.py——下载器

下⾯是采集⼀个⼩说⽹站的源码

先在items.py定义采集的数据

# author ⼩⽩<qq：810735403>

import scrapy

class BookspiderItem(scrapy.Item):

# define the fields for your item here like:

i = scrapy.Field()

book_name = scrapy.Field()

book_img = scrapy.Field()

浮点数的取值范围由阶码的位数决定book_author = scrapy.Field()

book_last_chapter = scrapy.Field()spring boot 启动器

book_last_time = scrapy.Field()

book_list_name = scrapy.Field()

book_content = scrapy.Field()

pass

编写采集规则

# author ⼩⽩<qq：810735403>

import scrapy

from..items import BookspiderItem

class Book(scrapy.Spider):

name ="BookSpider"

start_urls =[

'www.xbiquge.la/xiaoshuodaquan/'

]

def parse(self, response):

bookAllList = response.css('.novellist:first-child>ul>li')

for all in bookAllList:

booklist =all.css('a::attr(href)').extract_first()

yield scrapy.Request(booklist,callback=self.list)

def list(self,response):

book_name = response.css('#info>h1::text').extract_first()

book_img = response.css('#fmimg>img::attr(src)').extract_first()

book_author = response.css('#info p:nth-child(2)::text').extract_first()

book_last_chapter = response.css('#info p:last-child::text').extract_first()哈希表扩容

book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()

bookInfo ={

'book_name':book_name,

'book_img':book_img,

'book_author':book_author,

'book_last_chapter':book_last_chapter,

'book_last_time':book_last_time

}

list= response.css('#list>dl>dd>a::attr(href)').extract()

i =0

for var in list:

i +=1

bookInfo['i']= i # 获取抓取时的顺序，保存数据时按顺序保存

yield scrapy.Request('www.xbiquge.la'+var,meta=bookInfo,callback=self.info)

def info(self,response):

self.a['book_name'])

content = response.css('#content::text').extract()

item = BookspiderItem()

item['i']= a['i']

item['book_name']= a['book_name']

item['book_img']= a['book_img']

item['book_author']= a['book_author']

item['book_last_chapter']= a['book_last_chapter']

item['book_last_time']= a['book_last_time']

item['book_list_name']= response.css('.bookname h1::text').extract_first()

item['book_content']=''.join(content)

yield item

oracle高级函数大全

保存数据

import os

class BookspiderPipeline(object):

def process_item(self, item, spider):

curPath ='E:/⼩说/'

tempPath =str(item['book_name'])

targetPath = curPath + tempPath

if not ists(targetPath):

os.makedirs(targetPath)

book_list_name =str(str(item['i'])+item['book_list_name'])

filename_path = targetPath+'/'+book_list_name+'.txt'

print('------------')

print(filename_path)

with open(filename_path,'a',encoding='utf-8')as f:

f.write(item['book_content'])

return item

执⾏

scrapy crawl BookSpider

即可完成⼀个⼩说程序的采集

这⾥推荐使⽤

scrapy shell 爬取的⽹页url

printf是什么意思啊

然后 response.css('') 测试规则是否正确

在这⾥还是要推荐下我⾃⼰建的Python开发学习:810735403，⾥都是学Python开发的，如果你正在学习Python ，欢迎你加⼊，⼤家都是软件开发党，不定期分享⼲货（只有Python软件开发相关的），包括我⾃⼰整理的⼀份2020最新的Python进阶资料和⾼级开发教程，欢迎进阶中和进想深⼊Python的⼩伙伴！

688IT编程网

python采集小说网站完整教程(附完整代码)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

688IT编程网

python采集小说网站完整教程(附完整代码)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林 重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

随机森林重要性