基于Scrapy框架的分布式网络爬虫的研究与实现
作者:华云彬 匡芳君
来源:《智能计算机与应用》2018年第05期
作者:华云彬 匡芳君
来源:《智能计算机与应用》2018年第05期
Abstract: Aiming at the problems of offensive, defensive, and crawling efficiency in the development of Web crawlers, the paper focuses on analyzing the working principle and implementation of the distributed crawler based on the Scrapy framework, as well as some distributed operating principles, anti-reptiles, and the algorithm of duplicate removal, Redis database, MongoDB database, etc., designs and implements a distributed Web crawler based on Scrapy framework. Finally, through comparative test and analysis of the crawler, it is concluded that how to improve crawling efficiency of the crawler and avoid the anti-crawler strategy of the site.
种子哈希转换链接 引言
随着互联网的发展,大数据时代的到来,普通搜索引擎已无法满足人们对信息获取的需求,网络爬虫应运而生,如百度的网络爬虫Baiduspider、谷歌的网络爬虫Googlebot等[1],也
陆续涌现了很多成熟的爬虫框架,如本文使用的Scrapy[2]。但其从催生传承演变至今,爬虫开发也已面临着一些问题,对此可阐释分析如下。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论