...与爬虫实战视频——学习笔记(五)(京东爬虫、json数据、分布式爬虫概 ...--688IT编程网

数据分析与爬⾍实战视频——学习笔记（五）（京东爬⾍、json数据、分布式爬

⾍概念、Linux基础）

1、补充内容

json数据的处理

json数据是⼀种数据格式，长得⽐较像字典

名称/值 “firstname”:“John”

可以⽤表达式去处理，也可以使⽤python⾥⾯的json模块去解决它。接下来重点讲⼀下使⽤json模块去解决他。

import json

data='{"id":13145,"name":"外观漂亮"}'

jdata=json.loads(data)#json加载数据

jdata.keys()#提取jdata⾥⾯的关键字

jdata['id']#提出id对应的值

jdata['name']

分布式爬⾍的构建思路

scrapy也⽀持分布式。

scrapy

scrapy-redis 相当于将scrapy和redis结合。

redis做集也⽀持windows.

做分布式爬⾍需要这三个东西。

pip stall scrapy-redis

Linux基础

Linux和windows的最⼤区别就是windows可视化⽐较多，Linux多是命令。

关于爬⾍⼯程师⼯作

前程⽆忧（51job）

python爬⾍⼯程师

初级1万-1.5万

熟悉urllib库，scrapy框架

封账号和ip可以通过⽤户代理以及ip代理池。

JS页⾯间数据传递的各种⽅法。前端数据加密。可以使⽤抓包分析。

爬⾍项⽬（可以使⽤京东项⽬）。

中级

oop⾯向对象

算法

分布式的数据库redis

⾼级7万-9万

算法

反爬数据屏蔽（抓包）数据提取（表达式）

2上节课作业讲解

然后点击⼿机，进⼊⼀个页⾯，这个时候也要查看源代码，任选⼀个商品信息，查看链接在不在源代码⾥⾯，不在就需要抓包，这⾥不需要

现在点开⼀个⼿机商品，查看源代码。看看标题，店名，商店链接，商品价格，评论数，好评度可不可以到，⼤部分可以到，但是商品价格，评论数，好评度不能到，这时候需要抓包分析。

cd E:\FHLAZ\Python37\Anaconda3\scrapy_document\first\第7次课\dangdang

scrapy genspider -t crawl jd jd

scrapy crawl jd

把代码写下来把，爬取还有点问题

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from dangdang.items import DangdangItem

import re

quest

class JdSpider(CrawlSpider):

name = 'jd'

allowed_domains = ['jd']

start_urls = ['jd/']

rules = (

Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),#这⼀步爬取所有链接

)

def parse_item(self, response):

item =DangdangItem()

#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

#item['name'] = response.xpath('//div[@id="name"]').get()

#item['description'] = response.xpath('//div[@id="description"]').get()

thisurl=response.url

pat="item.jd/(.*?).html"

x=re.search(pat,thisurl)#这⼀步从所有链接中寻符合商品页⾯的链接

if(x):

thisid=repile(pat).findall(thisurl)[0]#这⼀步获取商品id

print(thisid)

title=response.xpath("/html/head/title/text()").extract()#商品标题

shop=response.xpath("//a[@clstag='shangpin|keycount|product|dianpuname1'/text()]").extract()#商品店名

shoplink=response.xpath("//a[@clstag='shangpin|keycount|product|dianpuname1'/@href]").extract()#商品链接

#print(title)

#print(shop)

#print(shoplink)

priceurl="c0.3/stock?skuId="+thisid+"&cat=9987,653,655&venderId=1000003443&area=1_72_4137_0&buyNum=1&choseSuitSkuIds=&ex traParam={%22originid%22:%221%22}&ch=1&fqsp=0&pduid=1539669746784797257006&pdpin=&callback=jQuery9745757"

commenturl="sclub.jd/comment/productPageComments.action?callback=fetchJSON_comment98vv2212&productId="+thisid+"&score=0 &sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1"

#print(priceurl)

#print(commenturl)

quest.urlopen(priceurl).read().decode("utf-8","ignore")

commentdata = quest.urlopen(commenturl).read().decode("utf-8", "ignore")

pricepat='"p":"(.*?)"'

commentpat='"goodRateShow":(.*?),'

price=repile(pricepat).findall(pricedata)

comment=repile(commentpat).findall(commentdata)

#print(price)

#print(comment)

if(len(title) and len(shop) and len(shoplink) and len(price) and len(comment)):

print(title[0])

print(shop[0])

print(shoplink[0])

print(price[0])

print(comment[0])

print("__________")

else:

pass

else:

pass

return item

cd C:\Program Files\MySQL\MySQL Server 8.0\bin

mysql -uroot -p

show databases;

create database jd;

use jd;

create table jdshop(title char(100) primary key,shop char(100),shoplink char(100),price char(20));

select * from jdshop;

select count(*) from jdshop;

# -*- coding: utf-8 -*-

import scrapy

import pymysql

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from dangdang.items import DangdangItem

import re

quest

class JdSpider(CrawlSpider):

name = 'jd'

allowed_domains = ['jd']

start_urls = ['jd/']

rules = (

Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),#这⼀步爬取所有链接

)

def parse_item(self, response):

conn = t(host="127.0.0.1", user="root", passwd="root", db="jd")

try:

item = DangdangItem()

# item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

# item['name'] = response.xpath('//div[@id="name"]').get()

# item['description'] = response.xpath('//div[@id="description"]').get()

js获取json的key和valuethisurl = response.url

pat = "item.jd//(.*?).html"

x = re.search(pat, thisurl) # 这⼀步从所有链接中寻符合商品页⾯的链接

print(x)

if (x):

thisid = repile(pat).findall(thisurl)[0] # 这⼀步获取商品id

print(thisid)

title = response.xpath("/html/head/title/text()").extract() # 商品标题

shop = response.xpath('//div[@class="name"]/a/text()').extract() # 商品店名

shoplink = response.xpath('//div[@class="name"]/a/@href').extract() # 商品链接

#print(title)

#print(shop)

#print(shoplink)

priceurl = "c0.3/stock?skuId=" + thisid + "&cat=9987,653,655&venderId=1000003443&area=1_72_4137_0&buyNum=1&choseSuitSkuId s=&extraParam={%22originid%22:%221%22}&ch=1&fqsp=0&pduid=1539669746784797257006&pdpin=&callback=jQuery9745757"

#commenturl打不开，我就不怕去这个了

#commenturl = "sclub.jd/comment/productPageComments.action?callback=fetchJSON_c

omment98vv2212&productId=" + thisid + "& score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1"

# print(priceurl)

# print(commenturl)

pricedata = quest.urlopen(priceurl).read().decode("utf-8", "ignore")

#commentdata = quest.urlopen(commenturl).read().decode("utf-8", "ignore")

#print(commentdata)

pricepat = '"p":"(.*?)"'

#commentpat = '"goodRateShow":(.*?),'

price = repile(pricepat).findall(pricedata)

#comment = repile(commentpat).findall(commentdata)

#print(price)

#print(comment)

if (len(title) and len(shop) and len(shoplink) and len(price)):

print(title[0])

print(shop[0])

print(shoplink[0])

print(price[0])

#print(comment[0])

print("__________")

sql = "insert into jdshop(title,shop,shoplink,price) values('" + title[0] + "','" + shop[0] + "','" + shoplink[0] + "','" + price[0] + "')"

conn.query(sql)

connmit()

else:

pass

else:

pass

conn.close()

return item

except Exception as e:

print(e)

除了评论其他的我都成功爬取了。评论的链接不知道为啥在⽹页上打不开。这个可能需要后续的学习把。

我现在就想知道我创建爬取的数据表在哪⾥，额，我要⼀下。

其实我看视频⾥⾯当当爬取的⾥⾯，那个评论是只有数字的，sql创建的时候也是int数据。我记得当时输不进去，因为格式不对应。也可以思考⼀下。

后⾯将数据分析好像⽤到了天⼭智能的数据，所以我也改下代码，把它爬取下来吧。

688IT编程网

...与爬虫实战视频——学习笔记(五)(京东爬虫、json数据、分布式爬虫概 ...

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

...与爬虫实战视频——学习笔记(五)(京东爬虫、json数据、分布式爬虫概 ...

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式