Python 爬⾍——request 模块
Python 爬⾍——request 模块
⼀、request 的作⽤
⼆、基本⽤法
<() ⽤于请求⽬标⽹站,类型是⼀个 HTTPresponse类型
三、各种请求⽅式:
⽤什么⽅式,就reqursts.⽅法(url)
四、带参数的GET 请求
如果访问⽹站的时候传⼊的连接需要带参数,有以下两种传参⽅式
1、直接将参数放在url 内
2、先将参数填写在dict 中,发起请求时params 参数指定为dict
五、解析json 数据
遇到json数据时,要⽤response.json()解析,且该⽅法只能解析json类型数据
response.json()⽅法同json.)import request sresponse = requests .get ('www.baidu')print (response .status_code ) # 打印状态码print (response .url ) # 打印请求url print (response .headers ) # 打印头信息print (response .cookies ) # 打印cookie 信息print (response .text ) #以⽂本形式打印⽹页源码print (response .content ) #以字节流形式打印(视频,图⽚,⾳频)print (response .request .headers )#请求头部信息
1
2
3
4
5
6
7
8
9r = requests .get ('/get') #GET ⽅法r = requests .post ('/post') #POST ⽅法r = requests .put ('/put') #PUT ⽅法r = requests .delete ('/delete') #DELETE ⽅法r = requests .head ('/get') #HEAD ⽅法r = requests .options ('/get') #OPTIONS ⽅法
1
2
3
4
python解析json文件5
6#⽅式⼀r = requests .get ('www.baidu/s?wd=python')#)print (r .request .headers )
1
2
3
4#⽅式⼆data = { 'wd': 'python'}r = requests .get ('www.baidu/s', params =data )#)print (r .request .headers )
1
2
3
4
5
response.json()解析,该⽅法只能解析json类型数据,所以,对返回数据要做医个分流的判断
六 、简单保存⼀个⼆进制⽂件
七、响应中⽂乱码问题
有时候拿到的⼀些text类型⽂本会出现中⽂乱码问题
⼋、添加请求头
为了简单伪装爬⾍,我们往往要更改真实的请求头中’User-Agent‘信息给爬⾍,#解析json 数据def parse_json (url , data ): response = requests .get (url , params =data ) print (response .text ) print (response .json ()) #如果不是json 数据,会报错
1
2
3
4
5
6#增加返回类型的判断 def parse_json2(url , data ): response = requests .get (url , params =dat
a ) #获取响应头的content_type content_type = response .headers .get ('Content-Type') if content_type .endswith ('json'): print (response .json ()) elif content_type .startswith ('text'): print (response .text )
1
2
3
4
5
6
7
8
9if __name__ == '__main__': url1 = '/get' data1 = { 'name': 'tom', 'age': 20} u
rl2 = 'www.baidu/s' data2 = { 'wd': 'python'} #parse_json(url1, data1) parse_json2(url2, data2)
1
2
3
4
5
6
7
8
9
10import requests url = 'img.ivsky/img/tupian/pre/201708/30/kekeersitao-002.jpg'response = requests .get (url )b = response .content with open ('fengjing.jpg','wb') as f : f .write (b )
1
2
3
4
5
6
7import requests data = { 'wd': '⼈⼯智能'}response = requests .get ('www.runoob/jsp/jsp-form-processing.html')print (response .text ) #可能会遇到中⽂乱码问题response .encoding = 'utf8' #设置响应编码为utf8print (response .text )
1
csdnwin 10安装2
3
4
5
6
7
8
9
九、使⽤代理
⽬睹是,变换使⽤多个代理User-Agent,
同样使⽤字典传参
⼗、post 请求
相似于det请求
但是传参⽅式有⼩差别
get⽅法⽤:params = data
⽽,post⽅法如下
⼗⼀、cookies
查看请求头的cookies import requests def hasetheaders (): headers = {} headers ['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36' response = requests .get ('www.baidu',headers =headers ) print (response .request .headers .get ('User-Agent'))def hasnoheaders (): response = requests .get ('www.baidu') print (response .request .headers .get ('User-Agent')) if __name__ == '__main__': #请求头中的'User-Agent'没有设置 hasnoheaders () #请求头中的'User-Agent'设置了操作系统和浏览器的相关信息 hasetheaders ()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18import requests proxy = {'http': '36.56.148.190:9999'}#代理字典heads = {}heads ['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X Me
req = requests .get ('www.baidu', headers =heads ,proxies =proxy )html = req .text print (html )1
2
3
4
56
7
8
9import requests data = {'name':'tom','age':'22'}response = requests .post ('/post', data =data )print (response .text )
1
2
3
4
5
6
7import requests response = requests .get ('www.baidu')print (response .cookies )print (type (response .cookies ))for k ,v in response .cookies .items ():#输出cookies 信息 print (k +':'+v )
1
2
3
4
5
6
7
⼗⼆、cookies_知乎
当我们爬取需要登录才能访问的页⾯时,就要设置cookies,传⼊登录信息,设置的cookies可以测试登录,在请求头查看然后设置headers信息,传递参数
例如,爬取知乎个⼈主页的信息(要传⼊cookies和⽹站域名host)
⼗⼆、会话的维持
cookie 和session 的区别
1、cookie数据存放在客户的浏览器上,session数据放在服务器上。
2、cookie不是很安全,别⼈可以分析存放在本地的COOKIE并进⾏COOKIE欺骗考虑到安全应当使⽤session。
3、session会在⼀定时间内保存在服务器上。当访问增多,会⽐较占⽤你服务器的性能,考虑到减轻服务器性能⽅⾯,应当使⽤COOKIE。
4、单个cookie保存的数据不能超过4K,很多浏览器都限制⼀个站点最多保存20个cookie。
处理cookie 和session
1.带上cookie和session的好处:能够请求到登陆后的页⾯
2.带上cookie和session的弊端:⼀套cookie和session往往对应⼀个⽤户,请求太快请求次数太多,容
易被识别为爬⾍
不需要cookie的时候尽量不去使⽤cookie
但是有时为了获取登陆的页⾯,必须发送带有cookie的请求会话保持的实现requests提供了⼀个sessiion类,来实现客户端和服务器端的会话保持使⽤的⽅法:1.实例化⼀个session对象
2.让session来发送get或post请求手机微商城asp源码下载
⼗三、证书验证
有时访问的url在浏览器中会出现警报,这种情况有时是缺少证书造成的,把证书验证设为False即可、import requests headers = { 'Cookie':'', 'Host':'www.zhihu', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}r = requests .get ('www.zhihu/people/jackde-jie-jie',headers =headers )with open ('', 'w', encoding ='utf8') as fi : fi .write (r .text )print (r .text )
1
2
3
4
5
6
phpmyadmin连接clickhouse7
8
9
输出一串字符对应的unicode10
11
12import requests requests .get ('/cookies/set/number/12345')#设置cookies r = requests .get ('/cookies')#获取cookies print (r .text )#没有内容session = requests .Session ()#创建⼀个会话session .get ('/cookies/set/number/12345')#设置cookies re
sponse = session .get ('/cookies')#获取cookies print (response .text )#有内容
1
4000800610人工2
3
4
5
6
7
8
9
10
11
⼗四、异常捕捉
在执⾏爬⾍时可能会遇到由于⽹络等原因造成的程序终⽌,为避免终⽌,可以是指异常捕捉机制
在你不确定会发⽣什么错误时,尽量使⽤try…except来捕获异常所有的requests exception 所有的requests exception: import ptions import ReadTimeout,HTTPError,RequestException import requests from requests .packages import urllib3urllib3.disable_warnings () #从urllib3中消除警告url = 'www.12306' response = requests .get (url ,verify =False ) #证书验证设为FALSE print (response .status_code )
1
2
3
4
5
6
7import requests from requests .exceptions import ReadTimeout from requests .exceptions import ConnectTimeout try : res = requests .get ('', timeout =0.1) print (res .status_code )except (ReadTimeout ,ConnectTimeout ) as timeout : pass #跳过异常
1
2
3
4
5
6
7
8
9import requests from requests .exceptions import ReadTimeout ,HTTPError ,RequestException try : response = requests .get ('www.baidu',timeout =0.5) print (response .status_code )except ReadTimeout : print ('timeout')except HTTPError : print ('httperror')except RequestException : print ('reqerror')1
2
3
4
5
6
7
8
9
10
11
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论