python网络爬虫书籍推荐超级详细的Python爬⾍介绍(Requests请求)--学习笔记
⽬录
⼀、爬⾍简介
⼀段抓取互联⽹信息的⾃动化的程序,从互联⽹上抓取对于我们有价值的信息,理论上来说,任何⽀持⽹络通信的语⾔都是可以写爬⾍的,爬⾍本⾝虽然语⾔关系不⼤,但是,总有相对顺⼿、简单的。⽬前来说,⼤多数爬⾍是⽤后台脚本类语⾔写的,其中python⽆疑是⽤的最多最⼴的!
⼆、爬⾍基本操作⽅法
-Requests块的安装与使⽤
Requests 是使⽤ Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进⾏了⾼度的封装,从⽽使得Pythoner进⾏⽹络请求时,变得美好了许多,使⽤Requests可以轻⽽易举的完成浏览器可有的任何操作。
安装Requests模块
pip3 install requests
1、GET请求
# 1、⽆参数
import requests
R = ('mp.csdn/')
print R.url
# 2、有参数
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
R = ("mp.csdn/", params=payload)
print R.url
2、POST请求
# 1、基本POST实例
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
R = requests.post("www.qwerty/", data=payload)
# 2、发送请求头和数据实例
import requests
import json
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
R = requests.post('api.github/', data=json.dumps(payload), headers=headers)
kies
3、Requests模块的其他⽅法
<(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
# 以上⽅法均是在此⽅法的基础上构建
requests模块已经将常⽤的Http请求⽅法为⽤户封装完成,⽤户直接调⽤其提供的相应⽅法即可,其中⽅法的所有参数有:
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
:param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload.    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
:type allow_redirects: bool
:
param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
Usage::
>>> import requests
>>> req = quest('GET', '/get')
<Response [200]>
"""
# By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
quest(method=method, url=url, **kwargs)
4、直接使⽤Request
method:          提交⽅式
url:                  提交地址
params:          在URL中传递的参数 ---GET
data:                在请求体⾥传递的数据
json:                在请求体⾥传递的数据
headers:          请求头
cookies:          Cookies
files:                上传⽂件
auth:                基本认知(headers中加⼊加密的⽤户名和密码)
timeout:            请求和响应的超时时间
allow_redirects:  是否允许重定向
proxies:            代理
verify:                是否忽略证书(布尔值)
cert:                证书⽂件
stream:            ⼤⽂件分段传输
method='GET',
url= 'www.baidu',
params = {'k1':'v1','k2':'v2'},
data = {'use':'alex','pwd': '123','x':[11,2,3]},
json = {'use':'alex','pwd': '123'},
headers={
'Referer': 'dig.chouti/',
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"        }
)
5、session对象
在进⾏接⼝测试的时候,我们会调⽤多个接⼝发出多个请求,在这些请求中有时候需要保持⼀些共⽤的数据,例如cookies信息。
requests库的session对象能够帮我们跨请求保持某些参数,也会在同⼀个session实例发出的所有请求之间保持cookies。
# 创建⼀个session对象
s = requests.Session()
# ⽤session对象发出get请求,设置cookies
<('/cookies/set/sessioncookie/123456789')
# ⽤session对象发出另外⼀个get请求,获取cookies
r = s.get("/cookies")
'''
# 显⽰结果
<
'{"cookies": {"sessioncookie": "123456789"}}'
'''
requests库的session对象还能为我们提供请求⽅法的缺省数据,通过设置session对象的属性来实现。
# 创建⼀个session对象
s = requests.Session()
# 设置session对象的auth属性,⽤来作为请求的默认参数
s.auth = ('user', 'pass')
# 设置session的headers属性,通过update⽅法,将其余请求⽅法中的headers属性合并起来作为最终的请求⽅法的headers
s.headers.update({'x-test': 'true'})
# 发送请求,这⾥没有设置auth会默认使⽤session对象的auth属性,这⾥的headers属性会与session对象的headers属性合并
r = s.get('/headers', headers={'x-test2': 'true'})
# 查看发送请求的请求头
'''
得到的请求头部是这样的:
{'Authorization': 'Basic dXNlcjpwYXNz', 'x-test': 'false'}
'''
⽅法层的参数覆盖会话的参数
将上⾯的请求中加上auth参数:
r = s.get('/headers', auth=('user','hah'), headers={'x-test2': 'true'})
'''
获取该请求的请求头
{'Authorization': 'Basic dXNlcjpoYWg=', 'x-test': 'false'}
'''
在request请求中,省略session对象中设置的属性,只需简单地在⽅法层参数中将那个键的值设置为 None ,那个键就会被⾃动省略掉。爬⾍:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。