爬⾍:HTTP请求与HTML解析(爬取某乎⽹站)
1. 发送web请求
1.1 requests
⽤requests库的get()⽅法发送get请求,常常会添加请求头"user-agent",以及登录"cookie"等参数
1.1.1 user-agent
登录⽹站,将"user-agent"值复制到⽂本⽂件
1.1.2 cookie
登录⽹站,将"cookie"值复制到⽂本⽂件
1.1.3 测试代码
import requests
ptions import RequestException
headers = {
'cookie': '',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替换为⾃⼰的cookie
def get_page(url):
try:
html = (url, headers=headers, timeout=5)
if html.status_code == 200:
print('请求成功')
else: # 这个else语句不是必须的
return None
except RequestException:
print('请求失败')
if__name__ == '__main__':
input_url = 'www.zhihu/hot'
get_page(input_url)
结果如下:
1.2 selenium
多数⽹站能通过window.navigator.webdriver的值识别selenium爬⾍,因此selenium爬⾍⾸先要防⽌⽹站识别selenium模拟浏览器。同样,selenium请求也常常需要添加请求头"user-agent",以及登录"cookie"等参数
1.2.1 移除Selenium中window.navigator.webdriver的值
在程序中添加如下代码(对应⽼版本⾕歌)
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
1.2.2 user-agent
selenium获取cookie 登录⽹站,将"user-agent"值复制到⽂本⽂件,执⾏如下代码将添加请求头
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
1.2.3 cookie
因为selenium要求cookie需要有"name","value"两个键以及对应的值的值,如果⽹站上⾯的cookie是字符串的形式,直接复制⽹站的cookie值将不符合selenium要求,可以⽤selenium中的get_cookies()⽅法获取登录"cookie"
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import json
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"') driver = Chrome(options=option)
time.sleep(10)
<('www.zhihu/signin?next=%2F')
time.sleep(30)
<('www.zhihu/')
cookies = _cookies()
jsonCookies = json.dumps(cookies)
with open('', 'a') as f: # ⽂件名和⽂件位置⾃⼰定义
f.write(jsonCookies)
f.write('\n')
1.2.4 测试代码⽰例
将上⾯获取到的cookie复制到下⾯程序中便可运⾏,
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
<('www.zhihu')
time.sleep(10)
driver.delete_all_cookies() # 清除刚才的cookie
time.sleep(2)
cookie = {} # 替换为⾃⼰的cookie
driver.add_cookie(cookie)
<('www.zhihu/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
)
结果截图如下:
2. HTML解析(元素定位)
要爬取到⽬标数据⾸先要定位数据所属元素,BeautifulSoup和selenium都很容易实现对HTML的元素遍历
2.1 BeautifulSoup元素定位
下⾯代码BeautifulSoup⾸先定位到属性为"HotItem-title"的"h2"标签,然后再通过.text()⽅法获取字符串值
import requests
from bs4 import BeautifulSoup
ptions import RequestException
headers = {
'cookie': '',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替换为⾃⼰的cookie
def get_page(url):
try:
html = (url, headers=headers, timeout=5)
if html.status_code == 200:
print('请求成功')
else: # 这个else语句不是必须的
return None
except RequestException:
print('请求失败')
def parse_page(html):
html = BeautifulSoup(html, "html.parser")
titles = html.find_all("h2", {'class': 'HotItem-title'})[:10]
for title in titles:
_text())
if__name__ == '__main__':
input_url = 'www.zhihu/hot'
parse_page(get_page(input_url))
截图如下:
2.2 selenium元素定位
selenium元素定位语法形式与requests不太相同,下⾯代码⽰例(1.2.4 测试代码⽰例)采⽤了⼀种层级定位⽅法:'div[itemprop="zhihu:question"] > a',笔者觉得这样定位⽐较放⼼。
selenium获取⽂本值得⽅法是.text,区别于requests的.get_text()
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
<('www.zhihu')
time.sleep(10)
driver.delete_all_cookies() # 清除刚才的cookie
time.sleep(2)
cookie = {} # 替换为⾃⼰的cookie
driver.add_cookie(cookie)
<('www.zhihu/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论