简单Twitter爬虫--关键词--688IT编程网

简单Twitter爬⾍--关键词

⽂章⽬录

1. 常规参数

Twitter⽹站是⽤AJAX异步加载的，当对搜索关键词进⾏检索时，可以⽤requests请求到保存有数据的json⽂件，但是需要对url进⾏处理。

例如对特朗普进⾏搜索：

第⼀页url：

selenium获取cookie第⼆页url：

两者相同的参数有以下⼏个：

self.url = (

f'twitter/i/api/2/search/adaptive.json?'

f'include_profile_interstitial_type=1'

f'&include_blocking=1'

f'&include_blocked_by=1'

f'&include_followed_by=1'

f'&include_want_retweets=1'

f'&include_mute_edge=1'

f'&include_can_dm=1'

f'&include_can_media_tag=1'

f'&skip_status=1'

f'&cards_platform=Web-12'

f'&include_cards=1'

f'&include_ext_alt_text=true'

f'&include_quote_count=true'

f'&include_reply_count=1'

f'&tweet_mode=extended'

f'&include_entities=true'

f'&include_user_entities=true'

f'&include_ext_media_color=true'

f'&include_ext_media_availability=true'

f'&send_error_codes=true'

f'&simple_quoted_tweet=true'

f'&count=20'

f'&pc=1'

f'&spelling_corrections=1'

f'&ext=mediaStats%2ChighlightedLabel'

)

从⽹站中可以看到，搜索结果分为5种，下图：

以Top和Photos的参数为例：

Top: Top的参数需要加上query_source和cd，即：

self.url = self.url + '&query_source=trend_click'

经过测试cd参数可有可⽆

Photos: Photos的参数需要加上query_source和result_filter，即：

self.url = self.url + '&result_filter=image' + '&query_source=typed_query'

在搜索结果中，可以发现第⼆页的链接相⽐第⼀页增加了 cursor 参数，我们查看第⼀页的请求结果，可以在其中到cursor的值，因此cursor是根据上⼀次请求的结果得到的，我们可以⽤正则表达式提取：

self.cursor_re = repile('"(scroll:[^"]*)"')

cursor = self.cursor_re.).group(1)

url = self.url + '&cursor={}'.format(quote(cursor))

⽬前为⽌，我们已经得到了完整的url。

2. 请求头

得到url后进⾏测试发现得不到对应的json数据，请求头参数如下：

经过测试，在不登陆的情况下，x-guest-token和authorization参数是必须的，通过对⽐，这两个参数是可重复使⽤

的，authorization的值是固定不变的，⽽x-guest-token的值与cookie中的gt参数⼀样，因此我们要先获取cookie。

对于获取cookie的⽅法这⾥我使⽤的是selenium：

def get_cookie(self):

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument("--disable-infobars")

chrome_options.add_argument('--headless') # ⽆头浏览器

driver = webdriver.Chrome(options=chrome_options)

<('twitter/explore')

try:

self.x_guest_token = _cookie('gt')['value'] # 得到参数的值

except Exception as e:

logging.info('cookie获取失败，请检查后重试！{}'.format(e))

self.headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',

'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33A GWWjCpTnA',

'x-guest-token': self.x_guest_token

}

print('*'*100 + '\n')

print(self.headers)

print('\n' + '*'*100 + '\n')

得到了headers的值，有了第⼀次请求的成功，就可以得到cursor ，通过更新cursor 参数就可以获取到此关键词下的所有搜索结果。

def start_request(self, cursor=None):

<_cookie()

while True:

print('开始关键词{}的第{}页内容抓取！'.format(self.key, self.page))

if cursor:

url = self.url + '&cursor={}'.format(quote(cursor))

else:

url = self.url

response = (url, headers=self.headers, proxies=self.proxies)

cursor = self.cursor_re.).group(1)

json_resp = json.)['globalObjects']

if len(json_resp['tweets']) == 0:

print('关键词{}抓取结束！'.format(self.key))

break

self.parse_tweet_item(json_resp) # 处理得到的json数据

self.page += 1

time.sleep(1)

688IT编程网

简单Twitter爬虫--关键词

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

简单Twitter爬虫--关键词

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行