前言
利用Python實現抓取某博評論,廢話不多說。
讓我們愉快地開始吧~
開發工具
Python版本:3.6.4
相關模塊:
requests模塊;
re模塊;
pandas模塊;
lxml模塊;
random模塊;
以及一些Python自帶的模塊。
環境搭建
安裝Python並添加到環境變量,pip安裝需要的相關模塊即可。
思路分析
本文以爬取某博熱搜《霍尊手寫道歉信》爲例,講解如何爬取某博評論!
抓取評論
網頁地址
https://m.weibo.cn/detail/4669040301182509
網頁分析
某博評論是動態加載的,進入瀏覽器的開發者工具後,在網頁上向下拉取會得到我們需要的數據包
得到真實URL
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0
兩條URL區別很明顯,首條URL是沒有參數max_id的,第二條開始max_id纔出現,而max_id其實是前一條數據包中的max_id
但有個需要注意的是參數max_id_type,它其實也是會變化的,所以我們需要從數據包中獲取max_id_type
代碼實現
import re
import requests
import pandas as pd
import time
import random
df = pd.DataFrame()
try:
a = 1
while True:
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
}
resposen = requests.get('https://m.weibo.cn/detail/4669040301182509', headers=header)
# 某博爬取大概幾十頁會封賬號的,而通過不斷的更新cookies,會讓爬蟲更持久點...
cookie = [cookie.value for cookie in resposen.cookies] # 用列表推導式生成cookies部件
headers = {
# 登錄後的cookie, SUB用登錄後的
'cookie': f'WEIBOCN_FROM={cookie[3]}; SUB=; _T_WM={cookie[4]}; MLOGIN={cookie[1]}; M_WEIBOCN_PARAMS={cookie[2]}; XSRF-TOKEN={cookie[0]}',
'referer': 'https://m.weibo.cn/detail/4669040301182509',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
}
if a == 1:
url = 'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0'
else:
url = f'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id={max_id}&max_id_type={max_id_type}'
html = requests.get(url=url, headers=headers).json()
data = html['data']
max_id = data['max_id'] # 獲取max_id和max_id_type返回給下一條url
max_id_type = data['max_id_type']
for i in data['data']:
screen_name = i['user']['screen_name']
i_d = i['user']['id']
like_count = i['like_count'] # 點贊數
created_at = i['created_at'] # 時間
text = re.sub(r'<[^>]*>', '', i['text']) # 評論
print(text)
data_json = pd.DataFrame({'screen_name': [screen_name], 'i_d': [i_d], 'like_count': [like_count], 'created_at': [created_at],'text': [text]})
df = pd.concat([df, data_json])
time.sleep(random.uniform(2, 7))
a += 1
except Exception as e:
print(e)
df.to_csv('某博.csv', encoding='utf-8', mode='a+', index=False)
print(df.shape)
效果展示