不用通過頁面源碼獲取,直接找數據的入口
鬥魚直播是一個典型使用ajax的頁面,對於這樣的頁面簡單粗暴,直接在網頁控制檯的xhr裏面找入口
請求requests 解析json()
在線json校驗工具:https://www.bejson.com/
來到第一頁發現沒有什麼特別矚目的網頁,繼續往下找
來到第二頁,發現了一個名爲2的xhr文件,大膽猜想這玩意可能和頁碼有關,再看一頁試試
來到第三頁,果然還有,這種頁面肯定藏有貓膩,不妨看看響應結果
果不其然是json數據的格式,這下就好辦了,直接構造請求頭獲取json數據,再對數據進行清洗就ok,
代碼如下:
import requests
from lxml import etree
base_url = "https://www.douyu.com/gapi/rkc/directory/2_1/{}"
headers = {
"authority": "www.douyu.com",
"method": "GET",
"scheme": "https",
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cookie": "dy_did=99d9bec8e3161267ca6f1b2700091501; acf_did=99d9bec8e3161267ca6f1b2700091501; smidV2=201910091127005e69063c81f439757b5c6853e98eb85600415c32cf59babd0; Hm_lvt_e99aee90ec1b2106afe7ec3b199020a7=1583281439; PHPSESSID=pifc2v49pv7eh3pfqh68vdmrp6; acf_auth=c805VIqQqC4NURXP%2BsXkVVLLs71Z3tGdFmlmwKvDfJddlPpBpHsZCb%2BAinbPuBGFqbJVR3zwn6rtV9neXmKxQjGRrSK212Jf4UlJNS5TrfPY6WwlpuI5I14; dy_auth=9679Wnn3NsJb2QR5Af1AKQpGbSYw6kgSwcujMSyG3AxQ3PSOPIINFiu%2FO7usyWfaQEGgY8xUgDHUVuTM0kSDrg4nj9Bg2Ib1AERZgYFzofeYDUjGrez85lo; wan_auth37wan=2d3ba7e8c7b7%2F2QURm%2FaQBYqJqHh6FwGQ26YRXP0y5n%2FjrR0gvtyc7%2FfBM%2FfhL%2F53HJ6mUBypKwmSw1Rk5ajw0Fx%2BpMyNOEG8bIiilruQGrYqED4kIA; acf_uid=329673281; acf_username=329673281; acf_nickname=%E7%94%A8%E6%88%B761411317; acf_own_room=0; acf_groupid=1; acf_phonestatus=1; acf_avatar=https%3A%2F%2Fapic.douyucdn.cn%2Fupload%2Favatar%2Fdefault%2F03_; acf_ct=0; acf_ltkid=69931249; acf_biz=1; acf_stk=391ed8ca5549845e; acf_ccn=b08c364a0d5c5aae33f1c5361ce1cfb6; Hm_lpvt_e99aee90ec1b2106afe7ec3b199020a7=1583281831",
"referer": "https://www.douyu.com/g_LOL",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36",
"x-requested-with": "XMLHttpRequest"
}
page = 6 #這是一個動態的數據,根據實際情況來頂
if __name__ == "__main__":
for i in range(page):
url = base_url.format(i+1)
response = requests.get(url,headers=headers) #返回的是json形式的數據
datas = response.json()["data"]["rl"]
for data in datas: #簡單的在控制檯顯示
room = data["rid"] #房間號
name = data["rn"] #房間名
zhubo = data["nn"] #主播
print(room,name,zhubo)
效果如下: