前言

利用Python實現抓取愛奇藝彈幕評論，廢話不多說。

讓我們愉快地開始吧~

開發工具

Python版本： 3.6.4

相關模塊：

requests模塊；

re模塊；

pandas模塊；

lxml模塊；

random模塊；

以及一些Python自帶的模塊。

環境搭建

安裝Python並添加到環境變量，pip安裝需要的相關模塊即可。

思路分析

本文以爬取電影《哥斯拉大戰金剛》爲例，講解如何爬愛奇藝視頻的彈幕和評論！

目標網址

https://www.iqiyi.com/v_19rr0m845o.html

抓取彈幕

愛奇藝視頻的彈幕依然是要進入開發者工具進行抓包，得到一個br壓縮文件，點擊可以直接下載，裏面的內容是二進制數據，視頻每播放一分鐘，就加載一條數據包

得到URL，兩條URL差別在於遞增的數字，60爲視頻每60秒更新一次數據包

https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.br\
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br

br文件可以用brotli庫進行解壓，但實際操作起來很難，特別是編碼等問題，難以解決；在直接使用utf-8進行解碼時，會報以下錯誤

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 52: invalid start byte

在解碼中加入ignore，中文不會亂碼，但html格式出現亂碼，數據提取依然很難

decode("utf-8", "ignore")

對得到URL進行修改成以下鏈接而獲得.z壓縮文件

https://cmts.iqiyi.com/bullet/64/00/1078946400_300_1.z

之所以如此更改，是因爲這是愛奇藝以前的彈幕接口鏈接，他還未刪除或修改，目前還可以使用。該接口鏈接中1078946400是視頻id；300是以前愛奇藝的彈幕每5分鐘會加載出新的彈幕數據包，5分鐘就是300秒，《哥斯拉大戰金剛》時長112.59分鐘，除以5向上取整就是23；1是頁數；64爲id值的第7爲和第8爲數。

代碼實現

import requests\
import pandas as pd\
from lxml import etree\
from zlib import decompress  # 解壓\
\
df = pd.DataFrame()\
for i in range(1, 23):\
    url = f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z'\
    bulletold = requests.get(url).content  # 得到二進制數據\
    decode = decompress(bulletold).decode('utf-8')  # 解壓解碼\
    with open(f'{i}.html', 'a+', encoding='utf-8') as f:  # 保存爲靜態的html文件\
        f.write(decode)\
\
    html = open(f'./{i}.html', 'rb').read()  # 讀取html文件\
    html = etree.HTML(html)  # 用xpath語法進行解析網頁\
    ul = html.xpath('/html/body/danmu/data/entry/list/bulletinfo')\
    for i in ul:\
        contentid = ''.join(i.xpath('./contentid/text()'))\
        content = ''.join(i.xpath('./content/text()'))\
        likeCount = ''.join(i.xpath('./likecount/text()'))\
        print(contentid, content, likeCount)\
        text = pd.DataFrame({'contentid': [contentid], 'content': [content], 'likeCount': [likeCount]})\
        df = pd.concat([df, text])\
df.to_csv('哥斯拉大戰金剛.csv', encoding='utf-8', index=False)

效果展示

抓取評論

愛奇藝視頻的評論在網頁下方，依然是動態加載的內容，需要進入瀏覽器的開發者工具進行抓包，當網頁下拉取時，會加載一條數據包，裏面包含評論數據

得到的準確URL

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,time&callback=jsonp_1629025964363_15405\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_size=20&types=time&callback=jsonp_1629026041287_28685\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size=20&types=time&callback=jsonp_1629026394325_81937

第一條URL加載的是精彩評論的內容，第二條URL開始加載的是全部評論的內容。經過刪減不必要參數得到以下URL

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=&page_size=10\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20

區別在於參數last_id和page_size。page_size在第一條url中的值爲10，從第二條url開始固定爲20。last_id在首條url中值爲空，從第二條開始會不斷髮生變化，經過我的研究，last_id的值就是從前一條URL中的最後一條評論內容的用戶id（應該是用戶id）；網頁數據格式爲json格式。

代碼實現

import requests\
import pandas as pd\
import time\
import random\
\
\
headers = {\
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
try:\
    a = 0\
    while True:\
        if a == 0:\
            url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&page_size=10'\
        else:\
            # 從id_list中得到上一條頁內容中的最後一個id值\
            url = f'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id={id_list[-1]}&page_size=20'\
        print(url)\
        res = requests.get(url, headers=headers).json()\
        id_list = []  # 建立一個列表保存id值\
        for i in res['data']['comments']:\
            ids = i['id']\
            id_list.append(ids)\
            uname = i['userInfo']['uname']\
            addTime = i['addTime']\
            content = i.get('content', '不存在')  # 用get提取是爲了防止鍵值不存在而發生報錯，第一個參數爲匹配的key值，第二個爲缺少時輸出\
            text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]})\
            df = pd.concat([df, text])\
        a += 1\
        time.sleep(random.uniform(2, 3))\
except Exception as e:\
    print(e)\
df.to_csv('哥斯拉大戰金剛_評論.csv', mode='a+', encoding='utf-8', index=False)

效果展示

Python爬蟲實戰，requests模塊，Python實現抓取愛奇藝視頻彈幕評論前言開發工具環境搭建思路分析抓取彈幕抓取評論

前言

開發工具

環境搭建

思路分析

抓取彈幕

抓取評論

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

【腳本項目源碼】Python製作藝術簽名生成器，打造專屬你的個人藝術簽名

【腳本項目源碼】Python實現魯迅名言查詢系統

【腳本項目源碼】Python製作多功能音樂播放器，打造專屬你的音樂播放器

Python爬蟲實戰，requests+xlwt模塊，爬取螺螄粉商品數據（附源碼）

Python爬蟲實戰，Request+urllib模塊，批量下載爬取飆歌榜所有音樂文件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬蟲實戰，requests模塊，Python實現抓取愛奇藝視頻彈幕評論 前言 開發工具 環境搭建 思路分析 抓取彈幕 抓取評論

前言

開發工具

環境搭建

思路分析

抓取彈幕

抓取評論

Python爬蟲實戰，requests模塊，Python實現抓取愛奇藝視頻彈幕評論前言開發工具環境搭建思路分析抓取彈幕抓取評論