前言

利用Python实现抓取爱奇艺弹幕评论，废话不多说。

让我们愉快地开始吧~

开发工具

Python版本： 3.6.4

相关模块：

requests模块；

re模块；

pandas模块；

lxml模块；

random模块；

以及一些Python自带的模块。

环境搭建

安装Python并添加到环境变量，pip安装需要的相关模块即可。

思路分析

本文以爬取电影《哥斯拉大战金刚》为例，讲解如何爬爱奇艺视频的弹幕和评论！

目标网址

https://www.iqiyi.com/v_19rr0m845o.html

抓取弹幕

爱奇艺视频的弹幕依然是要进入开发者工具进行抓包，得到一个br压缩文件，点击可以直接下载，里面的内容是二进制数据，视频每播放一分钟，就加载一条数据包

得到URL，两条URL差别在于递增的数字，60为视频每60秒更新一次数据包

https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.br\
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br

br文件可以用brotli库进行解压，但实际操作起来很难，特别是编码等问题，难以解决；在直接使用utf-8进行解码时，会报以下错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 52: invalid start byte

在解码中加入ignore，中文不会乱码，但html格式出现乱码，数据提取依然很难

decode("utf-8", "ignore")

对得到URL进行修改成以下链接而获得.z压缩文件

https://cmts.iqiyi.com/bullet/64/00/1078946400_300_1.z

之所以如此更改，是因为这是爱奇艺以前的弹幕接口链接，他还未删除或修改，目前还可以使用。该接口链接中1078946400是视频id；300是以前爱奇艺的弹幕每5分钟会加载出新的弹幕数据包，5分钟就是300秒，《哥斯拉大战金刚》时长112.59分钟，除以5向上取整就是23；1是页数；64为id值的第7为和第8为数。

代码实现

import requests\
import pandas as pd\
from lxml import etree\
from zlib import decompress  # 解压\
\
df = pd.DataFrame()\
for i in range(1, 23):\
    url = f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z'\
    bulletold = requests.get(url).content  # 得到二进制数据\
    decode = decompress(bulletold).decode('utf-8')  # 解压解码\
    with open(f'{i}.html', 'a+', encoding='utf-8') as f:  # 保存为静态的html文件\
        f.write(decode)\
\
    html = open(f'./{i}.html', 'rb').read()  # 读取html文件\
    html = etree.HTML(html)  # 用xpath语法进行解析网页\
    ul = html.xpath('/html/body/danmu/data/entry/list/bulletinfo')\
    for i in ul:\
        contentid = ''.join(i.xpath('./contentid/text()'))\
        content = ''.join(i.xpath('./content/text()'))\
        likeCount = ''.join(i.xpath('./likecount/text()'))\
        print(contentid, content, likeCount)\
        text = pd.DataFrame({'contentid': [contentid], 'content': [content], 'likeCount': [likeCount]})\
        df = pd.concat([df, text])\
df.to_csv('哥斯拉大战金刚.csv', encoding='utf-8', index=False)

效果展示

抓取评论

爱奇艺视频的评论在网页下方，依然是动态加载的内容，需要进入浏览器的开发者工具进行抓包，当网页下拉取时，会加载一条数据包，里面包含评论数据

得到的准确URL

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,time&callback=jsonp_1629025964363_15405\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_size=20&types=time&callback=jsonp_1629026041287_28685\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size=20&types=time&callback=jsonp_1629026394325_81937

第一条URL加载的是精彩评论的内容，第二条URL开始加载的是全部评论的内容。经过删减不必要参数得到以下URL

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=&page_size=10\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20

区别在于参数last_id和page_size。page_size在第一条url中的值为10，从第二条url开始固定为20。last_id在首条url中值为空，从第二条开始会不断发生变化，经过我的研究，last_id的值就是从前一条URL中的最后一条评论内容的用户id（应该是用户id）；网页数据格式为json格式。

代码实现

import requests\
import pandas as pd\
import time\
import random\
\
\
headers = {\
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
try:\
    a = 0\
    while True:\
        if a == 0:\
            url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&page_size=10'\
        else:\
            # 从id_list中得到上一条页内容中的最后一个id值\
            url = f'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id={id_list[-1]}&page_size=20'\
        print(url)\
        res = requests.get(url, headers=headers).json()\
        id_list = []  # 建立一个列表保存id值\
        for i in res['data']['comments']:\
            ids = i['id']\
            id_list.append(ids)\
            uname = i['userInfo']['uname']\
            addTime = i['addTime']\
            content = i.get('content', '不存在')  # 用get提取是为了防止键值不存在而发生报错，第一个参数为匹配的key值，第二个为缺少时输出\
            text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]})\
            df = pd.concat([df, text])\
        a += 1\
        time.sleep(random.uniform(2, 3))\
except Exception as e:\
    print(e)\
df.to_csv('哥斯拉大战金刚_评论.csv', mode='a+', encoding='utf-8', index=False)

效果展示

Python爬虫实战，requests模块，Python实现抓取爱奇艺视频弹幕评论前言开发工具环境搭建思路分析抓取弹幕抓取评论

前言

开发工具

环境搭建

思路分析

抓取弹幕

抓取评论

DAPPER 事务 TRANSACTION

【腳本項目源碼】Python製作藝術簽名生成器，打造專屬你的個人藝術簽名

【腳本項目源碼】Python實現魯迅名言查詢系統

【腳本項目源碼】Python製作多功能音樂播放器，打造專屬你的音樂播放器

Python爬蟲實戰，requests+xlwt模塊，爬取螺螄粉商品數據（附源碼）

Python爬蟲實戰，Request+urllib模塊，批量下載爬取飆歌榜所有音樂文件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬虫实战，requests模块，Python实现抓取爱奇艺视频弹幕评论 前言 开发工具 环境搭建 思路分析 抓取弹幕 抓取评论

前言

开发工具

环境搭建

思路分析

抓取弹幕

抓取评论

Python爬虫实战，requests模块，Python实现抓取爱奇艺视频弹幕评论前言开发工具环境搭建思路分析抓取弹幕抓取评论