爬蟲「Python」：爬取愛奇藝（網站）視頻彈幕——以《愛情公寓5》爲例

原創

Ambitioner_c

2020-07-08 02:30

本文以作者親身經歷爲邏輯線講述爬取過程，方便讀者舉一反三。

一、明確爬取內容

1. 首先我們打開《愛情公寓5》第一集視頻，廣告之後，打開控制檯（F12），使用 Ctrl+Shift+C 指令，獲取 Html 中彈幕元素：

Ctrl+Shift+C 指令

2. 我們看到該部分內容是動態更新的，<div>標籤的 status 屬性表明了當前元素的播放狀態，設想直接請求網頁獲取Html後進行解析來獲取彈幕是不可行的，故我們從網頁與服務器的交互進行下手。理所當然，我們想到了控制檯的 Network 功能，點開 Netwrok 後，使用 Ctrl+R 指令，重新加載網頁：

3. 在茫茫文件中，我們通過過濾器試着查找一下 bullet ，我們可以看到有以下幾個文件：

4. 我們觀察一下每個 Request URL ：

Request URL:
https://cmts.iqiyi.com/bullet/40/00/11298454000_300_5.z?rn=0.41964474120625517&business=danmu&is_iqiyi=true&is_video_page=true&tvid=11298454000&albumid=212447801&categoryid=2&qypid=01010021010000000000

我們對 URL 解析發現，請求的主要部分爲：

# https://cmts.iqiyi.com/bullet/tv_id[-4:-2]/tv_id[-2:]/tv_id_300_x.z
# https://cmts.iqiyi.com/bullet/視頻編號的倒數4、3位/視頻編號的倒數2、1位/視頻編號_300_序號.z
# 彈幕文件每5分鐘（300秒）向服務器請求一次，故每集彈幕文件數量等於視頻時間除以300之後向上取整，實際編程時這裏可以簡單處理

該文件爲.z文件，故我們得先解壓後才能看到實際內容。

二、代碼編程

1. 獲取《愛情公寓5》36集的視頻編號（tv_id）：

tv_id.py

import requests
import json


def get_tv_id(aid):
    # tv_id列表
    tv_id_list = []

    for page in range(1, 3):
        url = 'https://pcw-api.iqiyi.com/albums/album/avlistinfo?aid=' \
              + aid + '&page='\
              + str(page) + '&size=30'

        # 請求網頁內容
        res = requests.get(url).text

        res_json = json.loads(res)

        # 視頻列表
        move_list = res_json['data']['epsodelist']
        for j in move_list:
            tv_id_list.append(j['tvId'])

    return tv_id_list


if __name__ == '__main__':
    # 節目id
    my_aid = '212447801'
    my_tv_id_list = get_tv_id(my_aid)

2. 爬取並解析彈幕：

bullet.py

import zlib
import requests

tv_id_module = __import__('tv_id')


def get_bullet(tv_id):
    for page in range(1, 17):
        # https://cmts.iqiyi.com/bullet/tv_id[-4:-2]/tv_id[-2:]/tv_id_300_x.z
        url = 'https://cmts.iqiyi.com/bullet/'\
              + tv_id[-4:-2] + '/'\
              + tv_id[-2:] + '/'\
              + tv_id + '_300_'\
              + str(page) + '.z'
        print(url)

        # 請求彈幕壓縮文件
        res = requests.get(url).content
        res_byte = bytearray(res)
        try:
            xml = zlib.decompress(res_byte).decode('utf-8')

            # 保存路徑
            path = '../data/' + tv_id + '_300_' + str(page) + '.xml'
            with open(path, 'w', encoding='utf-8') as f:
                f.write(xml)
        except:
            return


if __name__ == '__main__':
    # 節目id
    my_aid = '212447801'
    # tv_id列表
    my_tv_id_list = tv_id_module.get_tv_id(my_aid)
    for i in my_tv_id_list:
        get_bullet(str(i))

代碼已上傳到本人 GitHub ：https://github.com/Ambitioner-c/iqiyi_bullet.git

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

爬蟲「Python」：爬取愛奇藝（網站）視頻彈幕——以《愛情公寓5》爲例

一、明確爬取內容

二、代碼編程

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

Study：Git的簡單使用

論文學習「翻譯」：《Understanding LSTM Netword》，附原文

論文學習：BP神經網絡

Python: Base64文件編碼、解碼

爬蟲「Python」：解決網絡爬蟲遇到的字體包.ttf識別問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結