有趣的Python —— bilibili彈幕爬取 + 雲圖生成

原創

2019-08-07 10:39

文章目錄

先來個最終實現的效果：

輸入一個bilibili的視頻地址，生成這個視頻彈幕的雲圖。

視頻地址： https://www.bilibili.com/video/av47301018?from=search&seid=5397670291950820315

雲圖效果：

整體思路：

獲取頁面 html 數據
利用正則提取 cid , 組裝出 cid 彈幕地址
獲取 cid 地址相關的 xml 數據
利用BeautifulSoup 獲取具體的彈幕數據
彈幕數據生成雲圖（WordCloud）

涉及庫的安裝：

pip install beautifulsoup4 wordcloud

1. 分析網頁數據，提取cid

通過 chorme 打開視頻地址。

查看網頁源代碼（在網頁上右鍵可以出來菜單）

搜索一下 “cid”:

會發下下面的樣子：

發現第一個匹配的cid 數據，就是這個視頻的 cid ，也就是我們需要的 cid code

這一步比較簡單，其實就是分析出 cid 的code. 組裝成彈幕下載的地址。

cid 獲取彈幕的規則是： https://comment.bilibili.com/{}.xml ，{}替換對應cid code

這一步的代碼很簡單：

    def get_cid_url(self, org_url):
        """
        先獲取鏈接的 html 文件
        通過正則匹配出 cid code
        生成 彈幕鏈接 放回
        :param org_url: 視頻地址
        :return: 彈幕地址
        """
        html_data = self.get_url_data(org_url)
        cid = re.findall(r"\"cid\":([\d]+),", str(html_data))[0]
        return "https://comment.bilibili.com/{}.xml".format(cid)

2. cid 數據提取彈幕

首先來個彈幕地址，打開之後是怎麼樣的：

地址： https://comment.bilibili.com/82840963.xml

這個其實比較簡單，那些 d 標籤，其實就是我們需要的彈幕數據。具體邏輯：

傳入cid_url；
獲取對應的 xml 內容；
通過BeautifulSoup 獲取所有的 d 標籤；
提取 d 標籤裏面的數據，並且組裝成爲一個字符串用逗號分隔。

這一步的代碼：

    def get_all_tags(self, cid_url):
        """
        獲取彈幕  xml 文件
        利用 beautifulSoup 獲取所有彈幕文案
        :param cid_url: 彈幕獲取url
        :return: 所有彈幕list
        """
        xml_data = self.get_url_data(cid_url) 
        soup = BeautifulSoup(xml_data, features="xml")
        all_data=""
        for item in soup.find_all("d"):
            all_data = all_data + item.string + ","
        return all_data

3. 彈幕數據雲圖生成

獲取到彈幕的數據之後，就是最後一個步驟了，生成雲圖。

這一步使用的是 WordCloud 庫進行雲圖的生成，下面是新建一個 WordCloud 實體：

wordcloud = WordCloud(font_path=“STXINGKA.TTF”
,background_color=“white”
, width= 800
, height=400
, max_words=100)

參數解析：
font_path：這個是中文字體路徑，這裏直接放在和根目錄，如果是中文的雲圖，必須加上這個字體的設置，不然出現亂碼的情況；
background_color：雲圖的背景顏色
width，height ：雲圖的寬高設置
max_words ：最大展示的詞數量

還有一個參數，這裏講一下，就是 mask 這個參數，是可以設置圖片背景的，就是生成的雲圖，是根據你的圖片來進行的。

最後，新建了WordCloud，傳入我們前兩步得到的彈幕數據，就可以獲取到一張有趣的雲圖了。

    def create_could_pic(self, tags):
        wordcloud = WordCloud(font_path="STXINGKA.TTF"
                              ,background_color="white"
                              , width= 800
                              , height=400
                              , max_words=100)
        wordcloud.generate(tags)
        image = wordcloud.to_image()

        image.show()

總結

這個爬取的思路比較簡單。重點在於，靈活運用庫的力量。

re 進行正則的匹配
requests 進行網頁數據的獲取
BeautifulSoup 進行數據的提取
WordCloud 進行雲圖的生成

可以展開的點：

提取其他網站的視頻彈幕
生成有背景的雲圖

源碼

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

__author__ = 'Weijian Xuan'

import re
from bs4 import BeautifulSoup
import requests
from wordcloud import WordCloud

class Bilibili(object) :

    @staticmethod
    def get_url_data(url):
        headers = {
            "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Mobile Safari/537.36"
        }
        response = requests.get(url = url, headers = headers)
        return response.content.decode()

    def get_cid_url(self, org_url):
        """
        先獲取鏈接的 html 文件
        通過正則匹配出 cid code
        生成 彈幕鏈接 放回
        :param org_url: 視頻地址
        :return: 彈幕地址
        """
        html_data = self.get_url_data(org_url)
        cid = re.findall(r"\"cid\":([\d]+),", str(html_data))[0]
        return "https://comment.bilibili.com/{}.xml".format(cid)

    def get_all_tags(self, cid_url):
        """
        獲取彈幕  xml 文件
        利用 beautifulSoup 獲取所有彈幕文案
        :param cid_url: 彈幕獲取url
        :return: 所有彈幕list
        """
        xml_data = self.get_url_data(cid_url)
        soup = BeautifulSoup(xml_data, features="xml")
        all_data=""
        for item in soup.find_all("d"):
            all_data = all_data + item.string + ","
        return all_data

    def create_could_pic(self, tags):
        wordcloud = WordCloud(font_path="STXINGKA.TTF"
                              ,background_color="white"
                              , width= 800
                              , height=400
                              , max_words=100)
        wordcloud.generate(tags)
        image = wordcloud.to_image()

        image.show()

    def run(self, url):
        tags = self.get_all_tags(self.get_cid_url(url))
        self.create_could_pic(tags)



if __name__ == "__main__":
    bibi = Bilibili()
    bibi.run("https://www.bilibili.com/video/av47301018?from=search&seid=5323472189377090516")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

有趣的Python —— bilibili彈幕爬取 + 雲圖生成

文章目錄

1. 分析網頁數據，提取cid

2. cid 數據提取彈幕

3. 彈幕數據雲圖生成

總結

源碼

OKhttp源碼學習（八）—— 攔截器_CallServerInterceptor

OKhttp源碼學習（四）—— 攔截器_RetryAndFollowUpInterceptor

OKhttp源碼學習（七）—— 攔截器_ConnectInterceptor

OKhttp源碼學習（五）—— 攔截器_BridgeInterceptor

OKhttp源碼學習（六）—— 攔截器_CacheInterceptor

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結