python自動獲取B站彈幕並生成詞雲

這是關於python自動獲取B站彈幕並生成詞雲的小例子

1、思路

  • 用requests獲取B站的網頁內容
  • 用BS來解析網頁內容,並獲得彈幕
  • 將彈幕保存本地txt中
  • 讀取txt採用wordcloud生成詞雲

2、導入庫

# -*- coding=utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import jieba
import wordcloud

3、根據B站av號來獲取彈幕

def cid_from_av(av):
    url = 'http://www.bilibili.com/video/bv' + str(av)
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    html = response.text

    # 用try防止有些av號沒視頻
    try:
        soup = BeautifulSoup(html, 'lxml')
        # 視頻名
        title = soup.select('meta[name="title"]')[0]['content']
        # 投稿人
        author = soup.select('meta[name="author"]')[0]['content']
        # 彈幕的網站代碼
        danmu_id = re.findall(r'cid=(\d+)&', html)[0]
        #print(title, author)
        return {
            'status': 'ok',
            'title': title,
            'author': author,
            'cid': danmu_id}
    except BaseException:
        print('視頻不見了!')
    return {'status': 'no'}

這一部分主要是用requests.get獲取網頁,用BeautifulSoup獲得彈幕,這部分網上有很多例子,搜搜就可以。

4、將彈幕保存txt文件

def get_danmu(cid,fileName):
    url = 'http://comment.bilibili.com/' + str(cid) + '.xml'
    req = requests.get(url)
    html = req.content
    html_doc = str(html, 'utf-8')  # 修改成utf-8

    # 解析
    soup = BeautifulSoup(html_doc, "lxml")
    results = soup.find_all('d')
    contents = [x.text for x in results]
    # print(contents)
    saveData(contents,fileName)
    return contents

def saveData(items, fileName):
    with open(fileName, 'a', encoding='utf-8') as f:
        for item in items:
            f.write(item + '\n')

上面contents是一個彈幕字符串list,需要保存txt,方面後續詞雲shengc

5、統計彈幕詞頻(可不需要)

def danmu_cut(fileName):
    word_frequency = dict()
    # 獲取停止詞
    stop_word = []

    f = open(fileName, encoding='utf-8')
    txt = f.read()
    # 分詞
    words = jieba.cut(txt)
    # 統計詞頻
    for word in words:
        if word not in stop_word:
            word_frequency[word] = word_frequency.get(word, 0) + 1
    return word_frequency

6、獲得詞雲

def dro_wc(cid,fileName):
    f = open(fileName, encoding='utf-8')
    txt = f.read()
    w = wordcloud.WordCloud(width=1000,
                            height=700,
                            background_color='white',
                            font_path='msyh.ttc')
    w.generate(txt)
    w.to_file(str(cid) + '.png')

7、程序入口

if __name__ == '__main__':
    bv = '1NZ4y1j7nw'
    fileName = bv + '.txt'
    avInfo = cid_from_av(bv)
    cid = avInfo['cid']
    danmus = get_danmu(cid,fileName)
    # word_frequency = danmu_cut(fileName)
    wordFile = dro_wc(bv,fileName)

8、演示

例如輸入bv = '1NZ4y1j7nw'視頻
在這裏插入圖片描述
最後可以生成詞雲如下:

再看一個視頻:1tt4y127vq
在這裏插入圖片描述
生成的詞雲如下:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章