這是關於python自動獲取B站彈幕並生成詞雲的小例子
1、思路
- 用requests獲取B站的網頁內容
- 用BS來解析網頁內容,並獲得彈幕
- 將彈幕保存本地txt中
- 讀取txt採用wordcloud生成詞雲
2、導入庫
# -*- coding=utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import jieba
import wordcloud
3、根據B站av號來獲取彈幕
def cid_from_av(av):
url = 'http://www.bilibili.com/video/bv' + str(av)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
html = response.text
# 用try防止有些av號沒視頻
try:
soup = BeautifulSoup(html, 'lxml')
# 視頻名
title = soup.select('meta[name="title"]')[0]['content']
# 投稿人
author = soup.select('meta[name="author"]')[0]['content']
# 彈幕的網站代碼
danmu_id = re.findall(r'cid=(\d+)&', html)[0]
#print(title, author)
return {
'status': 'ok',
'title': title,
'author': author,
'cid': danmu_id}
except BaseException:
print('視頻不見了!')
return {'status': 'no'}
這一部分主要是用requests.get獲取網頁,用BeautifulSoup獲得彈幕,這部分網上有很多例子,搜搜就可以。
4、將彈幕保存txt文件
def get_danmu(cid,fileName):
url = 'http://comment.bilibili.com/' + str(cid) + '.xml'
req = requests.get(url)
html = req.content
html_doc = str(html, 'utf-8') # 修改成utf-8
# 解析
soup = BeautifulSoup(html_doc, "lxml")
results = soup.find_all('d')
contents = [x.text for x in results]
# print(contents)
saveData(contents,fileName)
return contents
def saveData(items, fileName):
with open(fileName, 'a', encoding='utf-8') as f:
for item in items:
f.write(item + '\n')
上面contents是一個彈幕字符串list,需要保存txt,方面後續詞雲shengc
5、統計彈幕詞頻(可不需要)
def danmu_cut(fileName):
word_frequency = dict()
# 獲取停止詞
stop_word = []
f = open(fileName, encoding='utf-8')
txt = f.read()
# 分詞
words = jieba.cut(txt)
# 統計詞頻
for word in words:
if word not in stop_word:
word_frequency[word] = word_frequency.get(word, 0) + 1
return word_frequency
6、獲得詞雲
def dro_wc(cid,fileName):
f = open(fileName, encoding='utf-8')
txt = f.read()
w = wordcloud.WordCloud(width=1000,
height=700,
background_color='white',
font_path='msyh.ttc')
w.generate(txt)
w.to_file(str(cid) + '.png')
7、程序入口
if __name__ == '__main__':
bv = '1NZ4y1j7nw'
fileName = bv + '.txt'
avInfo = cid_from_av(bv)
cid = avInfo['cid']
danmus = get_danmu(cid,fileName)
# word_frequency = danmu_cut(fileName)
wordFile = dro_wc(bv,fileName)
8、演示
例如輸入bv = '1NZ4y1j7nw'
視頻
最後可以生成詞雲如下:
再看一個視頻:1tt4y127vq
生成的詞雲如下: