想見你的彈幕爬取和對彈幕信息簡單可視化

之前朋友圈無意刷到比較關注的人發了一條什麼找人幫忙追女孩子也不能找李子維,作爲程序猿(舔狗)看了一臉懵逼,從來不怎麼看劇的我然後就去百度了一下,原來是想見你。然後就去騰訊視頻看了一下,額不是我喜歡看的類型,不過還是想做點什麼,那就分析一下彈幕吧。

以後真的要改掉拖延症,其實兩天前我就應該寫這個博客的,拖了兩天。。。結果現在去翻朋友圈,三天可見,無朋友圈截圖

爬取彈幕

由於不是VIP,第一件事就是打開一集然後等着45s的廣告。。。
然後找到了彈幕的鏈接網址
在這裏插入圖片描述

https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910014051953985803944_1579443369825&target_id=4576819405%26vid%3Di0033qa01p1&session_key=23873%2C84%2C1579443370&timestamp=75&_=1579443369830

然後多看幾集多找幾個彈幕鏈接,發現主要改變的就是target_id和timestamp,然後簡化鏈接,最終實驗出了只要target_id和timestamp的url。

# 'https://mfm.video.qq.com/danmu?otype=json&timestamp=2385&target_id=4576819404%26vid%3Du0033tu6jy5&count=80'#無數據
# #最後一頁1995=133*15
# '2385'

#最後就是
#https://mfm.video.qq.com/danmu?otype=json&timestamp={}&target_id={}%26vid%3D{}&count=400&second_count=5

下面這是我找的一段記錄

# #第一集
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19564%2C70%2C1579087668&timestamp=135&_=1579087666006'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19564%2C70%2C1579087668&timestamp=165&_=1579087666007'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19564%2C70%2C1579087668&timestamp=195&_=1579087666008'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19564%2C70%2C1579087668&timestamp=225&_=1579087666009'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19564%2C70%2C1579087668&timestamp=255&_=1579087666010'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087666001&target_id=4576819405%26vid%3Di0033qa01p1&session_key=19572%2C70%2C1579088455&timestamp=2115&_=1579087666018'
#
# #第二集
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087665999&target_id=4576819403%26vid%3Dw0033mb4upm&session_key=17967%2C94%2C1579088496&timestamp=105&_=1579087666034'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery1910975216511090891_1579087665999&target_id=4576819403%26vid%3Dw0033mb4upm&session_key=17967%2C94%2C1579088496&timestamp=135&_=1579087666035'
# '''callback: jQuery1910975216511090891_1579087665999
# target_id: 4576819403&vid=w0033mb4upm
# session_key: 17967,94,1579088496'''
#
#
# '''Request URL: https://tunnel.video.qq.com/fcgi/danmu_read_count?ddwTargetId=4576819404%26vid%3Du0033tu6jy5&ddwUin=0&dwGetTotal=1&wOnlyTotalCount=0&strSessionKey=&dwGetPubCount=1&raw=1&vappid=29188582&vsecret=37ae5f4003c9a2332e566d8c53bf32b0d4ddfa4ac6717cd1'''
# '''ddwTargetId: 4576819404&vid=u0033tu6jy5
# ddwUin: 0
# dwGetTotal: 1
# wOnlyTotalCount: 0
# strSessionKey:
# dwGetPubCount: 1
# raw: 1
# vappid: 29188582
# vsecret: 37ae5f4003c9a2332e566d8c53bf32b0d4ddfa4ac6717cd1'''
#
#
# '''ddwTargetId: 4576819404
# ddwUin: 0
# dwUpCount: 1
# ddwUpUin: 0
# dwTotalCount: 16924
# stLastComment: {ddwTargetId: 0, ddwUin: 0, dwIsFriend: 0, dwIsOp: 0, dwIsSelf: 0, dwTimePoint: 0, dwUpCount: 0,…}
# ddwTargetId: 0
# ddwUin: 0
# dwIsFriend: 0
# dwIsOp: 0
# dwIsSelf: 0
# dwTimePoint: 0
# dwUpCount: 0
# ddwPostTime: 0
# dwHlwLevel: 0
# dwRichType: 0
# dwDanmuContentType: 0
# dwTimeInterval: 59
# strSessionKey: "59,16924,1579088736"
# dwMaxUpNum: 133
# dwPubCount: 13190'''
#
# 'https://mfm.video.qq.com/danmu?otype=json&timestamp=45&target_id=4576819404%26vid%3Du0033tu6jy5&count=80'
#
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109959296720329007_1579088733741&target_id=4576819404%26vid%3Du0033tu6jy5&session_key=0%2C0%2C0&timestamp=15&_=1579088733743'
# 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109959296720329007_1579088733741&target_id=4576819404%26vid%3Du0033tu6jy5&_=1579088733743'
#
# 'https://mfm.video.qq.com/danmu?otype=json&timestamp=2385&target_id=4576819404%26vid%3Du0033tu6jy5&count=80'#無數據
# #最後一頁1995=133*15
# '2385'
#
#
# 'https://access.video.qq.com/danmu_manage/regist?vappid=97767206&vsecret=c0bdcbae120669fff425d0ef853674614aa659c605a613a4&raw=1'
# 'https://access.video.qq.com/danmu_manage/regist?vappid=97767206&vsecret=c0bdcbae120669fff425d0ef853674614aa659c605a613a4&raw=1'
# '''{wRegistType: 2, vecIdList: ["h00336e2bmu"], wSpeSource: 0, bIsGetUserCfg: 1,…}
# wRegistType: 2
# vecIdList: ["h00336e2bmu"]
# 0: "h00336e2bmu"
# wSpeSource: 0
# bIsGetUserCfg: 1
# mapExtData: {h00336e2bmu: {strCid: "mzc00200umueb9v", strLid: ""}}
# h00336e2bmu: {strCid: "mzc00200umueb9v", strLid: ""}
# strCid: "mzc00200umueb9v"
# strLid: ""'''

然後就是要找到每一集對應的target_id和v_id才能得到每一集的彈幕,所以再去找找哪裏有那些target_id和v_id。
在這裏插入圖片描述
這裏就得到了v_id,然後再去通過v_id,post得到target_id
在這裏插入圖片描述
在這裏插入圖片描述
大概思路就是這樣,到處找我們想要的最後得到就行。
之前也看到了別人寫的爬取騰訊視頻彈幕的代碼,不過鏈接忘了,我在這個基礎上改成了我需要的,上爬蟲代碼:

import requests
import json
import pandas as pd
import time
import random


# 頁面基本信息解析,獲取構成彈幕網址所需的後綴ID、播放量、集數等信息。
def parse_base_info(url, headers):
    df = pd.DataFrame()

    html = requests.get(url, headers=headers)
    bs = json.loads(html.text[html.text.find('{'):-1])

    for i in bs['results']:
        v_id = i['id']
        title = i['fields']['title']
        view_count = i['fields']['view_all_count']
        episode = int(i['fields']['episode'])
        if episode == 0:
            pass
        else:
            cache = pd.DataFrame({'id': [v_id], 'title': [title], '播放量': [view_count], '第幾集': [episode]})
            df = pd.concat([df, cache])
    return df


# 傳入後綴ID,獲取該集的target_id並返回
def get_episode_danmu(v_id, headers):
    base_url = 'https://access.video.qq.com/danmu_manage/regist?vappid=97767206&vsecret=c0bdcbae120669fff425d0ef853674614aa659c605a613a4&raw=1'
    pay = {"wRegistType": 2, "vecIdList": [v_id],
           "wSpeSource": 0, "bIsGetUserCfg": 1,
           "mapExtData": {v_id: {"strCid": "mzc00200umueb9v", "strLid": ""}}}
    html = requests.post(base_url, data=json.dumps(pay), headers=headers)
    bs = json.loads(html.text)
    danmu_key = bs['data']['stMap'][v_id]['strDanMuKey']
    target_id = danmu_key[danmu_key.find('targetid') + 9: danmu_key.find('vid') - 1]
    return [v_id, target_id]


# 解析單個彈幕頁面,需傳入target_id,v_id(後綴ID)和集數(方便匹配),返回具體的彈幕信息
def parse_danmu(url, target_id, v_id, headers, period):
    html = requests.get(url, headers=headers)
    bs = json.loads(html.text, strict=False)
    df = pd.DataFrame()
    try:
        for i in bs['comments']:
            content = i['content']
            name = i['opername']
            upcount = i['upcount']
            user_degree = i['uservip_degree']
            timepoint = i['timepoint']
            comment_id = i['commentid']
            cache = pd.DataFrame({'用戶名': [name], '內容': [content], '會員等級': [user_degree],
                                  '彈幕時間點': [timepoint], '彈幕點贊': [upcount], '彈幕id': [comment_id], '集數': [period]})
            df = pd.concat([df, cache])
    except:
        pass
    return df


# 構造單集彈幕的循環網頁,傳入target_id和後綴ID(v_id),通過設置爬取頁數來改變timestamp的值完成翻頁操作
def format_url(target_id, v_id, page=85):
    urls = []
    base_url = 'https://mfm.video.qq.com/danmu?otype=json&timestamp={}&target_id={}%26vid%3D{}&count=400&second_count=5'

    for num in range(15, page * 30 + 15, 30):
        url = base_url.format(num, target_id, v_id)
        urls.append(url)
    #print(urls)
    return urls


def get_all_ids(part1_url,part2_url, headers):
    part_1 = parse_base_info(part1_url, headers)
    part_2 = parse_base_info(part2_url, headers)
    df = pd.concat([part_1, part_2])
    df.sort_values('第幾集', ascending=True, inplace=True)
    count = 1
    # 創建一個列表存儲target_id
    info_lst = []
    for i in df['id']:
        info = get_episode_danmu(i, headers)
        info_lst.append(info)
        print('正在努力爬取第 %d 集的target_id' % count)
        count += 1
        time.sleep(2 + random.random())
    print('是不是發現多了一集?別擔心,會去重的')
    # 根據後綴ID,將target_id和後綴ID所在的表合併
    info_lst = pd.DataFrame(info_lst)
    info_lst.columns = ['v_id', 'target_id']
    combine = pd.merge(df, info_lst, left_on='id', right_on='v_id', how='inner')
    # 去重複值
    combine = combine.loc[combine.duplicated('id') == False, :]
    return combine


# 輸入包含v_id,target_id的表,並傳入想要爬取多少集
def crawl_all(combine, num, page, headers):
    c = 1
    final_result = pd.DataFrame()
    for v_id, target_id in zip(combine['v_id'][:num], combine['target_id'][:num]):
        count = 1
        urls = format_url(target_id, v_id, page)
        for url in urls:
            result = parse_danmu(url, target_id, v_id, headers, c)
            final_result = pd.concat([final_result, result])
            time.sleep(2 + random.random())
            print('這是 %d 集的第 %d 頁爬取..' % (c, count))
            count += 1
        print('-------------------------------------')
        c += 1
    return final_result


if __name__ == '__main__':
    part1_url = 'https://union.video.qq.com/fcgi-bin/data?otype=json&tid=682&appid=20001238&appkey=6c03bbe9658448a4&union_platform=1&idlist=x00335pmni4,u0033s6w87l,a0033ocq64d,z0033hacdb0,k30411thjyx,x00330kt2my,o0033xvtciz,r0033i8nwq5,q0033f4fhz2,a3033heyf4t,p0033giby5x,g0033iiwrkz,m0033hmbk3e,a0033m43iq4,o003381p611,c00339y0zzt,w0033ij5l6r,d0033mc7glb,k003314qjhw,x0033adrr32,h0033oqojcq,a00335xx2ud,t0033osrtb7&callback=jQuery191022356914493548485_1579090019078&_=1579090019082'
    part2_url = 'https://union.video.qq.com/fcgi-bin/data?otype=json&tid=682&appid=20001238&appkey=6c03bbe9658448a4&union_platform=1&idlist=t00332is4j6,i0033qa01p1,w0033mb4upm,u0033tu6jy5,v0033x5trub,h00336e2bmu,t00332is4j6,v0033l43n4x,s0033vhz5f6,u003325xf5p,n0033a2n6sl,s00339e7vqp,p0033je4tzi,y0033a6tibn,x00333vph31,v0033d7uaui,g0033a8ii9x,e0033hhaljd,g00331f53yk,m00330w5o8v,o00336lt4vb,l0033sko92l,g00337s3skh,j30495nlv60,m3047vci34u,j3048fxjevm,q0033a3kldj,y0033k978fd,a0033xrwikg,q0033d9y0jt&callback=jQuery191022356914493548485_1579090019080&_=1579090019081'

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }

    # 得到所有的後綴ID,基於後綴ID爬取target_id
    combine = get_all_ids(part1_url,part2_url, headers)

    # 設置要爬取多少集(num參數),每一集爬取多少頁彈幕(1-85頁,page參數)
    # 比如想要爬取30集,每一集85頁,num = 30,page = 85
    final_result = crawl_all(combine, num=5, page=80, headers=headers)
    final_result.to_excel('./想見你彈幕.xlsx')

數據分析

爬取之後我們的彈幕是這個樣子
在這裏插入圖片描述
數據分析還是用我們pandas,一開始還是導入和數據清洗
在這裏插入圖片描述
在這裏插入圖片描述
然後看看誰發的彈幕最多(只考慮有用戶名的用戶)

#同一個人發了多少彈幕
dmby1=df.groupby('用戶名')['彈幕id'].count().sort_values(ascending= False).reset_index()
dmby1.columns = ['用戶名','累計發送彈幕數']
dmby1.head(20)

在這裏插入圖片描述
然後用一下上次的模板

#除去匿名者不考慮,對發彈幕數多的可視化
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(25,16),dpi= 80)
ax.vlines(x=dmby1.index[1:50], ymin=0, ymax=dmby1.累計發送彈幕數[1:50], color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x=dmby1.index[1:50], y=dmby1.累計發送彈幕數[1:50], s=75, color='firebrick', alpha=0.7)

ax.set_title('每人發送彈幕數', fontdict={'size':22})
ax.set_ylabel('彈幕數')
ax.set_xticks(dmby1[1:50].index)
ax.set_xticklabels(dmby1.用戶名[1:50].str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':12})
ax.set_ylim(0, 200)

for row in dmby1[1:50].itertuples():
    ax.text(row.Index, row.累計發送彈幕數+3.5, s=round(row.累計發送彈幕數, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14)
plt.show()

在這裏插入圖片描述
爲了截圖縮小了網頁,原本很清楚的
然後看看每一集的彈幕數

#每一集有多少彈幕
dm=df.groupby('集數')['內容'].count().sort_values(ascending= False).reset_index()
dm.columns = ['集數','本集發送彈幕數']
dm['佔比']=dm['本集發送彈幕數']/sum(dm['本集發送彈幕數'])
dm.head()

# 可視化
plt.pie(x =dm['佔比'], labels=dm['集數'],autopct='%1.1f%%')
plt.title('各集彈幕佔比')
plt.show()

在這裏插入圖片描述
因爲當時爬蟲代碼設置的count=400,所以可能有遺漏,而且明顯第三集爬取彈幕的時候出了問題。

然後做個詞雲

#詞雲
from wordcloud import WordCloud
import imageio
import jieba


df['內容']=df['內容'].astype(str)
word_list=" ".join(df['內容'])
word_list=" ".join(jieba.cut(word_list))

#設置詞雲
wc = WordCloud(
    mask = imageio.imread('C:/Users/ysj/Pictures/卑微.jpg'),
    max_words = 500,
    font_path = 'C:/Windows/Fonts/simhei.ttf',
    width=400,
    height=860,
    )
 
#生成詞雲
myword = wc.generate(word_list)
wc.to_file("./想見你詞雲.png")
#展示詞雲圖
fig = plt.figure(dpi=80)
plt.imshow(myword)
plt.title('想見你詞雲')
plt.axis('off')
plt.show()

在這裏插入圖片描述
是因爲vip發的彈幕前面會帶vip嗎?居然還會出現這個。。。
最後看看彈幕簡單的情感分析

彈幕情感分析
from snownlp import SnowNLP
 
def sentiment(row):
    content = str(row['內容']).strip()
    s = SnowNLP(content)
    score = float(s.sentiments)
    return score
    
df['score'] = df.apply(sentiment, axis = 1)
 
df1 = df.groupby(['彈幕時間點'],as_index=False)['score'].mean()

df1.head()


fig = plt.figure(figsize=(10, 4.5))
plt.plot(df1[0:60],lw=2)
plt.title("彈幕情感趨勢")
plt.xlabel("時間")
plt.ylabel("分數")
plt.ylim(0,1)
plt.axhline(0.5, color='orange')
plt.show()

在這裏插入圖片描述
大部分都是比較正面的評價,而且往往評價高的時間點都在15分鐘—35分鐘之間,可能是這段時間正式進入劇情吧。

最後

其實剛剛在寫這篇博客之前我看了愛情公寓5的13集,這一集真的很有意思,彈幕的作用可以讓我們看劇的同時發表分享一下自己的想法,但是儘量還是做個文明的看客,看到有的彈幕無腦黑真的很不好,就像很多人噴愛情公寓抄襲之類的,但是好歹也陪伴了我們十年。其實很多劇都是投入了大量的心血的,很多會中途夭折,比如康斯坦丁,我是比較喜歡這類恐怖靈異劇的,很可惜只有一季。還有已經完結的基本演繹法、最後一季的邪惡力量,有的時候這些劇就像朋友陪伴着我們,儘量包容吧被噴的感覺實在不爽。這幾天會去研究一下愛奇藝的彈幕,看看愛情公寓5的彈幕會是什麼樣子的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章