如何寫一隻抓嗶哩嗶哩彈幕的爬蟲

原創

bigbigsman

2018-08-27 02:34

如何寫一隻抓嗶哩嗶哩彈幕的爬蟲

爬蟲工作流程

解析首頁獲得視頻cid
構造所有的獲取彈幕鏈接
解析xml文件並插入數據庫
遍歷獲取每一集的彈幕內容

1、解析首頁獲取每一集的cid和內容

訪問靜態頁面,利用lxml解析視頻cid和視頻內容
解析如下標記的內容:

2、構造獲取彈幕鏈接

構造得到彈幕的鏈接很簡單，根據第一步得到的cid

http://comment.bilibili.com/{cid}.xml #cid 第一步獲得的

根據cid構造url 訪問獲取彈幕，返回的是xml文件。
如下：

3、彈幕xml文件解析

獲取的彈幕xml格式如下：

<d p="533.67199707031,1,25,41194,1498943949,0,7edeebe9,3511616609">刀還是沒有槍快</d>

p這個字段裏面的內容：
0,1,25,16777215,1312863760,0,eff85771,42759017中幾個逗號分割的數據
第一個參數是彈幕出現的時間以秒數爲單位。
第二個參數是彈幕的模式1..3 滾動彈幕 4底端彈幕 5頂端彈幕 6.逆向彈幕 7精準定位 8高級彈幕
第三個參數是字號， 12非常小,16特小,18小,25中,36大,45很大,64特別大
第四個參數是字體的顏色以HTML顏色的十進制爲準
第五個參數是Unix格式的時間戳。基準時間爲 1970-1-1 08:00:00
第六個參數是彈幕池 0普通池 1字幕池 2特殊池【目前特殊池爲高級彈幕專用】
第七個參數是發送者的ID，用於“屏蔽此彈幕的發送者”功能
第八個參數是彈幕在彈幕數據庫中rowID 用於“歷史彈幕”功能。

4、遍歷獲取每一集的彈幕

爬蟲源碼

#coding=utf-8
from lxml import etree
import requests, re, time
import datetime
import sys,sqlite3,os   

#初始化數據庫
if os.path.exists('bilbili.db'):
    cx = sqlite3.connect('bilbili.db', check_same_thread = False)
else:
    cx = sqlite3.connect('bilbili.db', check_same_thread = False)
    cx.execute('''create table comment(videoname text,
                    chatid text,
                    dtTime text, 
                    danmu_model text, 
                    font text, 
                    rgb text, 
                    stamp text, 
                    danmu_chi text, 
                    userID text, 
                    rowID text,
                    message text)''')

def request_get_comment(getdetail):
    '''#獲取彈幕內容'''
    name,url,cid=getdetail
    # url='http://www.bilibili.com'+url
    url='http://comment.bilibili.com/{}.xml'.format(cid)
    #preurl='http://www.bilibili.com'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'}
    response = requests.get(url=url, headers=headers)
    tree=etree.HTML(response.content)
    message=tree.xpath('//d/text()')
    infos=tree.xpath('//d/@p')
    comment=[info.split(',') for info in infos]
    saveme=[]
    # comments=[(cid,i) for i in zip(infos,message)]
    for i in range(len(comment)-1):
        # print i
        try:
            saveme.append((name,cid,comment[i][0],comment[i][1],comment[i][2],
                    comment[i][3],comment[i][4],comment[i][5],
                    comment[i][6],comment[i][7],message[i]
                    ))
        except Exception as e:
            print(e)
            continue

    # print saveme
    cx.executemany('''INSERT INTO comment VALUES(?,?,?,?,?,?,?,?,?,?,?)''',saveme)
    cx.commit()


def indexget(url):
    '''解析首頁獲取name,value,cid'''
    r=requests.get(url)
    tree=etree.HTML(r.content)
    name=tree.xpath('//option/text()')
    value=tree.xpath('//option/@value')
    cid=tree.xpath('//option/@cid')
    return [i for i in zip(name,value,cid)]        

    return True
if __name__ == "__main__":
    '''eg: python xxx.py url
           python xxx.py
        url:'http://www.bilibili.com/video/av3663007'       
    '''
    if len(sys.argv)>1:
        first_url = sys.argv[1] or 'http://www.bilibili.com/video/av3663007'
    else:
        first_url='http://www.bilibili.com/video/av3663007' 
    preurl='http://www.bilibili.com'
    get_comment_url= indexget(first_url)
    for i in get_comment_url:
        print (i)
        request_get_comment(i)

    cx.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何寫一隻抓嗶哩嗶哩彈幕的爬蟲

如何寫一隻抓嗶哩嗶哩彈幕的爬蟲

爬蟲工作流程

1、解析首頁獲取每一集的cid和內容

2、構造獲取彈幕鏈接

3、彈幕xml文件解析

4、遍歷獲取每一集的彈幕

爬蟲源碼

python cx_Oracle 查詢到生僻字報錯問題處理

Unirest一款輕量級的HTTP客戶端庫

IPv6基本知識

樹莓派 python+selenium+chromium 安裝及使用

【裝載】MySQL通過RPM安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結