任務爬取網易雲黃老闆的shape of you下面贊超過1000的評論

文章目錄

網頁爬取

網頁爬取

本次任務的難點就在於網頁爬取，可以結合知乎關於此問題的回答一起看

網頁分析

打開網頁之後切換評論的頁數，可以看到網址的URL並沒有變化，沒有像豆瓣一樣出現page=X，猜測是直接通過加載JavaScript數據包改變評論。

打開F12，刷新一下，選擇NetWork，勾選XHR，經過分析，評論數據是由R_SO_4_…數據包發過來的。

選中這個數據包，我們分析一下。
這是一個POST數據包，對每一頁評論URL沒有變。服務器應該是用過請求的其他數據確定我們需要的是哪一頁。

往下翻，到From Data，顯然我們這兩個參數是經過加密的，大概率就是我們在找的數據。

我們去看看對應的JavaScript請求，點擊Initiator，可以看到對應的JavaScript請求，點擊一下core_f69…

可以看到跳轉到了Sources部分，代碼不太方便看，可以點擊一下左下角的{}符號

經過查找，發現我們要的params參數和enSecKey參數由一個bVj7c的變量提供的，而bVj7c是通過window.asrsea函數得到的，其共有四個參數
JSON.stringify(i8a),
brx9o([“流淚”, “強”]),
brx9o(Xs4w.md),
brx9o([“愛心”, “女孩”, “驚恐”, “大笑”])
（選這幾個詞來加密的程序員一定是個有故事的程序員~）
我們把斷點打在13092（左擊一下行號就可以設置斷點）

現在點擊一下網頁評論的其他頁可以看到對應的參數

按下esc鍵調出console，在console中依次輸入四個參數，可以得到對應的值，經過對比，發現後三個爲常數，而第一個參數通過改變offset來確定頁數，每次變化20，從0開始變化。

參數獲取

現在我們來實現一下window.asrsea得到我們要的params和enSecKey。
把代碼下載下來後，找到window.asrsea位置。
簡單分析一下，
function a實現生成長度爲a的隨機字符串；
function b是把a和b一起進行AES加密，iv設置爲0102030405060708；
function c將a,b,c一起進行RSA加密
function d也就是我們要用的window.asrsea，可以由四個參數得到params和enSecKey

我們也用pycrypto模仿實現一下（可以搜一下願意對應着看）
安裝pycrypto模塊報錯的話，可以用

pip install -i https://pypi.douban.com/simple/ pycryptodome

代碼：

class MusicSpider:

    def __init__(self):
        self.headers = {
            'accept' : "*/*",
            'origin' : "https://music.163.com",
            'Host': "music.163.com",
            'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
        }
        # 第二個參數
        self.second_param = "010001"
        # 第三個參數
        self.third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
        # 第四個參數
        self.forth_param = "0CoJUm6Qyw8W8jud"

    def get_params(self, page):
        offset = str((page - 1) * 20)
        self.first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' % (offset, 'true')
        self.random_strs = self.generate_random_strs(16) # 生成長度爲16的隨機字符串
        # 兩次AES加密之後得到params的值
        self.params = self.AES_encrypt(self.first_param, self.forth_param)
        self.params = self.AES_encrypt(self.params.decode('utf-8'), self.random_strs)

    def get_encSecKey(self):
        # RSA加密之後得到encSecKey的值
        self.encSecKey = self.RSAencrypt(self.random_strs, self.second_param, self.third_param)

    #生成隨機字符串
    def generate_random_strs(self, length):
        string = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
        random_strs = ""
        for i in range(length):
            temp = random.randint(0, len(string)-1)
            random_strs += list(string)[temp]
        return random_strs

    #AES加密
    def AES_encrypt(self, msg, key):
        # 如果不是16的倍數則進行填充(paddiing)
        padding = 16 - len(msg) % 16
        # 這裏使用padding對應的單字符進行填充
        msg = msg + padding * chr(padding)
        # 用來加密或者解密的初始向量(必須是16位)
        iv = '0102030405060708'

        encryptor = AES.new(key.encode('utf-8'), AES.MODE_CBC, iv.encode('utf-8'))
        # 加密後得到的是byte類型的數據
        encrypt_text = encryptor.encrypt(msg.encode('utf-8'))
        # 使用Base64進行編碼,返回byte字符串
        encrypt_text = base64.b64encode(encrypt_text)
        return encrypt_text

    # RSA加密
    def RSAencrypt(self, randomstrs, key, f):
        # 隨機字符串逆序排列
        string = randomstrs[::-1]
        # 將隨機字符串轉換成byte類型數據
        text = bytes(string, 'utf-8')
        seckey = int(codecs.encode(text, encoding='hex'), 16) ** int(key, 16) % int(f, 16)
        # 返回整數的小寫十六進制形式
        return format(seckey, 'x').zfill(256)

數據分析

這部分與知乎分析json數據類似
回到Network 欄，找到Preview，可以看到，評論內容在comments下的content，點贊數在comments下的likedCount

將params和encSecKey作爲數據，發送post請求，返回json文件

    def get_json(self, url):
        self.post = {
            'params' : self.params,
            'encSecKey': self.encSecKey,
        }
        try:
            self.response = requests.post(url, data=self.post, headers = self.headers)
            if self.response.status_code == 200:
                return self.response.json()
        except requests.ConnectionError:
            return None

數據存儲

在得到的json文件中獲取content和likedcount，當likedcount超過100就保存content

    def get_comments(self, url):
        f = open('./comments.txt', 'w', encoding='utf-8')
        self.get_params(1)
        self.get_encSecKey()
        data = self.get_json(url)
        page = data.get('total') // 20 + 1 if (data.get('total')%20) else 0
        for i in range(1, page):
            self.get_params(i)
            self.get_encSecKey()
            data = self.get_json(url)
            for comment in data.get("comments"):
                likedcount = comment.get('likedCount')
                content = comment.get("content")
                if likedcount > 100 :
                    f.write(content+'\n')
            print("第%d頁抓取完畢"%i)
            time.sleep(5)

得到的評論做個詞雲叭

完整代碼

from Crypto.Cipher import AES
import base64
import time
import requests
import random
import codecs
from urllib.parse import urlencode

class MusicSpider:

    def __init__(self):
        self.headers = {
            'accept' : "*/*",
            'origin' : "https://music.163.com",
            'Host': "music.163.com",
            'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
        }
        # 第二個參數
        self.second_param = "010001"
        # 第三個參數
        self.third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
        # 第四個參數
        self.forth_param = "0CoJUm6Qyw8W8jud"

    def get_params(self, page):
        offset = str((page - 1) * 20)
        self.first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' % (offset, 'true')
        self.random_strs = self.generate_random_strs(16) # 生成長度爲16的隨機字符串
        # 兩次AES加密之後得到params的值
        self.params = self.AES_encrypt(self.first_param, self.forth_param)
        self.params = self.AES_encrypt(self.params.decode('utf-8'), self.random_strs)
       
    def get_encSecKey(self):
        # RSA加密之後得到encSecKey的值
        self.encSecKey = self.RSAencrypt(self.random_strs, self.second_param, self.third_param)

    #生成隨機字符串
    def generate_random_strs(self, length):
        string = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
        random_strs = ""
        for i in range(length):
            temp = random.randint(0, len(string)-1)
            random_strs += list(string)[temp]
        return random_strs

    #AES加密
    def AES_encrypt(self, msg, key):
        # 如果不是16的倍數則進行填充(paddiing)
        padding = 16 - len(msg) % 16
        # 這裏使用padding對應的單字符進行填充
        msg = msg + padding * chr(padding)
        # 用來加密或者解密的初始向量(必須是16位)
        iv = '0102030405060708'

        encryptor = AES.new(key.encode('utf-8'), AES.MODE_CBC, iv.encode('utf-8'))
        # 加密後得到的是byte類型的數據
        encrypt_text = encryptor.encrypt(msg.encode('utf-8'))
        # 使用Base64進行編碼,返回byte字符串
        encrypt_text = base64.b64encode(encrypt_text)
        return encrypt_text

    # RSA加密
    def RSAencrypt(self, randomstrs, key, f):
        # 隨機字符串逆序排列
        string = randomstrs[::-1]
        # 將隨機字符串轉換成byte類型數據
        text = bytes(string, 'utf-8')
        seckey = int(codecs.encode(text, encoding='hex'), 16) ** int(key, 16) % int(f, 16)
        # 返回整數的小寫十六進制形式
        return format(seckey, 'x').zfill(256)

    def get_json(self, url):
        self.post = {
            'params' : self.params,
            'encSecKey': self.encSecKey,
        }
        try:
            self.response = requests.post(url, data=self.post, headers = self.headers)
            if self.response.status_code == 200:
                return self.response.json()
        except requests.ConnectionError:
            return None

    def get_comments(self, url):
        f = open('./comments.txt', 'w', encoding='utf-8')
        self.get_params(1)
        self.get_encSecKey()
        data = self.get_json(url)
        page = data.get('total') // 20 + 1 if (data.get('total')%20) else 0
        for i in range(1, page):
            self.get_params(i)
            self.get_encSecKey()
            data = self.get_json(url)
            for comment in data.get("comments"):
                likedcount = comment.get('likedCount')
                content = comment.get("content")
                if likedcount > 100 :
                    f.write(content+'\n')
            print("第%d抓取完畢"%i)
            time.sleep(5)

if __name__ == "__main__":
	#要其他歌曲的話，改一下URL的R_SO_4_後面的歌曲id即可~
    url = "https://music.163.com/weapi/v1/resource/comments/R_SO_4_451703096?csrf_token="
    musicspider = MusicSpider()
    musicspider.get_comments(url)

爬取網易雲評論

文章目錄

網頁爬取

網頁分析

參數獲取

數據分析

數據存儲

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

nodejs學習06——小案例

DCGAN生成二次元頭像（Pytorch）

1020 Delete At Most Two Characters (35 分)（C++）

PAT頂級目錄（C++）

爬取知乎表情包

WaveNet筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結