摘要

[23]個風格[流行,搖滾,民謠,電子,舞曲,說唱,輕音樂,爵士,鄉村,R&B/Soul,古典,民族,英倫,金屬,朋克,藍調,雷鬼,世界音樂,拉丁,New Age,古風,後搖,Bossa Nova]

[29928]首歌單,播放量大於100w的歌單共有[2416]熱門歌單

共計[787223]首歌曲，去重後還剩[174845]待爬歌曲，共計[1374523]條熱門評論

貼一下2020年前10的熱評（跳過歌者自己的評論），看有沒讓你意難平的：

1. 每個人都在盯着我
我恨死了
於是
我惡狠狠的
在身上縫滿了鏡子
——2017年全球華語大學生短詩比賽入圍獎

——出自一支榴蓮《海底》
評論者：洞庭老碧螺春
點贊數：466924
評論時間：2020-03-25 16:27:50

2. 和媽媽吵架了就摸摸肚臍那是和媽媽曾經相連的地方

——出自蠟筆小心《MOM》
評論者：黃金至尊鹹魚_
點贊數：364415
評論時間：2020-03-09 01:11:00

3. 不年少無爲了？不自卑了？長本事了？

——出自李榮浩《在一起嘛好不好》
評論者：該暱稱已被佔我用
點贊數：350477
評論時間：2020-06-03 00:16:22

4. 一開口就是老二次元了(狗頭)

——出自周深《達拉崩吧 (Live)》
評論者：明月春風-三五夜
點贊數：325905
評論時間：2020-03-27 21:34:23

5. 媽媽是第一次當媽媽
沒有人教她
可她做的很好
我也是第一次當女兒
很多人教我
可我做的還不夠

——出自蠟筆小心《MOM》
評論者：腿不長的梁妹妹
點贊數：314899
評論時間：2020-03-09 04:20:35

6. 我太喜歡這首歌的留白處理了，
太喜歡突然停頓的窒息感，
那空白的幾秒，
就好像是留給聽衆消化情緒的時間，
或哽咽，或屏息，或大口呼吸，
或感知世界的心跳，
或在地上撿起七零八落的自己( o̴̶̷᷄ ·̫ o̴̶̷̥᷅ )

——出自華晨宇《好想愛這個世界啊 (Live)》
評論者：昃星
點贊數：312470
評論時間：2020-04-03 21:45:28

7. 你坐過出租車吧你下車後司機是不是把空牌子立了起來你應該明白空出來的位置總有人坐

——出自井朧《丟了你》
評論者：雨落東海只剩浪
點贊數：301451
評論時間：2020-04-25 02:18:23

8. 這份愛會不會會不會讓你也好疲憊

——出自劉大壯《會不會（吉他版）》
評論者：明月春風-三五夜
點贊數：284451
評論時間：2020-09-22 00:00:29

9. 還記得那則新聞嗎？一場戰爭中一對情侶走散了，再次相見時隔幾十年，老奶奶終身未嫁，老爺爺子孫滿堂，老奶奶高估了他對她的愛，老爺爺低估了自己在她心中的位置

——出自井朧《丟了你》
評論者：別對小玖心動
點贊數：258565
評論時間：2020-04-25 00:06:19

10.

想看什麼。

想看日復一日年復一年，想看窗邊的窺探，想看眸光流轉。想看偏愛，想看成全，想看隱忍的喜歡，想看光陰找不到終點，想看囁嚅了千百遍也說不出的箴言。

想看美夢被敲碎，想看夕陽被踐踏，想看故事不夠完美，想看到處都是遺憾。

我們都活着啊，活在夏天。

——出自一支榴蓮《海底》
評論者：Lunarfocus_
點贊數：250374
評論時間：2020-02-28 03:22:03

思路

獲取網易雲所有風格
爬取網易雲所有風格下，播放量大於100w的歌單歌曲
爬取每個歌單裏所有歌曲（最麻煩的）
使用隊列爬取每首歌的熱評

下面大概講下實現過程

一、拿到網易雲所有風格

所有風格如圖所示，選用xpath提取所有風格

    def get_all_style(self):
        """
        獲取網易雲所有風格
        Returns: 網易雲所有風格鏈接

        """
        style_list = []

        url = 'https://music.163.com/discover/playlist'
        res = self.session.get(url)

        selector = etree.HTML(res.text)
        styles = selector.xpath('//dd')[1]
        for style in styles.xpath('a'):
            href = str(style.xpath('@href')[0])
            cat = str(style.xpath('text()')[0])
            style_list.append(dict(href=href, cat=cat))

        logger.info("共計抓取[%s]個風格[%s]", len(style_list), ','.join('{}'.format(style['cat']) for style in style_list))
        return style_list

二、爬取網易雲音樂風格下所有歌單

每個風格下歌單網頁結構如下圖：

技術選用xpath提取要的內容。

提取到總頁碼，遞歸拿到所有要的歌單信息（歌單id）。

歌單全體中充斥着很多相同的歌單以及重複的歌曲存量，so分析全部歌單並無太大意義。

最優策略是到最熱的播放歌單中去捕捉最騷的熱評

因此，我最後篩選了播放量大於100w次的歌單，作爲熱門歌單分析。

    def get_style_playlist(self, style_, offset=1, play_list_ids=None):
        """
        爬取網易雲音樂風格分類下所有歌單
        Args:
            style_:風格信息
            offset: 分頁大小
            play_list_ids: 歌單id字典

        Returns:歌單id字典

        """
        if play_list_ids is None:
            play_list_ids = {}

        cat = style_['cat']
        logger.info("開始抓取[%s]第[%s]頁歌單", cat, offset)
        url = 'https://music.163.com/discover/playlist/?order=hot&cat={}&limit=35&offset={}'.format(quote(cat),
                                                                                                    (offset - 1) * 35)

        res = self.session.get(url)
        html = res.text
        selector = etree.HTML(html)

        end_page = selector.xpath('//div[@class="u-page"]/a/text()')[-2]

        bottoms = selector.xpath('//ul[@id="m-pl-container"]/li/div/div[@class="bottom"]')
        for bom in bottoms:
            # 歌單id
            playlist_id = str(bom.xpath('a/@data-res-id')[0])
            # 播放量
            volume = int(str(bom.xpath('span[@class="nb"]/text()')[0]).replace("萬", '0000'))

            self.play_list_count += 1
            # 篩選播放量大於100w的歌單
            if volume > 1000000:
                play_list_ids[playlist_id] = volume
        if offset == 1:
            for page in range(2, int(end_page) + 1):
                self.get_style_playlist(style_, page, play_list_ids)

            return play_list_ids

三、獲取歌單所有歌曲

Let's愉快的通過網頁來提取歌曲信息吧......no way!

fine，看一下有沒接口可以模擬吧........no way!

網頁每個歌單隻顯示10首歌，而ajax請求也是沒有的!

一番頭腦風暴之後，最終找到一個linux版的接口，一個歌單可以實現1000首歌的獲取。

ok ，就是你了，直接貼代碼

    def get_playlist(self, _id):
        """
        獲取歌單所有歌曲
        Args:
            _id: 歌單id

        Returns:所有歌曲字典

        """
        time.sleep(random.randint(2, 5))

        logger.info("開始獲取[%s]歌單所有歌曲", _id)
        url = "http://music.163.com/api/v3/playlist/detail"
        param = dict(id=_id, total="true", limit=1000, n=1000, offest=0)
        try:

            res = requests.post(url=url, headers=self.headers, data=music163_encrypt.get_form_data(param, 'linux'),
                                timeout=7)
            if res.ok:

                try:
                    text = json.loads(res.text)

                    if 'playlist' in text:
                        playlist = text["playlist"]
                        if 'tracks' in playlist:
                            playlist = playlist["tracks"]

                            for song in playlist:
                                song_request = {'_id': song['id'],
                                                'name': song['name'],
                                                'artists': '/'.join([ar['name'] for ar in song['ar']])}
                                self.song_count += 1
                                if self.song_dict.get(song['id']) is None:
                                    self.song_dict[song['id']] = 0
                                    logger.info(','.join('{}'.format(v) for v in list(song_request.values())))
                                    send_task('handle_music_163_song_hot_comment', json.dumps(song_request))
                                else:
                                    logger.info("已存在歌曲，跳過...")
                        else:
                            logger.error('json內容異常,%s', text)
                    else:
                        logger.error('json內容異常,%s', text)
                except BaseException as e:
                    logger.error('獲取歌單[%s]歌曲異常,%s', _id, e)
        except BaseException as e:
            logger.error("請求失敗,%s,%s", _id, e)

四、隊列獲取歌曲熱評

如果有仔細看的話，會發現我再拿到歌曲信息的時候，最終將歌曲信息直接扔到隊列。

我爲了方便，在python使用RabbitMQ的時候，先封裝了個工具類如下。

import pika


def send_task(queue_name, task):
    connection = pika.BlockingConnection(pika.ConnectionParameters(
        'localhost'))
    channel = connection.channel()
    # 聲明一個exchange
    channel.exchange_declare(exchange='messages', exchange_type='direct')
    # 聲明queue
    channel.queue_declare(queue=queue_name, durable=True)
    # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
    channel.queue_bind(exchange='messages',
                       queue=queue_name,
                       routing_key=queue_name)

    channel.basic_publish(exchange='messages',
                          routing_key=queue_name,
                          body=task,
                          properties=pika.BasicProperties(
                              delivery_mode=2,  # make message persistent
                          )
                          )
    connection.close()


def recv_task(queue_name, callback):
    connection = pika.BlockingConnection(pika.ConnectionParameters(
        'localhost'))
    channel = connection.channel()
    # 聲明一個exchange
    channel.exchange_declare(exchange='messages', exchange_type='direct')
    # 聲明queue
    channel.queue_declare(queue=queue_name, durable=True)

    # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
    channel.queue_bind(exchange='messages',
                       queue=queue_name,
                       routing_key=queue_name)
    channel.basic_qos(prefetch_count=10)
    channel.basic_consume(callback,
                          queue=queue_name,
                          no_ack=False)
    channel.basic_qos(prefetch_count=1)
    channel.start_consuming()
    print(' [*] Waiting for messages. To exit press CTRL+C')

下面是隊列處理歌曲熱評主要代碼：


    def request_music163_hot_comments(self, song):
        """
        獲取網易雲熱門評論
        Args:
            song: 歌曲信息

        Returns:

        """
        _id = str(song['_id'])

        # 新歌曲評論
        try:
            self.create_hot_comments_file()

            cursor = -1
            index = 1
            offset = 0
            # 時間戳，第一頁默認-1 請求下一頁時取上一頁最後一條評論時間戳
            param1 = """{"csrf_token": "", "cursor": "%s", "offset": "%s", "orderType": "1","pageNo": "%s","pageSize": "20", "rid": "R_SO_4_%s", "threadId": "R_SO_4_%s"}"""
            requests.packages.urllib3.disable_warnings()

            arg1 = param1 % (cursor, offset, index, _id, _id)
            r = self.session.post(self.url, headers=self.session.headers,
                                  data=music163_encrypt.get_form_data(arg1),
                                  verify=False,
                                  timeout=5)
            result = r.json()
            try:
                # 熱評
                hotComments = result["data"]['hotComments']
                if hotComments is not None:
                    logger.info('開始抓取歌曲ID[%s]的熱門評論', _id)
                    for comment in hotComments:
                        self.parse_comment_info(song, comment, True)
            except BaseException as e:
                logger.error(e)
        except BaseException as e:
            logger.error('獲取網易雲評論異常.參數[%s]%s', song, e)


    def get_comment_info(self, ch, method, properties, body):

        song = json.loads(body)
        thread = threading.Thread(target=self.request_music163_hot_comments, args=(song,))
        thread.start()
        while thread.is_alive():
            # Loop while the thread is processings
            ch._connection.sleep(1.0)
        logger.info('Back from thread')

        ch.basic_ack(delivery_tag=method.delivery_tag)  # 發送ack消息

    def start(self):
        """
        處理網易雲音樂不同歌曲評論
        Returns:

        """
        try:
            queue_name = 'handle_music_163_song_hot_comment'
            connection = pika.BlockingConnection(pika.ConnectionParameters(
                'localhost'))
            channel = connection.channel()
            # 聲明一個exchange
            channel.exchange_declare(exchange='messages', exchange_type='direct')
            # 聲明queue
            channel.queue_declare(queue=queue_name, durable=True)
            # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
            channel.queue_bind(exchange='messages',
                               queue=queue_name,
                               routing_key=queue_name)
            channel.basic_qos(prefetch_count=10)
            channel.basic_consume(self.get_comment_info,
                                  queue=queue_name,
                                  no_ack=False)
            channel.start_consuming()
        except BaseException as e:
            logger.error(e)

五、熱評可視化分析

先貼幾張熱評分析的圖，主要取了點贊前1w的熱評進行分析，下一章再詳細python講一下熱評的可視化分析，以及情感分析。

1. 用戶全國分佈

截止2021年1月1日，在網易雲音樂全國熱評戰力榜的評選中，廣東摘的桂冠，江蘇緊隨其後，四川、北京、浙江、山東仍打的難捨難分，其它地區早已淘汰出局。

2.用戶年齡分佈

熱評榜年齡分佈曲線證明了一個事實，19歲到30歲的年輕人果然是網抑雲er的主力軍。

3. 用戶性別佔比

網易雲音樂熱評男性用戶佔比52.46%，女性用戶佔比僅爲35.25%

4.熱評年份分佈

2017年的峯度應該來自網易雲音樂的走紅，2019-2020這個勇攀高峯的直線來自於因病毒而被隔離、生活不易的大家，人類的悲歡可以相同。

5.熱評月份分佈

熱評月榜的戰力值，隨着春節的臨近輸出爆表，又伴着春節的逝去漸漸無力，一路墜到谷底，在5月躺平認嘲後又重新蹦躂了起來。

6. 24小時分佈

瀏覽量up+共情力up=熱評

下班之後+睡覺之前=瀏覽量up+共情力up

熱評時間主要分佈範圍：[下班之後:睡覺之前]

結語

自此大功告成，寫好跑了一夜就拿到了所有熱評數據。

此文僅供學習，如有不足望指正。

網易雲音樂熱評爬蟲(二): 130萬條熱評裏有你嗎？

思路

一、拿到網易雲所有風格

二、爬取網易雲音樂風格下所有歌單

三、獲取歌單所有歌曲

四、隊列獲取歌曲熱評

五、熱評可視化分析

1. 用戶全國分佈

2.用戶年齡分佈

3. 用戶性別佔比

4.熱評年份分佈

5.熱評月份分佈

6. 24小時分佈

結語

今天，昨天，近七天，近30天，近90天，js封裝

validate 驗證

Python爬蟲技術與數據可視化：Numpy、pandas、Matplotlib的黃金組合

ArkTS開發原生鴻蒙HarmonyOS短視頻應用

安全策略增量加速之對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結