摘要

[23]个风格[流行,摇滚,民谣,电子,舞曲,说唱,轻音乐,爵士,乡村,R&B/Soul,古典,民族,英伦,金属,朋克,蓝调,雷鬼,世界音乐,拉丁,New Age,古风,后摇,Bossa Nova]

[29928]首歌单,播放量大于100w的歌单共有[2416]热门歌单

共计[787223]首歌曲，去重后还剩[174845]待爬歌曲，共计[1374523]条热门评论

贴一下2020年前10的热评（跳过歌者自己的评论），看有没让你意难平的：

1. 每个人都在盯着我
我恨死了
于是
我恶狠狠的
在身上缝满了镜子
——2017年全球华语大学生短诗比赛入围奖

——出自一支榴莲《海底》
评论者：洞庭老碧螺春
点赞数：466924
评论时间：2020-03-25 16:27:50

2. 和妈妈吵架了就摸摸肚脐那是和妈妈曾经相连的地方

——出自蜡笔小心《MOM》
评论者：黄金至尊咸鱼_
点赞数：364415
评论时间：2020-03-09 01:11:00

3. 不年少无为了？不自卑了？长本事了？

——出自李荣浩《在一起嘛好不好》
评论者：该暱称已被占我用
点赞数：350477
评论时间：2020-06-03 00:16:22

4. 一开口就是老二次元了(狗头)

——出自周深《达拉崩吧 (Live)》
评论者：明月春风-三五夜
点赞数：325905
评论时间：2020-03-27 21:34:23

5. 妈妈是第一次当妈妈
没有人教她
可她做的很好
我也是第一次当女儿
很多人教我
可我做的还不够

——出自蜡笔小心《MOM》
评论者：腿不长的梁妹妹
点赞数：314899
评论时间：2020-03-09 04:20:35

6. 我太喜欢这首歌的留白处理了，
太喜欢突然停顿的窒息感，
那空白的几秒，
就好像是留给听众消化情绪的时间，
或哽咽，或屏息，或大口呼吸，
或感知世界的心跳，
或在地上捡起七零八落的自己( o̴̶̷᷄ ·̫ o̴̶̷̥᷅ )

——出自华晨宇《好想爱这个世界啊 (Live)》
评论者：昃星
点赞数：312470
评论时间：2020-04-03 21:45:28

7. 你坐过出租车吧你下车后司机是不是把空牌子立了起来你应该明白空出来的位置总有人坐

——出自井胧《丢了你》
评论者：雨落东海只剩浪
点赞数：301451
评论时间：2020-04-25 02:18:23

8. 这份爱会不会会不会让你也好疲惫

——出自刘大壮《会不会（吉他版）》
评论者：明月春风-三五夜
点赞数：284451
评论时间：2020-09-22 00:00:29

9. 还记得那则新闻吗？一场战争中一对情侣走散了，再次相见时隔几十年，老奶奶终身未嫁，老爷爷子孙满堂，老奶奶高估了他对她的爱，老爷爷低估了自己在她心中的位置

——出自井胧《丢了你》
评论者：别对小玖心动
点赞数：258565
评论时间：2020-04-25 00:06:19

10.

想看什么。

想看日复一日年复一年，想看窗边的窥探，想看眸光流转。想看偏爱，想看成全，想看隐忍的喜欢，想看光阴找不到终点，想看嗫嚅了千百遍也说不出的箴言。

想看美梦被敲碎，想看夕阳被践踏，想看故事不够完美，想看到处都是遗憾。

我们都活着啊，活在夏天。

——出自一支榴莲《海底》
评论者：Lunarfocus_
点赞数：250374
评论时间：2020-02-28 03:22:03

思路

获取网易云所有风格
爬取网易云所有风格下，播放量大于100w的歌单歌曲
爬取每个歌单里所有歌曲（最麻烦的）
使用队列爬取每首歌的热评

下面大概讲下实现过程

一、拿到网易云所有风格

所有风格如图所示，选用xpath提取所有风格

    def get_all_style(self):
        """
        获取网易云所有风格
        Returns: 网易云所有风格链接

        """
        style_list = []

        url = 'https://music.163.com/discover/playlist'
        res = self.session.get(url)

        selector = etree.HTML(res.text)
        styles = selector.xpath('//dd')[1]
        for style in styles.xpath('a'):
            href = str(style.xpath('@href')[0])
            cat = str(style.xpath('text()')[0])
            style_list.append(dict(href=href, cat=cat))

        logger.info("共计抓取[%s]个风格[%s]", len(style_list), ','.join('{}'.format(style['cat']) for style in style_list))
        return style_list

二、爬取网易云音乐风格下所有歌单

每个风格下歌单网页结构如下图：

技术选用xpath提取要的内容。

提取到总页码，递归拿到所有要的歌单信息（歌单id）。

歌单全体中充斥着很多相同的歌单以及重复的歌曲存量，so分析全部歌单并无太大意义。

最优策略是到最热的播放歌单中去捕捉最骚的热评

因此，我最后筛选了播放量大于100w次的歌单，作为热门歌单分析。

    def get_style_playlist(self, style_, offset=1, play_list_ids=None):
        """
        爬取网易云音乐风格分类下所有歌单
        Args:
            style_:风格信息
            offset: 分页大小
            play_list_ids: 歌单id字典

        Returns:歌单id字典

        """
        if play_list_ids is None:
            play_list_ids = {}

        cat = style_['cat']
        logger.info("开始抓取[%s]第[%s]页歌单", cat, offset)
        url = 'https://music.163.com/discover/playlist/?order=hot&cat={}&limit=35&offset={}'.format(quote(cat),
                                                                                                    (offset - 1) * 35)

        res = self.session.get(url)
        html = res.text
        selector = etree.HTML(html)

        end_page = selector.xpath('//div[@class="u-page"]/a/text()')[-2]

        bottoms = selector.xpath('//ul[@id="m-pl-container"]/li/div/div[@class="bottom"]')
        for bom in bottoms:
            # 歌单id
            playlist_id = str(bom.xpath('a/@data-res-id')[0])
            # 播放量
            volume = int(str(bom.xpath('span[@class="nb"]/text()')[0]).replace("万", '0000'))

            self.play_list_count += 1
            # 筛选播放量大于100w的歌单
            if volume > 1000000:
                play_list_ids[playlist_id] = volume
        if offset == 1:
            for page in range(2, int(end_page) + 1):
                self.get_style_playlist(style_, page, play_list_ids)

            return play_list_ids

三、获取歌单所有歌曲

Let's愉快的通过网页来提取歌曲信息吧......no way!

fine，看一下有没接口可以模拟吧........no way!

网页每个歌单只显示10首歌，而ajax请求也是没有的!

一番头脑风暴之后，最终找到一个linux版的接口，一个歌单可以实现1000首歌的获取。

ok ，就是你了，直接贴代码

    def get_playlist(self, _id):
        """
        获取歌单所有歌曲
        Args:
            _id: 歌单id

        Returns:所有歌曲字典

        """
        time.sleep(random.randint(2, 5))

        logger.info("开始获取[%s]歌单所有歌曲", _id)
        url = "http://music.163.com/api/v3/playlist/detail"
        param = dict(id=_id, total="true", limit=1000, n=1000, offest=0)
        try:

            res = requests.post(url=url, headers=self.headers, data=music163_encrypt.get_form_data(param, 'linux'),
                                timeout=7)
            if res.ok:

                try:
                    text = json.loads(res.text)

                    if 'playlist' in text:
                        playlist = text["playlist"]
                        if 'tracks' in playlist:
                            playlist = playlist["tracks"]

                            for song in playlist:
                                song_request = {'_id': song['id'],
                                                'name': song['name'],
                                                'artists': '/'.join([ar['name'] for ar in song['ar']])}
                                self.song_count += 1
                                if self.song_dict.get(song['id']) is None:
                                    self.song_dict[song['id']] = 0
                                    logger.info(','.join('{}'.format(v) for v in list(song_request.values())))
                                    send_task('handle_music_163_song_hot_comment', json.dumps(song_request))
                                else:
                                    logger.info("已存在歌曲，跳过...")
                        else:
                            logger.error('json内容异常,%s', text)
                    else:
                        logger.error('json内容异常,%s', text)
                except BaseException as e:
                    logger.error('获取歌单[%s]歌曲异常,%s', _id, e)
        except BaseException as e:
            logger.error("请求失败,%s,%s", _id, e)

四、队列获取歌曲热评

如果有仔细看的话，会发现我再拿到歌曲信息的时候，最终将歌曲信息直接扔到队列。

我为了方便，在python使用RabbitMQ的时候，先封装了个工具类如下。

import pika


def send_task(queue_name, task):
    connection = pika.BlockingConnection(pika.ConnectionParameters(
        'localhost'))
    channel = connection.channel()
    # 声明一个exchange
    channel.exchange_declare(exchange='messages', exchange_type='direct')
    # 声明queue
    channel.queue_declare(queue=queue_name, durable=True)
    # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
    channel.queue_bind(exchange='messages',
                       queue=queue_name,
                       routing_key=queue_name)

    channel.basic_publish(exchange='messages',
                          routing_key=queue_name,
                          body=task,
                          properties=pika.BasicProperties(
                              delivery_mode=2,  # make message persistent
                          )
                          )
    connection.close()


def recv_task(queue_name, callback):
    connection = pika.BlockingConnection(pika.ConnectionParameters(
        'localhost'))
    channel = connection.channel()
    # 声明一个exchange
    channel.exchange_declare(exchange='messages', exchange_type='direct')
    # 声明queue
    channel.queue_declare(queue=queue_name, durable=True)

    # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
    channel.queue_bind(exchange='messages',
                       queue=queue_name,
                       routing_key=queue_name)
    channel.basic_qos(prefetch_count=10)
    channel.basic_consume(callback,
                          queue=queue_name,
                          no_ack=False)
    channel.basic_qos(prefetch_count=1)
    channel.start_consuming()
    print(' [*] Waiting for messages. To exit press CTRL+C')

下面是队列处理歌曲热评主要代码：


    def request_music163_hot_comments(self, song):
        """
        获取网易云热门评论
        Args:
            song: 歌曲信息

        Returns:

        """
        _id = str(song['_id'])

        # 新歌曲评论
        try:
            self.create_hot_comments_file()

            cursor = -1
            index = 1
            offset = 0
            # 时间戳，第一页默认-1 请求下一页时取上一页最后一条评论时间戳
            param1 = """{"csrf_token": "", "cursor": "%s", "offset": "%s", "orderType": "1","pageNo": "%s","pageSize": "20", "rid": "R_SO_4_%s", "threadId": "R_SO_4_%s"}"""
            requests.packages.urllib3.disable_warnings()

            arg1 = param1 % (cursor, offset, index, _id, _id)
            r = self.session.post(self.url, headers=self.session.headers,
                                  data=music163_encrypt.get_form_data(arg1),
                                  verify=False,
                                  timeout=5)
            result = r.json()
            try:
                # 热评
                hotComments = result["data"]['hotComments']
                if hotComments is not None:
                    logger.info('开始抓取歌曲ID[%s]的热门评论', _id)
                    for comment in hotComments:
                        self.parse_comment_info(song, comment, True)
            except BaseException as e:
                logger.error(e)
        except BaseException as e:
            logger.error('获取网易云评论异常.参数[%s]%s', song, e)


    def get_comment_info(self, ch, method, properties, body):

        song = json.loads(body)
        thread = threading.Thread(target=self.request_music163_hot_comments, args=(song,))
        thread.start()
        while thread.is_alive():
            # Loop while the thread is processings
            ch._connection.sleep(1.0)
        logger.info('Back from thread')

        ch.basic_ack(delivery_tag=method.delivery_tag)  # 发送ack消息

    def start(self):
        """
        处理网易云音乐不同歌曲评论
        Returns:

        """
        try:
            queue_name = 'handle_music_163_song_hot_comment'
            connection = pika.BlockingConnection(pika.ConnectionParameters(
                'localhost'))
            channel = connection.channel()
            # 声明一个exchange
            channel.exchange_declare(exchange='messages', exchange_type='direct')
            # 声明queue
            channel.queue_declare(queue=queue_name, durable=True)
            # n RabbitMQ a message can never be sent directly to the queue, it always needs to go through an exchange.
            channel.queue_bind(exchange='messages',
                               queue=queue_name,
                               routing_key=queue_name)
            channel.basic_qos(prefetch_count=10)
            channel.basic_consume(self.get_comment_info,
                                  queue=queue_name,
                                  no_ack=False)
            channel.start_consuming()
        except BaseException as e:
            logger.error(e)

五、热评可视化分析

先贴几张热评分析的图，主要取了点赞前1w的热评进行分析，下一章再详细python讲一下热评的可视化分析，以及情感分析。

1. 用户全国分布

截止2021年1月1日，在网易云音乐全国热评战力榜的评选中，广东摘的桂冠，江苏紧随其后，四川、北京、浙江、山东仍打的难舍难分，其它地区早已淘汰出局。

2.用户年龄分布

热评榜年龄分布曲线证明了一个事实，19岁到30岁的年轻人果然是网抑云er的主力军。

3. 用户性别占比

网易云音乐热评男性用户占比52.46%，女性用户占比仅为35.25%

4.热评年份分布

2017年的峰度应该来自网易云音乐的走红，2019-2020这个勇攀高峰的直线来自于因病毒而被隔离、生活不易的大家，人类的悲欢可以相同。

5.热评月份分布

热评月榜的战力值，随着春节的临近输出爆表，又伴着春节的逝去渐渐无力，一路坠到谷底，在5月躺平认嘲后又重新蹦跶了起来。

6. 24小时分布

浏览量up+共情力up=热评

下班之后+睡觉之前=浏览量up+共情力up

热评时间主要分布范围：[下班之后:睡觉之前]

结语

自此大功告成，写好跑了一夜就拿到了所有热评数据。

此文仅供学习，如有不足望指正。

网易云音乐热评爬虫(二): 130万条热评里有你吗？

思路

一、拿到网易云所有风格

二、爬取网易云音乐风格下所有歌单

三、获取歌单所有歌曲

四、队列获取歌曲热评

五、热评可视化分析

1. 用户全国分布

2.用户年龄分布

3. 用户性别占比

4.热评年份分布

5.热评月份分布

6. 24小时分布

结语

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

[转帖]

python列出centos7内存使用前50的进程信息

Garnet：微软官方基于.NET开源的高性能分布式缓存存储数据库

Flink执行图

Java响应式编程

评估统计算法在银行伪造钞票检测中的价值

nodejs学习06——小案例

Java ThreadPoolShutdown

個人開發者如何入門 Java 異步編程

在Linux操作系統的安裝過程中，如何選擇合適的發行版

前端面試題 - vue的雙向綁定原理是什麼？

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結