QQ音樂爬蟲——下載榜單歌曲

今天我們來實現一下QQ音樂的爬蟲,實現對榜單裏面的歌曲的下載

主頁

榜單內容

    可以簡單分析一下頁面,網頁也是基於動態處理的,所以有必要對所需的數據包進行抓取,QQ音樂會不定時進行更新,所以每一期的規則會不一樣,這裏是基於目前的規則進行編寫的代碼,給大家偷個懶,有關歌曲數據的數據包基本上都包含fcg關鍵字,可以直接篩選,大家也可以自行查看preview進行判斷


    這裏是榜單歌曲信息包:,這裏就作爲我們爬蟲的切入點,從這裏可以獲取到歌曲的基本信息,包括歌曲id和名字,後面會用到這些信息,我們先記住,慢慢來進行分析

    我們打開播放頁面,對歌曲媒體文件進行抓取,直接獲取media數據即可



    仔細觀察會發現不同歌曲下載鏈接之間的餓異同點,去抓取不同的歌曲數據包會發現包括guid,format等參數都是固定數值,這裏變化的只有C400後面的參數(仔細觀察發現這裏就是songmid值)和vkey值。


    我們再對vkey相關的數據包進行抓取,從名字就能簡單看出這個數據包適合vkey相關的


    這裏是vkey數據包 ,我們將數據整理一下(放在json在線解析頁面整理)查看,對比一下不難發現vkey值的保存地址,這裏的purl地址就是C400後面那一串加上vkey後面,也是省去不少麻煩

    對這裏vkey連接裏header裏面真實url的連接進行分析,發現後面的數據參數基本上就是後面data裏面的參數,只是除了data裏面的songmid不同外,所以這裏只需要將songmid進行構造一下然後進行頁面獲取即可

def getVkey(songmid):
    vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
    res = requests.get(url=vkey_url)
    time.sleep(0.5)
    res02 = json.loads(res.text)
    vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
    return vkey

    我們隨便拿一首歌的songmid和vkey進行驗證,發現是可以下載的,至此完整流程我們已經完成,基本上就是:

  1. 獲取歌曲songmid
  2. 通過songmid獲取vkey
  3. 通過vkey組合的下載鏈接進行歌曲獲取

代碼實現

  #!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
@author: maya
@contact: [email protected]
@software: Pycharm
@file: music.py
@time: 2019/1/8 12:48
@desc:
'''
import json
import requests
import time
import os
import urllib

headers = {
        "cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
        "user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'

    }

def getHtml(start_url):

    try:
        r = requests.get(start_url, headers=headers)
        r.encoding = r.apparent_encoding


        text = json.loads(r.text)
        return text
    except:
        return ""

def getSongMid(html):

    songmid = []
    for tid in html['songlist']:
        songmid.append([tid['data']['songmid'], tid['data']['songname']])
    return songmid

def getSong(html):
    start_index = 0
    while (True):
        start_num = start_index * 30
        num = 30
        start_index += 1
        update_key = html['update_time']  # 有些update_key爲2018-5,而實際請求需要傳遞2018-05,因此需要轉換下
        temp_key = update_key.split("_")
        if (len(temp_key) == 3):
            if len(temp_key[1]) == 1:
                update_key = temp_key[0] + '_0' + temp_key[1] + temp_key[2]
            elif len(temp_key[2]) == 1:
                update_key = temp_key[0] + temp_key[1] + '_0' + temp_key[2]
        page_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0".format(
            update_key, start_num)
        json_text = getHtml(page_url)
        songinfo = getSongMid(json_text)
        if len(songinfo) == 0:
            break
        for sid in songinfo:
            vkey = getVkey(sid[0])#獲取每首音樂的vkey
            saveMusic(sid[0],vkey,sid[1])#保存此音樂
            time.sleep(1)#休眠1秒,防止被服務器過濾掉

def getVkey(songmid):
    vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
    res = requests.get(url=vkey_url)
    time.sleep(0.5)
    res02 = json.loads(res.text)
    vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
    return vkey



def saveMusic(songmid, vkey, name):

    headers['Host'] = 'dl.stream.qqmusic.qq.com'
    url = "http://dl.stream.qqmusic.qq.com/" + vkey
    res = requests.get(url, headers=headers, stream=True)
    filename = 'song/{0}.m4a'.format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", ""))

    print("*****    正在下載    *****")
    print(url)
    print("*****歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))

    with open(filename, 'wb') as f:
        f.write(res.raw.read())
    if(urllib.request.urlopen(url).getheader('Content-Length') > 0):
        print("成功下載歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
        # size = urllib.request.urlopen(url).getheader('Content-Length')
        # print(size)
    else:
        print("下載失敗")
        os.remove(filename)

if __name__ == '__main__':
    start_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date=2019-01-08&topid=4&type=top&song_begin=0&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0"
    text = getHtml(start_url)
    getSong(text)


多線程版本:

import requests
import json
import time
from datetime import datetime
import threading



date_time=datetime.now().date()
def func(num):
    starturl="https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1285181755&loginUin=2521763805&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0".format(date_time,num*30)
    print(starturl)
    headers = {
    "cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
    "user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
    }
    res=requests.get(url=starturl,headers=headers)
    res=res.text
    res=json.loads(res)
    songname=[]
    songmid=[]
    for i in res["songlist"]:
        songname.append(i["data"]["songname"])
        songmid.append(i["data"]["songmid"])
    mid_name=dict(zip(songmid,songname))

    for j in mid_name:
        vkey_url ="https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(j)
        res02=requests.get(url=vkey_url)
        time.sleep(0.5)
        res02 = res02.text
        res02 = json.loads(res02)
        vkey=res02["req_0"]["data"]["midurlinfo"][0]["purl"]
        url="http://dl.stream.qqmusic.qq.com/"+vkey
        try:
            filename="music/"+mid_name[j]+".m4a"
            print(filename)
            res03=requests.get(url=url,headers=headers)
            with open(filename,"wb") as f:
                f.write(res03.content)
        except:
            continue

# threading_list=[]
# for the in range(4):
#     threadParse = threading.Thread(target=func(the))
#     threading_list.append(threadParse)
#
# for th in threading_list:
#     th.setDaemon(True)
#     th.start()
for lon in range(4):
    func(lon)

  • 這裏通過urllib對歌曲數據進行判斷,去除無法下載的歌曲(由於權限等問題)
  • 代碼中沒有對文件夾進行建立,大家可以自行修改一下,也可以直接建立相應文件夾
  • 更多爬蟲代碼詳情查看Github
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章