今天我們來實現一下QQ音樂的爬蟲,實現對榜單裏面的歌曲的下載
主頁
榜單內容
可以簡單分析一下頁面,網頁也是基於動態處理的,所以有必要對所需的數據包進行抓取,QQ音樂會不定時進行更新,所以每一期的規則會不一樣,這裏是基於目前的規則進行編寫的代碼,給大家偷個懶,有關歌曲數據的數據包基本上都包含fcg關鍵字,可以直接篩選,大家也可以自行查看preview進行判斷
這裏是榜單歌曲信息包:,這裏就作爲我們爬蟲的切入點,從這裏可以獲取到歌曲的基本信息,包括歌曲id和名字,後面會用到這些信息,我們先記住,慢慢來進行分析
我們打開播放頁面,對歌曲媒體文件進行抓取,直接獲取media數據即可
仔細觀察會發現不同歌曲下載鏈接之間的餓異同點,去抓取不同的歌曲數據包會發現包括guid,format等參數都是固定數值,這裏變化的只有C400後面的參數(仔細觀察發現這裏就是songmid值)和vkey值。
我們再對vkey相關的數據包進行抓取,從名字就能簡單看出這個數據包適合vkey相關的
這裏是vkey數據包 ,我們將數據整理一下(放在json在線解析頁面整理)查看,對比一下不難發現vkey值的保存地址,這裏的purl地址就是C400後面那一串加上vkey後面,也是省去不少麻煩
對這裏vkey連接裏header裏面真實url的連接進行分析,發現後面的數據參數基本上就是後面data裏面的參數,只是除了data裏面的songmid不同外,所以這裏只需要將songmid進行構造一下然後進行頁面獲取即可
def getVkey(songmid):
vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
res = requests.get(url=vkey_url)
time.sleep(0.5)
res02 = json.loads(res.text)
vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
return vkey
我們隨便拿一首歌的songmid和vkey進行驗證,發現是可以下載的,至此完整流程我們已經完成,基本上就是:
- 獲取歌曲songmid
- 通過songmid獲取vkey
- 通過vkey組合的下載鏈接進行歌曲獲取
代碼實現
#!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
@author: maya
@contact: [email protected]
@software: Pycharm
@file: music.py
@time: 2019/1/8 12:48
@desc:
'''
import json
import requests
import time
import os
import urllib
headers = {
"cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
}
def getHtml(start_url):
try:
r = requests.get(start_url, headers=headers)
r.encoding = r.apparent_encoding
text = json.loads(r.text)
return text
except:
return ""
def getSongMid(html):
songmid = []
for tid in html['songlist']:
songmid.append([tid['data']['songmid'], tid['data']['songname']])
return songmid
def getSong(html):
start_index = 0
while (True):
start_num = start_index * 30
num = 30
start_index += 1
update_key = html['update_time'] # 有些update_key爲2018-5,而實際請求需要傳遞2018-05,因此需要轉換下
temp_key = update_key.split("_")
if (len(temp_key) == 3):
if len(temp_key[1]) == 1:
update_key = temp_key[0] + '_0' + temp_key[1] + temp_key[2]
elif len(temp_key[2]) == 1:
update_key = temp_key[0] + temp_key[1] + '_0' + temp_key[2]
page_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0".format(
update_key, start_num)
json_text = getHtml(page_url)
songinfo = getSongMid(json_text)
if len(songinfo) == 0:
break
for sid in songinfo:
vkey = getVkey(sid[0])#獲取每首音樂的vkey
saveMusic(sid[0],vkey,sid[1])#保存此音樂
time.sleep(1)#休眠1秒,防止被服務器過濾掉
def getVkey(songmid):
vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
res = requests.get(url=vkey_url)
time.sleep(0.5)
res02 = json.loads(res.text)
vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
return vkey
def saveMusic(songmid, vkey, name):
headers['Host'] = 'dl.stream.qqmusic.qq.com'
url = "http://dl.stream.qqmusic.qq.com/" + vkey
res = requests.get(url, headers=headers, stream=True)
filename = 'song/{0}.m4a'.format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", ""))
print("***** 正在下載 *****")
print(url)
print("*****歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
with open(filename, 'wb') as f:
f.write(res.raw.read())
if(urllib.request.urlopen(url).getheader('Content-Length') > 0):
print("成功下載歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
# size = urllib.request.urlopen(url).getheader('Content-Length')
# print(size)
else:
print("下載失敗")
os.remove(filename)
if __name__ == '__main__':
start_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date=2019-01-08&topid=4&type=top&song_begin=0&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0"
text = getHtml(start_url)
getSong(text)
多線程版本:
import requests
import json
import time
from datetime import datetime
import threading
date_time=datetime.now().date()
def func(num):
starturl="https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1285181755&loginUin=2521763805&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0".format(date_time,num*30)
print(starturl)
headers = {
"cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
}
res=requests.get(url=starturl,headers=headers)
res=res.text
res=json.loads(res)
songname=[]
songmid=[]
for i in res["songlist"]:
songname.append(i["data"]["songname"])
songmid.append(i["data"]["songmid"])
mid_name=dict(zip(songmid,songname))
for j in mid_name:
vkey_url ="https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(j)
res02=requests.get(url=vkey_url)
time.sleep(0.5)
res02 = res02.text
res02 = json.loads(res02)
vkey=res02["req_0"]["data"]["midurlinfo"][0]["purl"]
url="http://dl.stream.qqmusic.qq.com/"+vkey
try:
filename="music/"+mid_name[j]+".m4a"
print(filename)
res03=requests.get(url=url,headers=headers)
with open(filename,"wb") as f:
f.write(res03.content)
except:
continue
# threading_list=[]
# for the in range(4):
# threadParse = threading.Thread(target=func(the))
# threading_list.append(threadParse)
#
# for th in threading_list:
# th.setDaemon(True)
# th.start()
for lon in range(4):
func(lon)
- 這裏通過urllib對歌曲數據進行判斷,去除無法下載的歌曲(由於權限等問題)
- 代碼中沒有對文件夾進行建立,大家可以自行修改一下,也可以直接建立相應文件夾
- 更多爬蟲代碼詳情查看Github