因工作學習需要大量APK樣本,手動下載太痛苦,自動挖掘一招解決。
第一步:需求分析
1.正經渠道的APK。
2.隨機取不要太刻意的找某個類型的
3.動靜不要太大,爬多了有點慌
第二步:找口子
發現可以在應用市場中通過搜索找APK,返回的是json。下面是例子:
https://sj.qq.com/myapp/searchAjax.htm?kw=a&pns=MzA=&sid=0
https://sj.qq.com/myapp/searchAjax.htm?kw=a&pns=MjA=&sid=10
kw 是我們的搜索條件
pns 好像是類似於頁碼
sid 每太懂
第二步:拼接條件
先弄個字母表
key = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
"t", "u", "v", "w", "x", "y", "z"]
爲保證數量,我們搞個組合循環
for k in key:
for e in key:
name = k + e
types = ['', 'MTA', 'MjA', 'MzA', 'NDA']
for y in types:
url = 'https://sj.qq.com/myapp/searchAjax.htm?kw={}&pns={}&sid=0'.format(name, y)
爲了減少對應用市場的流量衝擊,我們不立馬下載(雖然別人可能根本不在乎這點流量),先存到數據庫再說哈!
sqlDecompile = SqlDecompile.Sqltools(self.MYSQLIP, self.MYSQLNAME, self.MYSQLPWD, self.MYSQLTABLE)
第三步:數據解析
看看接口返回的Json
想要的全部都有:MD5(用來去重),下載地址,APP名稱,類型,廠商,說明,包名,版本
try:
text3 = s.get(url, verify=False).json()['obj']['items']
for text in text3:
apk_name = text['appDetail']['appName']
apk_page = text['appDetail']['pkgName']
apk_type = text['appDetail']['categoryName']
apk_description = str(text['appDetail']['description']).replace('"', "'").replace('>', '<')
apk_url = text['appDetail']['apkUrl']
apk_fasten = ''
apk_authorName = text['appDetail']['authorName']
apk_versionName = text['appDetail']['versionName']
apk_md5 = text['appDetail']['apkMd5']
if len(apk_description) >= 5000:
apk_description = apk_authorName[0:4999]
sql = 'INSERT INTO apk_down (apk_name, apk_page,apk_type, apk_description,apk_url, ' \
'apk_fasten,apk_authorName,apk_versionName,apk_md5) VALUES (' \
'"{}","{}","{}","{}","{}","{}","{}","{}","{}");'.format(apk_name, apk_page,
apk_type, apk_description,
apk_url, apk_fasten,
apk_authorName,
apk_versionName,
apk_md5)
try:
sqlDecompile.put(sql)
b = b - 1
print('剩餘:{}組'.format(b))
sqlDecompile.close()
except:
print("{}:MD5已存在".format(apk_md5))
except:
print("{}無數據".format(url))
continue
來來來,跑一波
第四步:APK下載
完美落庫後,後面我們就用一點下一點。
sqlDecompile = SqlDecompile.Sqltools(self.MYSQLIP, self.MYSQLNAME, self.MYSQLPWD, self.MYSQLTABLE)
sql1 = 'SELECT apk_page,apk_versionName,id,apk_url,apk_dir FROM apk_down;'
texts = sqlDecompile.get(sql1)
num = 1
while sum([len(x) for _, _, x in os.walk(os.path.dirname(self.APKDIR + '/' + str(num) + '/'))]) >= 100:
num = num + 1
for text in texts:
if (text[4] is None) or (not os.path.exists(text[4])):
fileName = str(text[0]) + '.' + str(text[1]) + '_' + str(text[2])
downUrl = str(text[3])
filedir = self.APKDIR + '/' + str(num) + '/'
apkDir = filedir + fileName + '.apk'
if not os.path.exists(apkDir):
if not os.path.exists(filedir):
os.mkdir(filedir)
try:
s = requests.Session()
r = s.get(downUrl, stream=True, verify=False, timeout=30)
f = open(apkDir, "wb")
for chunk in r.iter_content(chunk_size=512):
if chunk:
f.write(chunk)
sql2 = 'UPDATE apk_down SET apk_dir="{}" WHERE id = {};'.format(apkDir, str(text[2]))
sqlDecompile.put(sql2)
except:
print(downUrl + ':下載失敗')
if sum([len(x) for _, _, x in os.walk(os.path.dirname(filedir))]) >= 100:
num = num + 1
溫馨提醒:爬蟲有風險,慎重使用哦!