爬蟲之-APK下載

           因工作學習需要大量APK樣本,手動下載太痛苦,自動挖掘一招解決。

第一步:需求分析
           1.正經渠道的APK。
           2.隨機取不要太刻意的找某個類型的
           3.動靜不要太大,爬多了有點慌

第二步:找口子
            發現可以在應用市場中通過搜索找APK,返回的是json。下面是例子:
             https://sj.qq.com/myapp/searchAjax.htm?kw=a&pns=MzA=&sid=0
             https://sj.qq.com/myapp/searchAjax.htm?kw=a&pns=MjA=&sid=10
             kw  是我們的搜索條件
             pns   好像是類似於頁碼
             sid   每太懂

第二步:拼接條件
             先弄個字母表

key = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
               "t", "u", "v", "w", "x", "y", "z"]

              爲保證數量,我們搞個組合循環

for k in key:
            for e in key:
                name = k + e
                types = ['', 'MTA', 'MjA', 'MzA', 'NDA']
                for y in types:
                    url = 'https://sj.qq.com/myapp/searchAjax.htm?kw={}&pns={}&sid=0'.format(name, y)

              爲了減少對應用市場的流量衝擊,我們不立馬下載(雖然別人可能根本不在乎這點流量),先存到數據庫再說哈!

sqlDecompile = SqlDecompile.Sqltools(self.MYSQLIP, self.MYSQLNAME, self.MYSQLPWD, self.MYSQLTABLE)

第三步:數據解析
            看看接口返回的Json

               想要的全部都有:MD5(用來去重),下載地址,APP名稱,類型,廠商,說明,包名,版本

try:
                        text3 = s.get(url, verify=False).json()['obj']['items']
                        for text in text3:
                            apk_name = text['appDetail']['appName']
                            apk_page = text['appDetail']['pkgName']
                            apk_type = text['appDetail']['categoryName']
                            apk_description = str(text['appDetail']['description']).replace('"', "'").replace('>', '<')
                            apk_url = text['appDetail']['apkUrl']
                            apk_fasten = ''
                            apk_authorName = text['appDetail']['authorName']
                            apk_versionName = text['appDetail']['versionName']
                            apk_md5 = text['appDetail']['apkMd5']
                            if len(apk_description) >= 5000:
                                apk_description = apk_authorName[0:4999]
                            sql = 'INSERT INTO apk_down (apk_name, apk_page,apk_type, apk_description,apk_url, ' \
                                  'apk_fasten,apk_authorName,apk_versionName,apk_md5) VALUES (' \
                                  '"{}","{}","{}","{}","{}","{}","{}","{}","{}");'.format(apk_name, apk_page,
                                                                                          apk_type, apk_description,
                                                                                          apk_url, apk_fasten,
                                                                                          apk_authorName,
                                                                                          apk_versionName,
                                                                                          apk_md5)
                            try:
                                sqlDecompile.put(sql)
                                b = b - 1
                                print('剩餘:{}組'.format(b))
                                sqlDecompile.close()
                            except:
                                print("{}:MD5已存在".format(apk_md5))
                    except:
                        print("{}無數據".format(url))
                        continue

       來來來,跑一波

       
第四步:APK下載
         完美落庫後,後面我們就用一點下一點。

sqlDecompile = SqlDecompile.Sqltools(self.MYSQLIP, self.MYSQLNAME, self.MYSQLPWD, self.MYSQLTABLE)
        sql1 = 'SELECT apk_page,apk_versionName,id,apk_url,apk_dir FROM apk_down;'
        texts = sqlDecompile.get(sql1)
        num = 1
        while sum([len(x) for _, _, x in os.walk(os.path.dirname(self.APKDIR + '/' + str(num) + '/'))]) >= 100:
            num = num + 1
        for text in texts:
            if (text[4] is None) or (not os.path.exists(text[4])):
                fileName = str(text[0]) + '.' + str(text[1]) + '_' + str(text[2])
                downUrl = str(text[3])
                filedir = self.APKDIR + '/' + str(num) + '/'
                apkDir = filedir + fileName + '.apk'
                if not os.path.exists(apkDir):
                    if not os.path.exists(filedir):
                        os.mkdir(filedir)
                    try:
                        s = requests.Session()
                        r = s.get(downUrl, stream=True, verify=False, timeout=30)
                        f = open(apkDir, "wb")
                        for chunk in r.iter_content(chunk_size=512):
                            if chunk:
                                f.write(chunk)
                        sql2 = 'UPDATE apk_down SET apk_dir="{}" WHERE id = {};'.format(apkDir, str(text[2]))
                        sqlDecompile.put(sql2)
                    except:
                        print(downUrl + ':下載失敗')
                    if sum([len(x) for _, _, x in os.walk(os.path.dirname(filedir))]) >= 100:
                        num = num + 1

溫馨提醒:爬蟲有風險,慎重使用哦!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章