下面這段代碼的邏輯起點是頁面滑到底部ajax加載的一個鏈接url_initial
(我是在手機微信上在瀏覽某個公衆號的某個頁面時send to Chat
到自己賬號,獲得鏈接後在電腦chrome上打開; 或者可以通過關注 “My Evernote”,把鏈接發到Evernote的筆記裏; 或者在ios平臺上好像有發送到郵件的功能,也可以得到這個鏈接; 但是直接在瀏覽器中打開或者之間拷貝鏈接是無效的。然後在瀏覽器中打開鏈接,通過檢測瀏覽器訪問的Network
->XHR
找到需要的鏈接的)
通過觀察發現這個加載功能的接口每次訪問時返回10個條目,而訪問時傳到服務器的參數只有frommsgid
是改變的,而且是上一次訪問的最後一個條目的id
。所以可以遞歸獲得所有的條目。
這個加載接口的參數還有幾個是和session相關的,導致只能存活20min左右,這時候需要重新通過手機微信去獲得
url_initial
訪問接口的頻率不能太快,一次循環結束需要加一句
time.sleep(t)
其中t>0.5
時應該是安全的,否則會被臨時禁止訪問單單通過這個方法獲得不到閱讀數和評論數,http://tool.qoofan.com/weixin/query這個網站提供了API查詢閱讀數和評論數(但是收費),當然也可以自己嘗試着去尋找漏洞自己寫API
#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyquery import PyQuery as pq
import requests
import codecs
import json
import time
import datetime
def ttd(timestamp):
timeStr=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))
return timeStr
f=codecs.open('/home/allen/projects/sjtu_news/data_wechat.csv','w','utf-8')
url_initial='https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MDIyMDQyMA==&uin=MjMzNTA3MjQw&key=8dcebf9e179c9f3a15be50268fa7072ea98eab64f47c243ed0a75d48addee99ee76f6585fb67fd0f15a6d32009bfe38b&f=json&frommsgid=1000000038&count=10&uin=MjMzNTA3MjQw&key=8dcebf9e179c9f3a15be50268fa7072ea98eab64f47c243ed0a75d48addee99ee76f6585fb67fd0f15a6d32009bfe38b&pass_ticket=OHlwEqWexZOnMR8LoFoVhpLM7RXg28HZhMrsoP4Rrc0%25253D&wxtoken=&x5=0'
url_head=url_initial[0:url_initial.index('frommsgid=')+len('frommsgid=')]
url_tail=url_initial[url_initial.index('&count='):]
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17"}
def run(url_initial):
page=requests.get(url_initial)
page.encoding='utf-8'
page_json=json.loads(page.text)
general_msg_list=page_json['general_msg_list']
general_msg_list=json.loads(general_msg_list)
lists=general_msg_list['list']
# print lists[0]['app_msg_ext_info']['multi_app_msg_item_list'][0]['title']
last_id=general_msg_list['list'][9]['comm_msg_info']['id']
# main_title=[general_msg_list['list'][i]['app_msg_ext_info']['title'] for i in range(10)]
main_title=[]
for i in range(10):
try:
main_title.append(general_msg_list['list'][i]['app_msg_ext_info']['title'])
except:
main_title.append(general_msg_list['list'][i]['comm_msg_info']['content'])
datetime=[general_msg_list['list'][i]['comm_msg_info']['datetime'] for i in range(10)]
for i in range(10):
try:
multi_msg=[general_msg_list['list'][i]['app_msg_ext_info']['multi_app_msg_item_list'] for i in range(10)]
except:
multi_msg=[]
i=0
for j in multi_msg:
f.write(str(ttd(datetime[i]))+'\t')
f.write(main_title[i]+'\t')
i=i+1
for k in j:
f.write(k['title']+'\t')
f.write('\n')
return last_id
for i in range(1000):
rv=run(url_initial)
print rv
time.sleep(2)
url_initial=url_head+str(rv)+url_tail
# print url_initial