Python3爬蟲實踐——QQ空間自動點贊程序（下）

    （發完上一篇博客之後，觀察了一個星期發現閱讀量遲遲突破不了50大關，藍瘦香菇+心疼自己T.T，於是果然又找到了各種拖延的理由，剛纔登博客的時候突然發現有人評論期待我的下篇，立馬精神振奮！開始敲字。）

——————————————————————————————————————————————————————————————————————

    那麼，再次發車。

    上一次我們已經實現了“點贊”，卻是缺少了“自動”，這一篇，就講解如何實現“自動“點贊程序。

①探索：在瀏覽器地址欄輸入網址並回車、或點擊刷新後，瀏覽器與Qzone又發生了什麼交易~

<1>與上一篇一樣，先抓包。演員又是它們，我的兩個小號。登錄QQ空間後，打開抓包工具，刷新頁面，Fiddler裏嘩啦啦地閃出一排數據包。然而，只有一個主角，那就是我們“刷新”操作時候發出去的數據包。

（瀏覽器訪問空間.png）

（數據包們.png）

 <2>照例，分析包。可見，在我們刷新頁面後，瀏覽器向user.qzone.qq.com的相對路徑/3236556749使用HTTP請求中的GET方法。（GET和POST的區分百度一搜一大把，我也有點暈。只是單純地認爲GET就是向服務器發個只有headers、body爲空的數據包過去，包裏含有Cookie，索要/3236556749的內容。而POST則是發包含了非空headers和非空body的數據包，body中含有了該次操作的內容，服務器解讀body並實現對應的操作。）

（數據包的headers內容.png）

 於是，現在目標就是用Python3的代碼實現GET方法了。

代碼如下：
<pre name="code" class="python">from http import client;
from urllib import parse;

headers = {'Host': 'user.qzone.qq.com',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
 'Accept-Encoding': 'gzip, deflate',
 'Cookie': '',
 'Connection': 'keep-alive',
 'Upgrade-Insecure-Requests': 1,
 'Content-Type': 'application/x-www-form-urlencoded'
 }
headers['Cookie']='*************';#（自己抓）
httpClient=client.HTTPConnection('user.qzone.qq.com') #host是user.qzone.qq.com
httpClient.request("GET","/3236556749",parse.urlencode({}),headers);


#輸出獲得的內容
response=httpClient.getresponse();
print(response.status);#狀態，成功的話是200；常見的錯誤403表示權限不足，404表示頁面不存在。
print(response.reason);#原因，成功的話是ok；失敗的話，各種原因
html=response.read();
print(html);#獲得的數據包內容
 與上一章的POST方法類似，使用的是同一個函數。只不過Host改了，method爲“GET”，url固定爲你的QQ號，body爲空，headers僅有Host不一樣。

然後，你看到了什麼？

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xb4YOo\xdb\xc8\x15?\xab\xc0~\x871\x03H$BQ\x96\xd3\xd8\x8edz\xb1Mb4@\xb6[7^\xa0mj\x08\x149&\x19S\x1c\x9a3\xb2\x1c\xc7\x06\……………（省略n長字符）……………'

 這是什麼——一串byte類型的'\x1f\x8b'開頭的莫名其妙的字符。當時我各種百度，各種懵逼，用了各種在線轉碼工具，UTF-8轉碼，Unicode轉碼，ASCII轉碼，16進制轉十進制再轉碼……依舊懵逼……

 也忘了是什麼原因讓我找到了它的正確理解方式——它是使用gzip格式壓縮的一個壓縮包！所以是byte類型。

 所以，爲了得到正確的html文件，我們要解壓這個數據包。
②實現：GZIP解壓GET到的數據包，獲取Qzone的HTML文件實現“刷新”
代碼如下：
import gzip

#①
html=gzip.decompress(html).decode("utf-8");#解壓get到的html文件並解碼爲utf-8或者unicode格式

#②
non_bmp_map=dict.fromkeys(range(0x10000,sys.maxunicode+1),0xfffd);#將編碼爲0x10000到最大unicode編碼之外的編碼替換爲0xfffd
print(html.translate(non_bmp_map));#令得到的html實現上述過濾
理論上，標註爲①的解壓解碼完成後，就可以print(html)了，至少現在是這樣的。
但是在我實現這個代碼的時候，並沒有這麼容易——因爲emoji超脫三界之外無量天之外的存在！

如果在獲得的html文件中含有emoji的話，不通過標註爲②的過濾步驟，使用print()會報"UnicodeDecodeError"的錯，因爲print無法print出emoji！

然而注意到了沒，近期Qzone發說說後，emoji表情變得不清真了？

原因如下：

（Qzone內emoji處的html代碼.png）

 是的，近期qq空間優化了網頁端的顯示，使用url代替了原本的emoji，而如果不通過url優化emoji顯示的話，下面就是電腦網頁端使用emoji的車禍現場（找不到原版的QQ空間的emoji截圖了，反正更慘）：

（百度知道內的emoji處的html代碼.png）

 會發現，emoji在html中仍舊以圖形的方式出現，這顯然不是print()可以打印出來的，所以當爬蟲抓取有emoji的頁面時，想要使用print進行debuge時，步驟②的過濾還是必要的。

 現在，將html變量print出來之後，是不是已經看到了"user.qzone.qq.com/(你的qq號)"頁面上的所有html標籤和內容了~

此時，可以寫出“頁面刷新”類了:
<pre name="code" class="python">from http import client
from urllib import parse
import gzip
import sys
<pre>class httpGETer:
 httpClient=None;
 content=None;
 headers=None;
 status=None;
 qNum=None;
 cookie=None;
 def __init__(self,coo,qqNumber):
 self.cookie=coo;self.qNum=qqNumber;
 self.headers={'Host':'user.qzone.qq.com',
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
 'Accept-Encoding':'gzip, deflate',
 'Cookie':'',
 'Connection':'keep-alive',
 'Upgrade-Insecure-Requests':1,
 'Content-Type':'application/x-www-form-urlencoded' 
 }
 self.headers['Cookie']=coo;
 self.refresh();
 
 def refresh(self,):
 try:
 self.httpClient=client.HTTPConnection('user.qzone.qq.com');#我也忘了當時爲啥要用這麼浪費資源的寫法了，“每次刷新重連一次”
 self.httpClient.request('GET','/'+self.qNum,parse.urlencode({}),self.headers)
 temp=self.httpClient.getresponse()
 self.status=temp.status
 self.content=gzip.decompress(temp.read()).decode("utf-8").translate(dict.fromkeys(range(0x10000,sys.maxunicode+1),0xfffd))
 except Exception as e:
 if(self.content!=None):
 print("Error happended in the class httpGETer"+time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())));
 print(e)
 self.content=None;
 finally:
 if self.httpClient:
 self.httpClient.close();
 
 def getHtml(self,):
 if(self.content):
 return self.content;
 else:
 return "Fail to get HTML data";
#使用這個類的代碼如下:
#main process BEGIN
cookie='（輸入區）';#自己輸入cookie
qqNumber='（輸入區）';#自己輸入qq
qzoneG=httpGETer(cookie,qqNumber);#實例化一個頁面刷新類
qzoneG.refresh();#刷新
print(qzoneG.getHtml());#輸出html內容
#main process END
 現在，我們已經實現了獲取頁面html內容功能。

 還差什麼。

 還差上一篇留下來的問題沒有解決——獲得頁面中每一條說說的unikey和curkey！
③實現：使用正則表達式過濾頁面內容抓取unikey和curkey。
其實也就四行代碼：
html=qzoneG.getHtml();
temp=re.search('data-unikey="(http[^"]*)"[^d]*data-curkey="([^"]*)"[^d]*data-clicklog=("like")[^h]*href="javascript:;"',html);
unikey=group(1);
curkey=group(2);
 （完結！撒花~~~~~~~~~ 纔怪-.-！）
 這個正則表達式我費了那麼大週摺才搞對的怎麼可能不把這個逼裝完整。

所以，從頁面哪裏找到unikey和curkey

步驟1：通過Fiddler找到點讚的數據包中的unikey一個確切的值

（通過Fiddler獲得點贊請求包中的unikey和curkey的值.png）

步驟2：通過瀏覽器中按F12，然後CTRL+F，粘貼unikey的值，找到含有unikey的位置。

（FireFox反向搜索unikey位置.png）

（Edge反向搜索unikey.png）

 是的，就是這麼坑。

 HTTP請求發送的數據包的headers中包含了你所使用的瀏覽器的信息，就有可能出現不同瀏覽器獲得不同的DOM的情況。

 所以請注意，想要“點贊類”和“頁面刷新類”共用一個Cookie的話，請儘量保證headers中的瀏覽器信息是同一個瀏覽器~（最好是headers除了Host，其他的都一樣）

步驟3：寫出對應的正則表達式
正則表達式……網上很多教程，我只推薦這個在線測試的網址——http://tool.oschina.net/regex/

然後我再解釋一下，爲何我的正則表達式長得那麼坑。
<pre name="code" class="python">temp=re.search('data-unikey="(http[^"]*)"[^d]*data-curkey="([^"]*)"[^d]*data-clicklog=("like")[^h]*href="javascript:;"',html);
 re模塊——是regex（正則表達式）的縮寫。
 search(表達式，對象)函數——返回對象被過濾後的值，匹配到一個就停止。

這些都很容易搜到。我只講兩個跟Qzone有關的內容：

第一個，是一個坑爹的參數——data-clicklog

 我原本的正則表達式是不包括href="javascript;"這個內容的，理論上，當我成功點贊一條說說後刷新頁面獲得新的DOM樹，data-clicklog的值會從"like"變爲“cancellike”。而事實確實如此。

 但是！我之後分析了一下點贊後的頁面，發現這條正則仍能抓到同一個unikey和curkey。導致重複發送n次同一個數據包，且只能點贊第一個人。（之前有同學就抱怨說Qzone一直重複提醒她我點讚的消息）。

 說這麼多，我只想強調——“如果unikey和curkey在多次運行後沒有改變，不是刷新類的錯，是正則表達式”、“正則表達式限制條件儘量多一點，否則抓不準”
第二個，是unikey和curkey的取值

通常unikey和curkey都是“http://”開頭的一個連接。然而，上一篇我們說過，curkey是原作者發表說說時候產生的，但如果轉發者分享的是一個連接呢？分享的是一個推廣呢？分享的是一首音樂呢？原作者不是用戶，所以curkey存在不是“http://”開頭的情況。因此我正則表達式中data-curkey=“([^"]*)”裏面沒加http開頭的限制。
④完結！撒花~~
我把全部代碼貼出來了，有問題還請評論，這段時間我會不定期查看閱讀量的T.T：
from http import client
from urllib import parse
import re
import time
import gzip
import sys

class httpGETer:
 httpClient=None;
 content=None;
 headers=None;
 status=None;
 qNum=None;
 cookie=None;
 def __init__(self,coo,qqNumber):
 self.cookie=coo;self.qNum=qqNumber;
 self.headers={'Host':'user.qzone.qq.com',
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
 'Accept-Encoding':'gzip, deflate',
 'Cookie':'',
 'Connection':'keep-alive',
 'Upgrade-Insecure-Requests':1,
 'Content-Type':'application/x-www-form-urlencoded' 
 }
 self.headers['Cookie']=coo;
 self.refresh();
 
 def refresh(self,):
 try:
 self.httpClient=client.HTTPConnection('user.qzone.qq.com');
 self.httpClient.request('GET','/'+self.qNum,parse.urlencode({}),self.headers)
 temp=self.httpClient.getresponse()
 self.status=temp.status
 self.content=gzip.decompress(temp.read()).decode("utf-8").translate(dict.fromkeys(range(0x10000,sys.maxunicode+1),0xfffd))
 except Exception as e:
 if(self.content!=None):
 print("Error happended in the class httpGETer"+time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())));
 print(e)
 self.content=None;
 finally:
 if self.httpClient:
 self.httpClient.close();
 
 def getHtml(self,):
 if(self.content):
 return self.content;
 else:
 return "Fail to get HTML data";



class httpPOSTer:
 headers=None;
 body=None;
 qNum=None;
 url=None;
 cookie=None;
 url=None;
 
 def getGTK(self,ss):
 hash=5381
 for i in ss:
 hash+=(hash<<5)+ord(i);
 return (hash & 0x7fffffff);
 
 def __init__(self,coo,qqNumber):
 self.cookie=coo;self.qNum=qqNumber;
 p_skey=re.search('p_skey=([^;^\']*)',self.cookie).group(1);
 g_tk=self.getGTK(p_skey);
 self.url="/proxy/domain/w.qzone.qq.com/cgi-bin/likes/internal_dolike_app?g_tk="+str(g_tk);
 self.headers={'Host':'h5.qzone.qq.com',
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
 'Accept-Encoding':'gzip, deflate',
 'Cookie':'',
 'Connection':'keep-alive',
 'Upgrade-Insecure-Requests':1,
 'Content-Type':'application/x-www-form-urlencoded'
 }
 self.headers['Cookie']=coo;
 self.body={'qzreferrer':'http://user.qzone.qq.com/'+qqNumber,
 'opuin':qqNumber,
 'unikey':'',#mark
 'curkey':'',#mark
 'from':1,
 'appid':311,
 'typeid':0,
 'active':0,
 'fupdate':1
 }

 def thumbs_up(self,unikey,curkey):
 self.body['unikey']=unikey;
 self.body['curkey']=curkey;
 print("url="+self.url);
 httpClient=client.HTTPConnection('h5.qzone.qq.com');
 httpClient.request("POST",self.url,parse.urlencode(self.body),self.headers);
 response=httpClient.getresponse();
 print("thumbs_up "+str(response.status)+" "+response.reason);
 httpClient.close();

#main process begin
cookie='（輸入）';#輸入Cookie
qqNumber='（輸入）';#輸入qq號
qzoneG=httpGETer(cookie,qqNumber)#實例化一個頁面刷新類
qzoneP=httpPOSTer(cookie,qqNumber)#實例化一個點贊類
flag=1;#避免重複輸出的標記，影響不大
while(True):
 qzoneG.refresh();
 html=qzoneG.getHtml();
 temp=re.search('data-unikey="(http[^"]*)"[^d]*data-curkey="([^"]*)"[^d]*data-clicklog=("like")[^h]*href="javascript:;"',html);#正則自己寫咯
 if(temp!=None):
 print(temp.group(0))
 unikey=temp.group(1);
 curkey=temp.group(2);
 qzoneP.thumbs_up(unikey,curkey);
 flag=1;
 if(flag==1):
 print("sleeping... from:"+time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) );#輸出最該次點讚的時間
 flag=0;
 time.sleep(1);#1秒延遲，可以在好友發完說說後的一秒左右完成點贊。也可以騰出CPU資源
#main process end
（文章排版很醜我還是有自知之明的……所以我儘量把重點句子都棕色標出來了……不然我自己都沒有看下去的慾望……）
（之後我會嘗試着實現自動登錄獲取Cookie，否則頻繁更換Cookie真的好枯燥。）
（撒花~）