1,前言
在爬蟲的世界裏,模擬登錄是一項必備的技能,很多網站登錄纔能有瀏覽信息的權限,今天就在python來模擬登錄知乎
2,獲取登錄時post的參數
在網頁上輸入知乎的url:https://www.zhihu.com/#signin,隨便輸入一個手機號(13265604588)和密碼(1234),按f12,然後點擊登錄,在 network就能獲取提交的表單
這裏登錄時需要提交的數據有四個:手機號碼和密碼由我們自己輸入,_xsrf 爲知乎的隱藏隨機碼,captcha_type 爲驗證碼類型
請求的url是:https://www.zhihu.com/login/phone_num,後面會用到
2.1 獲取_xsrf
在登錄頁面右鍵檢查網頁源碼,就可以在提交表單那裏發現
<input type="hidden" name="_xsrf" value="
37616639663361332d393965632d346634632d396166362d356538383763653738367
">這個通過一個正則表達式就可以提取出來
text = '<input type="hidden" name="_xsrf" value="37616639663361332d393965632d346634632d396166362d356538383763653738367">'
match_obj = re.match('.*name="_xsrf" value="(.*?)"',text)
if match_obj:
print (match_obj.group(1))
其中在提取那裏(.*?)要注意加入問號取消貪婪匹配
方法代碼如下:
def get_xsrf():
#獲取xsrf code
response = session.get("https://www.zhihu.com",headers = header)
match_obj = re.match('.*name="_xsrf" value="(.*?)"',response.text)
# print(match_obj.group(1))
if match_obj:
return (match_obj.group(1))
return ""
2.2 獲取驗證碼
def get_captcha():
t = str(int(time.time() * 1000))
captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
print(captcha_url)
response = session.get(captcha_url, headers=header)
with open('captcha.gif', 'wb') as f:
f.write(response.content)
f.close()
from PIL import Image
try:
im = Image.open('captcha.gif')
im.show()
im.close()
except:
pass
captcha = input('請輸入驗證碼: ')
return captcha
這裏的獲取驗證碼需要人工驗證,按照生成的gif輸入即可3,創建登錄函數
import requests
import re
session =requests.session()
agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': agent
}
def zhihu_login(account,password):
#知乎登錄
if re.match("^1\d{10}",account):
print("手機號碼登錄")
#前面提取的手機號碼登錄請求url
post_url = "https://www.zhihu.com/login/phone_num"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password,
"captcha":get_captcha()
}
response_text = session.post(post_url,data=post_data,headers=header)
這樣結合之前的的get_xsrf,get_captcha方法就能順利登錄知乎了,如何驗證?
4,檢驗登錄是否成功
在未登錄成功時是無法訪問查詢個人私信的網站:https://www.zhihu.com/inbox
在網頁上輸入該網址
可以看到訪問身份信息爲302即臨時跳轉,知乎將我們網址跳轉到登錄界面,即沒有訪問權限,可以通過檢驗response的身份信息來檢驗是否登錄成功,代碼如下
def is_login():
#通過個人中心頁面返回狀態碼來判斷是否爲登錄狀態
inbox_url = "https://www.zhihu.com/inbox"
#allow_redirects使重定向爲false
response = session.get(inbox_url, headers=header, allow_redirects=False)
if response.status_code != 200:
return False
else:
return True
這裏注意session的allow_redirects參數, 取消了網頁重定向功能。如果不取消,即使訪問失敗也會重新跳轉到登錄界面,這樣返回的身份信息還是 200
5,完整代碼
import requests
import re
session =requests.session()
agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': agent
}
def get_xsrf():
#獲取xsrf code
response = session.get("https://www.zhihu.com",headers = header)
# print(response.text)
# text = '<input type="hidden" name="_xsrf" value="37616639663361332d393965632d346634632d396166362d356538383763653738363637">'
match_obj = re.match('.*name="_xsrf" value="(.*?)"',response.text)
# print(match_obj.group(1))
if match_obj:
return (match_obj.group(1))
return ""
def get_captcha():
import time
t = str(int(time.time()*1000))
captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
t = session.get(captcha_url, headers=header)
with open("captcha.jpg","wb") as f:
f.write(t.content)
f.close()
from PIL import Image
try:
im = Image.open('captcha.jpg')
im.show()
im.close()
except:
pass
captcha = input("輸入驗證碼\n>")
return captcha
def zhihu_login(account,password):
#知乎登錄
if re.match("^1\d{10}",account):
print("手機號碼登錄")
post_url = "https://www.zhihu.com/login/phone_num"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password,
"captcha":get_captcha()
}
response_text = session.post(post_url,data=post_data,headers=header)
def is_login():
#通過個人中心頁面返回狀態碼來判斷是否爲登錄狀態
inbox_url = "https://www.zhihu.com/inbox"
#allow_redirects使重定向爲false
response = session.get(inbox_url, headers=header, allow_redirects=False)
if response.status_code != 200:
return False
else:
return True
zhihu_login("yourPhoneNumber","password")
print(is_login())
運行結果: