python3下使用scrapy實現模擬用戶登錄與cookie存儲—— 中級篇(百度雲俱樂部)
1. 背景
- 相關基礎知識點回顧:
- python3下使用requests模擬用戶登錄 —— 中級篇(百度雲俱樂部):https://blog.csdn.net/zwq912318834/article/details/79665863
- python3下使用scrapy實現模擬用戶登錄與cookie存儲 —— 基礎篇(馬蜂窩): https://blog.csdn.net/zwq912318834/article/details/79614372
2. 環境
- 系統:win7
- python 3.6.1
- scrapy 1.4.0
- requests 2.14.2 (通過pip list查看)
3. 模擬登錄百度雲俱樂部
- 關於百度雲俱樂部登錄的分析過程,可以參考文章:https://blog.csdn.net/zwq912318834/article/details/79665863
3.1. 瞭解requests庫和scrapy框架的區別
- scrapy爬蟲和requests爬蟲有個很大的區別,就是scrapy是一個異步爬蟲框架,特點主要體現在兩個方面:
- 第一:異步性。從scrapy框架中發出的request請求之間是沒有順序的,不存在先yield,就一定先請求的。
- 第二:框架性。所有的動作,框架都已經幫忙集成了。比如說到這個cookie,一旦設置好 “COOKIES_ENABLED”: True ,那麼所有的Request請求就在同一個cookie下,不需要像使用requests庫中的session來保持會話。
- 從以上的信息來看,百度雲俱樂部的登錄過程唯一要注意好的就是控制好請求信息的順序性。因爲登錄參數的獲取是有順序的。像驗證碼和登錄參數之間的聯繫這些問題,scrapy框架已經完全支持好了。
- 在pyCharm中,啓動debug模式,可以看到response對象的全部信息,找到對應的cookie信息。
3.2. 代碼詳細解析
# 文件baiduyunSpider.py
# -*- coding: utf-8 -*-
import scrapy
import datetime
import re
import random
from PIL import Image
# 將cookie保存到文本文件中
def convertResponseCookieFormat(cookieLstInfo, cookieFileName):
'''
Set-Cookie = [b'L3em_2132_saltkey=nMjUM797; expires=Mon, 23-Apr-2018 05:48:42 GMT; Max-Age=2592000; path=/; httponly', b'L3em_2132_lastvisit=1521866922; expires=Mon, 23-Apr-2018 05:48:42 GMT; Max-Age=2592000; path=/', b'L3em_2132_sid=g8DFgE; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/', b'L3em_2132_lastact=1521870522%09member.php%09logging; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/', b'L3em_2132_sid=g8DFgE; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/']
item = b'L3em_2132_saltkey=nMjUM797; expires=Mon, 23-Apr-2018 05:48:42 GMT; Max-Age=2592000; path=/; httponly'
item = b'L3em_2132_lastvisit=1521866922; expires=Mon, 23-Apr-2018 05:48:42 GMT; Max-Age=2592000; path=/'
item = b'L3em_2132_sid=g8DFgE; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/'
item = b'L3em_2132_lastact=1521870522%09member.php%09logging; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/'
item = b'L3em_2132_sid=g8DFgE; expires=Sun, 25-Mar-2018 05:48:42 GMT; Max-Age=86400; path=/'
:param cookieLstInfo:
:return:
'''
cookieLst = []
if len(cookieLstInfo) > 0:
for cookieItem in cookieLstInfo:
cookieItemStr = str(cookieItem, encoding="utf8")
cookieLst.append(cookieItemStr)
# 將cookie寫入到文件中,方便後面使用
with open(cookieFileName, 'w') as f:
for cookieValue in cookieLst:
f.write(str(cookieValue) + '\n')
return cookieLst
class baiduyunSpider(scrapy.Spider):
# 定製化設置
custom_settings = {
'LOG_LEVEL': 'DEBUG', # Log等級,默認是最低級別debug
'ROBOTSTXT_OBEY': False, # default Obey robots.txt rules
'DOWNLOAD_DELAY': 2, # 下載延時,默認是0
'COOKIES_ENABLED': True, # 默認enable,爬取登錄後的數據時需要啓用
# 'COOKIES_DEBUG': True, # 默認值爲False,如果啓用,Scrapy將記錄所有在request(Cookie 請求頭)發送的cookies及response接收到的cookies(Set-Cookie 接收頭)。
'DOWNLOAD_TIMEOUT': 20, # 下載超時,既可以是爬蟲全局統一控制,也可以在具體請求中填入到Request.meta中,Request.meta['download_timeout']
}
name = 'baiduyun'
allowed_domains = ['51baiduyun.com']
host = "http://www.51baiduyun.com/"
account = "13725168940" # 百度雲俱樂部帳號
password = "aaa00000000" # 密碼
userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
headerData = {
"Referer": "http://www.51baiduyun.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
# 爬蟲運行的起始位置
def start_requests(self):
print("start baiduyun clawer")
# 馬蜂窩登錄頁面
baiduyunLoginArgsPage = "http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login"
loginArgsIndexReq = scrapy.Request(
url = baiduyunLoginArgsPage,
headers = self.headerData,
callback = self.parseLoginArgsPage,
dont_filter = True, # 防止頁面因爲重複爬取,被過濾了
)
yield loginArgsIndexReq
# 從網頁源碼和cookie中拿到登錄所需的參數:formdata, referer, seccodehash
def parseLoginArgsPage(self, response):
# 首先第一步,從網頁源碼中獲得 formhash, referer, seccodehash
print(f"parseLoginArgsPage: statusCode = {response.status}, url = {response.url}")
'''
<div class="c cl">
<input type="hidden" name="formhash" value="7736cc00" />
<input type="hidden" name="referer" value="http://www.51baiduyun.com/" />
'''
# formhashRe = re.search('name="formhash" value="(.*?)"', response.text, re.DOTALL)
formhashRe = re.search('name="formhash" value="(\w+?)"', response.text, re.DOTALL)
refererRe = re.search('name="referer" value="(.*?)"', response.text, re.DOTALL)
print(f"formhashRe = {formhashRe}, refererRe = {refererRe}")
if formhashRe:
formhash = formhashRe.group(1)
else:
formhash = ""
if refererRe:
referer = refererRe.group(1)
else:
referer = ""
# 獲取請求request中的Cookie,也就是攜帶給網站的cookie信息
# Cookie = response.request.headers.getlist('Cookie')
# print(f'CookieReq = {Cookie}')
# 獲取服務器返回過來的Cookie,也就是網站攜帶給用戶的cookie信息
Cookie = response.headers.getlist('Set-Cookie')
print(f"Set-Cookie = {Cookie}")
cookieFileName = "baiduyunCookies.txt"
'''
cookieInfoLst = ['L3em_2132_saltkey=w0QHA0q5; expires=Mon, 23-Apr-2018 05:59:00 GMT; Max-Age=2592000; path=/; httponly', 'L3em_2132_lastvisit=1521867540; expires=Mon, 23-Apr-2018 05:59:00 GMT; Max-Age=2592000; path=/', 'L3em_2132_sid=mALP7a; expires=Sun, 25-Mar-2018 05:59:00 GMT; Max-Age=86400; path=/', 'L3em_2132_lastact=1521871140%09member.php%09logging; expires=Sun, 25-Mar-2018 05:59:00 GMT; Max-Age=86400; path=/', 'L3em_2132_sid=mALP7a; expires=Sun, 25-Mar-2018 05:59:00 GMT; Max-Age=86400; path=/']
'''
cookieInfoLst = convertResponseCookieFormat(Cookie, cookieFileName)
print(f"cookieInfoLst = {cookieInfoLst}")
sid = ""
for cookieItem in cookieInfoLst:
if cookieItem.find("_sid=") != -1:
sidRe = re.search('_sid=(\w+?);', cookieItem)
if sidRe:
sid = sidRe.group(1)
print(f"sid = {sid}")
seccodehash = 'cSA' + sid
postData = {
"formhash": formhash,
"referer": referer,
"seccodehash": seccodehash,
}
# 接下來需要請求驗證碼圖片,這是請求圖片的第一步:獲取到update參數:
# 將暫時拿到的postData放到meta信息中,傳遞下去
# 第一步:發送第一個請求,獲取“update” 的參數值
randomFloat = random.uniform(0, 1)
url = f"http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash={seccodehash}&{randomFloat}&modid=undefined"
yield scrapy.Request(
url = url,
headers = self.headerData,
meta = {"postData": postData},
callback = self.parseUpdateForCaptcha,
dont_filter = True, # 防止頁面因爲重複爬取,被過濾了
)
# 獲取請求驗證碼所需的參數:update
def parseUpdateForCaptcha(self, response):
# 請求驗證碼的前奏,先拿到update參數的值
print(f"parseUpdateForCaptcha: statusCode = {response.status}, url = {response.url}")
postData = response.meta.get("postData", {})
updateRe = re.search('update=(\d+?)&', response.text, re.DOTALL)
print(f"updateRe = {updateRe}")
if updateRe:
update = int(updateRe.group(1))
else:
update = 0
print(f"update = {update}")
# 拿到update參數之後,接下來根據這些信息,請求驗證碼圖片
# http://www.51baiduyun.com/misc.php?mod=seccode&update=88800&idhash=cSAY3fpK6
seccodehash = postData['seccodehash']
captchaUrl = f"http://www.51baiduyun.com/misc.php?mod=seccode&update={update}&idhash={seccodehash}"
yield scrapy.Request(
url = captchaUrl,
headers = self.headerData,
meta = {"postData": postData}, # 繼續傳遞下去
callback = self.parseCaptcha,
dont_filter = True, # 防止頁面因爲重複爬取,被過濾了
)
# 獲取驗證碼
def parseCaptcha(self, response):
# 解析出驗證碼圖片
print(f"parseCaptcha: statusCode = {response.status}, url = {response.url}")
postData = response.meta.get("postData", {})
# print(f"t = {response.text}") # 打印結果可以看出是一張圖片
with open("captcha51baiduyun.jpg", "wb") as f:
# 這個地方一定注意,是body,而不是text
f.write(response.body)
f.close()
# 在這裏,爲了讓邏輯簡單,暫時採用手動輸入驗證碼的方式。
# 如果想讓程序自動打碼,可以參考文章:https://blog.csdn.net/zwq912318834/article/details/78616462
try:
imObj = Image.open('captcha51baiduyun.jpg')
imObj.show()
imObj.close()
except:
pass
captcha = input("輸入驗證碼\n>").strip()
print(f"input captcha is : {captcha}")
# 最後帶着這些參數信息,進行登錄操作
# 百度雲模仿 登錄
print("開始模擬登錄百度雲俱樂部")
postUrl = "http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1"
'''
formhash:eb6fc0ed
referer:http://www.51baiduyun.com/
loginfield:username
username:aaaaaa
password:abc123456
questionid:0
answer:
seccodehash:cSAY3fpK6
seccodemodid:member::logging
seccodeverify:ejwe
'''
postData["loginfield"] = 'username'
postData["username"] = self.account
postData["password"] = self.password
postData["questionid"] = '0'
postData["answer"] = ''
postData["seccodemodid"] = 'member::logging'
postData["seccodeverify"] = captcha
yield scrapy.FormRequest(
url = postUrl,
method = "POST",
formdata = postData,
callback = self.parseLoginResPage,
dont_filter = True, # 防止頁面因爲重複爬取,被過濾了
)
# 檢查登錄結果
def parseLoginResPage(self, response):
# 查看登錄結果
print(f"parseLoginResPage: statusCode = {response.status}, url = {response.url}")
print(f"text = {response.text}")
# 正常的分析頁面請求
def parse(self, response):
print(f"parse: url = {response.url}, meta = {response.meta}")
# 請求錯誤處理:可以打印,寫文件,或者寫到數據庫中
def errorHandle(self, failure):
print(f"request error: {failure.value.response}")
# 爬蟲運行完畢時的收尾工作,例如:可以打印信息,可以發送郵件
def closed(self, reason):
# 爬取結束的時候可以發送郵件
finishTime = datetime.datetime.now()
subject = f"clawerName had finished, reason = {reason}, finishedTime = {finishTime}"
print(f"subject = {subject}")
3.3. 執行結果
- 登錄成功
start baiduyun clawer
2018-03-24 14:25:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login> (referer: http://www.51baiduyun.com/)
parseLoginArgsPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login
formhashRe = <_sre.SRE_Match object; span=(650, 682), match='name="formhash" value="74e7159e"'>, refererRe = <_sre.SRE_Match object; span=(708, 757), match='name="referer" value="http://www.51baiduyun.com/">
Set-Cookie = [b'L3em_2132_saltkey=eFHQH69h; expires=Mon, 23-Apr-2018 06:29:56 GMT; Max-Age=2592000; path=/; httponly', b'L3em_2132_lastvisit=1521869396; expires=Mon, 23-Apr-2018 06:29:56 GMT; Max-Age=2592000; path=/', b'L3em_2132_sid=Q37hU7; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/', b'L3em_2132_lastact=1521872996%09member.php%09logging; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/', b'L3em_2132_sid=Q37hU7; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/']
cookieInfoLst = ['L3em_2132_saltkey=eFHQH69h; expires=Mon, 23-Apr-2018 06:29:56 GMT; Max-Age=2592000; path=/; httponly', 'L3em_2132_lastvisit=1521869396; expires=Mon, 23-Apr-2018 06:29:56 GMT; Max-Age=2592000; path=/', 'L3em_2132_sid=Q37hU7; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/', 'L3em_2132_lastact=1521872996%09member.php%09logging; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/', 'L3em_2132_sid=Q37hU7; expires=Sun, 25-Mar-2018 06:29:56 GMT; Max-Age=86400; path=/']
sid = Q37hU7
2018-03-24 14:25:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSAQ37hU7&0.47346051143907475&modid=undefined> (referer: http://www.51baiduyun.com/)
parseUpdateForCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSAQ37hU7&0.47346051143907475&modid=undefined
updateRe = <_sre.SRE_Match object; span=(1079, 1092), match='update=26184&'>
update = 26184
2018-03-24 14:25:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&update=26184&idhash=cSAQ37hU7> (referer: http://www.51baiduyun.com/)
parseCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&update=26184&idhash=cSAQ37hU7
2018-03-24 14:25:23 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2018-03-24 14:25:23 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 2518
輸入驗證碼
>2018-03-24 14:25:24 [PIL.Image] DEBUG: Error closing: 'NoneType' object has no attribute 'close'
6Y4Q
input captcha is : 6Y4Q
開始模擬登錄百度雲俱樂部
2018-03-24 14:25:37 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1> (referer: http://www.51baiduyun.com/misc.php?mod=seccode&update=26184&idhash=cSAQ37hU7)
parseLoginResPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1
text = <?xml version="1.0" encoding="utf-8"?>
<root><![CDATA[<script type="text/javascript" reload="1">if(typeof succeedhandle_login=='function') {succeedhandle_login('http://www.51baiduyun.com/', '歡迎您回來,見習會員 13725168940,現在將轉入登錄前頁面', {'username':'13725168940','usergroup':'見習會員','uid':'1315026','groupid':'23','syn':'0'});}hideWindow('login');showDialog('歡迎您回來,見習會員 13725168940,現在將轉入登錄前頁面', 'right', null, function () { window.location.href ='http://www.51baiduyun.com/'; }, 0, null, null, null, null, null, 3);</script>]]></root>
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-24 14:25:37.954602
- 登錄失敗:驗證碼錯誤
start baiduyun clawer
2018-03-24 14:38:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login> (referer: http://www.51baiduyun.com/)
parseLoginArgsPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login
formhashRe = <_sre.SRE_Match object; span=(650, 682), match='name="formhash" value="177921b6"'>, refererRe = <_sre.SRE_Match object; span=(708, 757), match='name="referer" value="http://www.51baiduyun.com/">
Set-Cookie = [b'L3em_2132_saltkey=Tb0DPHLs; expires=Mon, 23-Apr-2018 06:43:35 GMT; Max-Age=2592000; path=/; httponly', b'L3em_2132_lastvisit=1521870215; expires=Mon, 23-Apr-2018 06:43:35 GMT; Max-Age=2592000; path=/', b'L3em_2132_sid=pwg1uj; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/', b'L3em_2132_lastact=1521873815%09member.php%09logging; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/', b'L3em_2132_sid=pwg1uj; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/']
cookieInfoLst = ['L3em_2132_saltkey=Tb0DPHLs; expires=Mon, 23-Apr-2018 06:43:35 GMT; Max-Age=2592000; path=/; httponly', 'L3em_2132_lastvisit=1521870215; expires=Mon, 23-Apr-2018 06:43:35 GMT; Max-Age=2592000; path=/', 'L3em_2132_sid=pwg1uj; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/', 'L3em_2132_lastact=1521873815%09member.php%09logging; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/', 'L3em_2132_sid=pwg1uj; expires=Sun, 25-Mar-2018 06:43:35 GMT; Max-Age=86400; path=/']
sid = pwg1uj
2018-03-24 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSApwg1uj&0.8624172924051665&modid=undefined> (referer: http://www.51baiduyun.com/)
parseUpdateForCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSApwg1uj&0.8624172924051665&modid=undefined
updateRe = <_sre.SRE_Match object; span=(1079, 1092), match='update=22682&'>
update = 22682
2018-03-24 14:39:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&update=22682&idhash=cSApwg1uj> (referer: http://www.51baiduyun.com/)
parseCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&update=22682&idhash=cSApwg1uj
2018-03-24 14:39:01 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2018-03-24 14:39:01 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 2539
2018-03-24 14:39:02 [PIL.Image] DEBUG: Error closing: 'NoneType' object has no attribute 'close'
輸入驗證碼
>C94a
input captcha is : C94a
開始模擬登錄百度雲俱樂部
2018-03-24 14:39:13 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1> (referer: http://www.51baiduyun.com/misc.php?mod=seccode&update=22682&idhash=cSApwg1uj)
parseLoginResPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1
text = <?xml version="1.0" encoding="utf-8"?>
<root><![CDATA[抱歉,驗證碼填寫錯誤<script type="text/javascript" reload="1">if(typeof errorhandle_login=='function') {errorhandle_login('抱歉,驗證碼填寫錯誤', {});}</script>]]></root>
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-24 14:39:13.302602
- 登錄失敗:密碼錯誤
start baiduyun clawer
2018-03-24 14:37:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login> (referer: http://www.51baiduyun.com/)
parseLoginArgsPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&infloat=yes&handlekey=login&inajax=1&ajaxtarget=fwin_content_login
formhashRe = <_sre.SRE_Match object; span=(650, 682), match='name="formhash" value="ded981e3"'>, refererRe = <_sre.SRE_Match object; span=(708, 757), match='name="referer" value="http://www.51baiduyun.com/">
Set-Cookie = [b'L3em_2132_saltkey=lnzh8Msb; expires=Mon, 23-Apr-2018 06:41:51 GMT; Max-Age=2592000; path=/; httponly', b'L3em_2132_lastvisit=1521870111; expires=Mon, 23-Apr-2018 06:41:51 GMT; Max-Age=2591999; path=/', b'L3em_2132_sid=j91h9s; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/', b'L3em_2132_lastact=1521873711%09member.php%09logging; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/', b'L3em_2132_sid=j91h9s; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/']
cookieInfoLst = ['L3em_2132_saltkey=lnzh8Msb; expires=Mon, 23-Apr-2018 06:41:51 GMT; Max-Age=2592000; path=/; httponly', 'L3em_2132_lastvisit=1521870111; expires=Mon, 23-Apr-2018 06:41:51 GMT; Max-Age=2591999; path=/', 'L3em_2132_sid=j91h9s; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/', 'L3em_2132_lastact=1521873711%09member.php%09logging; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/', 'L3em_2132_sid=j91h9s; expires=Sun, 25-Mar-2018 06:41:51 GMT; Max-Age=86399; path=/']
sid = j91h9s
2018-03-24 14:37:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSAj91h9s&0.5541295116875323&modid=undefined> (referer: http://www.51baiduyun.com/)
parseUpdateForCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&action=update&idhash=cSAj91h9s&0.5541295116875323&modid=undefined
updateRe = <_sre.SRE_Match object; span=(1079, 1092), match='update=12248&'>
update = 12248
2018-03-24 14:37:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/misc.php?mod=seccode&update=12248&idhash=cSAj91h9s> (referer: http://www.51baiduyun.com/)
parseCaptcha: statusCode = 200, url = http://www.51baiduyun.com/misc.php?mod=seccode&update=12248&idhash=cSAj91h9s
2018-03-24 14:37:19 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2018-03-24 14:37:19 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 2498
輸入驗證碼
2018-03-24 14:37:20 [PIL.Image] DEBUG: Error closing: 'NoneType' object has no attribute 'close'
>CY7R
input captcha is : CY7R
開始模擬登錄百度雲俱樂部
2018-03-24 14:37:25 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1> (referer: http://www.51baiduyun.com/misc.php?mod=seccode&update=12248&idhash=cSAj91h9s)
parseLoginResPage: statusCode = 200, url = http://www.51baiduyun.com/member.php?mod=logging&action=login&loginsubmit=yes&handlekey=login&loginhash=Lpd1b&inajax=1
text = <?xml version="1.0" encoding="utf-8"?>
<root><![CDATA[登錄失敗,您還可以嘗試 3 次<script type="text/javascript" reload="1">if(typeof errorhandle_login=='function') {errorhandle_login('登錄失敗,您還可以嘗試 3 次', {'loginperm':'3'});}</script>]]></root>
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-24 14:37:25.798602
3.4. 說明
- 其實從整個過程來看,是比較簡單的,唯一需要特別注意的點就是如何保證獲取到的驗證碼和這些用戶參數保持一定的關聯性,scrapy是通過cookie傳遞做到了這一點。同理,如果想在scrapy中使用requests模塊,只需要將cookie(要分清請求的cookie和服務器返回的cookie是不一樣的)取出,放入到requests的get,post請求中去。