python3下使用scrapy實現模擬用戶登錄與cookie存儲 —— 基礎篇(馬蜂窩)
1. 背景
2. 環境
- 系統:win7
- python 3.6.1
- scrapy 1.4.0
3. 標準的模擬登陸步驟
- 第一步:首先進入用戶登錄的頁面,拿到一些登錄所需的參數(比如說知乎網站,登陸頁面裏的 _xsrf)。
- 第二步:將這些參數,和賬戶密碼,一起post到服務器,登錄。
- 第三步:檢查用戶登錄是否成功。
- 第四步:如果用戶登錄失敗,排查錯誤,重新啓動登錄程序。
- 第五步:如果用戶登錄成功,按照正常流程爬取網站頁面。
import scrapy
import datetime
import re
class mafengwoSpider(scrapy.Spider):
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'COOKIES_ENABLED': True,
'COOKIES_DEBUG': True,
'DOWNLOAD_TIMEOUT': 25,
}
name = 'mafengwo'
allowed_domains = ['mafengwo.cn']
host = "http://www.mafengwo.cn/"
username = "13725168940"
password = "aaa00000000"
headerData = {
"Referer": "https://passport.mafengwo.cn/",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
def start_requests(self):
print("start mafengwo clawer")
mafengwoLoginPage = "https://passport.mafengwo.cn/"
loginIndexReq = scrapy.Request(
url = mafengwoLoginPage,
headers = self.headerData,
callback = self.parseLoginPage,
dont_filter = True,
)
yield loginIndexReq
def parseLoginPage(self, response):
print(f"parseLoginPage: url = {response.url}")
loginPostUrl = "https://passport.mafengwo.cn/login/"
yield scrapy.FormRequest(
url = loginPostUrl,
headers = self.headerData,
method = "POST",
formdata = {
"passport": self.username,
"password": self.password,
},
callback = self.loginResParse,
dont_filter = True,
)
def loginResParse(self, response):
print(f"loginResParse: url = {response.url}")
routeUrl = "http://www.mafengwo.cn/plan/route.php"
yield scrapy.Request(
url = routeUrl,
headers = self.headerData,
meta={
'dont_redirect': True,
},
callback = self.isLoginStatusParse,
dont_filter = True,
)
def isLoginStatusParse(self, response):
print(f"isLoginStatusParse: url = {response.url}")
yield scrapy.Request(
url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
headers=self.headerData,
)
def parse(self, response):
print(f"parse: url = {response.url}, meta = {response.meta}")
def errorHandle(self, failure):
print(f"request error: {failure.value.response}")
def closed(self, reason):
finishTime = datetime.datetime.now()
subject = f"clawerName had finished, reason = {reason}, finishedTime = {finishTime}"
E:\Miniconda\python.exe E:/documentCode/scrapyMafengwo/start.py
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyMafengwo)
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrapyMafengwo', 'NEWSPIDER_MODULE': 'scrapyMafengwo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapyMafengwo.spiders']}
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-19 17:03:54 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:03:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:03:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:03:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:03:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.mafengwo.cn> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:03:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn> (referer: None)
loginResParse: url = http://www.mafengwo.cn
2018-03-19 17:03:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
isLoginStatusParse: url = http://www.mafengwo.cn/plan/route.php
2018-03-19 17:04:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html> (referer: https://passport.mafengwo.cn/)
parse: url = https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html, meta = {'depth': 3, 'download_timeout': 25.0, 'download_slot': 'www.mafengwo.cn', 'download_latency': 0.2569999694824219}
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:04:01.638400
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:04:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3251,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 38259,
'downloader/response_count': 5,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 19, 9, 4, 1, 638400),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2018, 3, 19, 9, 3, 54, 707400)}
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
2018-03-19 17:05:06 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:05:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:05:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:05:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:05:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.mafengwo.cn/> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:05:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2234,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 5044,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 19, 9, 5, 10, 368900),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/302': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 3, 19, 9, 5, 6, 871900)}
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Spider closed (finished)
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:05:10.368900
Process finished with exit code 0
- 對比一下,就可以看到,在驗證用戶登錄狀態這個步驟時,如果用戶處於非登錄狀態,而且又不允許頁面重定向(302)到登錄頁面,那麼爬蟲就會在這個地方終止,不再繼續往後爬取。
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
4. 注意事項
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'COOKIES_ENABLED': True,
# 需要有,否則服務器會拒絕請求
headerData = {
"Referer": "https://passport.mafengwo.cn/",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
5. cookie的本地存儲與使用
- 在驗證用戶登錄成功之後,可以選擇把cookie保存下來。然後在下次登錄時,可以直接使用這個cookie登錄(當然,並不推薦這種方式)
5.1. 把cookie保存在本地
def convertToCookieFormat(cookieLstInfo, cookieFileName):
'''
CookieReq = [b'PHPSESSID=427jcfptrsogeg7onenojvqmp0; mfw_uuid=5ab0adb9-177d-a7d3-a47a-9522417e0652; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+14%3A44%3A09%22%3B%7D; __today_login=1; mafengwo=d336513fb8fc6edd490db9725739bb85_94281374_5ab0adbac4ba51.24002232_5ab0adbac4ba92.98161419; uol_throttle=94281374; mfw_uid=94281374']
:param cookieLstInfo:
:return:
'''
cookieDict = {}
if len(cookieLstInfo) > 0:
cookieStr = str(cookieLstInfo[0], encoding="utf8")
print(f"cookieStr = {cookieStr}")
for cookieItemStr in cookieStr.split(";"):
cookieItem = cookieItemStr.strip().split("=")
print(f"cookieItemStr = {cookieItemStr}, cookieItem = {cookieItem}")
cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
print(f"cookieDict = {cookieDict}")
with open(cookieFileName, 'w') as f:
for cookieKey, cookieValue in cookieDict.items():
f.write(str(cookieKey) + ':' + str(cookieValue) + '\n')
return cookieDict
def isLoginStatusParse(self, response):
print(f"isLoginStatusParse: url = {response.url}")
CookieReq = response.request.headers.getlist('Cookie')
print(CookieReq = {CookieReq}')
cookieFileName = "mafengwoCookies.txt"
cookieDict = convertToCookieFormat(Cookie, cookieFileName)
# 響應Cookie
Cookie = response.headers.getlist('Set-Cookie')
print(f"Set-Cookie = {Cookie}")
# 如果能進到這一步,都沒有出錯的話,那麼後面就可以用登錄狀態,訪問後面的頁面了
# ………………………………
# 不需要存儲cookie
# 其他網頁爬取
# ………………………………
yield scrapy.Request(
url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
headers=self.headerData,
# 如果不指定callback,那麼默認會使用parse函數
)
PHPSESSID:vperarhkjekdsv5mut4vjk9ri0
mfw_uuid:5ab0bcc6-0279-cbef-673e-15fd2c0b73c5
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D
__today_login:1
mafengwo:926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262
uol_throttle:94281374
mfw_uid:94281374
5.2. 讀取cookie使用
- 這個部分,當然,也可以直接用瀏覽器登錄,然後從瀏覽器中拿到cookie,然後作爲登錄的憑證。
def getCookieFromFile(cookieFileName):
'''
PHPSESSID:nkv0d5g29bde1ni5p9bha8cq04
mfw_uuid:5ab0b3a3-22ac-61f1-ba72-db5a070c7e5d
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A09%3A23%22%3B%7D
__today_login:1
mafengwo:7e7cd3cffefcc05d3cbb217172a2d9fa_94281374_5ab0b3a5ac8007.33269268_5ab0b3a5ac8053.87485829
uol_throttle:94281374
mfw_uid:94281374
:param cookieFileName:
:return:
'''
cookieDict = {}
f = open(cookieFileName, "r")
for line in f.readlines():
print(f"line = {line}")
if line != "":
cookieItem = line.split(":")
cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
f.close()
return cookieDict
def start_requests(self):
print("start mafengwo clawer")
cookieFileName = "mafengwoCookies.txt"
cookieDict = getCookieFromFile(cookieFileName)
routeUrl = "http://www.mafengwo.cn/plan/route.php"
yield scrapy.Request(
url=routeUrl,
headers=self.headerData,
cookies=cookieDict,
meta={
},
callback=self.isLoginStatusParse,
dont_filter=True,
)
- 需要說明的是:
- 第一,如果cookie是能用的,那確實很方便。
- 第二,但是如果一旦cookie失效了,那麼這個cookie就會在所有的requests中流轉,不但無法訪問rout頁面,同時也無法訪問重定向(302)後的登錄頁面,爬蟲也就異常終止了(這也是不推薦使用cookie登錄的原因)。如下:
line =
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.mafengwo.cn/plan/route.php>
Cookie:
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <302 http://www.mafengwo.cn/plan/route.php>
Set-Cookie: PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; path=/; domain=.mafengwo.cn; HttpOnly
Set-Cookie: mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; expires=Wed, 20-Mar-2019 08:01:51 GMT; Max-Age=31536000; path=/; domain=.mafengwo.cn
Set-Cookie: oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D; expires=Tue, 27-Mar-2018 08:01:51 GMT; Max-Age=604800; path=/; domain=.mafengwo.cn
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> from <GET http://www.mafengwo.cn/plan/route.php>
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>
Cookie:
2018-03-20 15:58:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> (referer: https://passport.mafengwo.cn/)
2018-03-20 15:58:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>: HTTP status code is not handled or not allowed
2018-03-20 15:58:12 [scrapy.core.engine] INFO: Closing spider (finished)