python3下使用scrapy實現模擬用戶登錄與cookie存儲 —— 基礎篇（馬蜂窩）

1. 背景

相關基礎知識點回顧：
- python3下使用requests實現模擬用戶登錄（馬蜂窩）： http://blog.csdn.net/zwq912318834/article/details/79571110

2. 環境

系統：win7
python 3.6.1
scrapy 1.4.0

3. 標準的模擬登陸步驟

第一步：首先進入用戶登錄的頁面，拿到一些登錄所需的參數（比如說知乎網站，登陸頁面裏的 _xsrf）。
第二步：將這些參數，和賬戶密碼，一起post到服務器，登錄。
第三步：檢查用戶登錄是否成功。
第四步：如果用戶登錄失敗，排查錯誤，重新啓動登錄程序。
第五步：如果用戶登錄成功，按照正常流程爬取網站頁面。

# 以馬蜂窩網站登錄爲例，講解如何模擬用戶登錄
# 保持登錄狀態，訪問其他頁面


# 爬蟲文件：mafengwoSpider.py
# -*- coding: utf-8 -*-

import scrapy
import datetime
import re

class mafengwoSpider(scrapy.Spider):
    # 定製化設置
    custom_settings = {
        'LOG_LEVEL': 'DEBUG',       # Log等級，默認是最低級別debug
        'ROBOTSTXT_OBEY': False,    # default Obey robots.txt rules
        'DOWNLOAD_DELAY': 2,        # 下載延時，默認是0
        'COOKIES_ENABLED': True,    # 默認enable，爬取登錄後的數據時需要啓用。 會增加流量，因爲request和response中會多攜帶cookie的部分
        'COOKIES_DEBUG': True,      # 默認值爲False,如果啓用，Scrapy將記錄所有在request(Cookie 請求頭)發送的cookies及response接收到的cookies(Set-Cookie 接收頭)。
        'DOWNLOAD_TIMEOUT': 25,     # 下載超時，既可以是爬蟲全局統一控制，也可以在具體請求中填入到Request.meta中，Request.meta['download_timeout']
    }

    name = 'mafengwo'
    allowed_domains = ['mafengwo.cn']
    host = "http://www.mafengwo.cn/"
    username = "13725168940"            # 螞蜂窩帳號
    password = "aaa00000000"          # 馬蜂窩密碼
    headerData = {
        "Referer": "https://passport.mafengwo.cn/",
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    }


    # 爬蟲運行的起始位置
    # 第一步：爬取馬蜂窩登錄頁面
    def start_requests(self):
        print("start mafengwo clawer")
        # 馬蜂窩登錄頁面
        mafengwoLoginPage = "https://passport.mafengwo.cn/"
        loginIndexReq = scrapy.Request(
            url = mafengwoLoginPage,
            headers = self.headerData,
            callback = self.parseLoginPage,
            dont_filter = True,     # 防止頁面因爲重複爬取，被過濾了
        )
        yield loginIndexReq


    # 第二步：分析登錄頁面，取出必要的參數，然後發起登錄請求POST
    def parseLoginPage(self, response):
        print(f"parseLoginPage: url = {response.url}")
        # 如果這個登錄頁面含有一些登錄必備的信息，那麼就在這個函數裏面進行信息提取( response.text )

        loginPostUrl = "https://passport.mafengwo.cn/login/"
        # FormRequest 是Scrapy發送POST請求的方法
        yield scrapy.FormRequest(
            url = loginPostUrl,
            headers = self.headerData,
            method = "POST",
            # post的具體數據
            formdata = {
                "passport": self.username,
                "password": self.password,
                # "other": "other",
            },
            callback = self.loginResParse,
            dont_filter = True,
        )

    # 第三步：分析登錄結果，然後發起登錄狀態的驗證請求
    def loginResParse(self, response):
        print(f"loginResParse: url = {response.url}")

        # 通過訪問個人中心頁面的返回狀態碼來判斷是否爲登錄狀態
        # 這個頁面，只有登錄過的用戶，才能訪問。否則會被重定向(302) 到登錄頁面
        routeUrl = "http://www.mafengwo.cn/plan/route.php"
        # 下面有兩個關鍵點
        # 第一個是header，如果不設置，會返回500的錯誤
        # 第二個是dont_redirect，設置爲True時，是不允許重定向，用戶處於非登錄狀態時，是無法進入這個頁面的，服務器返回302錯誤。
        #       dont_redirect，如果設置爲False，允許重定向，進入這個頁面時，會自動跳轉到登錄頁面。會把登錄頁面抓下來。返回200的狀態碼
        yield scrapy.Request(
            url = routeUrl,
            headers = self.headerData,
            meta={
                'dont_redirect': True,      # 禁止網頁重定向302, 如果設置這個，但是頁面又一定要跳轉，那麼爬蟲會異常
                # 'handle_httpstatus_list': [301, 302]      # 對哪些異常返回進行處理
            },
            callback = self.isLoginStatusParse,
            dont_filter = True,
        )


    # 第五步:分析用戶的登錄狀態, 如果登錄成功，那麼接着爬取其他頁面
    # 如果登錄失敗，爬蟲會直接終止。
    def isLoginStatusParse(self, response):
        print(f"isLoginStatusParse: url = {response.url}")

        # 如果能進到這一步，都沒有出錯的話，那麼後面就可以用登錄狀態，訪問後面的頁面了
        # ………………………………
        # 不需要存儲cookie
        # 其他網頁爬取
        # ………………………………
        yield scrapy.Request(
            url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
            headers=self.headerData,
            # 如果不指定callback，那麼默認會使用parse函數
        )


    # 正常的分析頁面請求
    def parse(self, response):
        print(f"parse: url = {response.url}, meta = {response.meta}")


    # 請求錯誤處理：可以打印，寫文件，或者寫到數據庫中
    def errorHandle(self, failure):
        print(f"request error: {failure.value.response}")


    # 爬蟲運行完畢時的收尾工作，例如：可以打印信息，可以發送郵件
    def closed(self, reason):
        # 爬取結束的時候可以發送郵件
        finishTime = datetime.datetime.now()
        subject = f"clawerName had finished, reason = {reason}, finishedTime = {finishTime}"

登錄成功的Log：

E:\Miniconda\python.exe E:/documentCode/scrapyMafengwo/start.py
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyMafengwo)
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrapyMafengwo', 'NEWSPIDER_MODULE': 'scrapyMafengwo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapyMafengwo.spiders']}
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-19 17:03:54 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:03:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:03:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:03:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:03:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.mafengwo.cn> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:03:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn> (referer: None)
loginResParse: url = http://www.mafengwo.cn
2018-03-19 17:03:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
isLoginStatusParse: url = http://www.mafengwo.cn/plan/route.php
2018-03-19 17:04:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html> (referer: https://passport.mafengwo.cn/)
parse: url = https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html, meta = {'depth': 3, 'download_timeout': 25.0, 'download_slot': 'www.mafengwo.cn', 'download_latency': 0.2569999694824219}
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:04:01.638400
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:04:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3251,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 38259,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 19, 9, 4, 1, 638400),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'request_depth_max': 3,
 'response_received_count': 4,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2018, 3, 19, 9, 3, 54, 707400)}
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

登錄失敗的Log：

2018-03-19 17:05:06 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:05:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:05:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:05:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:05:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.mafengwo.cn/> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:05:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2234,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 5044,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 19, 9, 5, 10, 368900),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/302': 1,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2018, 3, 19, 9, 5, 6, 871900)}
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Spider closed (finished)
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:05:10.368900

Process finished with exit code 0

對比一下，就可以看到，在驗證用戶登錄狀態這個步驟時，如果用戶處於非登錄狀態，而且又不允許頁面重定向（302）到登錄頁面，那麼爬蟲就會在這個地方終止，不再繼續往後爬取。

loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed

4. 注意事項

settings設置

'ROBOTSTXT_OBEY': False,    # default Obey robots.txt rules，因爲很多網站都禁止爬蟲爬取
'DOWNLOAD_DELAY': 2,        # 下載延時，默認是0，防止過快，導致IP和帳號被封
'COOKIES_ENABLED': True,    # 默認enable，爬取登錄後的數據時需要啓用

header的配置：

# 需要有，否則服務器會拒絕請求
headerData = {
    "Referer": "https://passport.mafengwo.cn/",
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    }

下載中間件配置：middleware.py

# 由於要保持用戶登錄狀態，所以用戶使用的user-agent，IP地址，都不要變。
# 要不然容易導致用戶數據異常，賬戶被封。
# 這些設置，都在middleware.py中，所以尤其需要注意

5. cookie的本地存儲與使用

在驗證用戶登錄成功之後，可以選擇把cookie保存下來。然後在下次登錄時，可以直接使用這個cookie登錄（當然，並不推薦這種方式）

5.1. 把cookie保存在本地

# 文件mafengwoSpider.py

# 將cookie保存到文件中
def convertToCookieFormat(cookieLstInfo, cookieFileName):
    '''
    CookieReq = [b'PHPSESSID=427jcfptrsogeg7onenojvqmp0; mfw_uuid=5ab0adb9-177d-a7d3-a47a-9522417e0652; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+14%3A44%3A09%22%3B%7D; __today_login=1; mafengwo=d336513fb8fc6edd490db9725739bb85_94281374_5ab0adbac4ba51.24002232_5ab0adbac4ba92.98161419; uol_throttle=94281374; mfw_uid=94281374']
    :param cookieLstInfo:
    :return:
    '''
    cookieDict = {}
    if len(cookieLstInfo) > 0:
        # bs = str(b, encoding = "utf8")
        cookieStr = str(cookieLstInfo[0], encoding="utf8")
        print(f"cookieStr = {cookieStr}")
        for cookieItemStr in cookieStr.split(";"):
            cookieItem = cookieItemStr.strip().split("=")
            print(f"cookieItemStr = {cookieItemStr}, cookieItem = {cookieItem}")
            cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
        print(f"cookieDict = {cookieDict}")

        # 將cookie寫入到文件中，方便後面使用
        with open(cookieFileName, 'w') as f:
            for cookieKey, cookieValue in cookieDict.items():
                f.write(str(cookieKey) + ':' + str(cookieValue) + '\n')
        return cookieDict

# 第五步:分析用戶的登錄狀態, 如果登錄成功，那麼接着爬取其他頁面
# 如果登錄失敗，爬蟲會直接終止。
def isLoginStatusParse(self, response):
    print(f"isLoginStatusParse: url = {response.url}")

    # 查詢網址的Cookie
    # 發出請求的Cookie, 事實上是要存儲這個cookie，因爲當用戶登錄成功之後，
    # 以後，就會將cookie信息放到請求中，帶給服務器，來表明自己的身份
    CookieReq = response.request.headers.getlist('Cookie')
    print(CookieReq = {CookieReq}')
    cookieFileName = "mafengwoCookies.txt"
    cookieDict = convertToCookieFormat(Cookie, cookieFileName)

    # 響應Cookie
    Cookie = response.headers.getlist('Set-Cookie')
    print(f"Set-Cookie = {Cookie}")

    # 如果能進到這一步，都沒有出錯的話，那麼後面就可以用登錄狀態，訪問後面的頁面了
    # ………………………………
    # 不需要存儲cookie
    # 其他網頁爬取
    # ………………………………
    yield scrapy.Request(
        url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
        headers=self.headerData,
        # 如果不指定callback，那麼默認會使用parse函數
    )

存儲結果如下

# 文件：mafengwoCookies.txt

PHPSESSID:vperarhkjekdsv5mut4vjk9ri0
mfw_uuid:5ab0bcc6-0279-cbef-673e-15fd2c0b73c5
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D
__today_login:1
mafengwo:926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262
uol_throttle:94281374
mfw_uid:94281374

5.2. 讀取cookie使用

這個部分，當然，也可以直接用瀏覽器登錄，然後從瀏覽器中拿到cookie，然後作爲登錄的憑證。

# 從文件中，把cookie信息取出來
def getCookieFromFile(cookieFileName):
    '''
        PHPSESSID:nkv0d5g29bde1ni5p9bha8cq04
        mfw_uuid:5ab0b3a3-22ac-61f1-ba72-db5a070c7e5d
        oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A09%3A23%22%3B%7D
        __today_login:1
        mafengwo:7e7cd3cffefcc05d3cbb217172a2d9fa_94281374_5ab0b3a5ac8007.33269268_5ab0b3a5ac8053.87485829
        uol_throttle:94281374
        mfw_uid:94281374
    :param cookieFileName:
    :return:
    '''
    cookieDict = {}
    f = open(cookieFileName, "r")  # 打開文件
    for line in f.readlines():
        print(f"line = {line}")
        if line != "":
            cookieItem = line.split(":")
            cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
    f.close()  # 關閉文件
    return cookieDict


# 爬蟲運行的起始位置
def start_requests(self):
    print("start mafengwo clawer")
    cookieFileName = "mafengwoCookies.txt"
    cookieDict = getCookieFromFile(cookieFileName)

    # 通過訪問個人中心頁面的返回狀態碼來判斷是否爲登錄狀態
    # 這個頁面，只有登錄過的用戶，才能訪問。否則會被重定向(302) 到登錄頁面
    routeUrl = "http://www.mafengwo.cn/plan/route.php"
    # 下面有兩個關鍵點
    # 第一個是header，如果不設置，會返回500的錯誤
    # 第二個是dont_redirect，設置爲True時，是不允許重定向，用戶處於非登錄狀態時，是無法進入這個頁面的，服務器返回302錯誤。
    #       dont_redirect，如果設置爲False，允許重定向，進入這個頁面時，會自動跳轉到登錄頁面。會把登錄頁面抓下來。返回200的狀態碼
    yield scrapy.Request(
        url=routeUrl,
        headers=self.headerData,
        cookies=cookieDict,
        meta={
            # 'dont_redirect': True,    # 禁止網頁重定向302, 如果設置這個，但是頁面又一定要跳轉，那麼爬蟲會異常
            # 'handle_httpstatus_list': [301, 302]      # 對哪些異常返回進行處理
        },
        callback=self.isLoginStatusParse,
        dont_filter=True,
    )

需要說明的是：
第一，如果cookie是能用的，那確實很方便。
第二，但是如果一旦cookie失效了，那麼這個cookie就會在所有的requests中流轉，不但無法訪問rout頁面，同時也無法訪問重定向（302）後的登錄頁面，爬蟲也就異常終止了（這也是不推薦使用cookie登錄的原因）。如下：

line = #mfw_uid:9474669944

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.mafengwo.cn/plan/route.php>
Cookie: #PHPSESSID=vperarhkjekdsv5mut4vjk9ri0; #mfw_uuid=5ab0bcc6-0279-cbef-673e-15fd2c0b73c5; #oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D; #__today_login=1; #mafengwo=926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262; #uol_throttle=94281374; #mfw_uid=94281374

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <302 http://www.mafengwo.cn/plan/route.php>
Set-Cookie: PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; path=/; domain=.mafengwo.cn; HttpOnly

Set-Cookie: mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; expires=Wed, 20-Mar-2019 08:01:51 GMT; Max-Age=31536000; path=/; domain=.mafengwo.cn

Set-Cookie: oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D; expires=Tue, 27-Mar-2018 08:01:51 GMT; Max-Age=604800; path=/; domain=.mafengwo.cn

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> from <GET http://www.mafengwo.cn/plan/route.php>
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>
Cookie: #PHPSESSID=vperarhkjekdsv5mut4vjk9ri0; #mfw_uuid=5ab0bcc6-0279-cbef-673e-15fd2c0b73c5; #oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D; #__today_login=1; #mafengwo=926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262; #uol_throttle=94281374; #mfw_uid=94281374; PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D

2018-03-20 15:58:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> (referer: https://passport.mafengwo.cn/)
2018-03-20 15:58:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>: HTTP status code is not handled or not allowed
2018-03-20 15:58:12 [scrapy.core.engine] INFO: Closing spider (finished)

關於cookie的文章可以參考：
- Scrapy框架–cookie的獲取/傳遞/本地保存：https://www.cnblogs.com/thunderLL/p/7992040.html
- Scrapy源碼註解–CookiesMiddleware：http://www.cnblogs.com/thunderLL/p/8060279.html
- site-packages\scrapy\downloadermiddlewares\cookies.py

Kosmoo

發佈了73 篇原創文章 · 獲贊 244 · 訪問量 69萬+

他的留言板關注

python3下使用scrapy實現模擬用戶登錄與cookie存儲 —— 基礎篇（馬蜂窩）

python3下使用scrapy實現模擬用戶登錄與cookie存儲 —— 基礎篇（馬蜂窩）

1. 背景

2. 環境

3. 標準的模擬登陸步驟

4. 注意事項

5. cookie的本地存儲與使用

5.1. 把cookie保存在本地

5.2. 讀取cookie使用

ziw2pdf

apisix~helm方式的部署到k8s

firmeye - IoT固件漏洞挖掘工具

scrapy-redis分佈式爬蟲如何在start_urls中添加參數

python3下使用scrapy實現模擬用戶登錄與cookie存儲—— 中級篇（百度雲俱樂部）

故障分析系列（01） —— scrapy爬蟲速度突然變慢原因分析

python下selenium如何處理日期控件的幾種方法

scrapy-redis分佈式爬蟲的搭建過程（代碼篇）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結