問題描述:
在使用scrapy 爬取QQ郵箱的過程中, 我想把郵件相應的附件同時下載下來。於是我使用了scrapy自帶的下載功能FilesPipeline 。
當我使用其爬取郵箱的時候,發現有部分可以爬取而一部分附件反饋爲302。於是爬取失敗
[scrapy] WARNING: File (code: 302): Error downloading file from
問題解決
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
這是在FilesPipeline中的初始化方法, 我們可以看到最後調用了父類的init方法進行初始化
FilesPipeline繼承至MediaPipeline,於是我們來看看父類的方法
def __init__(self, download_func=None, settings=None):
self.download_func = download_func
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
resolve = functools.partial(self._key_for_pipe,
base_class_name="MediaPipeline",
settings=settings)
self.allow_redirects = settings.getbool(
resolve('MEDIA_ALLOW_REDIRECTS'), False
)
self._handle_statuses(self.allow_redirects)
從這裏我們可以看到,如果在settings文件中沒有設置MEDIA_ALLOW_REDIRECTS參數的話,默認會將值賦值成False 及如果在下載的過程中如果有重定向過程,將不再重定向。
於是我再settings文件中 設置 MEDIA_ALLOW_REDIRECTS =True 問題完美解決!!