scrapy高階技巧+++FilesPipeline和ImagesPipeline（文件下載）

參考文章：

https://blog.csdn.net/qq_43537354/article/details/88360636
https://doc.scrapy.org/en/1.3/topics/media-pipeline.html

FilesPipeline的工作流如下：

在spider中爬取要下載的文件鏈接，將其放置於item中的file_urls（注意這只是一個代名詞就像數學中的x，他的值在配置項裏面，可以自定義的）。
spider將其返回並傳送至pipeline。
當FilesPipeline處理時，它會檢測是否有file_urls字段，如果有的話，會將url傳送給scarpy調度器和下載器。
下載完成之後，會將結果寫入item的另一字段files，files包含了文件現在的本地路徑（相對於配置FILE_STORE的路徑）、文件校驗和checksum、文件的url

兩個管道都實現了這些功能：

避免重新下載最近下載的媒體
指定存儲介質的位置（文件系統目錄）
圖像管道具有一些用於處理圖像的額外功能：

將下載的圖片轉換爲JPG格式和RGB模式，並生成圖像縮略圖；
檢查圖像寬度/高度以確保它們符合最小約束（需要在settings中配置）；

在settings中，對圖像管道進行配置：

ITEM_PIPELINES  =  { 'scrapy.pipelines.images.ImagesPipeline' ： 1 }
IMAGES_STORE =  '/path/to/valid/dir'

對文件管道進行配置：

ITEM_PIPELINES  =  { 'scrapy.pipelines.files.FilesPipeline' ： 1 }
FILES_STORE=  '/path/to/valid/dir'

文件的和圖片將使用url的sha-1散列值進行命名，例如：

http://www.example.com/image.jpg

對應

3afec3b4765f8f0a07b78f98c07b83f013567a0a

最後的文件名是： 3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

下載的圖片將存儲在以下路徑：

<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

例子：

import scrapy
class MyItem(scrapy.Item):
    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

自定義file_urls的存儲字段和結果信息字段名稱，在配置文件做如下設置：

1.For the Files Pipeline

    FILES_URLS_FIELD = 'field_name_for_your_files_urls'
    FILES_RESULT_FIELD = 'field_name_for_your_processed_files'

2.For the Images Pipeline

    IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
    IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

還有一些附加功能：

文件壽命配置：

    # 120 days of delay for files expiration
    FILES_EXPIRES = 120   （配置文件的壽命，這在連續生產環境將非常有用）
    # 圖像到期延遲30天
    IMAGES_EXPIRES  =  30
    # 如果你使用的是自定義的：文件處理流水線（MYPIPELINE），那麼使用如下配置：
    (MYPIPELINE)_FILES_EXPIRES = 180

縮略圖配置項：

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

Example ：

<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg

過濾尺寸過大過小的圖片：

IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

自定義圖像處理流水線（pipeline）的完整例子：

# Custom Images pipeline example：
# https://doc.scrapy.org/en/1.3/topics/media-pipeline.html#custom-images-pipeline-example
 
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
 
class MyImagesPipeline(ImagesPipeline):
 
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)
 
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

scrapy高階技巧+++FilesPipeline和ImagesPipeline（文件下載）

Mysql中Join用法及優化

MongoDB中索引的創建和使用詳解

springboot 返回的json中忽略null屬性值，不傳遞

MongoTemplate中$in、$gt、$addToSet、$elemMatch、排序、分頁的使用

gitlab刪除文件/目錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結