【原文鏈接】https://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports
Feed exports
New in version 0.10.
One of the most frequently required features when implementing scrapers is being 能夠正確保存爬取下來的數據, 並且常常, 這意味着生成一個有爬取下來數據的 “導出文件” (通常被稱爲 “export feed”), 這個文件會被其他系統消費.
Scrapy 提供了這個功能 out of the box with the Feed Exports, which allows you to 利用多種序列化格式和存儲後端用爬取到的 items 生成一個 feed.
序列化格式
Feed exports 使用 Item exporters 對於爬取到的數據進行序列化. These formats are supported out of the box:
但是你也可以通過 FEED_EXPORTERS
setting 擴展可支持的格式.
JSON
FEED_FORMAT
:json
- Exporter used:
JsonItemExporter
- See this warning if you’re using JSON with large feeds.
JSON lines
FEED_FORMAT
:jsonlines
- Exporter used:
JsonLinesItemExporter
CSV
FEED_FORMAT
:csv
- Exporter used:
CsvItemExporter
- 想要指定要輸出的列和輸出順序,使用
FEED_EXPORT_FIELDS
. 其他 feed exporters 也可以使用這個選項, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML
FEED_FORMAT
:xml
- Exporter used:
XmlItemExporter
Pickle
FEED_FORMAT
:pickle
- Exporter used:
PickleItemExporter
Marshal
FEED_FORMAT
:marshal
- Exporter used:
MarshalItemExporter
存儲
當使用 feed exports 時,你要通過 URI (through the FEED_URI
setting) 定義 feed 要存儲在哪兒. Feed exports 支持多種存儲後端類型,這些存儲後端的類型是通過 URI schema 定義的.
The storages backends supported out of the box are:
- Local filesystem
- FTP
- S3 (requires botocore or boto)
- Standard output
如果所需的外部 libraries 不可用,有些存儲後端也可能不可用. 比如 S3 後端只有在 botocore or boto library 安裝了的情況下才可用 (Scrapy supports boto only on Python 2).
存儲 URI 參數
存儲 URI 也可以包含參數,這些參數會在 feed 被創建的時候被替換掉. These parameters are:
%(time)s
- gets replaced by a timestamp when the feed is being created%(name)s
- gets replaced by the spider name
任何其他被命名的參數都會被爬蟲的同名屬性替換掉. 比如, 當 feed 被創建的時候 %(site_id)s
會被 spider.site_id
屬性替換掉.
這裏有一些用來展示的例子:
- Store in FTP using one directory per spider:
ftp://user:[email protected]/scraping/feeds/%(name)s/%(time)s.json
- Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
存儲後端
本地文件系統
The feeds 存儲在本地文件系統中.
- URI scheme:
file
- Example URI:
file:///tmp/export.csv
- Required external libraries: none
注意僅對於本地文件系統存儲來說你可以忽略 schema,使用像這樣的絕對路徑 /tmp/export.csv
. 這僅適用於 Unix systems.
FTP
The feeds are stored in a FTP server.
- URI scheme:
ftp
- Example URI:
ftp://user:[email protected]/path/to/export.csv
- Required external libraries: none
S3
The feeds are stored on Amazon S3.
AWS 證書可以作爲 user/password 傳遞到 URI, 或者他們可以通過下列 settings 被傳遞:
標準輸出
The feeds are written to the standard output of the Scrapy process.
- URI scheme:
stdout
- Example URI:
stdout:
- Required external libraries: none
Settings
These are the settings used for configuring the feed exports:
FEED_URI
Default: None
The URI of the export feed. See Storage backends for supported URI schemes.
This setting is required for enabling the feed exports.
FEED_FORMAT
The serialization format to be used for the feed. See Serialization formats for possible values.
FEED_EXPORT_ENCODING
Default: None
The encoding to be used for the feed.
如果不設置或者設置爲 None
(默認) 對除了 JSON輸出外的所有輸出會使用 UTF-8, which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
FEED_EXPORT_FIELDS
Default: None
要輸出的一個 fields 列表, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]
.
使用 FEED_EXPORT_FIELDS 選項來定義要輸出的 fields 和其輸出順序.
當 FEED_EXPORT_FIELDS 是空的或 None (默認) 時, Scrapy 使用字典定義的 fields 或爬蟲 yield 的 Item
子類.
如果一個 exporter 需要一個固定的 fields 集合 (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, 那麼 Scrapy 會嘗試通過 exported 數據獲得 field 名字 - 目前它使用第一個 item 的 field 名字.
FEED_EXPORT_INDENT
Default: 0
每個 level 用來 indent 輸出所需的 spaces. 如果 FEED_EXPORT_INDENT
是一個非負的整數, 那麼數組元素和對象成員會使用這個 indent level 來被 pretty-printed. 如果 indent level 是 0
(默認), 或負數, 會讓每個 item 佔據一個新的行. None
選取的是最 compact representation.
現在只有 JsonItemExporter
和 XmlItemExporter
實現, 比如當你將數據 export 到 .json
or .xml
.
FEED_STORE_EMPTY
Default: False
Whether to export empty feeds (ie. feeds with no items).
FEED_STORAGES
Default: {}
一個包含了附加的項目支持的 feed 存儲後端的字典. 鍵是 URI schemes and the 值是存儲類的路徑.
FEED_STORAGES_BASE
Default:
{ '': 'scrapy.extensions.feedexport.FileFeedStorage', 'file': 'scrapy.extensions.feedexport.FileFeedStorage', 'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage', 's3': 'scrapy.extensions.feedexport.S3FeedStorage', 'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage', }
A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None
to their URI scheme in FEED_STORAGES
. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py
:
FEED_STORAGES = { 'ftp': None, }
FEED_EXPORTERS
Default: {}
A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.
FEED_EXPORTERS_BASE
Default:
{ 'json': 'scrapy.exporters.JsonItemExporter', 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter', 'jl': 'scrapy.exporters.JsonLinesItemExporter', 'csv': 'scrapy.exporters.CsvItemExporter', 'xml': 'scrapy.exporters.XmlItemExporter', 'marshal': 'scrapy.exporters.MarshalItemExporter', 'pickle': 'scrapy.exporters.PickleItemExporter', }
A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None
to their serialization format in FEED_EXPORTERS
. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py
:
FEED_EXPORTERS = { 'csv': None, }