【爬蟲】Scrapy Feed Exports

【原文鏈接】https://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports

 

Feed exports

New in version 0.10.

One of the most frequently required features when implementing scrapers is being 能夠正確保存爬取下來的數據, 並且常常, 這意味着生成一個有爬取下來數據的 “導出文件” (通常被稱爲 “export feed”), 這個文件會被其他系統消費.

Scrapy 提供了這個功能 out of the box with the Feed Exports, which allows you to 利用多種序列化格式和存儲後端用爬取到的 items 生成一個 feed.

序列化格式

Feed exports 使用 Item exporters 對於爬取到的數據進行序列化. These formats are supported out of the box:

但是你也可以通過 FEED_EXPORTERS setting 擴展可支持的格式.

JSON

JSON lines

CSV

  • FEED_FORMATcsv
  • Exporter used: CsvItemExporter
  • 想要指定要輸出的列和輸出順序,使用 FEED_EXPORT_FIELDS. 其他 feed exporters 也可以使用這個選項, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

XML

Pickle

Marshal

  • FEED_FORMATmarshal
  • Exporter used: MarshalItemExporter

存儲

當使用 feed exports 時,你要通過 URI (through the FEED_URI setting) 定義 feed 要存儲在哪兒. Feed exports 支持多種存儲後端類型,這些存儲後端的類型是通過 URI schema 定義的.

The storages backends supported out of the box are:

如果所需的外部 libraries 不可用,有些存儲後端也可能不可用. 比如 S3 後端只有在 botocore or boto library 安裝了的情況下才可用 (Scrapy supports boto only on Python 2).

存儲 URI 參數

存儲 URI 也可以包含參數,這些參數會在 feed 被創建的時候被替換掉. These parameters are:

  • %(time)s - gets replaced by a timestamp when the feed is being created
  • %(name)s - gets replaced by the spider name

任何其他被命名的參數都會被爬蟲的同名屬性替換掉. 比如, 當 feed 被創建的時候 %(site_id)s 會被 spider.site_id 屬性替換掉.

這裏有一些用來展示的例子:

  • Store in FTP using one directory per spider:
  • Store in S3 using one directory per spider:
    • s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

存儲後端

本地文件系統

The feeds 存儲在本地文件系統中.

  • URI scheme: file
  • Example URI: file:///tmp/export.csv
  • Required external libraries: none

注意僅對於本地文件系統存儲來說你可以忽略 schema,使用像這樣的絕對路徑 /tmp/export.csv. 這僅適用於 Unix systems.

FTP

The feeds are stored in a FTP server.

  • URI scheme: ftp
  • Example URI: ftp://user:[email protected]/path/to/export.csv
  • Required external libraries: none

S3

The feeds are stored on Amazon S3.

  • URI scheme: s3
  • Example URIs:
    • s3://mybucket/path/to/export.csv
    • s3://aws_key:aws_secret@mybucket/path/to/export.csv
  • Required external libraries: botocore (Python 2 and Python 3) or boto (Python 2 only)

AWS 證書可以作爲 user/password 傳遞到 URI, 或者他們可以通過下列 settings 被傳遞:

標準輸出

The feeds are written to the standard output of the Scrapy process.

  • URI scheme: stdout
  • Example URI: stdout:
  • Required external libraries: none

Settings

These are the settings used for configuring the feed exports:

FEED_URI

Default: None

The URI of the export feed. See Storage backends for supported URI schemes.

This setting is required for enabling the feed exports.

FEED_FORMAT

The serialization format to be used for the feed. See Serialization formats for possible values.

FEED_EXPORT_ENCODING

Default: None

The encoding to be used for the feed.

如果不設置或者設置爲 None (默認) 對除了 JSON輸出外的所有輸出會使用 UTF-8, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.

FEED_EXPORT_FIELDS

Default: None

要輸出的一個 fields 列表, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"].

使用 FEED_EXPORT_FIELDS 選項來定義要輸出的 fields 和其輸出順序.

當 FEED_EXPORT_FIELDS 是空的或 None (默認) 時, Scrapy 使用字典定義的 fields 或爬蟲 yield 的 Item 子類.

如果一個 exporter 需要一個固定的 fields 集合 (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, 那麼 Scrapy 會嘗試通過 exported 數據獲得 field 名字 - 目前它使用第一個 item 的 field 名字.

FEED_EXPORT_INDENT

Default: 0

每個 level 用來 indent 輸出所需的 spaces. 如果 FEED_EXPORT_INDENT 是一個非負的整數, 那麼數組元素和對象成員會使用這個 indent level 來被 pretty-printed. 如果 indent level 是 0 (默認), 或負數, 會讓每個 item 佔據一個新的行. None 選取的是最 compact representation.

現在只有 JsonItemExporterXmlItemExporter 實現, 比如當你將數據 export 到 .json or .xml.

FEED_STORE_EMPTY

Default: False

Whether to export empty feeds (ie. feeds with no items).

FEED_STORAGES

Default: {}

一個包含了附加的項目支持的 feed 存儲後端的字典. 鍵是 URI schemes and the 值是存儲類的路徑.

FEED_STORAGES_BASE

Default:

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:

FEED_STORAGES = {
    'ftp': None,
}

FEED_EXPORTERS

Default: {}

A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.

FEED_EXPORTERS_BASE

Default:

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py:

FEED_EXPORTERS = {
    'csv': None,
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章