scrapy 序列化寫入器 ——ItemExporter

原創

Vince Li

2019-08-09 09:49

scrapy 序列化寫入器

Scrapy支持多種序列化格式(serialization format)及存儲方式(storage backends)。

如果你是想單純的將數據輸出或存入文件，那直接可以用Scrapy提供的現成類。
Item Exporters

爲了使用 Item Exporter，你必須對 Item Exporter 及其參數 (args) 實例化。每個 Item Exporter 需要不同的參數。在實例化了 exporter 之後，你必須：

調用方法start_exporting()以標識過程的開始。
對要導出的每個item調用export_item()方法。
最後調用finish_exporting()表示過程的結束

例如:

from scrapy.exporters import JsonItemExporter
 
 
class TestPipeline(object):
  
    def open_spider(self, spider):
        self.fp = open("data.json", 'wb')  # 打開文件 ,要以wb的形式打開
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')   # 實例化寫入器
        
        self.exporter.start_exporting() # 開啓寫入器
 
 
    def process_item(self, item, spider):
        self.exporter.export_item(item) # 寫入一條數據
        return item
 
    def close_spider(self, spider):
        self.exporter.finish_exporting() # 關閉寫入器
        self.fp.close()  # 關閉文件

序列化方式(Serialization formats)

導入方式:from scrapy.exporters import XXXX

JSON(常用)
- ```
JsonItemExporter
```
JSON lines(常用)
- ```
JsonLinesItemExporter
```
CSV
- ```
CsvItemExporter
```
XML
- ```
XmlItemExporter
```

關於JsonItemExporter和JsonLinesItemExporter

源碼如下:

class JsonLinesItemExporter(BaseItemExporter):
	
    #指定了輸出文件，以及BaseItemExporter所需要的屬性。
    #ScrapyJSONEncoder提供了將字典類型的數據轉換爲Json格式數據的方法
    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        kwargs.setdefault('ensure_ascii', not self.encoding)
        self.encoder = ScrapyJSONEncoder(**kwargs)
	
    #將轉化爲字典類型的item寫入指定的文件中,每條數據就是一個字典
    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        data = self.encoder.encode(itemdict) + '\n'
        self.file.write(to_bytes(data, self.encoding))

        
        

class JsonItemExporter(BaseItemExporter):
	#創建Json格式的Exporter
#與JsonLinesItemExporter不同的是該類會把所有的item封裝成數組寫入文件，
#整個json文件就是一個列表。


    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        # there is a small difference between the behaviour or JsonItemExporter.indent
        # and ScrapyJSONEncoder.indent. ScrapyJSONEncoder.indent=None is needed to prevent
        # the addition of newlines everywhere
        json_indent = self.indent if self.indent is not None and self.indent > 0 else None
        kwargs.setdefault('indent', json_indent)
        kwargs.setdefault('ensure_ascii', not self.encoding)
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def _beautify_newline(self):
        if self.indent is not None:
            self.file.write(b'\n')

    def start_exporting(self):
        self.file.write(b"[")
        self._beautify_newline()

    def finish_exporting(self):
        self._beautify_newline()
        self.file.write(b"]")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(b',') # 添加逗號
            self._beautify_newline()  # 添加\n換行
        itemdict = dict(self._get_serialized_fields(item))
        data = self.encoder.encode(itemdict)
        self.file.write(to_bytes(data, self.encoding))

大致概括下：

（要用二進制的方式來寫）

首先我們從名字裏大致可以看出來了，兩者區別 Lines 也就是行的意思.

JsonLinesItemExporter每一條數據就是一個字典,一個字典一行,沒有逗號隔開,這個文件不是一個特別規範的json文件

JsonItemExporter在打開寫入器和關閉寫入器的時候分別寫入看[和]並且每一條數據都有逗號和換行分隔,整個文件是一個字典列表

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy 序列化寫入器 ——ItemExporter

scrapy 序列化寫入器

序列化方式(Serialization formats)

關於JsonItemExporter和JsonLinesItemExporter

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

flask中藍圖的使用

Django工程配置文件settings配置信息詳解

Django的MVT模式

flask中數據庫遷移

Django環境安裝和創建工程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結