Scrapy 中文輸出與存儲

原創

SteveForever

2018-09-02 07:17

1、中文輸出

python3.X中中文信息直接可以輸出處理；

python2.X中：採用中文encode("gbk")或者encode("utf-8")。

2、中文存儲

在Scrapy中對數據進行處理的文件是pipelines.py 文件，首先打開項目設置文件setting.py 配置pipelines。

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'firstpjt.pipelines.FirstpjtPipeline': 300,
#}

上面代碼中的'firstpjt.pipelines.FirstpjtPipeline'分別代表“核心目錄名.pipelines 文件名.對應的類名”,將代碼修改爲：

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'firstpjt.pipelines.FirstpjtPipeline': 300,
}

然後編寫pipelines.py 文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 導入codecs模塊，使用codecs直接進行解碼
import codecs


class FirstpjtPipeline(object):

    def __init__(self):
        # 以寫入的方式創建或打開一個普通的文件用於存儲爬取到的數據
        self.file = codecs.open("E:/SteveWorkspace/firstpjt/mydata/mydata1.txt", "wb", encoding="utf-8")

    def process_item(self, item, spider):
        # 設置每行要寫的內容
        l = str(item) + '\n'
        # 此處通過print() 輸出，方便程序的調試
        print(l)
        # 將對應信息寫入文件中
        self.file.write(l)
        return item

    def close_spider(self, spider):
        self.file.close()

3、輸出中文到json文件

JSON數據常見的基本存儲結構有數組和對象兩種。

數組形式：["蘋果","梨子","葡萄"]

對象結構爲鍵值對形式：{"姓名":"小明"，"身高":"173"}

修改pipelines.py 文件參考:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json


class ScrapytestPipeline(object):
    def __init__(self):
        #以寫入的方式創建或打開一個json格式文件
        self.file = codecs.open("E:/PycharmWorkspace/ScrapyTest/mydata/datayamaxun.json", "ab", encoding="utf-8")
        print("打開文件---------------")

    def process_item(self, item, spider):
        print("開始寫入---------------")
        for j in range(0,len(item["bookname"])):
            bookname = item["bookname"][j]
            # author=item["author"][j]
            price = item["price"][j]
            book = {"bookname": bookname, "price": price}
            #通過dict(item)將item轉化爲一個字典
            #然後通過json模塊下的dumps()處理字典數據
            #在進行json.dumps()序列化的時候，中文會默認使用ASCII編碼，顯示中文需要設置ensure_ascii=False
            i = json.dumps(dict(book), ensure_ascii=False)
            #加上"\n"形成要寫入的一行數據
            line = i + '\n'
            print("正在寫入文件---------------")
            self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()
        print("關閉文件---------------")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy 中文輸出與存儲

物理機開關機

前端使用 Konva 實現可視化設計器（15）- 自定義連接點、連接優化

c++11中的std::initialzer_list

關於會聲會影導入視頻出現卡頓花屏的解決辦法

Git提交代碼步驟

c語言unsigned int和int

Go語言包安裝問題解決方案

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結