Python Scrapy學習之pipelines不能保存數據到文件問題

原創

2020-06-26 05:57

　　今天，學習了scrapy框架pipelines處理數據部分。pipelines可以將由爬蟲文件傳遞來的數據保存到文件中。例如，將數據保存到.txt文件中。
　　首先，需要先設置pipelines.py文件。

# -*- coding: utf-8 -*-

import codecs
class MypjtPipeline(object):
    def __init__(self):
        self.file = codecs.open("D:/Kangbb/data1.txt", "w", encoding="utf-8")
    def process_item(self, item, spider):
        l = str(item['title'])+'\n'
        self.file.write(l)
        return item
    def close_spider(self):
        self.file.close()

　　process_item()函數必須定義，它是真正處理數據的函數。其他函數可選擇實現。
　　接着，設置setting.py。將下列部分的註釋去掉，根據自己定義的函數來填補內容：

#ITEM_PIPELINES = {
#    'mypjt.pipelines.MypjtPipeline': 300,
#}

ITEM_PIPELINES = {
    'mypjt.pipelines.MypjtPipeline': 300,
}

　
　　其中，mypjt.pipelines.MypjtPipeline命名規範爲項目名.piplines配置文件名.piplines文件裏的類名。300代表優先級，範圍爲0~1000，數字越小，越先執行。可以同時定義多個處理數據的類，使用數字來確定優先級。

　　但是，當我完成這些以後，發現依舊不能保存數據到文件。經過多方面確定和查閱文檔，才發現原來是爬蟲文件（spiders文件夾下的文件）中的parse()函數沒有return語句所致。
　
初始版本：

# -*- coding: utf-8 -*-
import scrapy
from mypjt.items import MypjtItem

class MyfileSpider(scrapy.Spider):
    name = 'myfile'
    allowed_domains = ['www.jd.com']
    start_urls = ['https://channel.jd.com/eleNews.html']

    def parse(self, response):
        item = MypjtItem()
        item['title'] = response.xpath("/html/head/title/text()")
        print(item['title'])

修改後版本：

# -*- coding: utf-8 -*-
import scrapy
from mypjt.items import MypjtItem

class MyfileSpider(scrapy.Spider):
    name = 'myfile'
    allowed_domains = ['www.jd.com']
    start_urls = ['https://channel.jd.com/eleNews.html']

    def parse(self, response):
        item = MypjtItem()
        item['title'] = response.xpath("/html/head/title/text()")
        print(item['title'])
        return item

　　這樣就解決了所有問題。
　　總結一下，要使用scrapy保存數據到文件，需要注意以下三點：

　　 1. pipelines文件正確配置
　　 2. 配置settings.py文件
　　 3. 爬蟲文件parse()函數一定要由return語句

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python Scrapy學習之pipelines不能保存數據到文件問題

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

操作系統實驗七之有限緩衝問題

多週期CPU實驗

Python Scrapy學習之pipelines不能保存數據到文件問題

Cache控制器的設計

ubuntu安裝程序常用方式及指令

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結