Python Scrapy学习之pipelines不能保存数据到文件问题

原創

2020-06-26 05:57

　　今天，学习了scrapy框架pipelines处理数据部分。pipelines可以将由爬虫文件传递来的数据保存到文件中。例如，将数据保存到.txt文件中。
　　首先，需要先设置pipelines.py文件。

# -*- coding: utf-8 -*-

import codecs
class MypjtPipeline(object):
    def __init__(self):
        self.file = codecs.open("D:/Kangbb/data1.txt", "w", encoding="utf-8")
    def process_item(self, item, spider):
        l = str(item['title'])+'\n'
        self.file.write(l)
        return item
    def close_spider(self):
        self.file.close()

　　process_item()函数必须定义，它是真正处理数据的函数。其他函数可选择实现。
　　接着，设置setting.py。将下列部分的注释去掉，根据自己定义的函数来填补内容：

#ITEM_PIPELINES = {
#    'mypjt.pipelines.MypjtPipeline': 300,
#}

ITEM_PIPELINES = {
    'mypjt.pipelines.MypjtPipeline': 300,
}

　
　　其中，mypjt.pipelines.MypjtPipeline命名规范为项目名.piplines配置文件名.piplines文件里的类名。300代表优先级，范围为0~1000，数字越小，越先执行。可以同时定义多个处理数据的类，使用数字来确定优先级。

　　但是，当我完成这些以后，发现依旧不能保存数据到文件。经过多方面确定和查阅文档，才发现原来是爬虫文件（spiders文件夹下的文件）中的parse()函数没有return语句所致。
　
初始版本：

# -*- coding: utf-8 -*-
import scrapy
from mypjt.items import MypjtItem

class MyfileSpider(scrapy.Spider):
    name = 'myfile'
    allowed_domains = ['www.jd.com']
    start_urls = ['https://channel.jd.com/eleNews.html']

    def parse(self, response):
        item = MypjtItem()
        item['title'] = response.xpath("/html/head/title/text()")
        print(item['title'])

修改后版本：

# -*- coding: utf-8 -*-
import scrapy
from mypjt.items import MypjtItem

class MyfileSpider(scrapy.Spider):
    name = 'myfile'
    allowed_domains = ['www.jd.com']
    start_urls = ['https://channel.jd.com/eleNews.html']

    def parse(self, response):
        item = MypjtItem()
        item['title'] = response.xpath("/html/head/title/text()")
        print(item['title'])
        return item

　　这样就解决了所有问题。
　　总结一下，要使用scrapy保存数据到文件，需要注意以下三点：

　　 1. pipelines文件正确配置
　　 2. 配置settings.py文件
　　 3. 爬虫文件parse()函数一定要由return语句

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python Scrapy学习之pipelines不能保存数据到文件问题

python gdal 安装使用（Windows， python 3.6.8）

操作系統實驗七之有限緩衝問題

多週期CPU實驗

Python Scrapy學習之pipelines不能保存數據到文件問題

Cache控制器的設計

ubuntu安裝程序常用方式及指令

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結