【Scrapy】Scrapy的pipelines管道使用方法

原創

gz-郭小敏

2020-07-06 13:20

在講解pipelines之前，我先舉個例子，這樣好了解爬取數據的具體過程：

發送請求
獲取到數據（從網站上爬取了數據）
數據清洗（處理數據）
存儲（把數據存儲起來）

而現在，我介紹一下pipelines，它可以負責的就是第3和第4步的工作，專業述語就是管道。我們通過定義一個或多個class,可以用來處理我們傳入的數據。

代碼目錄：

爬蟲代碼 mingyan_spider.py：

import scrapy

def getUrl():
    return 'https://search.51job.com/list/030200,000000,0000,00,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='



class itemSpider(scrapy.Spider):
    name = 'argsSpider'
    def start_requests(self):
        url = getUrl()
        yield scrapy.Request(url, self.parse)  # 發送請求爬取參數內容

    def parse(self, response):
        mingyan = response.css('div#resultList>.el')  # 提取首頁所有名言，保存至變量mingyan
        for v in mingyan:  # 循環獲取每一條名言裏面的：名言內容、作者、標籤
            t1 = v.css('.t1 a::text').extract_first()
            t2 = v.css('.t2 a::attr(title)').extract_first()  # 提取名言
            t3 = v.css('.t3::text').extract_first()
            t4 = v.css('.t4::text').extract_first()

            t1 = str(t1).replace(' ', '')
            t1 = str(t2).replace(' ', '')
            t3 = str(t3).replace(' ', '')
            t4 = str(t4).replace(' ', '')

            item = {
                't1': t1,
                't2': t2,
                't3': t3,
                't4': t4,
            }
            yield item

配置文件setting.py:

(需要配置了，管道才能生效，數字越小，管道的優先級越高，優先調用。數字控制在0~1000. )

#SpiderjobPipeline2 先執行
ITEM_PIPELINES = {
   'spiderJob.pipelines.SpiderjobPipeline': 400,
   'spiderJob.pipelines.SpiderjobPipeline2':100,
}

管道文件pipelines.py:

#存儲
class SpiderjobPipeline(object):
    # 可選實現，做參數初始化等
    def __init__(self):
        print("//////////////")

    # item (Item 對象) – 被爬取的item
    # spider (Spider 對象) – 爬取該item的spider
    # 這個方法必須實現，每個item pipeline組件都需要調用該方法，
    # 這個方法必須返回一個 Item 對象，被丟棄的item將不會被之後的pipeline組件所處理。
    def process_item(self, item, spider):
        fileName = 'aa.txt'  # 定義文件名
        if str(item['t1'])=='None':
            return item;
        with open(fileName, "a+", encoding='utf-8') as f:
            item = str(item['t1']) + "," + str(item['t2']) + "," + str(item['t3']) + "," + str(item['t4'])
            f.write(item)
            f.write('\n')  # ‘\n’ 表示換行
            # f.write('標籤：' + tags)
            f.write('\n-------\n')
            f.close()
        return item


#先給數據加上title
class SpiderjobPipeline2(object):
    def process_item(self, item, spider):
        return {
            't1': '崗位:'+item['t1'],
            't2': '公司:'+item['t2'],
            't3': '地區:'+item['t3'],
            't4': '薪資:'+item['t4'],
        }

效果：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Scrapy】Scrapy的pipelines管道使用方法

代碼目錄：

爬蟲代碼 mingyan_spider.py：

配置文件setting.py:

管道文件pipelines.py:

效果：

【Scrapy】Scrapy的pipelines管道使用方法

win定製執行python腳本（記錄csdn博客閱讀量和評論量變化）

【Scrapy】使用Scrapy框架遇到的坑

【Scrapy】Scrapy框架安裝錯誤

【Scrapy】Scrapy的items.py用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結