爬蟲學習筆記(十四)scrapy初體驗 2020.5.18

前言

本節開始學習scrapy

可參考
scrapy詳解
Scrapy爬蟲框架,入門案例

1、原理

在這裏插入圖片描述

Scrapy主要包括了以下組件:

  • 引擎(Scrapy):用來處理整個系統的數據流處理, 觸發事務(框架核心)
  • 調度器(Scheduler):用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 可以想像成一個URL(抓取網頁的網址或者說是鏈接)的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
  • 下載器(Downloader):用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)
  • 爬蟲(Spiders):爬蟲是主要幹活的, 用於從特定的網頁中提取自己需要的信息, 即所謂的實體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面
  • 項目管道(Pipeline):負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析後,將被髮送到項目管道,並經過幾個特定的次序處理數據。
  • 下載器中間件(Downloader Middlewares):位於Scrapy引擎和下載器之間的框架,主要是處理Scrapy引擎與下載器之間的請求及響應。
  • 爬蟲中間件(Spider Middlewares):介於Scrapy引擎和爬蟲之間的框架,主要工作是處理蜘蛛的響應輸入和請求輸出。
  • 調度中間件(Scheduler Middewares):介於Scrapy引擎和調度之間的中間件,從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程大概如下:

  • 引擎從調度器中取出一個鏈接(URL)用於接下來的抓取
  • 引擎把URL封裝成一個請求(Request)傳給下載器
  • 下載器把資源下載下來,並封裝成應答包(Response)
  • 爬蟲解析Response
  • 解析出實體(Item),則交給實體管道進行進一步的處理
  • 解析出的是鏈接(URL),則把URL交給調度器等待抓取

2、命令行

#1.創建一個新的項目
scrapy startproject [項目名]
#2.生成爬蟲
scrapy genspider +文件名+網址
#3.運行(crawl)
scrapy crawl +爬蟲名稱
scrapy crawl [爬蟲名] -o zufang.json
# -o output
scrapy crawl [爬蟲名] -o zufang.csv
#4.check檢查錯誤
scrapy check
#5.list返回項目所有spider名稱
scrapy list
#6. view 存儲、打開網頁
scrapy view https://www.baidu.com
#7. scrapy shell,進入終端
scrapy shell https://www.baidu.co
#8. scrapy runspider
scrapy runspider zufang_spider.py

3、一個簡單例子

爬取租房信息

items.py

import scrapy
class MaitianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    area = scrapy.Field()
    district = scrapy.Field()

settings.py

BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
ROBOTSTXT_OBEY = True
TEM_PIPELINES = {'maitian.pipelines.MaitianPipeline': 300,}
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'maitian'
MONGODB_DOCNAME = 'zufang'

pipelines.py

import pymongo
from scrapy.conf import settings

class MaitianPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]
    # 數據持久化
    def process_item(self, item, spider):
        zufang = dict(item)
        self.post.insert(zufang)
        return item

middlewares.py

from scrapy import signals
class MaitianSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
        # Should return None or raise an exception.
        return None
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.
        # Must return only requests (not items).
        for r in start_requests:
            yield r
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class MaitianDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

zufang_spider.py

import scrapy
from maitian.items import MaitianItem
class MaitianSpider(scrapy.Spider):
    name = "zufang"
    start_urls = ['http://bj.maitian.cn/zfall/PG1'] #起始url
    def parse(self, response): #解析函數parse
        for zufang_item in response.xpath('//div[@class="list_title"]'):
            yield {
                'title': zufang_item.xpath('./h1/a/text()').extract_first().strip(),
                'price': zufang_item.xpath('./div[@class="the_price"]/ol/strong/span/text()').extract_first().strip(),
                'area': zufang_item.xpath('./p/span/text()').extract_first().replace('㎡','')
.strip(),
                'district': zufang_item.xpath('./p//text()').re(r'昌平|朝陽|東城|大興|豐臺|海淀|石景山|順義|通州|西城')[0],
            }
        next_page_url = response.xpath('//div[@id="paging"]/a[@class="down_page"]/@href').extract_first() #下一頁
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

結語

簡單感受了下scrapy
之後慢慢學習

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章