爬虫学习笔记(十四)scrapy初体验 2020.5.18

前言

本节开始学习scrapy

可参考
scrapy详解
Scrapy爬虫框架,入门案例

1、原理

在这里插入图片描述

Scrapy主要包括了以下组件:

  • 引擎(Scrapy):用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler):用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader):用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders):爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline):负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares):位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares):介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares):介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

  • 引擎从调度器中取出一个链接(URL)用于接下来的抓取
  • 引擎把URL封装成一个请求(Request)传给下载器
  • 下载器把资源下载下来,并封装成应答包(Response)
  • 爬虫解析Response
  • 解析出实体(Item),则交给实体管道进行进一步的处理
  • 解析出的是链接(URL),则把URL交给调度器等待抓取

2、命令行

#1.创建一个新的项目
scrapy startproject [项目名]
#2.生成爬虫
scrapy genspider +文件名+网址
#3.运行(crawl)
scrapy crawl +爬虫名称
scrapy crawl [爬虫名] -o zufang.json
# -o output
scrapy crawl [爬虫名] -o zufang.csv
#4.check检查错误
scrapy check
#5.list返回项目所有spider名称
scrapy list
#6. view 存储、打开网页
scrapy view https://www.baidu.com
#7. scrapy shell,进入终端
scrapy shell https://www.baidu.co
#8. scrapy runspider
scrapy runspider zufang_spider.py

3、一个简单例子

爬取租房信息

items.py

import scrapy
class MaitianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    area = scrapy.Field()
    district = scrapy.Field()

settings.py

BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
ROBOTSTXT_OBEY = True
TEM_PIPELINES = {'maitian.pipelines.MaitianPipeline': 300,}
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'maitian'
MONGODB_DOCNAME = 'zufang'

pipelines.py

import pymongo
from scrapy.conf import settings

class MaitianPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]
    # 数据持久化
    def process_item(self, item, spider):
        zufang = dict(item)
        self.post.insert(zufang)
        return item

middlewares.py

from scrapy import signals
class MaitianSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
        # Should return None or raise an exception.
        return None
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.
        # Must return only requests (not items).
        for r in start_requests:
            yield r
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class MaitianDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

zufang_spider.py

import scrapy
from maitian.items import MaitianItem
class MaitianSpider(scrapy.Spider):
    name = "zufang"
    start_urls = ['http://bj.maitian.cn/zfall/PG1'] #起始url
    def parse(self, response): #解析函数parse
        for zufang_item in response.xpath('//div[@class="list_title"]'):
            yield {
                'title': zufang_item.xpath('./h1/a/text()').extract_first().strip(),
                'price': zufang_item.xpath('./div[@class="the_price"]/ol/strong/span/text()').extract_first().strip(),
                'area': zufang_item.xpath('./p/span/text()').extract_first().replace('㎡','')
.strip(),
                'district': zufang_item.xpath('./p//text()').re(r'昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城')[0],
            }
        next_page_url = response.xpath('//div[@id="paging"]/a[@class="down_page"]/@href').extract_first() #下一页
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

结语

简单感受了下scrapy
之后慢慢学习

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章