Python网络爬虫(十九)——CrawlSpider

原創

止步听风

2020-07-04 17:05

在之前 Scrapy 的基本使用当中，spider 如果要重新发送请求的话，就需要自己解析页面，然后发送请求。而 CrawlSpider 则可以通过设置 url 条件自动发送请求。

CrawlSpider 是 Spider 的一个派生类，相对于 Spider 来说，功能进行了更新，使用也更加方便。

CrawlSpider

创建 CrawlSpider

和之前创建 spider 一样，虽然可以在创建 Scrapy 项目之后手动构造 spider，但是 Scrapy 也给出了在终端下创建 CrawlSpider 的指令：

scrapy genspider -t crawl spidername domainname

在终端中使用上边的指令就能够使用 Scrapy 中的模板创建 CrawlSpider。

LinkExtractors

CrawlSpider 与 spider 不同的是就在于下一次请求的 url 不需要自己手动解析，而这一点则是通过 LinkExtractors 实现的。LinkExtractors 原型为：

class LxmlLinkExtractor(FilteringLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=False,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=(),
                 strip=True, restrict_text=None):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        lx = LxmlParserLinkExtractor(
            tag=lambda x: x in tags,
            attr=lambda x: x in attrs,
            unique=unique,
            process=process_value,
            strip=strip,
            canonicalized=canonicalize
        )

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
                                                allow_domains=allow_domains, deny_domains=deny_domains,
                                                restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
                                                canonicalize=canonicalize, deny_extensions=deny_extensions,
                                                restrict_text=restrict_text)

其中的参数为：

allow：允许的 url。所有满足这个正则表达式的 url 都会被提取
deny：禁止的 url。所有满足这个正则表达式的 url 都不会被提取
allow_domains：允许的域名。只有在这个里面指定的域名的 url 才会被提取
deny_domains：禁止的域名。所有在这个里面指定的域名的 url 都不会被提取
restrict_xpaths：严格的 xpath。和 allow 共同过滤链接

Rule

LinkExtractors 需要传递到 Rule 类对象中才能发挥作用。Rule 类为：

class Rule:

    def __init__(self, link_extractor=None, callback=None, cb_kwargs=None, follow=None,
                 process_links=None, process_request=None, errback=None):
        self.link_extractor = link_extractor or _default_link_extractor
        self.callback = callback
        self.errback = errback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links or _identity
        self.process_request = process_request or _identity_process_request
        self.process_request_argcount = None
        self.follow = follow if follow is not None else not callback

常见的参数为：

link_extractor：LinkExtractor 对象，用于定义爬取规则
callback：对于满足该规则的 url 所要执行的回掉函数，类似于之前提到的 scrapy.Request() 中的callback。而 CrawlSpider 使用了 parse 作为回调函数，因此不要覆盖 parse 作为回调函数自己的回调函数
follow：从 response 中提取的链接是否需要跟进
process_links：从 link_extractor 中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接

除了上述的这些差别，Crawlspider 和 spider 基本没有什么差别了。

settings.py

仍旧需要设置：

ROBOTSTXT_OBEY：设置为 False，否则为 True。True 表示遵守机器协议，此时爬虫会首先找 robots.txt 文件，如果找不到则会停止
DEFAULT_REQUEST_HEADERS：默认请求头，可以在其中添加 User-Agent，表示该请求是从浏览器发出的，而不是爬虫
DOWNLOAD_DELAY：表示下载的延迟，防止过快
ITEM_PIPELINES：启用 pipelines.py

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class StepItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    pub_time = scrapy.Field()
    content = scrapy.Field()

spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from step.items import StepItem

class WechatSpider(CrawlSpider):
    name = 'wechat'
    allowed_domains = ['www.wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        # 该 Rule 没有 callback 参数，说明不需要对符合该条件的 url 执行回调操作
        Rule(LinkExtractor(allow=r'.+page=\d'), follow=True),
        # 该 Rule 存在 callback 参数，说明需要对符合该条件的 url 执行回调操作
        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback='parse_item',follow=False)
    )

    def parse_item(self, response):
        title = response.xpath("//h1[@class='ph']/text()").get()
        author =response.xpath("//p[@class='authors']/a/text()").get()
        pub_time = response.xpath("//p[@class='authors']/span/text()").get()
        content = response.xpath("//div[@class='content_middle cl']//text()").getall()
        content = ''.join(content).strip()
        item = StepItem(title=title,author=author,pub_time=pub_time,content=content)
        yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter

class StepPipeline:
    def __init__(self):
        self.fp = open('wechat.json','wb')
        self.export = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self,spider):
        print('spider begin.')

    def process_item(self, item, spider):
        self.export.export_item(item)
        return item

    def close_spider(self,spider):
        self.fp.close()
        print('spider over.')

在 CrawlSpider 中需要注意的就是 spider 的写法，别的和之前差不多。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python网络爬虫(十九)——CrawlSpider

CrawlSpider

创建 CrawlSpider

LinkExtractors

Rule

settings.py

items.py

spider

pipelines.py

Python網絡爬蟲(二十三)——Redis

Python網絡爬蟲(十九)——CrawlSpider

Python網絡爬蟲(二十四)——Scrapy-Redis

Python網絡爬蟲(二十二)——Downloader Middlewares

Python網絡爬蟲(二十一)——Request 和 Response

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結