python爬虫 -- scrapy框架

原創

2020-06-29 05:08

Centos 7 安装scrapy

在安装了pyenv的基础上。可以支持多版本的python。再安装相关包。

yum install gcc libffi-devel openssl-devel libxml2 libxslt-devel libxml2-devel python-devel -y

安装lxml，再安装scrapy

pip install lxml
pip install scrapy

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

演示的爬取网站是scrapy的官方示例网站。这个网站是专门做爬虫练习的一个网站。http://quotes.toscrape.com

这个网站主要展示的是一些名人名言。有足够的爬取例子

一、新建一个项目。

scrapy startproject quote
scrapy genspider quotes quotes.toscrape.com

spiders 目录下就是通过genspider 创建的。

scrapy.cfg 中指明了 setting的路径和部署的配置之类的。

items.py 是指明保存的数据结构

middlewares.py 爬取过程中的中间键的定义。

pipelines.py 项目管道。

settings.py 项目的配置。

下面是开始爬取的命令。如果什么都不做修改的话，会将调试信息打印出来，如果将 quotes.py 里的pass 改成print(response.text)，那就是打印网易源代码了。

scrapy crawl quotes

将parse中的pass修改。就可以执行操作了。

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        print(response.text)

就可以打印抓取网站的源代码

还可以使用

scrapy shell quotes.toscrape.com 命令

进入命令行模式

In [1]: response
Out[1]: <200 http://quotes.toscrape.com>

In [2]: quotes = response.css('.quote')
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>

查看quotes的内容

In [6]: quotes[0].css('.text::text').extract()
Out[6]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

取出所有的tag

In [13]: quotes[0].css('.tags .tag::text').extract()
Out[13]: ['change', 'deep-thoughts', 'thinking', 'world']

注： scrapy 的 :: 的意思就是 ::text 属性下的内容。这是 scrapy特有的。.extract_first() 是返回第一个结果。而extract() 是返回全部结果。

三、开始抓取。

首先修改 items.py 的QuoteItem类。这个类有例子。将想要的内容声明下。

这里我抓 text，author，tag

class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

再修改quotes.py 。利用刚才 shell 里那个讲解的内容去抓下来想要的内容。

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

这样就可以把每条名言的内容，作者，标签都拿到手了。。

除去调试信息。当前第一页的内容如下。

下来开始翻页的操作。

首先发现，翻页对应的url是有一定规律的，那么直接请求这个URL就好了。

如第二页就是 http://quotes.toscrape.com/page/2/

而且可以看看翻页那个按钮是如何实现的呢。

既然如此就可以提取这个url了。

next = response.css('.pager .next a::attr(href)').extract_first()

如此以来就可以提取该url了。

完整的代码如下。

# -*- coding: utf-8 -*-
import scrapy
from quote.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url,callback=self.parse)

抓取下来如何保存呢？使用 -o 就可以了

scrapy crawl quotes -o quotes.json 存储为 json 格式

或者 quotes.csv .jl .xml 都可以。

甚至可以 -o ftp://[email protected]/path/quotes.csv 保存到远程的ftp服务器上。

但是数据如果想要保存至数据库中呢。就不能使用该方法了。

需要pipelines 上场了。

首先在settings.py 里面定义。

MONGO_URI = 'localhost'
MONGO_DB = 'quotes'

ITEM_PIPELINES = {
'quote.pipelines.QuotePipeline': 300,
'quote.pipelines.MongoPipeline': 400
}

然后在pipelines.py 里面

# -*- coding: utf-8 -*-
import pymongo
from scrapy.exceptions import DropItem
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class QuotePipeline(object):

    def __init__(self):
        self.limit = 150

    def process_item(self, item):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):
    def __init__(self,mongo_uri,mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self,item,spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬虫 -- scrapy框架

Centos 7 安装scrapy

一、新建一个项目。

三、开始抓取。

通过HPA+CronHPA组合应对业务复杂弹性伸缩场景

爬蟲基礎-requests庫

爬蟲基礎-- 正則基礎

python爬蟲 -- scrapy框架

Python IP 的處理模塊

Pyspider 框架的用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結