python爬蟲 -- scrapy框架

原創

2020-06-29 05:08

Centos 7 安裝scrapy

在安裝了pyenv的基礎上。可以支持多版本的python。再安裝相關包。

yum install gcc libffi-devel openssl-devel libxml2 libxslt-devel libxml2-devel python-devel -y

安裝lxml，再安裝scrapy

pip install lxml
pip install scrapy

Scrapy是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。

演示的爬取網站是scrapy的官方示例網站。這個網站是專門做爬蟲練習的一個網站。http://quotes.toscrape.com

這個網站主要展示的是一些名人名言。有足夠的爬取例子

一、新建一個項目。

scrapy startproject quote
scrapy genspider quotes quotes.toscrape.com

spiders 目錄下就是通過genspider 創建的。

scrapy.cfg 中指明瞭 setting的路徑和部署的配置之類的。

items.py 是指明保存的數據結構

middlewares.py 爬取過程中的中間鍵的定義。

pipelines.py 項目管道。

settings.py 項目的配置。

下面是開始爬取的命令。如果什麼都不做修改的話，會將調試信息打印出來，如果將 quotes.py 裏的pass 改成print(response.text)，那就是打印網易源代碼了。

scrapy crawl quotes

將parse中的pass修改。就可以執行操作了。

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        print(response.text)

就可以打印抓取網站的源代碼

還可以使用

scrapy shell quotes.toscrape.com 命令

進入命令行模式

In [1]: response
Out[1]: <200 http://quotes.toscrape.com>

In [2]: quotes = response.css('.quote')
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>

查看quotes的內容

In [6]: quotes[0].css('.text::text').extract()
Out[6]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

取出所有的tag

In [13]: quotes[0].css('.tags .tag::text').extract()
Out[13]: ['change', 'deep-thoughts', 'thinking', 'world']

注： scrapy 的 :: 的意思就是 ::text 屬性下的內容。這是 scrapy特有的。.extract_first() 是返回第一個結果。而extract() 是返回全部結果。

三、開始抓取。

首先修改 items.py 的QuoteItem類。這個類有例子。將想要的內容聲明下。

這裏我抓 text，author，tag

class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

再修改quotes.py 。利用剛纔 shell 裏那個講解的內容去抓下來想要的內容。

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

這樣就可以把每條名言的內容，作者，標籤都拿到手了。。

除去調試信息。當前第一頁的內容如下。

下來開始翻頁的操作。

首先發現，翻頁對應的url是有一定規律的，那麼直接請求這個URL就好了。

如第二頁就是 http://quotes.toscrape.com/page/2/

而且可以看看翻頁那個按鈕是如何實現的呢。

既然如此就可以提取這個url了。

next = response.css('.pager .next a::attr(href)').extract_first()

如此以來就可以提取該url了。

完整的代碼如下。

# -*- coding: utf-8 -*-
import scrapy
from quote.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url,callback=self.parse)

抓取下來如何保存呢？使用 -o 就可以了

scrapy crawl quotes -o quotes.json 存儲爲 json 格式

或者 quotes.csv .jl .xml 都可以。

甚至可以 -o ftp://[email protected]/path/quotes.csv 保存到遠程的ftp服務器上。

但是數據如果想要保存至數據庫中呢。就不能使用該方法了。

需要pipelines 上場了。

首先在settings.py 裏面定義。

MONGO_URI = 'localhost'
MONGO_DB = 'quotes'

ITEM_PIPELINES = {
'quote.pipelines.QuotePipeline': 300,
'quote.pipelines.MongoPipeline': 400
}

然後在pipelines.py 裏面

# -*- coding: utf-8 -*-
import pymongo
from scrapy.exceptions import DropItem
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class QuotePipeline(object):

    def __init__(self):
        self.limit = 150

    def process_item(self, item):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):
    def __init__(self,mongo_uri,mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self,item,spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲 -- scrapy框架

Centos 7 安裝scrapy

一、新建一個項目。

三、開始抓取。

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

爬蟲基礎-requests庫

爬蟲基礎-- 正則基礎

python爬蟲 -- scrapy框架

Python IP 的處理模塊

Pyspider 框架的用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結