有關爬蟲scrapy框架簡單使用

1.安裝的問題

  • 貼吧詢問:
    在這裏插入圖片描述
  • 看到官網居然推薦虛擬環境下使用scrapy:
    在這裏插入圖片描述

2.官網例子

  • 兩屬性一方法,必寫
	name = None
	self.start_urls = []

	def parse(self, response):
        raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
  • code
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    # 列表推導式
    # start_urls = [f'http://quotes.toscrape.com/page/{page}/' for page in range(1,11)]

    # 解析函數 response 響應源碼
    def parse(self, response):
        # 選擇出數據 css
        for selector in response.css('div.quote'):
            # 選擇名言
            text = selector.css('span.text::text').get()
            # 選擇作者
            author = selector.xpath('span/small/text()').get()
            # 返回數據 如果想保存數據,必須返回
            # 數據格式目前必須是字段格式 後期:item
            items = {
                "quote": text,
                "author": author
            }
            yield items

        # 翻頁處理
        # 1.先找到下一頁的網址
        # 2.發出請求,獲取響應,交給parse函數
        next_page = response.xpath('//li[@class="next"]/a/@href').get()
        if next_page:
            # 法一
            # url = 'http://quotes.toscrape.com' + next_page
            # yield scrapy.Request(url)
            # 法二
            # yield response.follow(next_page)

            yield response.follow(next_page, callback=self.parse)

  • 運行 terminal下執行
(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py
  • 保存json數據
(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py -o quotes.json

3.虛擬環境安裝並使用scrapy

  • 創建工程
(venv) E:\Graduation-design\spider\scrapy>scrapy startproject quotetutorial
  • 創建爬蟲 scrapy genspider [-t template] <name> <domain>
(venv) E:\Graduation-design\spider\scrapy>scrapy genspider quotes quotes.toscrape.com
  • 運行爬蟲
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes
  • terminal下shell測試:scrapy shell <網址>
>>> quotes = response.css('.quote')
>>> type(quotes)
<class 'scrapy.selector.unified.SelectorList'>
>>> quotes[0]
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>
>>> quotes[0].css(.text)
  File "<console>", line 1
    quotes[0].css(.text)
                  ^
SyntaxError: invalid syntax
>>> quotes[0].css('.text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“T...'>]
>>> quotes[0].css('.text::text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a...'>]
>>> quotes[0].css('.text::text').get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> quotes[0].css('.text::text').getall()
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags .tag::text').getall()
['change', 'deep-thoughts', 'thinking', 'world']

  • code:
# -*- coding: utf-8 -*-
import scrapy
from ..items import QuoteItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css('.tags .tag::text').getall()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').get()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)
  • 數據保存
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.json
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.jl
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.csv
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.xml
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.pickle
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.marshal
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o ftp://...
  • 讓pipeline生效settings.py
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'quotetutorial.pipelines.TextPipeline': 300, # 優先級
   'quotetutorial.pipelines.MongoPipeline': 400,

}

4.scrapy命令行用法

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章