1.安裝的問題

貼吧詢問：
看到官網居然推薦虛擬環境下使用scrapy：

2.官網例子

兩屬性一方法，必寫

	name = None
	self.start_urls = []

	def parse(self, response):
        raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))

code

import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    # 列表推導式
    # start_urls = [f'http://quotes.toscrape.com/page/{page}/' for page in range(1,11)]

    # 解析函數 response 響應源碼
    def parse(self, response):
        # 選擇出數據 css
        for selector in response.css('div.quote'):
            # 選擇名言
            text = selector.css('span.text::text').get()
            # 選擇作者
            author = selector.xpath('span/small/text()').get()
            # 返回數據 如果想保存數據，必須返回
            # 數據格式目前必須是字段格式 後期：item
            items = {
                "quote": text,
                "author": author
            }
            yield items

        # 翻頁處理
        # 1.先找到下一頁的網址
        # 2.發出請求，獲取響應，交給parse函數
        next_page = response.xpath('//li[@class="next"]/a/@href').get()
        if next_page:
            # 法一
            # url = 'http://quotes.toscrape.com' + next_page
            # yield scrapy.Request(url)
            # 法二
            # yield response.follow(next_page)

            yield response.follow(next_page, callback=self.parse)

運行 terminal下執行

(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py

保存json數據

(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py -o quotes.json

3.虛擬環境安裝並使用scrapy

創建工程

(venv) E:\Graduation-design\spider\scrapy>scrapy startproject quotetutorial

創建爬蟲 scrapy genspider [-t template] <name> <domain>

(venv) E:\Graduation-design\spider\scrapy>scrapy genspider quotes quotes.toscrape.com

運行爬蟲

(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes

terminal下shell測試：scrapy shell <網址>

>>> quotes = response.css('.quote')
>>> type(quotes)
<class 'scrapy.selector.unified.SelectorList'>
>>> quotes[0]
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>
>>> quotes[0].css(.text)
  File "<console>", line 1
    quotes[0].css(.text)
                  ^
SyntaxError: invalid syntax
>>> quotes[0].css('.text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“T...'>]
>>> quotes[0].css('.text::text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a...'>]
>>> quotes[0].css('.text::text').get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> quotes[0].css('.text::text').getall()
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags .tag::text').getall()
['change', 'deep-thoughts', 'thinking', 'world']

code：

# -*- coding: utf-8 -*-
import scrapy
from ..items import QuoteItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css('.tags .tag::text').getall()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').get()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

數據保存

(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.json
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.jl
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.csv
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.xml
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.pickle
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.marshal
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o ftp://...

讓pipeline生效settings.py

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'quotetutorial.pipelines.TextPipeline': 300, # 優先級
   'quotetutorial.pipelines.MongoPipeline': 400,

}

4.scrapy命令行用法

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

有關爬蟲scrapy框架簡單使用

1.安裝的問題

2.官網例子

3.虛擬環境安裝並使用scrapy

4.scrapy命令行用法

李航（統計學習方法第五章）

李航（統計學習方法第三章）

李航（統計學習方法第四章）

李航（統計學習方法第一章）

李航（統計學習方法第二章）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結