Scrapy簡介

通用爬蟲框架流程

Scrapy 框架流程

Scrapy組件

Scrapy主要包括了以下組件：

引擎(Scrapy)
用來處理整個系統的數據流處理, 觸發事務(框架核心)
調度器(Scheduler)
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 可以想像成一個URL（抓取網頁的網址或者說是鏈接）的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
下載器(Downloader)
用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)
爬蟲(Spiders)
爬蟲是主要幹活的, 用於從特定的網頁中提取自己需要的信息, 即所謂的實體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面
項目管道(Pipeline)
負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析後，將被髮送到項目管道，並經過幾個特定的次序處理數據。
下載器中間件(Downloader Middlewares)
位於Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。
爬蟲中間件(Spider Middlewares)
介於Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應輸入和請求輸出。
調度中間件(Scheduler Middewares)
介於Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程

引擎從調度器中取出一個鏈接(URL)用於接下來的抓取
引擎把URL封裝成一個請求(Request)傳給下載器
下載器把資源下載下來，並封裝成應答包(Response)
爬蟲解析Response
解析出實體（Item）,則交給實體管道進行進一步的處理
解析出的是鏈接（URL）,則把URL交給調度器等待抓取

Scrapy的安裝

Linux下的安裝(包括mac)

pip install scrapy

Windows下的安裝

1. 下載twisted 
	http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
	
2. 安裝wheel 
	pip3 install wheel

3. 安裝twisted 
	進入下載目錄，執行 pip3 install Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl

4. 安裝pywin32
	pip3 install pywin32

5. 安裝scrapy 
	pip3 install scrapy

基本命令

1. scrapy startproject 項目名稱
	在當前目錄中創建一個項目文件
	
2. scrapy genspider [-t template] <name> <domain>
	創建爬蟲應用
		如：
     	 	scrapy gensipider -t basic oldboy oldboy.com
      		scrapy gensipider -t xmlfeed autohome autohome.com.cn
		或者簡單直接: 
			 scrapy gensipider app名  要爬取的域名
	PS:
 		查看所有命令：scrapy gensipider -l
		查看模板命令：scrapy gensipider -d 模板名稱
		
3. scrapy list
	展示爬蟲應用列表
	
4. scrapy crawl 爬蟲應用名稱
	運行單獨爬蟲應用
備註：
	scrapy crawl 應用名稱  表示以日誌的形式運行爬蟲應用，可以在後面加 --nolog  取消日誌
    scrapy crawl 名稱  --nolog

項目文件說明

scrapy.cfg 項目的主配置信息。（真正爬蟲相關的配置信息在settings.py文件中）
items.py 設置數據存儲模板，用於結構化數據，如：Django的Model
pipelines 數據處理行爲，如：一般結構化的數據持久化
settings.py 配置文件，如：遞歸的層數、併發數，延遲下載等
spiders 爬蟲目錄，如：創建文件，編寫爬蟲規則

項目案例

項目介紹

爲了充分利用網上大數據資源,讓用戶能夠方便利用影視信息,採用基於 Scrapy 框架的爬蟲技術,開發了檢索電影信息的搜索引擎。對豆瓣網站的影視信息進行爬取,以方便用戶準確獲取最新的電影信息。

項目代碼

以“豆瓣電影”爲爬取目標,爬取網站中的影視信息。主要包括網站排名 “ Top250 ”和喜劇、動作類電影的電影名稱、電影評分、電影導演, 電影上映時間以及電影評語。

創建工程

scrapy startproject DouBan

創建爬蟲程序

cd DouBan/
scrapy genspider douban 'douban.com'

自動創建目錄及文件

編寫爬蟲文件（douban.py）

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from DouBan.items import DoubanItem
import copy


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    # start_urls = ['http://douban.com/']
    start_urls = ['https://movie.douban.com/top250']
    url = 'https://movie.douban.com/top250'

    def parse(self, response):
        items = DoubanItem()
        # with open('douban.html', 'w') as f:
        #     f.write(response.text)
        movies = response.xpath("//ol[@class='grid_view']/li")
        for movie in movies:
            title = movie.xpath(".//span[@class='title']/text()").extract()[0]
            rating_num = movie.xpath(".//span[@class='rating_num']/text()").extract()[0]
            # <span class="inq">希望讓人自由。</span>
            inq = movie.xpath(".//span[@class='inq']/text()").extract()
            if inq:
                inq = inq[0]
            items['inq'] = inq
            items['rating_num'] = rating_num
            items['title'] = title

            # 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg'
            items['image_url'] = movie.xpath('.//div[@class="pic"]/a/img/@src').extract()[0]
            # print("image url: ", item['image_url'])

            items['detail_url'] = movie.xpath('.//div[@class="hd"]//a/@href').extract()[0]
            # print("detail url: ", item['detail_url'])
            # yield items
            yield Request(items['detail_url'], meta={'item': copy.deepcopy(items)}, callback=self.detailParse)

        # """
        #    <span class="next">
        #    <link rel="next" href="?start=50&amp;filter=">
        #    <a href="?start=50&amp;filter=">??&gt;</a>
        #    </span>
        #    """
        # nextLink = response.xpath('.//span[@class="next"]/link/@href').extract()  # 返回列表
        # if nextLink:
        #     nextLink = nextLink[0]
        #     print('Next Link: ', nextLink)
        #     yield Request(self.url + nextLink, callback=self.parse)

    def detailParse(self, response):
        items = response.meta['item']
        # print(items, '111111111111')
        items['movieLength'] = response.xpath(".//span[@property='v:runtime']/text()").extract()[0]
        print(items, '333333333333333')
        yield copy.deepcopy(items)

編輯item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    inq = scrapy.Field()
    rating_num = scrapy.Field()
    title = scrapy.Field()
    image_url = scrapy.Field()
    detail_url = scrapy.Field()
    image_path = scrapy.Field()
    movieLength = scrapy.Field()

編輯pipelines文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

import pymysql
import scrapy
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline


class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item


class AddScoreNum(object):
    def process_item(self, item, spider):
        if item['rating_num']:
            rating_num = float(item['rating_num'])
            item['rating_num'] = str(rating_num + 1)
            return item
        else:
            raise Exception('沒有爬取到rating_num')


class JsonWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open('douban.json', 'w')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), indent=4, ensure_ascii=False)
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()


class MysqlPipeline(object):
    def open_spider(self, spider):
        self.connect = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            db='scrapyProject',
            user='root',
            passwd='westos',
            charset='utf8',
            use_unicode=True
        )

        self.cursor = self.connect.cursor()
        a = "create table if not exists douBanTop250(title varchar(50) unique,rating_num float,inq varchar(100));"
        print(a)
        self.cursor.execute(a)

    def process_item(self, item, spider):
        insert_sqli = "insert into douBanTop250(title, rating_num, inq) " \
                      "values('%s', '%s', '%s');" \
                      % (item['title'], item['rating_num'], item['inq'])
        print(insert_sqli)
        try:
            self.cursor.execute(insert_sqli)
            print(111)
        except Exception as e:
            self.connect.rollback()
            print(222)
        else:
            self.connect.commit()
            print(333)

        return item

    def close_spider(self, spider):
        self.connect.commit()
        self.cursor.close()
        self.connect.close()


class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['image_url'])

    def item_completed(self, results, item, info):
        """
        :param results:
            [(True,  {'url': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg',
                'path': 'full/e9cc62a6d6a0165314b832b1f31a74ca2487547a.jpg',
                'checksum': '5d77f59d4d634b795780b2138c1bf572'})]
        :param item:
        :param info:
        :return:
        """
        # for result in results:
        #     print("result: ", result)
        
        # isok = True/False
        image_paths = [x['path'] for isok, x in results if isok]
        # print("image_paths: ", image_paths[0])
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_path'] = image_paths[0]
        return item

編輯settings文件

# -*- coding: utf-8 -*-

# Scrapy settings for DouBan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DouBan'

SPIDER_MODULES = ['DouBan.spiders']
NEWSPIDER_MODULE = 'DouBan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DouBan (+http://www.yourdomain.com)'
from fake_useragent import UserAgent
ua = UserAgent()
USER_AGENT = ua.random

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'DouBan.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'DouBan.middlewares.DoubanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
   'scrapy.pipelines.files.FilesPipeline': 2,
   'DouBan.pipelines.MyImagesPipeline': 2,
   'DouBan.pipelines.AddScoreNum': 100,
   'DouBan.pipelines.JsonWriterPipeline': 200,
   'DouBan.pipelines.MysqlPipeline': 200
}

IMAGES_STORE = './images'
# FILES_STORE = './files'
IMAGES_EXPIRES = 30
# FILES_EXPIRES = 90
IMAGES_THUMBS = {
   'small': (50, 50),
   'big': (270, 270)
}
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

項目效果

實現了頁面解析爬取
爬取電影短評，分數，電影名，電影時長
對電影分數實現加一操作
item數據保存爲json文件
下載保存圖片：電影縮略圖，電影大圖

爬蟲：Scrapy爬蟲框架

文章目錄

Scrapy簡介

通用爬蟲框架流程

Scrapy 框架流程

Scrapy組件

Scrapy運行流程

Scrapy的安裝

Linux下的安裝(包括mac)

Windows下的安裝

基本命令

項目文件說明

項目案例

項目介紹

項目代碼

創建工程

創建爬蟲程序

自動創建目錄及文件

編寫爬蟲文件（douban.py）

編輯item文件

編輯pipelines文件

編輯settings文件

項目效果

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

java由於越界導致的報錯

爬蟲：Scrapy爬蟲框架

數據分析：Matplotlib

數據分析：pandas

機器學習：循環神經網絡

數據分析：Numpy

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結