Python之（scrapy）爬蟲

　　一、Scrapy是Python開發的一個快速、高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化測試。

　　　　Scrapy吸引人的地方在於它是一個框架，任何人都可以根據需求方便的修改。它也提供了多種類型爬蟲的基類，如BaseSpider、sitemap爬蟲等，最新版本又提供了web2.0爬蟲的支持。

　　　　Scrapy是一個爲爬取網站數據、提取結構性數據而設計的應用程序框架，它可以應用在廣泛領域：Scrapy 常應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。通常我們可以很簡單的通過 Scrapy 框架實現一個爬蟲，抓取指定網站的內容或圖片。

　　　二、結構圖

　　Scrapy Engine(引擎)：負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊，信號、數據傳遞等。

　　Scheduler(調度器)：它負責接受引擎發送過來的Request請求，並按照一定的方式進行整理排列，入隊，當引擎需要時，交還給引擎。

　　Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的所有Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理。

　　Spider（爬蟲）：它負責處理所有Responses,從中分析提取數據，獲取Item字段需要的數據，並將需要跟進的URL提交給引擎，再次進入Scheduler(調度器)。

　　Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方。

　　Downloader Middlewares（下載中間件）：一個可以自定義擴展下載功能的組件。

　　Spider Middlewares（Spider中間件）：一個可以自定擴展和操作引擎和Spider中間通信的功能組件。

　　三、框架介紹（入門）：

　　常用命令：

　　1）新建項目　

scrapy startproject <project_name>

　　2）爬蟲爬取

scrapy crawl <spider_name>

　　3）生成爬蟲文件

scrapy genspider [-t template] <name> <domain>

　　目錄結構：

　　這裏重點介紹文件的意義：

　　1）scrapy.cfg（主要用來指定配置和名稱等）

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scrapy_demo.settings

[deploy]
#url = http://localhost:6800/
project = scrapy_demo

　　2）settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for scrapy_demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_demo'

SPIDER_MODULES = ['scrapy_demo.spiders']
NEWSPIDER_MODULE = 'scrapy_demo.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_demo (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_demo.middlewares.ScrapyDemoSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_demo.middlewares.ScrapyDemoDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_demo.pipelines.ScrapyDemoPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　　說明：a、BOT_NAME-->項目名稱

　　　　　 b、SPIDER_MODULES，NEWSPIDER_MODULE-->爬蟲目錄，新爬蟲目錄

　　　　　 c、ROBOTSTXT_OBEY-->是否準守網站規則，robots.txt

　　　　　 d、CONCURRENT_REQUESTS-->併發數

　　　　　 e、DOWNLOAD_DELAY-->現在延時（秒）

　　　　　 f、CONCURRENT_REQUESTS_PER_DOMAIN、CONCURRENT_REQUESTS_PER_IP-->併發請求域和ip

　　　　　 g、COOKIES_ENABLED-->cookie開啓

　　　　　 h、TELNETCONSOLE_ENABLED-->telnet是否開啓

　　　　　 i、DEFAULT_REQUEST_HEADERS-->默認請求頭

　　　　　 j、SPIDER_MIDDLEWARES-->爬蟲中間件

　　　　　 k、DOWNLOADER_MIDDLEWARES-->下載中間件

　　　　　 l、EXTENSIONS-->擴展

　　　　　 m、ITEM_PIPELINES-->管道

　　3）items.py（主要用於模型的定義）　

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# class ScrapyDemoItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass


class DouYuItem(scrapy.Item):
    # 標題
    title = scrapy.Field()
    # 熱度
    hot = scrapy.Field()
    # 圖片url
    img_url = scrapy.Field()

　　4）pipelines.py（定義管道，同於後續的數據處理）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


# class ScrapyDemoPipeline(object):
#     def process_item(self, item, spider):
#         return item
import urllib2


class DouYuPipline(object):

    def __init__(self):
        self.csv_file = open("douyu.csv", "w")

    def process_item(self, item, spider):
        text = item["title"] + "," + str(item["hot"]) + "," + item["img_url"] + "\n"
        # with open("img/" + item["title"] + "_" + str(item["hot"]) + ".jpg", "wb") as f:
        #     f.write(urllib2.urlopen(item["img_url"]).read())
        self.csv_file.write(text.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.csv_file.close()

　　5）spiders（爬蟲目錄文件夾，核心內容都在這裏）

　　a、基於scrapy.Spider（基礎類）做的開發

# !/usr/bin/python
# -*- coding: UTF-8 -*-
import json

import scrapy
import time

from scrapy_demo.items import DouYuItem

class DouYuSpider(scrapy.Spider):
    name = "douyu"
    allowed_domains = ["www.douyu.com", "rpic.douyucdn.cn"]
    url = "https://www.douyu.com/gapi/rkc/directory/0_0/"
    page = 1
    start_urls = [url + str(page)]

    def parse(self, response):
        data = json.loads(response.text)["data"]["rl"]
        for detail in data:
            douyu_item = DouYuItem()
            douyu_item["title"] = detail["rn"]
            douyu_item["hot"] = detail["ol"]
            douyu_item["img_url"] = detail["rs1"]
            yield scrapy.Request(detail["rs1"], callback=self.img_data_handle)
            yield douyu_item
        self.page += 1
        yield scrapy.Request(self.url + str(self.page), callback=self.parse)

    def img_data_handle(self, response):
        with open("img/" + str(time.time()) + ".jpg", "wb") as f:
            f.write(response.body)

　　說明：Spider必須實現parse函數

　　　　　 name：爬蟲名稱（必填）

　　　　　 allowed_domains ：允許的域（選填）

　　　　　 start_urls：需要爬蟲的網址（必填）

　　b、基於CrawlSpider（父類爲Spider）做的開發

# !/usr/bin/python
# -*- coding: UTF-8 -*-

# !/usr/bin/python
# -*- coding: UTF-8 -*-

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DouYuSpider(CrawlSpider):
    name = "douyuCrawl"
    # allowed_domains = ["www.douyu.com"]
    url = "https://www.douyu.com/directory/all"
    start_urls = [url]

    links = LinkExtractor(allow="https")

    rules = [
        Rule(links, callback="link_handle")
    ]

    def link_handle(self, response):
        print response.body

　　說明：rules：匹配鏈接規則，用來匹配html中的鏈接。

　　四、上面介紹了主要的幾種文件開發方式、說明一下流程

　　1）首先會通過Spider目錄下的爬蟲文件，獲取數據，如果存在item的數據返回，可以使用yield或者return

　　2）然後item數據會進入pipline,進行後續的處理。

　　3）如果使用yield的方式，回事生成器的方式來做，會一直循環的讀取數據，主要退出

　　五、記住pipline、middleware、都需要在settings.py文件中配置，如果沒有配置則說明該管道或者中間件不存在，當然可以設置優先級，數字越小優先級越高

ITEM_PIPELINES = {
   # 'scrapy_demo.pipelines.ScrapyDemoPipeline': 300,
   'scrapy_demo.pipelines.DouYuPipline': 300,
}

　　六、啓動

　　使用命令的方式啓動

scrapy crawl <spider-name>

　　但是這樣存在一個問題，不好進行調試，我們一般採用pyCharm方式進行開發，所以通過腳本的方式進行啓動

　　start.py

# !/usr/bin/python
# -*- coding: UTF-8 -*-

from scrapy import cmdline

cmdline.execute(["scrapy", "crawl", "douyuCrawl"])

　　七、總結：這個和前面使用的Selenium+瀏覽器插件的使用方式還是存在差異的，這裏針對於ajax的處理還是需要人工手動去需要數據的加載，然後在通過接口去獲取數據在解析。Selenium+瀏覽器的方式是通過模擬瀏覽器的方式來實現js和其他ajax的加載，從效率上面來說，scrapy會更加高效和強大。但是隻是從頁面來說的話，Selenium+瀏覽器是一個不錯的選擇。

　　八、例子源碼：https://github.com/lilin409546297/scrapy_demo

Python之（scrapy）爬蟲

k3s應用

k8s集羣搭建-2

maven推送離線jar包

Java實現截圖和錄屏

圖數據庫之hugegraph

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結