一、Scrapy是Python開發的一個快速、高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛,可以用於數據挖掘、監測和自動化測試。
Scrapy吸引人的地方在於它是一個框架,任何人都可以根據需求方便的修改。它也提供了多種類型爬蟲的基類,如BaseSpider、sitemap爬蟲等,最新版本又提供了web2.0爬蟲的支持。
Scrapy是一個爲爬取網站數據、提取結構性數據而設計的應用程序框架,它可以應用在廣泛領域:Scrapy 常應用在包括數據挖掘,信息處理或存儲歷史數據等一系列的程序中。通常我們可以很簡單的通過 Scrapy 框架實現一個爬蟲,抓取指定網站的內容或圖片。
二、結構圖
三、框架介紹(入門):
常用命令:
1)新建項目
scrapy startproject <project_name>
2)爬蟲爬取
scrapy crawl <spider_name>
3)生成爬蟲文件
scrapy genspider [-t template] <name> <domain>
目錄結構:
這裏重點介紹文件的意義:
1)scrapy.cfg(主要用來指定配置和名稱等)
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scrapy_demo.settings [deploy] #url = http://localhost:6800/ project = scrapy_demo
2)settings.py
# -*- coding: utf-8 -*- # Scrapy settings for scrapy_demo project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapy_demo' SPIDER_MODULES = ['scrapy_demo.spiders'] NEWSPIDER_MODULE = 'scrapy_demo.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_demo (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapy_demo.middlewares.ScrapyDemoSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapy_demo.middlewares.ScrapyDemoDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'scrapy_demo.pipelines.ScrapyDemoPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
說明:a、BOT_NAME-->項目名稱
b、SPIDER_MODULES,NEWSPIDER_MODULE-->爬蟲目錄,新爬蟲目錄
c、ROBOTSTXT_OBEY-->是否準守網站規則,robots.txt
d、CONCURRENT_REQUESTS-->併發數
e、DOWNLOAD_DELAY-->現在延時(秒)
f、CONCURRENT_REQUESTS_PER_DOMAIN、CONCURRENT_REQUESTS_PER_IP-->併發請求域和ip
g、COOKIES_ENABLED-->cookie開啓
h、TELNETCONSOLE_ENABLED-->telnet是否開啓
i、DEFAULT_REQUEST_HEADERS-->默認請求頭
j、SPIDER_MIDDLEWARES-->爬蟲中間件
k、DOWNLOADER_MIDDLEWARES-->下載中間件
l、EXTENSIONS-->擴展
m、ITEM_PIPELINES-->管道
3)items.py(主要用於模型的定義)
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy # class ScrapyDemoItem(scrapy.Item): # # define the fields for your item here like: # # name = scrapy.Field() # pass class DouYuItem(scrapy.Item): # 標題 title = scrapy.Field() # 熱度 hot = scrapy.Field() # 圖片url img_url = scrapy.Field()
4)pipelines.py(定義管道,同於後續的數據處理)
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html # class ScrapyDemoPipeline(object): # def process_item(self, item, spider): # return item import urllib2 class DouYuPipline(object): def __init__(self): self.csv_file = open("douyu.csv", "w") def process_item(self, item, spider): text = item["title"] + "," + str(item["hot"]) + "," + item["img_url"] + "\n" # with open("img/" + item["title"] + "_" + str(item["hot"]) + ".jpg", "wb") as f: # f.write(urllib2.urlopen(item["img_url"]).read()) self.csv_file.write(text.encode("utf-8")) return item def close_spider(self, spider): self.csv_file.close()
5)spiders(爬蟲目錄文件夾,核心內容都在這裏)
a、基於scrapy.Spider(基礎類)做的開發
# !/usr/bin/python # -*- coding: UTF-8 -*- import json import scrapy import time from scrapy_demo.items import DouYuItem class DouYuSpider(scrapy.Spider): name = "douyu" allowed_domains = ["www.douyu.com", "rpic.douyucdn.cn"] url = "https://www.douyu.com/gapi/rkc/directory/0_0/" page = 1 start_urls = [url + str(page)] def parse(self, response): data = json.loads(response.text)["data"]["rl"] for detail in data: douyu_item = DouYuItem() douyu_item["title"] = detail["rn"] douyu_item["hot"] = detail["ol"] douyu_item["img_url"] = detail["rs1"] yield scrapy.Request(detail["rs1"], callback=self.img_data_handle) yield douyu_item self.page += 1 yield scrapy.Request(self.url + str(self.page), callback=self.parse) def img_data_handle(self, response): with open("img/" + str(time.time()) + ".jpg", "wb") as f: f.write(response.body)
說明:Spider必須實現parse函數
name:爬蟲名稱(必填)
allowed_domains :允許的域(選填)
start_urls:需要爬蟲的網址(必填)
b、基於CrawlSpider(父類爲Spider)做的開發
# !/usr/bin/python # -*- coding: UTF-8 -*- # !/usr/bin/python # -*- coding: UTF-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DouYuSpider(CrawlSpider): name = "douyuCrawl" # allowed_domains = ["www.douyu.com"] url = "https://www.douyu.com/directory/all" start_urls = [url] links = LinkExtractor(allow="https") rules = [ Rule(links, callback="link_handle") ] def link_handle(self, response): print response.body
說明:rules:匹配鏈接規則,用來匹配html中的鏈接。
四、上面介紹了主要的幾種文件開發方式、說明一下流程
1)首先會通過Spider目錄下的爬蟲文件,獲取數據,如果存在item的數據返回,可以使用yield或者return
2)然後item數據會進入pipline,進行後續的處理。
3)如果使用yield的方式,回事生成器的方式來做,會一直循環的讀取數據,主要退出
五、記住pipline、middleware、都需要在settings.py文件中配置,如果沒有配置則說明該管道或者中間件不存在,當然可以設置優先級,數字越小優先級越高
ITEM_PIPELINES = { # 'scrapy_demo.pipelines.ScrapyDemoPipeline': 300, 'scrapy_demo.pipelines.DouYuPipline': 300, }
六、啓動
使用命令的方式啓動
scrapy crawl <spider-name>
但是這樣存在一個問題,不好進行調試,我們一般採用pyCharm方式進行開發,所以通過腳本的方式進行啓動
start.py
# !/usr/bin/python # -*- coding: UTF-8 -*- from scrapy import cmdline cmdline.execute(["scrapy", "crawl", "douyuCrawl"])
七、總結:這個和前面使用的Selenium+瀏覽器插件的使用方式還是存在差異的,這裏針對於ajax的處理還是需要人工手動去需要數據的加載,然後在通過接口去獲取數據在解析。Selenium+瀏覽器的方式是通過模擬瀏覽器的方式來實現js和其他ajax的加載,從效率上面來說,scrapy會更加高效和強大。但是隻是從頁面來說的話,Selenium+瀏覽器是一個不錯的選擇。