Scrapy - 命令行工具

Scrapy是由scrapy命令行工具來控制的，它的命令行工具爲多種用途提供了一些不同的命令，每一個命令都有不同的參數和選項。

一些Scrapy命令必須在Scrapy項目目錄下執行，另一些可以在任何目錄下執行。而那些可以在任何目錄下執行的命令，如果在Scrapy項目目錄下執行可能會有些不同。

Scrapy命令執行環境
Global commands	Project-only commands
startproject	crawl
genspider	check
settings	list
runspider	edit
shell	parse
fetch	bench
view
version

1. scrapy

首先運行Scrapy命令行工具但沒有任何命令，它會輸出一些用法和可以使用的命令到屏幕上：

(scrapyEnv) MacBook-Pro:~ $ scrapy
Scrapy 1.4.0 - no active project


Usage:
  scrapy <command> [options] [args]


Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy


  [ more ]      More commands available when run from project directory


Use "scrapy <command> -h" to see more info about a command
(scrapyEnv) MacBook-Pro:~ $

如果是在一個Scrapy項目的目錄裏運行的命令，則第一行顯示的是當前的項目，如果不在任何Scrapy項目下，則顯示"no active project"

想了解命令更多的信息可以使用：scrapy <command> -h

(scrapyEnv) MacBook-Pro:myproject$ scrapy startproject -h
Usage
=====
  scrapy startproject <project_name> [project_dir]


Create new project


Options
=======
--help, -h              show this help message and exit


Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
(scrapyEnv) MacBook-Pro:myproject $

2. scrapy startproject

Syntax: scrapy startproject <project_name> [project_dir]
Requires project: no

創建scrapy項目。

(scrapyEnv) MacBook-Pro:Project $ scrapy startproject myproject [project_dir]

將會新建一個Scrapy項目在project_dir目錄下。如果project_dir沒有指定，則默認的目錄爲myproject。

然後進入新建項目的目錄，現在我們可以使用scrapy的命令管理、控制新建的scrapy項目。

這裏要介紹兩個方面的知識：

2.1 配置設定

Scrapy的配置都保存在scrapy.cfg文件中，這個文件可能出現在3個地方：

1. /etc/scrapy.cfg 或 c:\scrapy\scrapy.cfg （系統級的配置）

2. ~/.config/scrapy.cfg ($XDG_CONFIG_HOME) 和 ~/.scrapy.cfg ($HOME) （用戶級的配置）

3. scrapy.cfg 在scrapy項目的根目錄下（項目級的配置）

這些配置都會合併到一起並按3>2>1的順序排列，即3的優先級>2的優先級>1的優先級。

Scrapy也能通過一些環境變量進行設置：

SCRAPY_SETTINGS_MODULE
SCRAPY_PROJECT
SCRAPY_PYTHON_SHELL

2.2 項目結構

所有的Scrapy項目一個默認的基本結構如下：

.
|____myproject
| |____items.py
| |____middlewares.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |____spider1.py
| | |____spider2.py
|____scrapy.cfg

scrapy.cfg所在的目錄就是項目的根目錄。該文件包含着定義項目設置的python模塊的名字，如

 6 [settings]
 7 default = myproject.settings

3. scrapy genspider

Syntax: scrapy genspider [-t template] <name> <domain>
Requires project: no

在當前目錄或當前項目的spiders目錄下創建新的spider。<name>參數是設置spider的名字，<domain>用來生成spider的屬性：allowed_domains和start_urls。

(scrapyEnv) MacBook-Pro:scrapy $ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
(scrapyEnv) MacBook-Pro:scrapy $ scrapy genspider example example.com
Created spider 'example' using template 'basic'
(scrapyEnv) MacBook-Pro:scrapy $ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'
(scrapyEnv) MacBook-Pro:scrapy $

這個命令提供了一個創建spider的簡便方法，當然我們也還是可以自己創建spider的源文件。

4. scrapy crawl

Syntax: scrapy crawl <spider>
Requires project: yes

使用爬蟲spider開始爬取。

(scrapyEnv) MacBook-Pro:project $ scrapy crawl myspider

5. scrapy check

Syntax: scrapy check [-l] <spider>
Requires project: yes

運行檢查。

(scrapyEnv) MacBook-Pro:project $ scrapy check -l
(scrapyEnv) MacBook-Pro:project $ scrapy check


----------------------------------------------------------------------
Ran 0 contracts in 0.000s


OK
(scrapyEnv) MacBook-Pro:project $

6. scrapy list

Syntax: scrapy list
Requires project: yes

列出當前項目中所有可用的spiders。

(scrapyEnv) MacBook-Pro:project $ scrapy list
toscrape-css
toscrape-xpath
(scrapyEnv) MacBook-Pro:project $

7. edit

Syntax: scrapy edit <spider>
Requires project: yes

使用EDITOR環境變量設置的EDITOR指定的編輯器打開指定的spider。

(scrapyEnv) MacBook-Pro:project $ scrapy edit toscrape-css
(scrapyEnv) MacBook-Pro:project $

8. fetch

Syntax: scrapy fetch <url>
Requires project: no

使用Scrapy的下載器下載給定的url並將內容寫至標準輸出設備。

值得注意的是它是按照spider如何下載網頁的方式來獲取頁面的，如果spider有一個USER_AGENT屬性則fetch也會使用spider的USER_AGENT作爲自己的user_agent。如果是在Scrapy項目外使用fetch，沒有應用特別的爬蟲設置則使用默認的Scrapy下載設置。

此命令支持3個選項：

--spider=SPIDER: 忽略自動檢測的spider，強制使用指定的spider
--headers: 輸出response的HTTP headers而不是response的body內容
--no-redirect: 不會隨着HTTP 3xx重定向（默認是隨着HTTP的重定向）

(scrapyEnv) MacBook-Pro:project $ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ...]
(scrapyEnv) MacBook-Pro:project $ scrapy fetch --nolog --headers http://www.example.com/
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Cache-Control: max-age=604800
< Content-Type: text/html
< Date: Wed, 25 Oct 2017 13:55:57 GMT
< Etag: "359670651+gzip"
< Expires: Wed, 01 Nov 2017 13:55:57 GMT
< Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
< Server: ECS (oxr/839F)
< Vary: Accept-Encoding
< X-Cache: HIT
(scrapyEnv) MacBook-Pro:project $

9. view

Syntax: scrapy view <url>
Requires project: no

在瀏覽器中打開指定的URL。有時spider看的頁面會與普通用戶看到的不一致，因此這個命令可以用來檢查spider看到的是否與我們設想的一致。

支持的選項：

--spider=SPIDER: 強制使用指定的spider
--no-redirect: 不重定向（默認的是重定向）

(scrapyEnv) MacBook-Pro:project $ scrapy view http://www.163.com

10. shell

Syntax: scrapy shell [url]
Requires project: no

爲指定的URL啓動scrapy shell或不指定URL僅僅啓動shell。支持UNIX風格的本地文件路徑，也支持相對路徑./或../，同時絕對的文件路徑也是支持的。

支持的選項：

--spider=SPIDER: 強制使用指定的spider
-c code: 在shell中求代碼的值，打印結果並退出
--no-redirect: 不重定向（默認的是重定向）；這僅僅只適用於命令行中當做參數傳入的URL，如果進入scrapy shell，再使用fetch(url)時，默認會重定向。

(scrapyEnv) MacBook-Pro:project $ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ...]
(scrapyEnv) MacBook-Pro:project$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
(scrapyEnv) MacBook-Pro:project $ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')
(scrapyEnv) MacBook-Pro:project $ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
(scrapyEnv) MacBook-Pro:project $

11. parse

Syntax: scrapy parse <url> [options]
Requires project: yes

獲取給定URL的頁面並使用處理這個URL的spider解析，用--callback指定的方法，如果沒指定則用默認的parse方法解析。

支持的選項

--spider=SPIDER：強制使用指定的spider
--a NAME=VALUE：設置spider參數（可能重複的）
--callback 或 -c：spider的回調方法，用來解析response的
--pipelines：通過pipelines處理items
--rules 或 -r：用crqslspider的規則去發現回調方法來解析response
--noitems：不顯示爬取的items
--nolinks：不顯示獲取的鏈接
--nocolour：避免使用pygments給輸出着色
--depth 或 -d：request請求遞歸的深度（默認爲1）
--verbose 或 -v：顯示debug信息

12. settings

Syntax: scrapy settings [options]
Requires project: no

獲取Scrapy設置的某個值。

如果在項目下使用，顯示的是項目的配置，否則顯示的是默認的scrapy設置。

(scrapyEnv) MacBook-Pro:project $ scrapy settings --get BOT_NAME
quotesbot
(scrapyEnv) MacBook-Pro:project $ scrapy settings --get DOWNLOAD_DELAY
0
(scrapyEnv) MacBook-Pro:project $

13. runspider

Syntax: scrapy runspider <spider_file.py>
Requires project: no

運行自包含在一個py文件中的爬蟲，不需要創建項目。

$ scrapy runspider myspider.py

14. version

Syntax: scrapy version [-v]
Requires project: no

打印Scrapy的版本。如果和-v一起使用，將還會打印python，twisted和系統的信息。

15. bench

Syntax: scrapy bench
Requires project: no

運行一個基本測試。

Scrapy - 命令行工具

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

CentOS7下firewalld使用

MAC MySQL安裝及配置

CentOS7下使用Yum安裝MySQL

CentOS下安裝Jenkins

MAC MacVim及Vundle安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結