Scrapy命令行

Scrapy是通過scrapy命令行工具控制的，在這裏被稱爲“Scrapy工具”，以區別於我們剛剛稱之爲“命令”或“Scrapy命令”的子命令。

首先創建 scrapy項目。

[root@lol spider]# scrapy startproject testproject
New Scrapy project 'testproject', using template directory '/root/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /root/PycharmProjects/spider/testproject

You can start your first spider with:
    cd testproject
    scrapy genspider example example.com
[root@lol spider]# cd testproject/
[root@lol testproject]# scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
  testproject.spiders.baidu

在生成genspider選項中，有很多模板類型。可以使用 -l 去列出來。

[root@lol testproject]# scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

指定模板 crawl

運行爬蟲。

比如我要運行百度的那個爬蟲就 scrapy crawl baidu

檢查一下爬蟲程序是否有語法錯誤。

[root@lol testproject]# scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

列出spider

[root@lol testproject]# scrapy list
baidu
zhihu

訪問被抓取網頁的方式，並且打印網頁源代碼。

[root@lol testproject]# scrapy fetch --nolog http://www.iqiyi.com
【 html 頁面內容 】

[root@lol testproject]# scrapy fetch --nolog --headers http://www.iqiyi.com
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Date: Mon, 06 Nov 2017 11:56:07 GMT
< Content-Type: text/html
< Expires: Mon, 06 Nov 2017 11:53:23 GMT
< Cache-Control: max-age=300
< Last-Modified: Mon, 06 Nov 2017 11:46:26 GMT
< Server: Apache 1.3.29
< X-Cache: HIT from 101.227.22.100
< X-Cache: HIT from 115.238.189.1


[root@lol testproject]# scrapy fetch --nolog --no-redirect http://www.xiaomi.com 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by MIWS</body>
</html>

scrapy視圖

scrapy view url 把頁面下載到本地然後用瀏覽器打開。之後就可以在本地瀏覽。

scrapy shell。可以在命令行模式下調試。

# scrapy shell --nolog http://www.qq.com/
In [1]: response
Out[1]: <200 http://www.qq.com/>

In [2]: response.headers
Out[2]: 
{b'Cache-Control': b'max-age=60',
 b'Content-Type': b'text/html; charset=GB2312',
 b'Date': b'Mon, 06 Nov 2017 12:12:38 GMT',
 b'Expires': b'Mon, 06 Nov 2017 12:13:38 GMT',
 b'Server': b'squid/3.5.20',
 b'Vary': b'Accept-Encoding',
 b'X-Cache': b'HIT from shenzhen.qq.com'}

In [3]: response.css('title::text').extract_first()
Out[3]: '騰訊首頁'

scrapy 的從settings 中查找配置項

[root@lol quote]# scrapy settings --get=MONGO_DB
quotes

查看scrapy的版本和依賴庫。

[root@lol quote]# scrapy version
Scrapy 1.4.0
[root@lol quote]# scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.1.0.0
libxml2   : 2.9.5
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0
Python    : 3.6.1 (default, Oct 21 2017, 18:51:01) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
pyOpenSSL : 17.3.0 (OpenSSL 1.1.0g  2 Nov 2017)
Platform  : Linux-3.10.0-514.26.1.el7.x86_64-x86_64-with-centos-7.3.1611-Core