Scrapy 命令
全局命令
startproject 新建工程
settings 配置文件
genspider 新建spider
bench 測試命令
runspider
shell
fetch 使用Scrapy下載器(downloader)下載給定的URL,並將獲取到的內容送到標準輸出。
view 在瀏覽器中打開給定的URL,並以Scrapy spider獲取到的形式展現。
version scrapy的版本信息
項目命令
crawl 執行spider
check 檢查spider
list 列出當前項目中可用的spider
edit
parse 獲取給定的URL並使用相應的spider分析處理
常用操作
新建工程:scrapy startproject XXX
新建spider:scrapy genspider XXX “XXX.XXX.XX”
執行spider:scrapy crawl XXX
檢查spider:scrapy check XXX
列出當前項目中可用的spider:scrapy list
shell命令
Scrapy shell 是一個交互式的shell,Scrapy shell對於開發爬蟲是非常好用的一個測試工具。他可以在未啓動spider的情況下嘗試及調試爬蟲代碼。
shelp() - 打印可用對象及快捷命令的幫助列表
fetch(request_or_url) - 根據給定的請求(request)或URL獲取一個新的response,並更新相關的對象。
view(response) - 在本機的瀏覽器打開給定的response。 其會在response的body中添加一個 tag ,使得外部鏈接(例如圖片及css)能正確顯示。
在spider中啓用shell來查看response
通過 scrapy.shell.inspect_response 函數來實現:
import scrapy
from scrapy.shell import inspect_response
class MySpider(scrapy.Spider):
name = "myspider"
def parse(self, response):
inspect_response(response, self)
使用shell
打開shell界面
scrapy shell
請求百度頁面
fetch(“http://www.baidu.com“)
response.xpath()匹配數據
a = response.xpath(“//div[@id=’lg’]/img/@src”)
匹配到百度圖標鏈接:
- 得到data
b = a.extract()[0]
User Agent
User Agent中文名爲用戶代理,簡稱 UA,它是一個特殊字符串頭,使得服務器能夠識別客戶使用的操作系統及版本、CPU類型、瀏覽器及版本、瀏覽器渲染引擎、瀏覽器語言、瀏覽器插件等。
操作系統標識
瀏覽器User-Agent大全
PC端
safari 5.1 – MAC
User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
safari 5.1 – Windows
User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
IE 9.0
User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;
IE 8.0
User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)
IE 7.0
User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
IE 6.0
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Firefox 4.0.1 – MAC
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Firefox 4.0.1 – Windows
User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Opera 11.11 – MAC
User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11
Opera 11.11 – Windows
User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11
Chrome 17.0 – MAC
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11
傲遊(Maxthon)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)
騰訊TT
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)
世界之窗(The World) 2.x
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
世界之窗(The World) 3.x
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)
搜狗瀏覽器 1.x
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)
360瀏覽器
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
Avant
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)
Green Browser
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
移動設備端
safari iOS 4.33 – iPhone
User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5
safari ios 4.33 – iPod Touch
User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5
safari iOS 4.33 – iPad
User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5
Android N1
User-Agent: Mozilla/5.0 (Linux; U; android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
Android QQ瀏覽器 For android
User-Agent: MQQBrowser/26 Mozilla/5.0 (linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
Android Opera Mobile
User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10
Android Pad Moto Xoom
User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13
BlackBerry
User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+
WebOS HP Touchpad
User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0
Nokia N97
User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124
Windows Phone Mango
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)
UC無
User-Agent: UCWEB7.0.2.37/28/999
UC標準
User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999
UCOpenwave
User-Agent: Openwave/ UCWEB7.0.2.37/28/999
UC Opera
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999
爬蟲爬取網站數據時,有時候需要通過user agent來模擬移動端和PC端標識