scrapy框架

目錄：

介紹：

Scrapy 是基於twisted框架開發而來，twisted是一個流行的事件驅動的python網絡框架。因此Scrapy使用了一種非阻塞（又名異步）的代碼來實現併發。整體架構大致：

'''
Components：

1、引擎(EGINE)
引擎負責控制系統所有組件之間的數據流，並在某些動作發生時觸發事件。有關詳細信息，請參見上面的數據流部分。

2、調度器(SCHEDULER)
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 可以想像成一個URL的優先級隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址

3、下載器(DOWLOADER)
用於下載網頁內容, 並將網頁內容返回給EGINE，下載器是建立在twisted這個高效的異步模型上的

4、爬蟲(SPIDERS)
SPIDERS是開發人員自定義的類，用來解析responses，並且提取items，或者發送新的請求

5、項目管道(ITEM PIPLINES)
在items被提取後負責處理它們，主要包括清理、驗證、持久化（比如存到數據庫）等操作
下載器中間件(Downloader Middlewares)位於Scrapy引擎和下載器之間，主要用來處理從EGINE傳到DOWLOADER的請求request，已經從DOWNLOADER傳到EGINE的響應response，
你可用該中間件做以下幾件事：
  　　(1) process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
  　　(2) change received response before passing it to a spider;
  　　(3) send a new Request instead of passing received response to a spider;
  　　(4) pass response to a spider without fetching a web page;
  　　(5) silently drop some requests.

6、爬蟲中間件(Spider Middlewares)
位於EGINE和SPIDERS之間，主要工作是處理SPIDERS的輸入（即responses）和輸出（即requests）
'''

部件詳解

安裝：

#Windows平臺
    1、pip3 install wheel #安裝後，便支持通過wheel文件安裝軟件，wheel文件官網：https://www.lfd.uci.edu/~gohlke/pythonlibs
    2、pip3 install lxml
    3、pip3 install pyopenssl  # pyopenssl是一個封裝了openssl的python模塊。使用它可以方便地進行一些加解密操作。
    # pywin32與python3有不兼容的問題，在 下載與當前python相兼容的版本，使用pip install 路徑名（.wheel）文件方式進行安裝
    4、下載並安裝pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/  # 去220目錄下根據你的系統與python解釋器下載相應的版本
    # 因爲scrapy是基於twisted開發的，所以需要下載twisted
    5、下載twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted  
    6、執行pip3 install 下載目錄\Twisted-17.9.0-cp36-cp36m-win_amd64.whl   # 安裝本地的twisted文件
　　 # cmd: >> pip3 install D:\tank_files\Twisted-18.9.0-cp36-cp36m-win_amd64.whl
    7、pip3 install scrapy  # 把1-6做完以後再下載scarpy框架，否則會報錯
  
#Linux平臺
    1、pip3 install scrapy

安裝

命令行工具：

#1 查看幫助
    scrapy -h
    scrapy <command> -h

#2 有兩種命令：其中Project-only必須切到項目文件夾下才能執行，而Global的命令則不需要
    Global commands:
        startproject #創建項目
        genspider    #創建爬蟲程序
        settings     #如果是在項目目錄下，則得到的是該項目的配置
        runspider    #運行一個獨立的python文件，不必創建項目
        shell        #scrapy shell url地址  在交互式調試，如選擇器規則正確與否
        fetch        #獨立於程單純地爬取一個頁面，可以拿到請求頭
        view         #下載完畢後直接彈出瀏覽器，以此可以分辨出哪些數據是ajax請求
        version      #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依賴庫的版本
    Project-only commands:
        crawl        #運行爬蟲，必須創建項目才行，確保配置文件中ROBOTSTXT_OBEY = False
        check        #檢測項目中有無語法錯誤
        list         #列出項目中所包含的爬蟲名
        edit         #編輯器，一般不用
        parse        #scrapy parse url地址 --callback 回調函數  #以此可以驗證我們的回調函數是否正確
        bench        #scrapy bentch壓力測試

#3 官網鏈接
    https://docs.scrapy.org/en/latest/topics/commands.html

命令詳解

示範用法：

#1、執行全局命令：請確保不在某個項目的目錄下，排除受該項目配置的影響
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切換到項目目錄下，看到的則是該項目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)
    
scrapy view https://www.taobao.com #如果頁面顯示內容不全，不全的內容則是ajax請求實現的，以此快速定位問題

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依賴庫的版本


#2、執行項目命令：切到項目目錄下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

項目結構：

文件說明：

scrapy.cfg 項目的主配置信息，用來部署scrapy時使用，爬蟲相關的配置信息在settings.py文件中。
items.py 設置數據存儲模板，用於結構化數據，如：Django的Model
pipelines 數據處理行爲，如：一般結構化的數據持久化
settings.py 配置文件，如：遞歸的層數、併發數，延遲下載等。強調:配置文件的選項必須大寫否則視爲無效，正確寫法USER_AGENT='xxxx'
spiders 爬蟲目錄，如：創建文件，編寫爬蟲規則

注意：一般創建爬蟲文件時，以網站域名命名

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬蟲1.py
           爬蟲2.py
           爬蟲3.py

pycharm中使用scrapy項目：

#在項目目錄下新建：entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

scripy中使用xpath解析響應數據

response.selector.css()
response.selector.xpath()
可簡寫爲
response.css()
response.xpath()

#1 //與/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #開頭的//代表從整篇文檔中尋找,body之後的/代表body的兒子
[]
>>> response.xpath('//body//a') #開頭的//代表從整篇文檔中尋找,body之後的//代表body的子子孫孫
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract與extract_first:從selector對象中解出內容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、屬性：xpath的屬性加前綴@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、設置默認值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照屬性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照屬性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正則表達式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相對路徑
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #這就是從頭開始掃描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、帶變量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5個a標籤的div的id
'images'

posted @ 2019-03-13 19:50 ChuckXue 閱讀(...) 評論(...) 編輯收藏

scrapy框架

scrapy框架

介紹：

安裝：

命令行工具：

示範用法：

項目結構：

pycharm中使用scrapy項目：

scripy中使用xpath解析響應數據

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

支付寶支付與微信推送

源碼流程分析

JSON

Vue項目搭建

激活碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結