Python爬蟲 --- 2.2 Scrapy 選擇器的介紹

原文鏈接：https://www.fkomm.cn/article/2018/8/2/27.html

在使用Scrapy框架之前，我們必須先了解它是如何篩選數據的

Scrapy提取數據有自己的一套機制，被稱作選擇器（selectors）,通過特定的Xpath或者CSS表達式來選擇HTML文件的某個部分, Xpath是專門在XML文件中選擇節點的語言，也可以用在HTML上。 CSS是一門將HTML文檔樣式化的語言，選擇器由它定義，並與特定的HTML元素的樣式相關聯。而且這些選擇器構造於‘lxml’之上，這就意味着Scrapy框架下的數據篩選有着很高的效率。

基本選擇器：

Scrapy爬蟲支持多種信息提取的方法:

Beautiful Soup
Lxml
re
XPath Selector
CSS Selector

下面我們來介紹Xpath選擇器和CSS選擇器的使用：

Xpath選擇器

介紹一下XPath：

XPath 是一門在xml文檔中查找信息的語言，它可以在XML文檔中對於原色和屬性進行遍歷。其內置了超過100個內建函數，這些函數用於對字符串值，數值、日期、時間進行比較遍歷。總之是一門很方便的語言。
在網絡爬蟲中，我們只需要利用XPath來採集數據，所以只要掌握一些基本語法，就可以上手使用了。

基本使用語法，如下表：
實例介紹：

下面我們將以這個book.xml爲例子來介紹:

<html>
     <body>
         <bookstore>
             <book>
                 <title>水滸傳</title>
                 <author>施耐庵</author>
                 <price>58.95</price>
             </book>
             <book>
                 <title>西遊記</title>
                 <author>吳承恩</author>
                 <price>58.3</price>
             </book>
             <book>
                 <title>三國演義</title>
                 <author>羅貫中</author>
                 <price>48.3</price>
             </book>
             <book>
                 <title>紅樓夢</title>
                 <author>曹雪芹</author>
                 <price>75</price>
             </book>
         </bookstore>
     </body>
 </html>

先將我們需要使用的模塊導入（調試環境爲ipython）：

In [1]: from scrapy.selector import Selector

  In [2]: body = open('book.xml','r').read()

  In [3]: print(body)
  <html>
      <body>
          <bookstore>
              <book>
                  <title>水滸傳</title>
                  <author>施耐庵</author>
                  <price>58.95</price>
              </book>
              <book>
                  <title>西遊記</title>
                  <author>吳承恩</author>
                  <price>58.3</price>
              </book>
              <book>
                  <title>三國演義</title>
                  <author>羅貫中</author>
                  <price>48.3</price>
              </book>
              <book>
                  <title>紅樓夢</title>
                  <author>曹雪芹</author>
                  <price>75</price>
              </book>
          </bookstore>
      </body>
  </html>

  In [4]: body
  Out[4]: '<html>\n\t<body>\n\t\t<bookstore>\n\t\t\t<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>\n\t\t</bookstore>\n\t</body>\n</html>'

  In [5]:

下面我們來舉幾個小例子，說明一下如何通過xpath找到我們想要的數據：

In [5]: print("如果我們要第一個book的內容")
  如果我們要第一個book的內容

  In [7]: Selector(text=body).xpath('/html/body/bookstore/book[1]').extract()
  Out[7]: ['<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>']

  In [8]: print("如果我們要最後一個book的內容")
  如果我們要最後一個book的內容

  In [9]: Selector(text=body).xpath('/html/body/bookstore/book[last()]').extract()
  Out[9]: ['<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>']

  In [10]: print("如果我們要最後一個book的author屬性的文本")
  如果我們要最後一個book的author屬性的文本

  In [11]: Selector(text=body).xpath('/html/body/bookstore/book[last()]/author/text()').extract()
  Out[11]: ['曹雪芹']

  In [12]: print("下面是xpath的嵌套使用")
  下面是xpath的嵌套使用

  In [13]: subbody=Selector(text=body).xpath('/html/body/bookstore/book[3]').extract()

  In [14]: Selector(text=subbody[0]).xpath('//author/text()').extract()
  Out[14]: ['羅貫中']

  In [15]: Selector(text=subbody[0]).xpath('//book/author/text()').extract()
  Out[15]: ['羅貫中']

  In [16]: Selector(text=subbody[0]).xpath('//book/title/text()').extract()
  Out[16]: ['三國演義']

CSS選擇器

介紹一下CSS：

和Xpath選擇器比起來,感覺CSS選擇器容易一些，跟寫.css時方法基本一樣，就是在獲取內容時和Xpath不同,這裏需要注意一下。
基本使用語法，如下表：
實例介紹：

下面我們還是以這個book.xml爲例子來介紹:

上面xpath講過如何導入模塊了，下面我們來舉幾個小例子，說明一下如何通過css找到我們想要的數據：

In [2]: print("如果我們要所有節點的內容")
  如果我們所有節點的內容

  In [3]: Selector(text=body).css('*').extract()
  Out[3]:
  ['<html>\n\t<body>\n\t\t<bookstore>\n\t\t\t<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>\n\t\t</bookstore>\n\t</body>\n</html>',
  '<body>\n\t\t<bookstore>\n\t\t\t<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>\n\t\t</bookstore>\n\t</body>',
  '<bookstore>\n\t\t\t<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>\n\t\t\t<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>\n\t\t</bookstore>',
  '<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>',
  '<title>水滸傳</title>',
  '<author>施耐庵</author>',
  '<price>58.95</price>',
  '<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>',
  '<title>西遊記</title>',
  '<author>吳承恩</author>',
  '<price>58.3</price>',
  '<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>',
  '<title>三國演義</title>',
  '<author>羅貫中</author>',
  '<price>48.3</price>',
  '<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>',
  '<title>紅樓夢</title>',
  '<author>曹雪芹</author>',
  '<price>75</price>']

  In [4]: print("如果我們要bookstore下的所有內容")
  如果我們要bookstore下的所有內容

  In [5]: Selector(text=body).css('bookstore book').extract()
  Out[5]:
  ['<book>\n\t\t\t\t<title>水滸傳</title>\n\t\t\t\t<author>施耐庵</author>\n\t\t\t\t<price>58.95</price>\n\t\t\t</book>',
  '<book>\n\t\t\t\t<title>西遊記</title>\n\t\t\t\t<author>吳承恩</author>\n\t\t\t\t<price>58.3</price>\n\t\t\t</book>',
  '<book>\n\t\t\t\t<title>三國演義</title>\n\t\t\t\t<author>羅貫中</author>\n\t\t\t\t<price>48.3</price>\n\t\t\t</book>',
  '<book>\n\t\t\t\t<title>紅樓夢</title>\n\t\t\t\t<author>曹雪芹</author>\n\t\t\t\t<price>75</price>\n\t\t\t</book>']

由於book.xml沒有元素，只有節點，所以只能列舉以上例子，大家可以看到，css選擇器比起xpath選擇器更爲的簡潔。

總結

好了，以上就是對Scrapy 選擇器的介紹以及簡單的使用，後面我會慢慢介紹Scrapy框架的具體使用。

Python爬蟲 --- 2.2 Scrapy 選擇器的介紹

基本選擇器：

Xpath選擇器

CSS選擇器

總結

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

KubeKey 部署 K8s v1.28.8 實戰

python機器學習實戰（二）

Python爬蟲 --- 2.3 Scrapy 框架的簡單使用

Python爬蟲 --- 2.5 Scrapy之汽車之家爬蟲實踐

Python爬蟲 --- 2.4 Scrapy之天氣預報爬蟲實踐

Python爬蟲--- 1.5 爬蟲實踐：獲取百度貼吧內容

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結