【爬蟲】Python Scrapy Selectors (選擇器)

【原文鏈接】https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

 

When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

  • BeautifulSoup is a very popular web scraping library among Python programmers which 基於HTML代碼的結構創建一個 Python 對象 and also 很好地 deals with bad markup (標記), but it has one drawback: it’s slow.
  • lxml is an XML 解析庫 (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting 節點 in XML documents, which can also be used with HTML. CSS is a language for applying 樣式 to HTML documents. It 定義 selectors (選擇器) to associate those 樣式 with specific HTML 元素.

Scrapy 選擇器 are built over the lxml library, which means they’re very similar in speed and parsing accuracy.

This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.

For a complete reference of the selectors API see Selector reference

Using selectors

Constructing selectors

Scrapy selectors are 實例 of Selector class constructed by 傳遞 text or TextResponse 對象. It automatically chooses the 最好的解析規則 (XML vs HTML) based on input type:

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

Constructing from text:

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()

['good']

Constructing from response:

>>> response = HtmlResponse(url='http://example.com', body=body, encoding='utf-8')
>>> Selector(response=response).xpath('//span/text()').extract()

['good']

For convenience, 響應對象 expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:

>>> response.selector.xpath('//span/text()').extract()

['good']

Using selectors

To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:

https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Here’s its HTML code:

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

First, let’s open the shell:

scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Then, after the shell loads, you’ll have the 響應 available as response shell 變量, and its attached selector in response.selector attribute.

Since we’re dealing with HTML, the selector will automatically use an HTML parser.

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

In [1]: response.selector.xpath('//title/text()')

Out[1]: [<Selector xpath='//title/text()' data='Example website'>]

Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath() and response.css():

In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='Example website'>]

In [3]: response.css('title::text')
Out[3]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

To actually extract the textual data, you must call the selector .extract() method, as follows:

In [5]: response.xpath('//title/text()').extract()

Out[5]: ['Example website']

As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors. This API can be used for quickly selecting nested data:

In [4]: response.css('img').xpath('@src').extract()

Out[4]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

If you want to extract only first matched element, you can call the selector .extract_first()

In [6]: response.xpath('//div[@id="images"]/a/text()').extract_first()

Out[6]: 'Name: My image 1 '

It returns None if no element was found:

In [7]: response.xpath('//div[@id="not-exists"]/text()').extract_first() is None

Out[7]: True

Now we’re going to get the base URL and some image links:

In [8]: response.xpath('//base/@href').extract()
Out[8]: ['http://example.com/']

In [9]: response.css('base::attr(href)').extract()
Out[9]: ['http://example.com/']

In [11]: response.xpath('//a[contains(@href, "image")]/@href').extract()
Out[11]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [12]: response.xpath('//a[contains(@href, "image1")]/@href').extract()
Out[12]: ['image1.html']

In [13]: response.xpath('//a[contains(@href, "image")]/img/@src').extract()
Out[13]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

Nesting selectors

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:

In [14]: links = response.xpath('//a[contains(@href, "image")]')
In [15]: links.extract()
Out[15]:
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

In [17]: for index, link in enumerate(links):
    ...:     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
    ...:     print('Link number %d points to url %s and image %s' %args)
    ...:
Link number 0 points to url ['image1.html'] and image ['image1_thumb.jpg']
Link number 1 points to url ['image2.html'] and image ['image2_thumb.jpg']
Link number 2 points to url ['image3.html'] and image ['image3_thumb.jpg']
Link number 3 points to url ['image4.html'] and image ['image4_thumb.jpg']
Link number 4 points to url ['image5.html'] and image ['image5_thumb.jpg']

Using selectors with regular expressions

Selector also has a .re() method for extracting data using regular expressions. However, unlike using .xpath() or .css() methods, .re() returns a list of unicode strings. So you can’t construct nested .re() calls.

Here’s an example used to extract image names from the HTML code above:

In [18]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

Out[18]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

There’s an additional helper reciprocating .extract_first() for .re(), named .re_first(). Use it to extract just the first matching string:

In [22]: response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')

Out[22]: 'My image 1 '

Working with relative XPaths

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.

For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response.xpath('//div')
>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

note the dot prefixing the .//p XPath. At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div>elements:

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

Another common case would be to extract all direct <p> children:

>>> for p in divs.xpath('p'):
...     print p.extract()

For more details about relative XPaths see the Location Paths section in the XPath specification.

Variables in XPath expressions

XPath 能夠讓你引用 XPath 表達式中的變量, using the $somevariable syntax. This is somewhat 類似於 parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders (佔位符) like ?, which are then substituted with values passed with the query.

Here’s an example to 匹配一個元素 based on its “id” 屬性值, without 硬編碼 (that was shown previously):

In [34]: response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()

Out[34]: 'Name: My image 1 '

Here’s another example, to find the “id” attribute of a <div> tag containing five <a>children (here we pass the value 5 as an integer):

In [35]: response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()

Out[35]: 'images'

All variable references must have a binding value when calling .xpath() (otherwise you’ll get a ValueError: XPath error: exception). This is done by passing as many named arguments as necessary.

parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.

Using EXSLT extensions

Being built atop lxml, Scrapy selectors also support some EXSLT (a community initiative to provide extensions to XSLT, which are broken down into a number of modules) extensions and come with these 預註冊的命名空間 to use in XPath expressions:

prefix namespace usage
re http://exslt.org/regular-expressions regular expressions
set http://exslt.org/sets set manipulation
  • Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with()or contains() are not sufficient.

Example selecting links in list item with a “class” attribute ending with a digit:

In [36]: from scrapy import Selector
In [38]: doc = """
    ...: <div>
    ...:     <ul>
    ...:         <li class="item-0"><a href="link1.html">first item</a></li>
    ...:         <li class="item-1"><a href="link2.html">second item</a></li>
    ...:         <li class="item-inactive"><a href="link3.html">third item</a></li>
    ...:         <li class="item-1"><a href="link4.html">fourth item</a></li>
    ...:         <li class="item-0"><a href="link5.html">fifth item</a></li>
    ...:     </ul>
    ...: </div>
    ...: """
In [39]: sel = Selector(text=doc, type="html")
In [40]: sel.xpath('//li//@href').extract()
Out[40]: ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

In [41]: sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
Out[41]: ['link1.html', 'link2.html', 'link4.html', 'link5.html']

Warning: C library libxslt doesn’t natively support EXSLT regular expressions so lxml’s implementation uses hooks to Python’s re module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.

  • Set operations

These can be handy for excluding parts of a document tree before extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:

In [46]: doc = """
    ...:  <div itemscope itemtype="http://schema.org/Product">
    ...:  
    ...:    <span itemprop="name">Kenmore White 17" Microwave</span>
    ...:    <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
    ...:  
    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...:     Rated <span itemprop="ratingValue">3.5</span>/5
    ...:     based on <span itemprop="reviewCount">11</span> customer reviews
    ...:    </div>
    ...:
    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...:      <span itemprop="price">$55.00</span>
    ...:      <link itemprop="availability" href="http://schema.org/InStock" />In stock
    ...:    </div>
    ...:
    ...:    Product description:
    ...:    <span itemprop="description">0.7 cubic feet countertop microwave.
    ...:    Has six preset cooking categories and convenience features like
    ...:    Add-A-Minute and Child Lock.</span>
    ...:
    ...:    Customer reviews:
    ...:
    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...:      <span itemprop="name">Not a happy camper</span> -
    ...:      by <span itemprop="author">Ellie</span>,
    ...:      <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...:        <meta itemprop="worstRating" content = "1">
    ...:        <span itemprop="ratingValue">1</span>/
    ...:        <span itemprop="bestRating">5</span>stars
    ...:      </div>
    ...:      <span itemprop="description">The lamp burned out and now I have to replace
    ...:      it. </span>
    ...:    </div>
    ...:
    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...:      <span itemprop="name">Value purchase</span> -
    ...:      by <span itemprop="author">Lucas</span>,
    ...:      <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...:        <meta itemprop="worstRating" content = "1"/>
    ...:        <span itemprop="ratingValue">4</span>/
    ...:        <span itemprop="bestRating">5</span>stars
    ...:      </div>
    ...:      <span itemprop="description">Great microwave for the price. It is small and
    ...:      fits in my apartment.</span>
    ...:    </div>
    ...:
    ...:  </div>
    ...:  """
In [47]: sel = Selector(text=doc, type="html")
In [49]: #    ...:  <div itemscope itemtype="http://schema.org/Product">
    ...: #    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...: #    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...: for scope in sel.xpath('//div[@itemscope]'):
    ...:     print("current scope:", scope.xpath('@itemtype').extract())
    ...: # 第一個:
    ...: #    ...:    <span itemprop="name">Kenmore White 17" Microwave</span>
    ...: #    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...: #    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...: #    ...:    <span itemprop="description">0.7 cubic feet countertop microwave.
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: # 之後的:
    ...: #    ...:     Rated <span itemprop="ratingValue">3.5</span>/5
    ...: #    ...:     based on <span itemprop="reviewCount">11</span> customer reviews
    ...:     props = scope.xpath('set:difference(./descendant::*/@itemprop, .//*[@itemscope]/*/@itemprop)')
    ...:     print("    properties:", props.extract())
    ...:     print
    ...:

current scope: ['http://schema.org/Product']
    properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']
current scope: ['http://schema.org/AggregateRating']
    properties: ['ratingValue', 'reviewCount']
current scope: ['http://schema.org/Offer']
    properties: ['price', 'availability']
current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']
current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude those that are themselves inside another itemscope.

Some XPath tips

Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.

  • Using text nodes in a condition

When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just . instead.

This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-set (節點集合)被轉換爲一個字符串的時候, which happens 當這個字符串被當作實參傳遞給一個像 contains() or starts-with() 這樣的字符串函數的時候, it results in the text for the first element only.

Example: A node converted to a string, however, puts together the text of itself plus of all its descendants:

In [50]: from scrapy import Selector
In [51]: sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
In [52]: sel.xpath('//a//text()').extract() # take a peek at the node-set
Out[52]: ['Click here to go to the ', 'Next Page']

In [53]: sel.xpath("string(//a[1]//text())").extract() # convert it to string
Out[53]: ['Click here to go to the ']

In [54]: sel.xpath("//a[1]").extract() # select the first node
Out[54]: ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

In [55]: sel.xpath("string(//a[1])").extract() # convert it to string
Out[55]: ['Click here to go to the Next Page']

So, using the .//text() node-set won’t select anything in this case. But using the . to mean the node, works:

In [56]: sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
Out[56]: []

In [57]: sel.xpath("//a[contains(., 'Next Page')]").extract()
Out[57]: ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
  • Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

Example:

In [58]: from scrapy import Selector
In [59]: sel = Selector(text="""<ul class="list">
    ...:                             <li>1</li>
    ...:                             <li>2</li>
    ...:                             <li>3</li>
    ...:                         </ul>
    ...:                         <ul class="list">
    ...:                             <li>4</li>
    ...:                             <li>5</li>
    ...:                             <li>6</li>
    ...:                         </ul>""")

In [60]: xp = lambda x: sel.xpath(x).extract()

In [62]: xp("//li[1]") #This gets all first <li> elements under whatever it is its parent
Out[62]: ['<li>1</li>', '<li>4</li>']

In [63]: xp("(//li)[1]") #And this gets the first <li> element in the whole document
Out[63]: ['<li>1</li>']

In [64]: xp("//ul/li[1]") #This gets all first <li> elements under an <ul> parent
Out[64]: ['<li>1</li>', '<li>4</li>']

In [65]: xp("(//ul/li)[1]") #And this gets the first <li> element under an <ul> parent in the whole document
Out[65]: ['<li>1</li>']
  • When querying by class, consider using CSS (略) 

Built-in Selectors reference (略)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章