Scrapy爬蟲入門教程 安裝和基本使用

<div class="markdown_views">

<p><a href="http://blog.csdn.net/inke88/article/details/59761696" target="_blank">Python版本管理:pyenv和pyenv-virtualenv</a> <br>




<p><strong>開發環境:</strong> <br>
<code><a href="http://lib.csdn.net/base/python" class="replace_word" title="Python知識庫" target="_blank" style="color:#df3434; font-weight:bold;">Python</a> 3.6.0 版本</code> (當前最新) <br>
<code>Scrapy 1.3.2 版本</code> (當前最新)</p>


<p></p><div class="toc">
<ul>
<li><ul>
<li><ul>
<li><a href="#scrapy安裝" target="">Scrapy安裝</a></li>
<li><a href="#創建項目" target="">創建項目</a></li>
<li><a href="#如何運行我們爬蟲" target="">如何運行我們爬蟲</a></li>
<li><a href="#提取數據" target="">提取數據</a><ul>
<li><a href="#css選擇元素" target="">CSS選擇元素</a></li>
<li><a href="#提取標題" target="">提取標題</a></li>
<li><a href="#xpath選擇元素" target="">XPath選擇元素</a></li>
<li><a href="#提取引號和作者" target="">提取引號和作者</a></li>
</ul>
</li>
<li><a href="#存取數據" target="">存取數據</a></li>
<li><a href="#鏈接界面包含的鏈接" target="">鏈接界面包含的鏈接</a></li>
<li><a href="#更多示例和模式" target="">更多示例和模式</a></li>
<li><a href="#使用爬蟲參數" target="">使用爬蟲參數</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<p></p>






<h3 id="scrapy安裝"><a name="t0" target="_blank"></a>Scrapy安裝</h3>


<p>Scrapy在<a href="http://lib.csdn.net/base/python" class="replace_word" title="Python知識庫" target="_blank" style="color:#df3434; font-weight:bold;">python</a> 2.7和Python 3.3或更高版本上運行(除了在Windows 3上不支持Python 3)。</p>


<p>通用方式:可以從pip安裝Scrapy及其依賴: <br>
<code>pip install Scrapy</code></p>






<h3 id="創建項目"><a name="t1" target="_blank"></a>創建項目</h3>


<p><code>scrapy startproject tutorial</code> <br>
<img src="http://om2o4m4w0.bkt.clouddn.com/14912226418048.gif" alt="-w200" title=""></p>


<p>項目結構:</p>






<pre class="prettyprint" name="code"><code class="hljs avrasm has-numbering">tutorial/
    scrapy<span class="hljs-preprocessor">.cfg</span>            <span class="hljs-preprocessor"># 部署配置文件</span>


    tutorial/             <span class="hljs-preprocessor"># Python模塊,代碼寫在這個目錄下</span>
        __init__<span class="hljs-preprocessor">.py</span>


        items<span class="hljs-preprocessor">.py</span>          <span class="hljs-preprocessor"># 項目項定義文件</span>


        pipelines<span class="hljs-preprocessor">.py</span>      <span class="hljs-preprocessor"># 項目管道文件</span>


        settings<span class="hljs-preprocessor">.py</span>       <span class="hljs-preprocessor"># 項目設置文件</span>


        spiders/          <span class="hljs-preprocessor"># 我們的爬蟲/蜘蛛 目錄</span>
            __init__<span class="hljs-preprocessor">.py</span>
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>


<p>我們第一個爬蟲 <br>
創建第一個爬蟲類:tutorial/spiders/quotes_spider.py</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_requests</span><span class="hljs-params">(self)</span>:</span>
        urls = [
            <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
            <span class="hljs-string">'http://quotes.toscrape.com/page/2/'</span>,
        ]
        <span class="hljs-keyword">for</span> url <span class="hljs-keyword">in</span> urls:
            <span class="hljs-keyword">yield</span> scrapy.Request(url=url, callback=self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        page = response.url.split(<span class="hljs-string">"/"</span>)[-<span class="hljs-number">2</span>]
        filename = <span class="hljs-string">'quotes-%s.html'</span> % page
        <span class="hljs-keyword">with</span> open(filename, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            f.write(response.body)
        self.log(<span class="hljs-string">'Saved file %s'</span> % filename)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li></ul></pre>


<ul>
<li><p>必須繼承 scrapy.Spider</p></li>
<li><p>name:標識爬蟲。它在項目中必須是唯一的,也就是說,您不能爲不同的Spider設置相同的名稱。</p></li>
<li><p>start_requests():必須返回一個迭代的Requests(你可以返回請求列表或寫一個生成器函數),Spider將開始抓取。後續請求將從這些初始請求連續生成。</p></li>
<li><p>parse():將被調用來處理爲每個請求下載的響應的方法。 response參數是一個TextResponse保存頁面內容的實例,並且具有更多有用的方法來處理它。</p>


<p>該parse()方法通常解析響應,提取抓取的數據作爲詞典,並且還找到要跟蹤的新網址並從中創建新的請求(Request)。</p></li>
</ul>






<h3 id="如何運行我們爬蟲"><a name="t2" target="_blank"></a>如何運行我們爬蟲</h3>


<p>進入項目根目錄,也就是上面的tutorial目錄  <br>
<code>cd tutorial</code> <br>
執行爬蟲: <br>
<code>scrapy crawl quotes</code></p>


<blockquote>
  <p>quotes是上文寫的爬蟲名稱</p>
</blockquote>






<pre class="prettyprint" name="code"><code class="hljs avrasm has-numbering">... (omitted for brevity)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] INFO: Spider opened
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.extensions</span><span class="hljs-preprocessor">.logstats</span>] INFO: Crawled <span class="hljs-number">0</span> pages (at <span class="hljs-number">0</span> pages/min), scraped <span class="hljs-number">0</span> items (at <span class="hljs-number">0</span> items/min)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.extensions</span><span class="hljs-preprocessor">.telnet</span>] DEBUG: Telnet console listening on <span class="hljs-number">127.0</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span>:<span class="hljs-number">6023</span>
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">404</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/robots<span class="hljs-preprocessor">.txt</span>&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">200</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/page/<span class="hljs-number">1</span>/&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">200</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/page/<span class="hljs-number">2</span>/&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [quotes] DEBUG: Saved file quotes-<span class="hljs-number">1.</span>html
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [quotes] DEBUG: Saved file quotes-<span class="hljs-number">2.</span>html
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] INFO: Closing spider (finished)
...</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li></ul></pre>


<p>現在,檢查當前目錄中的文件。您應該注意到,已經創建了兩個新文件:quotes-1.html和quotes-2.html,以及相應URL的內容,parse方法解析的內容。</p>


<p><img src="http://om2o4m4w0.bkt.clouddn.com/14885332051166.jpg" alt="-w300" title=""> <br>
上圖用的是pycharm的IDE。</p>






<h3 id="提取數據"><a name="t3" target="_blank"></a>提取數據</h3>


<p>學習如何使用Scrapy提取數據的最好方法是嘗試使用shell Scrapy shell的選擇器。</p>


<p><code>scrapy shell 'http://quotes.toscrape.com/page/1/'</code></p>


<blockquote>
  <p>記住,當從命令行運行Scrapy shell時,總是用引號引起url,否則包含參數的urls(即。&amp;字符)將不起作用。 <br>
  在Windows上,請使用雙引號: <br>
  scrapy shell “<a href="http://quotes.toscrape.com/page/1/" target="_blank">http://quotes.toscrape.com/page/1/</a>”</p>
</blockquote>


<p>你會看到類似:</p>






<pre class="prettyprint" name="code"><code class="hljs r has-numbering">[<span class="hljs-keyword">...</span> Scrapy log here <span class="hljs-keyword">...</span>]
<span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">12</span>:<span class="hljs-number">09</span>:<span class="hljs-number">27</span> [scrapy.core.engine] DEBUG:Crawled(<span class="hljs-number">200</span>)&lt;GET http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;(referer:None)
[s]可用Scrapy對象:
[s] scrapy scrapy模塊(包含scrapy.Request,scrapy.Selector等)
[s] crawler &lt;scrapy.crawler.Crawler object at <span class="hljs-number">0x7fa91d888c90</span>&gt;
[s] item {}
[s] request &lt;GET http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;
[s] response &lt;<span class="hljs-number">200</span> http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;
[s] settings &lt;scrapy.settings.Settings object at <span class="hljs-number">0x7fa91d888c10</span>&gt;
[s] spider &lt;DefaultSpider<span class="hljs-string">'default'</span>at <span class="hljs-number">0x7fa91c8af990</span>&gt;
[s]有用的快捷鍵:
[s] shelp()Shell幫助(打印此幫助)
[s] fetch(req_or_url)Fetch請求(或URL)並更新本地對象
[s] view(response)在瀏覽器中查看響應
&gt;&gt;&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>






<h4 id="css選擇元素"><a name="t4" target="_blank"></a>CSS選擇元素</h4>






<h4 id="提取標題"><a name="t5" target="_blank"></a>提取標題</h4>


<p>嘗試使用帶有響應對象的CSS選擇元素:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title'</span>)
[&lt;Selector xpath=<span class="hljs-string">'descendant-or-self::title'</span> data=<span class="hljs-string">'&lt;title&gt;Quotes to Scrape&lt;/title&gt;'</span>&gt;]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>返回一個Selector 的集合。</p>


<p>從上面的標題中提取文本,您可以:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).extract()
[<span class="hljs-string">'Quotes to Scrape'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>這裏有兩個要注意的事情:一個是我們添加::text到CSS查詢,意味着我們要直接在\元素內部選擇文本元素 。如果我們不指定::text,我們將獲得完整的title元素,包括其標籤:</p>






<pre class="prettyprint" name="code"><code class="hljs vbnet has-numbering">&gt;&gt;&gt; response.css(<span class="hljs-comment">'title').extract()</span>
[<span class="hljs-comment">'<span class="hljs-xmlDocTag">&lt;title&gt;</span>Quotes to Scrape<span class="hljs-xmlDocTag">&lt;/title&gt;</span>']</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>另一件事是調用的結果.extract()是一個列表,因爲我們處理的是一個實例SelectorList。當你知道你只想要第一個結果,在這種情況下,你可以做:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).extract_first()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>也可以這樣寫:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>)[<span class="hljs-number">0</span>].extract()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>但是,使用.extract_first()避免了IndexError,並且None在找不到與選擇匹配的任何元素時返回 。</p>


<p>除了extract()和 extract_first()方法,您還可以使用該re()方法使用正則表達式提取:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'Quotes.*'</span>)
[<span class="hljs-string">'Quotes to Scrape'</span>]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'Q\w+'</span>)
[<span class="hljs-string">'Quotes'</span>]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'(\w+) to (\w+)'</span>)
[<span class="hljs-string">'Quotes'</span>, <span class="hljs-string">'Scrape'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>了找到合適的CSS選擇器使用,您可以用chrome和Firefox 的調試工具查看css。</p>






<h4 id="xpath選擇元素"><a name="t6" target="_blank"></a>XPath選擇元素</h4>


<p>除了CSS,Scrapy選擇器還支持使用XPath表達式:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.xpath(<span class="hljs-string">'//title'</span>)
[&lt;Selector xpath=<span class="hljs-string">'//title'</span> data=<span class="hljs-string">'&lt;title&gt;Quotes to Scrape&lt;/title&gt;'</span>&gt;]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.xpath(<span class="hljs-string">'//title/text()'</span>).extract_first()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul></pre>


<p>XPath表達式非常強大,是Scrapy選擇器的基礎。事實上,CSS選底層也是用XPath。</p>


<p>雖然也許不像CSS選擇器那麼流行,XPath表達式提供了更多的功能,因爲除了導航結構之外,它還可以查看內容。使用XPath,您可以選擇以下內容:選擇包含文本“下一頁”的鏈接。這使得XPath非常適合於抓取任務,我們鼓勵你學習XPath,即使你已經知道如何構建CSS選擇器,它會使刮除更容易。</p>


<p><strong>大家不要着急一下子把所以東西都介紹到,具體細節後面都會寫到。</strong></p>


<ul>
<li>xpath 資料: <br>
<ul><li>使用XPath與Scrapy選擇器在這裏:<a href="http://scrapy.readthedocs.io/en/latest/topics/selectors.html#topics-selectors" target="_blank">http://scrapy.readthedocs.io/en/latest/topics/selectors.html#topics-selectors</a></li></ul></li>
</ul>






<h4 id="提取引號和作者"><a name="t7" target="_blank"></a>提取引號和作者</h4>


<p><a href="http://quotes.toscrape.com" target="_blank">http://quotes.toscrape.com</a>都由以下HTML元素表示:</p>






<pre class="prettyprint" name="code"><code class="hljs livecodeserver has-numbering">&lt;<span class="hljs-operator">div</span> class=<span class="hljs-string">"quote"</span>&gt;
    &lt;span class=<span class="hljs-string">"text"</span>&gt;“The world <span class="hljs-keyword">as</span> we have created <span class="hljs-keyword">it</span> is <span class="hljs-operator">a</span> <span class="hljs-built_in">process</span> <span class="hljs-operator">of</span> our
    thinking. It cannot be changed <span class="hljs-keyword">without</span> changing our thinking.”&lt;/span&gt;
    &lt;span&gt;
        <span class="hljs-keyword">by</span> &lt;small class=<span class="hljs-string">"author"</span>&gt;Albert Einstein&lt;/small&gt;
        &lt;<span class="hljs-operator">a</span> href=<span class="hljs-string">"/author/Albert-Einstein"</span>&gt;(about)&lt;/<span class="hljs-operator">a</span>&gt;
    &lt;/span&gt;
    &lt;<span class="hljs-operator">div</span> class=<span class="hljs-string">"tags"</span>&gt;
        Tags:
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/change/page/1/"</span>&gt;change&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/deep-thoughts/page/1/"</span>&gt;deep-thoughts&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/thinking/page/1/"</span>&gt;thinking&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/world/page/1/"</span>&gt;world&lt;/<span class="hljs-operator">a</span>&gt;
    &lt;/<span class="hljs-operator">div</span>&gt;
&lt;/<span class="hljs-operator">div</span>&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>


<p>打開scrapy shell <br>
<code>$ scrapy shell'http://quotes.toscrape.com'</code> <br>
網站內容,可能需要翻牆,截圖如下: <br>
<img src="http://om2o4m4w0.bkt.clouddn.com/14912244561352.jpg" alt="" title=""></p>


<p>獲取selectors元素列表 <br>
<code>&gt;&gt;&gt; response.css("div.quote")</code></p>


<p>每個選擇器允許我們對它們的子元素執行進一步的查詢。 <br>
將第一個選擇器分配給一個變量,以便我們可以直接對特定的引用運行我們的CSS選擇器: <br>
<code>&gt;&gt;&gt; quote = response.css("div.quote")[0]</code></p>


<p>現在,從剛剛創建的對象的quote對象,提取title、author、tags:</p>






<pre class="prettyprint" name="code"><code class="hljs applescript has-numbering">&gt;&gt;&gt; title = <span class="hljs-constant">quote</span>.css(<span class="hljs-string">"span.text::text"</span>).extract_first()
&gt;&gt;&gt; title
'“The world <span class="hljs-keyword">as</span> we have created <span class="hljs-keyword">it</span> <span class="hljs-keyword">is</span> a process <span class="hljs-keyword">of</span> our thinking. It cannot be changed <span class="hljs-keyword">without</span> changing our thinking.”'
&gt;&gt;&gt; author = <span class="hljs-constant">quote</span>.css(<span class="hljs-string">"small.author::text"</span>).extract_first()
&gt;&gt;&gt; author
'Albert Einstein'</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>鑑於tags是字符串列表,我們可以使用該.extract()方法來獲取所有的:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>tags = quote.css(<span class="hljs-string">"div.tags a.tag::text"</span>).extract()
<span class="hljs-prompt">&gt;&gt;&gt; </span>tags
[<span class="hljs-string">'change'</span>, <span class="hljs-string">'deep-thoughts'</span>, <span class="hljs-string">'thinking'</span>, <span class="hljs-string">'world'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul></pre>


<p>現在可以遍歷所有的引號元素,並將它們放在一起成爲一個Python字典:</p>






<pre class="prettyprint" name="code"><code class="hljs r has-numbering">&gt;&gt;&gt; <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">"div.quote"</span>):
<span class="hljs-keyword">...</span>     text = quote.css(<span class="hljs-string">"span.text::text"</span>).extract_first()
<span class="hljs-keyword">...</span>     author = quote.css(<span class="hljs-string">"small.author::text"</span>).extract_first()
<span class="hljs-keyword">...</span>     tags = quote.css(<span class="hljs-string">"div.tags a.tag::text"</span>).extract()
<span class="hljs-keyword">...</span>     print(dict(text=text, author=author, tags=tags))
{<span class="hljs-string">'tags'</span>: [<span class="hljs-string">'change'</span>, <span class="hljs-string">'deep-thoughts'</span>, <span class="hljs-string">'thinking'</span>, <span class="hljs-string">'world'</span>], <span class="hljs-string">'author'</span>: <span class="hljs-string">'Albert Einstein'</span>, <span class="hljs-string">'text'</span>: <span class="hljs-string">'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'</span>}
{<span class="hljs-string">'tags'</span>: [<span class="hljs-string">'abilities'</span>, <span class="hljs-string">'choices'</span>], <span class="hljs-string">'author'</span>: <span class="hljs-string">'J.K. Rowling'</span>, <span class="hljs-string">'text'</span>: <span class="hljs-string">'“It is our choices, Harry, that show what we truly are, far more than our abilities.”'</span>}
    <span class="hljs-keyword">...</span> a few more of these, omitted <span class="hljs-keyword">for</span> brevity
&gt;&gt;&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul></pre>


<p>通過上面的demo,我們學會了一些基本的提取數據方法,現在我們嘗試集成到我們上面的創建的爬蟲中。</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>
    start_urls = [
        <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
        <span class="hljs-string">'http://quotes.toscrape.com/page/2/'</span>,
    ]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
                <span class="hljs-string">'tags'</span>: quote.css(<span class="hljs-string">'div.tags a.tag::text'</span>).extract(),
            }
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li></ul></pre>


<p>如果你運行這個爬蟲,它將輸出提取的數據與日誌:</p>






<pre class="prettyprint" name="code"><code class="hljs cs has-numbering"><span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">18</span>:<span class="hljs-number">57</span>:<span class="hljs-number">19</span> [scrapy.core.scraper] DEBUG:Scraped <span class="hljs-keyword">from</span> &lt;<span class="hljs-number">200</span> http:<span class="hljs-comment">//quotes.toscrape.com/page/1/&gt;</span>
{<span class="hljs-string">'tags'</span>:[<span class="hljs-string">'life'</span>,<span class="hljs-string">'love'</span>],<span class="hljs-string">'author'</span>:<span class="hljs-string">'AndréGide'</span>,<span class="hljs-string">'text'</span>:<span class="hljs-string">'“最好不要因爲你的愛而被恨。 “'</span>}
<span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">18</span>:<span class="hljs-number">57</span>:<span class="hljs-number">19</span> [scrapy.core.scraper] DEBUG:Scraped <span class="hljs-keyword">from</span> &lt;<span class="hljs-number">200</span> http:<span class="hljs-comment">//quotes.toscrape.com/page/1/&gt;</span>
{<span class="hljs-string">'tags'</span>:[<span class="hljs-string">'edison'</span>,<span class="hljs-string">'failure'</span>,<span class="hljs-string">'inspirational'</span>,<span class="hljs-string">'paraphrased'</span>],<span class="hljs-string">'author'</span>:<span class="hljs-string">'Thomas A. Edison'</span>,<span class="hljs-string">'text'</span>:“”我沒有失敗, <span class="hljs-number">10</span>,<span class="hljs-number">000</span>種方式將無法工作。“”}
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul></pre>






<h3 id="存取數據"><a name="t8" target="_blank"></a>存取數據</h3>


<p>最簡單方法是直接制定導出文件: <br>
<code>scrapy crawl quotes -o quotes.json</code></p>


<p>這將生成一個quotes.json包含所有被抓取的數據,以JSON序列化的文件。</p>


<p>出於歷史原因,<strong>Scrapy會附加到給定文件,而不是覆蓋其內容。如果你運行這個命令兩次,沒有在第二次之前刪除文件,你會得到一個破碎的JSON文件</strong>。</p>


<p>您還可以使用其他格式: <br>
<code>scrapy crawl quotes -o quotes.jl</code></p>


<p><br></p>






<h3 id="鏈接界面包含的鏈接"><a name="t9" target="_blank"></a>鏈接界面包含的鏈接</h3>


<p>讓我們說,不要只是從<a href="http://quotes.toscrape.com" target="_blank">http://quotes.toscrape.com</a>的前兩個頁面抓取東西,你想要從網站的所有頁面的報價。</p>


<p>現在,您知道如何從頁面中提取數據,讓我們看看如何跟蹤他們的鏈接。</p>


<p>首先是提取我們要關注的網頁的鏈接。檢查我們的頁面,我們可以看到有一個鏈接到下一頁與下面的標記:</p>






<pre class="prettyprint" name="code"><code class="hljs xml has-numbering"><span class="hljs-tag">&lt;<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"pager"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-title">li</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"next"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-title">a</span> <span class="hljs-attribute">href</span>=<span class="hljs-value">"/page/2/"</span>&gt;</span>Next <span class="hljs-tag">&lt;<span class="hljs-title">span</span> <span class="hljs-attribute">aria-hidden</span>=<span class="hljs-value">"true"</span>&gt;</span>&amp;rarr;<span class="hljs-tag">&lt;/<span class="hljs-title">span</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">a</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-title">li</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">ul</span>&gt;</span>
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>我們可以嘗試在shell中提取它:</p>






<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">&gt;&gt;&gt; response.css('li.next a').extract_first()
'<span class="hljs-tag">&lt;<span class="hljs-title">a</span> <span class="hljs-attribute">href</span>=<span class="hljs-value">"/page/2/"</span>&gt;</span>Next <span class="hljs-tag">&lt;<span class="hljs-title">span</span> <span class="hljs-attribute">aria-hidden</span>=<span class="hljs-value">"true"</span>&gt;</span>→<span class="hljs-tag">&lt;/<span class="hljs-title">span</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">a</span>&gt;</span>'</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>這得到錨點元素,但我們想要的屬性href。爲此,Scrapy支持一個CSS擴展,讓您選擇屬性內容,如下所示:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
<span class="hljs-string">'/page/2/'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>讓我們看看現在我們的爬蟲被修改爲遞歸的跟隨到下一頁的鏈接,從中提取數據:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>
    start_urls = [
        <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
    ]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
                <span class="hljs-string">'tags'</span>: quote.css(<span class="hljs-string">'div.tags a.tag::text'</span>).extract(),
            }


        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, callback=self.parse)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li></ul></pre>


<p>現在,在提取數據之後,該parse()方法尋找到下一頁的鏈接,使用該urljoin()方法構建完整的絕對URL (因爲鏈接可以是相對的)並且產生對下一頁的新請求,將其註冊爲回調以處理針對下一頁的數據提取,以及保持爬行通過所有頁面。</p>


<p>這裏看到的是Scrapy的向下鏈接的機制:當你在回調方法中產生一個請求時,Scrapy會調度要發送的請求,並註冊一個回調方法,在上次請求完成時執行。</p>






<h3 id="更多示例和模式"><a name="t10" target="_blank"></a>更多示例和模式</h3>


<p>這裏是另一個爬蟲,說明回調和以下鏈接,這一次提取作者信息:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AuthorSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">'author'</span>


    start_urls = [<span class="hljs-string">'http://quotes.toscrape.com/'</span>]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-comment"># follow links to author pages</span>
        <span class="hljs-keyword">for</span> href <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'.author + a::attr(href)'</span>).extract():
            <span class="hljs-keyword">yield</span> scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)


        <span class="hljs-comment"># follow pagination links</span>
        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, callback=self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_author</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_with_css</span><span class="hljs-params">(query)</span>:</span>
            <span class="hljs-keyword">return</span> response.css(query).extract_first().strip()


        <span class="hljs-keyword">yield</span> {
            <span class="hljs-string">'name'</span>: extract_with_css(<span class="hljs-string">'h3.author-title::text'</span>),
            <span class="hljs-string">'birthdate'</span>: extract_with_css(<span class="hljs-string">'.author-born-date::text'</span>),
            <span class="hljs-string">'bio'</span>: extract_with_css(<span class="hljs-string">'.author-description::text'</span>),
        }</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li><li>25</li><li>26</li><li>27</li><li>28</li><li>29</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li><li>25</li><li>26</li><li>27</li><li>28</li><li>29</li></ul></pre>


<p>這個爬蟲將從主頁開始,它將跟隨所有指向作者頁面的鏈接parse_author,每個鏈接都調用它們的回調,並且還有parse我們之前看到的回調鏈接。</p>


<p>該parse_author回調定義了一個輔助函數從一個CSS查詢提取和清理數據,併產生了Python字典與作者的數據。</p>


<p>即使有很多來自同一作者的爬蟲,我們不需要擔心訪問同一作者頁多次。默認情況下,Scrapy會過濾掉已訪問過的網址的重複請求,從而避免由於編程錯誤而導致服務器過多的問題。這可以通過設置進行配置 DUPEFILTER_CLASS。</p>


<p>此外,一個常見的模式是使用來自多個頁面的數據構建項目,使用一個技巧將附加數據傳遞給回調。</p>


<p><strong>大家不要着急一下子把所以東西都介紹到,具體細節後面都會寫到。</strong></p>


<p><br></p>






<h3 id="使用爬蟲參數"><a name="t11" target="_blank"></a>使用爬蟲參數</h3>


<p>您可以通過-a 在運行它們時使用該選項爲您的爬蟲提供命令行參數: <br>
<code>scrapy crawl quotes -o quotes-humor.json -a tag=humor</code></p>


<p>這些參數傳遞給Spider的<strong>init</strong>方法,默​​認情況下成爲spider屬性。</p>


<p>在此示例中,爲tag參數提供的值將通過self.tag。您可以使用它來使您的蜘蛛僅抓取帶有特定標記的引號,根據參數構建網址:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_requests</span><span class="hljs-params">(self)</span>:</span>
        url = <span class="hljs-string">'http://quotes.toscrape.com/'</span>
        tag = getattr(self, <span class="hljs-string">'tag'</span>, <span class="hljs-keyword">None</span>)
        <span class="hljs-keyword">if</span> tag <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            url = url + <span class="hljs-string">'tag/'</span> + tag
        <span class="hljs-keyword">yield</span> scrapy.Request(url, self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
            }


        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, self.parse)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li></ul></pre>


<p>如果您將tag=humor參數傳遞給此蜘蛛,您會注意到它只會訪問humor代碼中的網址,例如 <a href="http://quotes.toscrape.com/tag/humor" target="_blank">http://quotes.toscrape.com/tag/humor</a>。</p></div>
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章