【前言】利用 Scrapy 爬取網站文字的時候發現,footer 中的 Copyright 等文字會影響後續分詞的效果,因此決定將網頁的 HTML 中有關 footer 的內容都丟棄。以下是不排除 footer 中內容的時候拿到網頁的所有文本內容:
response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()
['400-004-3535',
'一鍵匹配貸款',
'(爲您獲取精準貸款方案)',
'貸款金額',
'萬元',
'搜索',
'信用貸',
'經營貸',
'房貸',
'車貸',
'貸款攻略',
'客服熱線',
'快速申請',
'貸款計算器',
'熱門貸款產品',
'紅本抵押貸款',
'總利息:',
'0.19',
'萬元 \xa0月供:',
'4325',
'元',
'查看',
'\r\n\t\ufeff',
'電腦版',
'\xa0|\xa0',
'關於我們',
'版權所有©貸上我 m.dai35.com ',
'深圳貸上我金融服務有限公司',
'電話諮詢',
'400-004-3535',
'貸款產品多?太難選',
'一鍵委託',
'專業爲您推薦']
Explore HTML Contents of Various Pages
一般來說,footer會以這麼幾個形式出現:
- <div class="footer">
<div class="footer">
<div class="topBtn"><a id="btn" href="#"></a></div>
<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a> | <a href="about.php">關於我們</a></div>
<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>
<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>
</div>
- <footer>
<footer>
<div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下載自如APP,立即簽約好房源</span></a></div>
<ul class="ub">
<li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">電腦版</a></li>
<li class="ub-f1 borderLeft"><a href="/">觸屏版</a></li>
<li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客戶端</a></li>
</ul>
<ul class="ub">
<li class="ub-f1"><a href="/">首頁</a></li>
<li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
</ul>
<p class="version">Copyright©2017 ziroom.com</p>
</footer>
- id="footer"
<div id="footer">
<div class="area">
<div class="clearfix">
<div class="glbLeft">
<dl class="fList">
<dt>關於我們</dt>
<dd>
<a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">關於自如</a>
<a href="http://www.ziroom.com/about/lianxi.html">聯繫自如</a>
<a href="http://www.ziroom.com/zhaopin/">加入自如</a>
</dd>
</dl>
<dl class="fList">
<dt>自如業務</dt>
<dd>
<a href="http://www.ziroom.com/about/fuwu.html">業務體系</a>
<a href="http://www.ziroom.com/about/fuwu.html">自如產品</a>
<a href="http://www.ziroom.com/servicecentre/">自如服務</a>
<a href="http://www.ziroom.com/purchase/">自如採購</a>
</dd>
</dl>
<dl class="fList">
<dt>關注自如</dt>
<dd>
<a>自如客微信</a>
<a>下載app</a>
</dd>
</dl>
</div>
<div class="glbRight">
<div class="img">
<img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
<p>關注自如客微信</p>
</div>
<div class="img">
<img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
<p>下載自如app</p>
</div><!--/img-->
</div><!--/glbRight-->
</div><!--/clearfix-->
<div class="linksFooter"></div>
<div class="footerBottom pr">
<p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版權所有 京ICP備16015349號-1</p>
<p>本網站所有頁面的數據統計均來源於自如數據庫 聯繫客服:自如客微信 週一至週日09:00-22:00</p>
<a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;" logo_size="124x47" logo_type="business" href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
</div>
</div><!--/area-->
</div><!--/footer-->
How to Extract Footers Using XPath
打開 Scrapy shell,並訪問某網頁
scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
'<a href="#">熱門貸款產品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">紅本抵押貸款</h3>',
'<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>',
'<div class="topBtn"><a id="btn" href="#"></a></div>',
'<a id="btn" href="#"></a>',
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
'<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
'<a href="about.php">關於我們</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>']
那麼我們自然會想,不想選擇這部分直接這樣寫就好了嘛:
response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()
......
'<a href="#">熱門貸款產品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">紅本抵押貸款</h3>',
'<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<div class="topBtn"><a id="btn" href="#"></a></div>',
'<a id="btn" href="#"></a>',
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
'<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
'<a href="about.php">關於我們</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
然而,我們可以看到,這次的結果和上次的結果的差別僅在於,這次的結果中少了一段:
'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>'
但是這次結果中的這些部分(見下)還是存在的,也就是會導致其實最終我們抽取出來的文本還是會有 footer 的內容。那麼到底應該怎樣寫才能真地將 footer 的內容從結果中剔除呢?
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
'<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
'<a href="about.php">關於我們</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
Exclude Footers of Any Kind in Results
其實我們只需要選擇 footer node 本身以及其子節點即可,通過這種方法,我們可以看到所有和 footer 有關的內容已經都被清除了:
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()
......
'<a href="#">熱門貸款產品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">紅本抵押貸款</h3>\r\n <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">紅本抵押貸款</h3>',
'<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
Function not(boolean) in XPath
其實這裏面起了關鍵作用的就是標題這個 not 函數。如果我們想要既排除祖先或本身是 footer 的元素,又排除本身是 script 或 title 或 style 的元素,那麼我們需要這樣寫:
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()
最終我們需要選擇排除了這些條件之後所有的 text 內容(見下),是不是比文章開頭所得到的文本少了好多噪音呢?
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()
['400-004-3535',
'一鍵匹配貸款',
'(爲您獲取精準貸款方案)',
'貸款金額',
'萬元',
'搜索',
'信用貸',
'經營貸',
'房貸',
'車貸',
'貸款攻略',
'客服熱線',
'快速申請',
'貸款計算器',
'熱門貸款產品',
'紅本抵押貸款',
'總利息:',
'0.19',
'萬元 \xa0月供:',
'4325',
'元',
'查看',
'\r\n\t\ufeff',
'電話諮詢',
'400-004-3535',
'貸款產品多?太難選',
'一鍵委託',
'專業爲您推薦']
【參考鏈接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags