【爬蟲】Scrapy 中利用 XPath 丟棄所有跟 footer 相關的內容

【前言】利用 Scrapy 爬取網站文字的時候發現,footer 中的 Copyright 等文字會影響後續分詞的效果,因此決定將網頁的 HTML 中有關 footer 的內容都丟棄。以下是不排除 footer 中內容的時候拿到網頁的所有文本內容:

response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一鍵匹配貸款',
 '(爲您獲取精準貸款方案)',
 '貸款金額',
 '萬元',
 '搜索',
 '信用貸',
 '經營貸',
 '房貸',
 '車貸',
 '貸款攻略',
 '客服熱線',
 '快速申請',
 '貸款計算器',
 '熱門貸款產品',
 '紅本抵押貸款',
 '總利息:',
 '0.19',
 '萬元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '電腦版',
 '\xa0|\xa0',
 '關於我們',
 '版權所有©貸上我 m.dai35.com  ',
 '深圳貸上我金融服務有限公司',
 '電話諮詢',
 '400-004-3535',
 '貸款產品多?太難選',
 '一鍵委託',
 '專業爲您推薦']

 

Explore HTML Contents of Various Pages

一般來說,footer會以這麼幾個形式出現:

  • <div class="footer">
	<div class="footer">
	<div class="topBtn"><a id="btn" href="#"></a></div>
	<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>&nbsp;|&nbsp;<a href="about.php">關於我們</a></div>
	<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有&copy;貸上我 m.dai35.com  </div>
	<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>
	</div>
  • <footer>
            <footer>
                <div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下載自如APP,立即簽約好房源</span></a></div>
                <ul class="ub">
                    <li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">電腦版</a></li>
                    <li class="ub-f1 borderLeft"><a href="/">觸屏版</a></li>
                    <li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客戶端</a></li>
                </ul>
                <ul class="ub">
                    <li class="ub-f1"><a href="/">首頁</a></li>
                    <li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
                </ul>
                <p class="version">Copyright©2017 ziroom.com</p>
            </footer>
  • id="footer"
<div id="footer">
    <div class="area">
        <div class="clearfix">
            <div class="glbLeft">
                <dl class="fList">
                    <dt>關於我們</dt>
                    <dd>
                        <a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">關於自如</a>
                        <a href="http://www.ziroom.com/about/lianxi.html">聯繫自如</a>
                        <a href="http://www.ziroom.com/zhaopin/">加入自如</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>自如業務</dt>
                    <dd>
                        <a href="http://www.ziroom.com/about/fuwu.html">業務體系</a>
                        <a href="http://www.ziroom.com/about/fuwu.html">自如產品</a>
                        <a href="http://www.ziroom.com/servicecentre/">自如服務</a>
                        <a href="http://www.ziroom.com/purchase/">自如採購</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>關注自如</dt>
                    <dd>
                        <a>自如客微信</a>
                        <a>下載app</a>
                    </dd>
                </dl>
            </div>

            <div class="glbRight">
                <div class="img">
                    <img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
                    <p>關注自如客微信</p>
                </div>
                <div class="img">
                    <img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
                    <p>下載自如app</p>
                </div><!--/img-->
            </div><!--/glbRight-->
        </div><!--/clearfix-->
		
        <div class="linksFooter"></div>

        <div class="footerBottom pr">
            <p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版權所有 京ICP備16015349號-1</p>
            <p>本網站所有頁面的數據統計均來源於自如數據庫 &nbsp;&nbsp;聯繫客服:自如客微信  週一至週日09:00-22:00</p>
            <a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;"  logo_size="124x47"  logo_type="business"  href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
        </div>
    </div><!--/area-->
</div><!--/footer-->

How to Extract Footers Using XPath

打開 Scrapy shell,並訪問某網頁

scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......


response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>']

那麼我們自然會想,不想選擇這部分直接這樣寫就好了嘛:

response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()

......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

然而,我們可以看到,這次的結果和上次的結果的差別僅在於,這次的結果中少了一段:

'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>'

但是這次結果中的這些部分(見下)還是存在的,也就是會導致其實最終我們抽取出來的文本還是會有 footer 的內容。那麼到底應該怎樣寫才能真地將 footer 的內容從結果中剔除呢?

'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',

Exclude Footers of Any Kind in Results

其實我們只需要選擇 footer node 本身以及其子節點即可,通過這種方法,我們可以看到所有和 footer 有關的內容已經都被清除了:

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()

......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

Function not(boolean) in XPath

其實這裏面起了關鍵作用的就是標題這個 not 函數。如果我們想要既排除祖先或本身是 footer 的元素,又排除本身是 script 或 title 或 style 的元素,那麼我們需要這樣寫:

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()

最終我們需要選擇排除了這些條件之後所有的 text 內容(見下),是不是比文章開頭所得到的文本少了好多噪音呢?

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一鍵匹配貸款',
 '(爲您獲取精準貸款方案)',
 '貸款金額',
 '萬元',
 '搜索',
 '信用貸',
 '經營貸',
 '房貸',
 '車貸',
 '貸款攻略',
 '客服熱線',
 '快速申請',
 '貸款計算器',
 '熱門貸款產品',
 '紅本抵押貸款',
 '總利息:',
 '0.19',
 '萬元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '電話諮詢',
 '400-004-3535',
 '貸款產品多?太難選',
 '一鍵委託',
 '專業爲您推薦']

【參考鏈接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章