【前言】利用 Scrapy 爬取網站文字的時候發現，footer 中的 Copyright 等文字會影響後續分詞的效果，因此決定將網頁的 HTML 中有關 footer 的內容都丟棄。以下是不排除 footer 中內容的時候拿到網頁的所有文本內容：

response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一鍵匹配貸款',
 '(爲您獲取精準貸款方案)',
 '貸款金額',
 '萬元',
 '搜索',
 '信用貸',
 '經營貸',
 '房貸',
 '車貸',
 '貸款攻略',
 '客服熱線',
 '快速申請',
 '貸款計算器',
 '熱門貸款產品',
 '紅本抵押貸款',
 '總利息:',
 '0.19',
 '萬元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '電腦版',
 '\xa0|\xa0',
 '關於我們',
 '版權所有©貸上我 m.dai35.com  ',
 '深圳貸上我金融服務有限公司',
 '電話諮詢',
 '400-004-3535',
 '貸款產品多？太難選',
 '一鍵委託',
 '專業爲您推薦']

Explore HTML Contents of Various Pages

一般來說，footer會以這麼幾個形式出現：

<div class="footer">

	<div class="footer">
	<div class="topBtn"><a id="btn" href="#"></a></div>
	<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>&nbsp;|&nbsp;<a href="about.php">關於我們</a></div>
	<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有&copy;貸上我 m.dai35.com  </div>
	<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>
	</div>

<footer>

            <footer>
                <div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下載自如APP,立即簽約好房源</span></a></div>
                <ul class="ub">
                    <li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">電腦版</a></li>
                    <li class="ub-f1 borderLeft"><a href="/">觸屏版</a></li>
                    <li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客戶端</a></li>
                </ul>
                <ul class="ub">
                    <li class="ub-f1"><a href="/">首頁</a></li>
                    <li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
                </ul>
                <p class="version">Copyright©2017 ziroom.com</p>
            </footer>

id="footer"

<div id="footer">
    <div class="area">
        <div class="clearfix">
            <div class="glbLeft">
                <dl class="fList">
                    <dt>關於我們</dt>
                    <dd>
                        <a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">關於自如</a>
                        <a href="http://www.ziroom.com/about/lianxi.html">聯繫自如</a>
                        <a href="http://www.ziroom.com/zhaopin/">加入自如</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>自如業務</dt>
                    <dd>
                        <a href="http://www.ziroom.com/about/fuwu.html">業務體系</a>
                        <a href="http://www.ziroom.com/about/fuwu.html">自如產品</a>
                        <a href="http://www.ziroom.com/servicecentre/">自如服務</a>
                        <a href="http://www.ziroom.com/purchase/">自如採購</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>關注自如</dt>
                    <dd>
                        <a>自如客微信</a>
                        <a>下載app</a>
                    </dd>
                </dl>
            </div>

            <div class="glbRight">
                <div class="img">
                    <img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
                    <p>關注自如客微信</p>
                </div>
                <div class="img">
                    <img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
                    <p>下載自如app</p>
                </div><!--/img-->
            </div><!--/glbRight-->
        </div><!--/clearfix-->
		
        <div class="linksFooter"></div>

        <div class="footerBottom pr">
            <p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版權所有 京ICP備16015349號-1</p>
            <p>本網站所有頁面的數據統計均來源於自如數據庫 &nbsp;&nbsp;聯繫客服：自如客微信  週一至週日09:00-22:00</p>
            <a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;"  logo_size="124x47"  logo_type="business"  href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
        </div>
    </div><!--/area-->
</div><!--/footer-->

How to Extract Footers Using XPath

打開 Scrapy shell，並訪問某網頁

scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......


response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>']

那麼我們自然會想，不想選擇這部分直接這樣寫就好了嘛：

response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()

......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

然而，我們可以看到，這次的結果和上次的結果的差別僅在於，這次的結果中少了一段：

'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>\r\n</div>'

但是這次結果中的這些部分（見下）還是存在的，也就是會導致其實最終我們抽取出來的文本還是會有 footer 的內容。那麼到底應該怎樣寫才能真地將 footer 的內容從結果中剔除呢？

'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">電腦版</a>\xa0|\xa0<a href="about.php">關於我們</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">電腦版</a>',
 '<a href="about.php">關於我們</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版權所有©貸上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳貸上我金融服務有限公司</div>',

Exclude Footers of Any Kind in Results

其實我們只需要選擇 footer node 本身以及其子節點即可，通過這種方法，我們可以看到所有和 footer 有關的內容已經都被清除了：

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()

......
 '<a href="#">熱門貸款產品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">紅本抵押貸款</h3>\r\n         <p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">紅本抵押貸款</h3>',
 '<p class="prolist_infop1">總利息:<font color="#e10014">0.19</font>萬元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

Function not(boolean) in XPath

其實這裏面起了關鍵作用的就是標題這個 not 函數。如果我們想要既排除祖先或本身是 footer 的元素，又排除本身是 script 或 title 或 style 的元素，那麼我們需要這樣寫：

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()

最終我們需要選擇排除了這些條件之後所有的 text 內容（見下），是不是比文章開頭所得到的文本少了好多噪音呢？

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一鍵匹配貸款',
 '(爲您獲取精準貸款方案)',
 '貸款金額',
 '萬元',
 '搜索',
 '信用貸',
 '經營貸',
 '房貸',
 '車貸',
 '貸款攻略',
 '客服熱線',
 '快速申請',
 '貸款計算器',
 '熱門貸款產品',
 '紅本抵押貸款',
 '總利息:',
 '0.19',
 '萬元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '電話諮詢',
 '400-004-3535',
 '貸款產品多？太難選',
 '一鍵委託',
 '專業爲您推薦']

【參考鏈接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags

【爬蟲】Scrapy 中利用 XPath 丟棄所有跟 footer 相關的內容

Explore HTML Contents of Various Pages

How to Extract Footers Using XPath

Exclude Footers of Any Kind in Results

Function not(boolean) in XPath

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結