python過濾html文檔中的Tag標籤

原創

FanWinter

2018-08-22 08:45

最近在練習爬蟲時需提取HTML文檔正文內容，現總結如下方法。

方法一：

模塊 lxml.html.clean 提供一個Cleaner 類來清理 HTML 頁。它支持刪除嵌入或腳本內容、特殊標記、 CSS 樣式註釋或者更多。　　

注意，page_structure,safe_attrs_only爲False時保證頁面的完整性，否則，這個Cleaner會把你的html結構與標籤裏的屬性都給清理了。

Cleaner參數說明：http://lxml.de/api/lxml.html.clean.Cleaner-class.html

from lxml.html.clean import Cleaner
import requests

url ='http://www.csh.com.cn/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

html = requests.get(url, headers=headers).content
#清除不必要的標籤
cleaner = Cleaner(style = True,scripts=True,comments=True,javascript=True,page_structure=False,safe_attrs_only=False)

content = cleaner.clean_html(html.decode('utf-8')).encode('utf-8')
#這裏打印出來的結果會將上面過濾的標籤去掉，但是未過濾的標籤任然存在。
print content

方法二：

正則表達式過濾標籤。

（1）過濾全部標籤：

import re
import requests

url ='http://www.csh.com.cn/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

html = requests.get(url, headers=headers).content
#清除所有標籤，正則匹配所有標籤。
reg = re.compile('<[^>]*>')
content = reg.sub('',html).replace('\n','').replace(' ','')

#此時所得結果爲頁面文本內容，不包含任何標籤信息。
print content

（2）過濾指定標籤：

此方式我在測試時有的script標籤不能完全過濾掉。大家可視情況而定。

def filter_tags(htmlstr):
  re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA
  re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script
  re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style
  re_br=re.compile('<br\s*?/?>')#處理換行
  re_h=re.compile('</?\w+[^>]*>')#HTML標籤
  re_comment=re.compile('<!--[^>]*-->')#HTML註釋
  blank_line=re.compile('\n+')

  #過濾匹配內容
  s=re_cdata.sub('',htmlstr)#去掉CDATA
  s=re_script.sub('',s) #去掉SCRIPT
  s=re_style.sub('',s)#去掉style
  s=re_br.sub('\n',s)#將br轉換爲換行
  s=re_h.sub('',s) #去掉HTML 標籤
  s=re_comment.sub('',s)#去掉HTML註釋
  s=blank_line.sub('\n',s)#去掉多餘的空行

  return s

方法三：

BeautifulSoup過濾指定標籤有以下三種方法：

clear() ：clear() 方法移除當前tag的內容:

extract()：extract() 方法將當前tag移除文檔樹,並作爲方法結果返回:

decompose()：decompose() 方法將當前節點移除文檔樹並完全銷燬:

from bs4 import BeautifulSoup
# clear() 方法移除當前tag的內容:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_clear = soup.a
i_clear = soup.i.clear()
# extract() 方法將當前tag移除文檔樹,並作爲方法結果返回
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_extract = soup.a
i_extract = soup.i.extract()
# decompose() 方法將當前節點移除文檔樹並完全銷燬
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_decompose = soup.a
i_decompose = soup.i.decompose()


# 輸出
print a_clear         # <a href="http://example.com/">I linked to <i></i></a>
print i_clear         # None
print a_extract       # <a href="http://example.com/">I linked to </a>
print i_extract       # <i>example.com</i>
print a_decompose     # <a href="http://example.com/">I linked to </a>
print i_decompose     # None

這些方法僅供參考，請大家根據自己的情況擇優選取使用。

參考文章：

python-27：clear()，extract()，decompose()：http://www.w2bc.com/article/89892

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python過濾html文檔中的Tag標籤

查看已登錄網站cookie信息

常用瀏覽器User-Agent

linux中文件數目統計

python函數編程SyntaxError: non-default argument follows default argument

Http協議詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結