基本方案,採用 lxml + beautifulsoup 進行html解析和url 提取
參考 Python HTML 解析器性能評測 lxml 解析速度快,beautifulsoup 的容錯性更好一些.
下了一個
lxml-2.3-py2.7-win32.egg 安裝需要先安裝一個
setuptools 然後執行 setuptools.exe xxx.egg 安裝了xml
lxml封裝了beautifulsoup , 但需要自己安裝這個東東.下完解開,自帶setup . 執行 python.exe setup.py install
下面就可以開始解析文件了
#-*- coding: utf-8 -*- import lxml.html.soupparser as soupparser import lxml.etree as etree print "hello html parser !" html = r'i:\temp\test.html' dom = soupparser.parse(html) #dom = soupparser.fromstring(html) count = 0 for ele in dom.iter(): if(ele.tag == 'a'): count += 1 print ele.attrib.get('href') print "parse finished ! find url = ", count