-
lxml包沒有etree模塊的解決方法:
環境:python3.7+ lxml4.4.4
因爲etree是C語言寫的,所以在import時,不會有提示,直接輸入即可
from lxml import etree
-
在使用etree.parse時報錯,原因:該方法默認使用的是“XML”解析器,所以如果碰到不規範的html文件時就會解析錯誤
htmlElement = etree.parse('renren.html')
File "src\lxml\etree.pyx", line 3467, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1839, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
File "renren.html", line 4
lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 4, column 2
解決辦法:
自己創建html解析器,增加parser參數
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse('renren.html', parser=parser)
etree很多方法不會提示,直接手動輸入即可:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
功能:lxml和xpath使用
環境:python3.7+ lxml4.4.4
日期:2019/8/14 21:41
作者:指尖魔法師
版本:1.0
"""
from lxml import etree
def fromstring():
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link3.html">second item</a></li>
<li class="item-inactive"><a href="link4.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
htmlElement = etree.HTML(text)
result = etree.tostring(htmlElement, encoding='utf-8').decode('utf-8')
print(result)
def fromfile():
#默認是XML解析器,碰到不規範的html文件時就會解析錯誤,增加解析器
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse('renren.html', parser=parser)
result = etree.tostring(htmlElement, encoding='utf-8').decode('utf-8')
print(result)
def main():
#字符串讀取html
#fromstring()
# 從文件讀取html
fromfile()
if __name__ == '__main__':
main()