第十九篇，爬取bilibili彈幕使用lxml解析遇到ValueError: Unicode strings with encoding declaration are not supported

原創

2020-06-16 08:27

這篇博客是我看了別人的一篇博客有感而發寫的：python爬蟲：bilibili彈幕爬取+詞雲生成想着既然他用beautifulsoup解析的那我lmxl肯定不能落後。
這裏是我爬取bilibili視頻彈幕遇到的一個問題如下：

    html = etree.HTML(text)
  File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

這裏報錯了，在我們使用lxml解析的時候遇到的。
先看看我們之前的代碼：

import requests
from lxml import etree
url = 'https://comment.bilibili.com/128589248.xml'
response = requests.get(url)
print(response.content.decode('utf-8'))  #轉碼

上面那個url哪來的呢，這裏提一下我們打開嗶哩嗶哩然後隨便點擊一個視頻，點擊播放的時候查看F12找到：
網路下面的XHR裏面播放了會有很多heartbeat，隨便選一個點到裏面的參數，參數下面有個cid把它的值複製下來放入下面網址：

https://comment.bilibili.com/cid(128589248).xml

之後打開這個網址：
這就是這個視頻的所有評論，然後我們就獲取響應們就是上面的那幾行代碼，獲取到了之後，按照以往的習慣導入lxml庫下面的etree模塊：from lxml import etree，別忘了requests也要導入，這裏就不多提。然後接着用一個變量html來接收解析的網頁：

text = response.content.decode('utf-8')
html = etree.HTML(text)

這個時候運行的時候就發現報錯了：ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration…
這個時候我們就當然百度了一下：
lxml簡明教程
發現裏面的第一個就讓我耳目一新：

>>> xml_string = '<root><foo id="foo-id" class="foo zoo">Foo</foo><bar>中文</bar><baz></baz></root>'
>>> root = etree.fromstring(xml_string.encode('utf-8')) # 最好傳 byte string
>>> etree.tostring(root)
# 默認返回的是 byte string
b'<root>root content<foo id="foo-id" class="foo zoo">Foo</foo><bar>Bar</bar><baz/></root>'

這種方法是把我們傳入的html轉換爲utf-8的格式再返回一個byte string類型的數據之後輸出，那咱就試試看;

html = etree.fromstring(text.encode('utf-8'))
danmu = html.xpath('//d/text()')
print(danmu)

果然成功了，再看看我們之前的獲取到的響應的html。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

第十九篇，爬取bilibili彈幕使用lxml解析遇到ValueError: Unicode strings with encoding declaration are not supported

第十七篇，數據分析之pandas的時間操作其二

第九篇，數據分析之pandas的文件讀寫操作

第七篇，數據分析之pandas的索引對象

第十篇，數據分析之pandas的處理缺失值

第十二篇，數據分析之pandas的數據規整其二

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結