python讀取未知編碼文件

原創

mofei12138

2020-06-27 18:45

python讀取未知編碼文件

背景

在開發日誌分析功能時，需要讀取不同編碼的文件然後對文件內容進行解析，那麼首先要解決的就是如何檢測編碼的問題。

測試文件說明

爲了方便演示，先創建5個測試文件（文件名對應編碼）：utf8-file，utf8bom-file，gbk-file，utf16le-file，utf16be-file。5個文件統一寫入以下內容：

abcd
1234
一二三四

使用chardet模塊來檢測編碼

chardet是一個用於編碼檢測的模塊，它可以幫助我們識別一段未知格式的字節是屬於什麼編碼格式。

小文件的編碼檢測

chardet模塊的detect函數接受一個非unicode字符串參數，返回一個字典。該字典包括檢測到的編碼格式和置信度。

>>> import chardet
>>> with open('utf8-file', 'rb') as f:
...     result = chardet.detect(f.read())
...     print(result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

大文件的編碼檢測

考慮到有的文件非常大，如果按照上述方法全部讀入後再判斷編碼格式，效率會變得非常低下，因此使用增量檢測的方式。在這裏我們每次給檢測器傳入一行數據，當檢測器達到最低置信度閾值就可以獲取檢測結果，這樣的話相較於上述方法讀取的內容可能更少，從而可以減少檢測的時間。這個方式的另一個好處就是分塊讀取文件內容，不會就內存造成過大的壓力。

>>> import chardet
>>> from chardet.universaldetector import UniversalDetector
>>> detector = UniversalDetector()
>>> with open('utf8-file', 'rb') as f:
...     for line in f:
...         detector.feed(line)
...         if detector.done:
...             break
...     detector.close()
...     print(detector.result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

結合檢測編碼和讀取內容

我們將檢測編碼和讀取文件內容封裝成一個函數，並對5種編碼格式的文件進行了測試。以下代碼在創建UniversalDetector對象時傳入了LanguageFilter.CHINESE參數，這樣可以使檢測結果更加準確。

>>> import io
>>> import chardet
>>> from chardet.universaldetector import UniversalDetector, LanguageFilter
>>> def reading_unknown_encoding_file(filename):
...     detector = UniversalDetector(LanguageFilter.CHINESE)
...     with open(filename, 'rb') as f:
...         for line in f:
...             detector.feed(line)
...             if detector.done:
...                 break
...         detector.close()
...         encoding = detector.result['encoding']
...         f = io.TextIOWrapper(f, encoding=encoding)
...         f.seek(0)
...         for line in f:
...             print(repr(line))
...
>>> reading_unknown_encoding_file('utf8-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf8bom-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('gbk-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf16le-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf16be-file')
'abcd\n'
'1234\n'
'一二三四'

參考文檔

chardet文檔

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python讀取未知編碼文件

python讀取未知編碼文件

背景

測試文件說明

使用chardet模塊來檢測編碼

小文件的編碼檢測

大文件的編碼檢測

結合檢測編碼和讀取內容

參考文檔

在flask-restplus下統一接口返回格式

python讀取未知編碼文件

用python的difflib模塊比較文本序列

python：關於super

flask-socketio配置websocket步驟

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結