python读取未知编码文件

原創

mofei12138

2020-06-27 18:45

python读取未知编码文件

背景

在开发日志分析功能时，需要读取不同编码的文件然后对文件内容进行解析，那么首先要解决的就是如何检测编码的问题。

测试文件说明

为了方便演示，先创建5个测试文件（文件名对应编码）：utf8-file，utf8bom-file，gbk-file，utf16le-file，utf16be-file。5个文件统一写入以下内容：

abcd
1234
一二三四

使用chardet模块来检测编码

chardet是一个用于编码检测的模块，它可以帮助我们识别一段未知格式的字节是属于什么编码格式。

小文件的编码检测

chardet模块的detect函数接受一个非unicode字符串参数，返回一个字典。该字典包括检测到的编码格式和置信度。

>>> import chardet
>>> with open('utf8-file', 'rb') as f:
...     result = chardet.detect(f.read())
...     print(result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

大文件的编码检测

考虑到有的文件非常大，如果按照上述方法全部读入后再判断编码格式，效率会变得非常低下，因此使用增量检测的方式。在这里我们每次给检测器传入一行数据，当检测器达到最低置信度阈值就可以获取检测结果，这样的话相较于上述方法读取的内容可能更少，从而可以减少检测的时间。这个方式的另一个好处就是分块读取文件内容，不会就内存造成过大的压力。

>>> import chardet
>>> from chardet.universaldetector import UniversalDetector
>>> detector = UniversalDetector()
>>> with open('utf8-file', 'rb') as f:
...     for line in f:
...         detector.feed(line)
...         if detector.done:
...             break
...     detector.close()
...     print(detector.result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

结合检测编码和读取内容

我们将检测编码和读取文件内容封装成一个函数，并对5种编码格式的文件进行了测试。以下代码在创建UniversalDetector对象时传入了LanguageFilter.CHINESE参数，这样可以使检测结果更加准确。

>>> import io
>>> import chardet
>>> from chardet.universaldetector import UniversalDetector, LanguageFilter
>>> def reading_unknown_encoding_file(filename):
...     detector = UniversalDetector(LanguageFilter.CHINESE)
...     with open(filename, 'rb') as f:
...         for line in f:
...             detector.feed(line)
...             if detector.done:
...                 break
...         detector.close()
...         encoding = detector.result['encoding']
...         f = io.TextIOWrapper(f, encoding=encoding)
...         f.seek(0)
...         for line in f:
...             print(repr(line))
...
>>> reading_unknown_encoding_file('utf8-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf8bom-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('gbk-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf16le-file')
'abcd\n'
'1234\n'
'一二三四'
>>> reading_unknown_encoding_file('utf16be-file')
'abcd\n'
'1234\n'
'一二三四'

参考文档

chardet文档

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python读取未知编码文件

python读取未知编码文件

背景

测试文件说明

使用chardet模块来检测编码

小文件的编码检测

大文件的编码检测

结合检测编码和读取内容

参考文档

Spring Cloud 部署时如何使用 Kubernetes 作为注册中心和配置中心

在flask-restplus下統一接口返回格式

python讀取未知編碼文件

用python的difflib模塊比較文本序列

python：關於super

flask-socketio配置websocket步驟

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結