編碼處理問題總結：UnicodeDecodeError:'gbk' codec can't decode byte 0xe3: illegal multibyte sequence與讀取docx

在試圖打開docx文檔內容時，以爲可以向讀取txt文檔一樣，於是寫下了下面的代碼

with open('C:\\Users\\Administrator\\Desktop\\案例二.docx','r')as f:
contents = f.read()
print(contents)

結果遇上報錯：UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xe3 in position 55: illegal multibyte sequence

解決方法一：
一看，編碼錯誤，祖傳方法encoding='utf-8‘’百試百靈的修改

with open('C:\\Users\\Administrator\\Desktop\\案例二.docx','r'，encoding='utf-8‘’)as f:
	contents = f.read()
	print(contents)

結果一樣報錯UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x87 in position 10: invalid start byte

我就納悶了，怎麼還有 utf-8都解碼不了，utf-8號稱‘萬國碼’’（UTF-8編碼：它是一種全國家通過的一種編碼，如果網站涉及到多個國家的語言，那麼建議選擇UTF-8編碼。），基本上用上它一切就ok了，怎麼還報錯。我就打了一個“你好”在裏面啊！

但既然是編碼錯誤，就繼續。
之後按照這篇文章《UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xe9 in position 7581: illegal multibyte sequence》
一個一個換編碼

gbk
gb2312
gb18030
utf-8
utf-16
utf-32
ISO-8859-1

都沒有效果，
utf-16:UnicodeDecodeError: ‘utf-16-le’ codec can’t decode bytes in position 92-93: illegal encoding

ISO-8859-1:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\x87’ in position 11: illegal multibyte sequence

都沒有效果
解決方法二：
可能是不認識的編碼。於是按照《使用chardet判斷編碼方式》使用chardet進行編碼自動判斷並調用

import chardet

def chardets():
	path = 'C:\\Users\\Administrator\\Desktop\\案例二.docx'
	with open(path, 'rb') as f:
    	#print(chardet.detect(f.read())['encoding'])
    	return chardet.detect(f.read())['encoding']
#chardets()
with open('C:\\Users\\Administrator\\Desktop\\案例二.docx','r', 	encoding=chardets())as f:
	contents = f.read()
	print(contents)

然而依然保錯：UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xe3 in position 55: illegal multibyte sequence

看看是什麼樣的編碼這樣難以解決

print(chardet.detect(f.read())['encoding'])

怎麼是None呢？

def chardets():
path = 'C:\\Users\\Administrator\\Desktop\\案例二.docx'
with open(path, 'rb') as f:
    print(chardet.detect(f.read()))#chardet.detect()返回的是一個字典，所以之前需要[‘encoding’]獲取編碼方式
    #return chardet.detect(f.read())['encoding']
chardets()

結果一樣是None

於是我再次嘗試了txt文件的讀取，結果正常讀取

但是有意思的是，其編碼竟然是

{'encoding': 'TIS-620', 'confidence': 0.3598212120361634, 'language': 'Thai'}

這個編碼從未見過，於是我想可能是文件中存在異常的編碼（之前寫入的不是“你好”這兩個字符，而是一篇文章）
解決方法三：
編碼解決不了，那就解決出現問題的編碼，對之進行跳過。
增加errors=‘ignore’

import chardet

def chardets():
path = 'C:\\Users\\Administrator\\Desktop\\案例二.docx'
with open(path, 'rb') as f:
    print(chardet.detect(f.read()))
    return chardet.detect(f.read())['encoding']
#chardets()
with open('C:\\Users\\Administrator\\Desktop\\案例二.docx','r', encoding=chardets(), errors='ignore')as f:
contents = f.read()
print(contents)

結果亂碼

之後嘗試過對文檔內容進行刪除，查找是否存在異常的內容。
結果刪成了上面的“你好”兩個字符還是報錯或者亂碼。

嘗試二進制byte寫入後再gbk、utf-8讀取，結果全是二進制內容

解決方法四（最終解決方法）：
既然一致出現解碼錯誤，而在讀取txt時發現其編碼爲TIS-620，我想是不是文件格式導致的問題，於是查了一下讀取docx的模塊，結果出現了docx讀取的模塊，看來就是文檔格式導致的問題了。
於是按照《Python學習筆記(28)-Python讀取word文本》下載了python-docx

import docx
file=docx.Document("C:\\Users\\Administrator\\Desktop\\案例二.docx")
for para in file.paragraphs:
	print(para.text)

總結：有時錯誤是由於其上層錯誤導致的，而不是自己的問題，在找不到錯誤的情況下需要找找其上層依賴的問題是否存在。
[1]https://blog.csdn.net/Katrina_ALi/article/details/80638972?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522158848764019725256734556%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.57662%2522%257D&request_id=158848764019725256734556&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_allfirst_rank_v2~rank_v25-6

【2】https://blog.csdn.net/woshisangsang/article/details/75221723?ops_request_misc=&request_id=&biz_id=102&utm_medium=distribute.pc_search_result.none-task-blog-2_blogsobaiduweb~default-0

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

編碼處理問題總結：UnicodeDecodeError:'gbk' codec can't decode byte 0xe3: illegal multibyte sequence與讀取docx

.NET開源強大、易於使用的緩存框架 - FusionCache

面試，有時候是個運氣活

中國裁判文書下載：selenium路線

Pandas合併excel文件錯位現象的解決

remove方法缺陷補全：續《列表數據清洗遇到問題的記錄——set用法和remove方法的缺陷》

Appium安裝環境配置中的一些問題

selenium彈窗之windows下載文件彈窗點擊方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結