Python2和Python3中字符串編碼問題詳解

本文參考：

https://www.cnblogs.com/saolv/p/8158159.html

https://blog.csdn.net/mycar001/article/details/78364357

首先說明：Python2和Python3的默認字符串編碼是不一樣的！

首先需要知道什麼是Unicode，Unicode是一種編碼，類似於ASCII、UTF8，而Python以Unicode編碼爲轉換中介，進行字符串的轉碼。

瞭解一下什麼是編碼，解碼：

字符需要有效傳輸，所以需要將字符編碼爲二進制的字節序列，解碼就是從二進制序列，按照指定的規則，恢復出原始的內容。

字符→編碼encode→字節序列

字節序列→解碼decode→字符

瞭解Python中的編碼和解碼：

Python中解碼就是把某種編碼格式的“內容”解碼成Unicode字符串，編碼就是將Unicode編碼成具有某種編碼格式的“內容”,Python中encode和decode的默認參數(也可以自己指定編碼)是指定的編碼（Python2爲ASCII，Python3爲utf8），與你模塊的編碼，編譯環境的全局編碼，第一行指定的編碼都沒有關係！

Python2中：

str類型(utf8等編碼)字符串→解碼(decode)→Unicode類型(Unicode編碼)字符串

Unicode類型(Unicode編碼)字符串→編碼(encode)→str類型(utf8等編碼)字符串

Python3中：

str類型(Unicode編碼)字符串→編碼(encode)→bytes類型字節序列

bytes類型字節序列→解碼(decode)→str類型(Unicode編碼)字符串

Python3中的str類型就等於Python2中的Unicode類型！！！

Python2中的字符串有兩個類型，分別爲str和unicode，接下來看兩者區別

看一段代碼：

# coding:utf-8
str1 = "嚴"
print type(str1)
print str1
print len(str1)

str2 = u"嚴"
print type(str2)
print str2
print len(str2)


運行結果：
<type 'str'>
嚴
3
<type 'unicode'>
嚴
1

下面解釋每一行代碼

Python2中你定義字符串的時候，字符串的類型是str，也就是具有某種編碼格式的字符串（默認爲ASCII）

第一行代碼是聲明本模塊的編碼格式，因爲默認爲ASCII，當不寫這一行的時候，字符串是沒辦法輸入漢字的，寫上這一行

以後，str1就是以utf8編碼的類型爲str的字符串，輸出的是它本身，長度爲3（可以理解這個3，不知道怎麼解釋，utf8一個漢字三個字節）。

str2是一個unicode類型的字符串，輸出也爲它本身，長度爲1（1個Unicode）

再看一段代碼：

# coding:utf8
str1 = "嚴"  #utf8的str類型的字符串
print type(str1)
print str1
print len(str1)

str2=str1.decode("utf8")  #參數必須與第一行的聲明相同，解碼爲unicode類型的字符串
print type(str2)
print str2
print len(str2)

str3=str2.encode("utf8")  #unicode字符串編碼爲utf8的str類型的字符串
print type(str3)
print str3
print len(str3)

str4=str2.encode("gbk")  #unicode字符串編碼爲gbk的str類型的字符串
print type(str4)
print str4
print len(str4)

運行結果：
<type 'str'>
嚴
3
<type 'unicode'>
嚴
1
<type 'str'>
嚴
3
<type 'str'>
��
2

str1爲utf8(第一行決定的)編碼的str類型的字符串。第二段代碼decode將其解碼爲unicode類型的字符串str2.

Python2中的encode和decode函數默認爲ASCII編碼，可以自己寫參數指定。

第三段代碼，unicode類型的str2用utf8編碼格式編碼爲str類型的字符串str3,同理，第四段代碼， unicode類型的str2用gbk編碼格式編碼爲str類型的字符串str4。當中文以utf8編碼的時候佔3個字節，以gbk編碼的時候佔2個字節。至於第四段輸出亂碼問題我感覺是控制檯的編碼是utf8的原因。

需要知道的是str類型和unicode類型的字符串輸出到控制檯都是可見的字符串，不是字節序列（bytes）。

接下來看Python3：

Python3中：字符串有str類型(Unicode編碼)和bytes類型(字節序列)兩種

str類型(Unicode編碼)字符串→編碼(encode)→bytes類型字節序列

bytes類型字節序列→解碼(decode)→str類型(Unicode編碼)字符串

Python3中的str類型就等於Python2中的Unicode類型！！！

看兩者的區別：

看段代碼：

str1 = "我是abc"
print(type(str1))
print(str1)
print(len(str1))

str2 = bytes("我是abc", encoding="utf8")
print(type(str2))
print(str2)
print(len(str2))

運行結果：
<class 'str'>
我是abc
5
<class 'bytes'>
b'\xe6\x88\x91\xe6\x98\xafabc'
9

注意：Python3的字符串默認爲Unicode編碼的字符串(str類型)，不需要加上Python2代碼中的(# coding:utf8)

str1爲str類型的(Unicode編碼)的字符串，可以直接顯示，長度爲5。

str2爲bytes類型的(必須指定編碼格式)字節序列，輸出時看到的不是字符串，是b開頭的字節序列，長度爲9，utf8中一個漢字三個字節。

注意：Python2中str類型和Unicode類型輸出都是可見的字符串，Python3中str類型(Unicode編碼)爲可見字符串，bytes類型爲字節序列，輸出看不懂。

再看一段代碼：

str1 = "我是abc"
print(type(str1))
print(str1)
print(len(str1))

str2 = str1.encode()  # 默認爲utf8，等效於encode("utf8")
print(type(str2))
print(str2)
print(len(str2))

str3 = str2.decode()  # 默認爲utf8，等效於decode("utf8")
print(type(str3))
print(str3)
print(len(str3))

str4 = str1.encode("gbk")
print(type(str4))
print(str4)
print(len(str4))

str5 = str4.decode("gbk")  # 參數需要與上面相同
print(type(str5))
print(str5)
print(len(str5))

str6 = str4.decode("utf16",errors="ignore")  #以不同方式解碼的話就會報錯或者亂碼，由參數決定
print(str6)


執行結果：
<class 'str'>
我是abc
5
<class 'bytes'>
b'\xe6\x88\x91\xe6\x98\xafabc'
9
<class 'str'>
我是abc
5
<class 'bytes'>
b'\xce\xd2\xca\xc7abc'
7
<class 'str'>
我是abc
5
틎쟊扡

str1的類型爲str(Unicode編碼)，str2爲str1用默認編碼格式(utf8)編碼以後形成的bytes類型的字節序列，str3爲str2解碼後的str類型的字符串，str4和str5分別於str2和str3相同，只是編碼不同，編碼不同也導致了長度不同，，str6位當解碼方式不同的時候的亂碼，會報錯，可以用參數忽略錯誤。

先寫到這，回頭想起來什麼再來不錯，有什麼不對的地方，希望大家可以提出。

Python2和Python3中字符串編碼問題詳解

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

記錄一段難忘的秋招時光

win10禁止用戶賬號控制窗口

Windows10暫停更新教程

關閉Windows Defender教程

office右鍵新建沒有word選項

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結