由time.tzname返回值引發的對str、bytes轉換時編碼問題實踐

Windows 10家庭中文版，Python 3.6.4，

下午複習了一下time模塊，熟悉一下其中的各種時間格式的轉換：時間戳浮點數、struct_tm、字符串，還算順利。

可是，測試其中的time.tzname屬性時遇到了亂碼，如下：

1 >>> import time
2 >>> time.tzname
3 ('ÖÐ¹ú±ê×¼Ê±¼ä', 'ÖÐ¹úÏÄÁîÊ±')

返回了一個元組，可是，亂碼怎麼看得懂！

補充：time.tzname

A tuple of two strings: the first is the name of the local non-DST timezone, the second is the name of the local DST timezone.

從結果來看，返回的是兩個Unicode字符串組成的元組。

那麼，這兩個字符串用的是什麼編碼呢？怎麼轉換爲孤可以讀的懂得信息呢？

網上搜索到一篇文章（https://www.oschina.net/question/2927993_2199064?sort=default），解決方法爲：

1 a = time.tzname[0]
2 b = a.encode('latin-1').decode('gbk')
3 print(b)

說明，後面的gbk更改爲gb2312也是可以的。

測試的b：

中國標準時間

上面代碼解釋（參考鏈接1中會解釋的更清楚）：

使用encode將字符串轉換爲bytes，再使用decode將bytes轉換爲字符串，最後得到一個gbk編碼的字符串，此字符串在Python IDLE就可以正常顯示了。

能看懂了。可是，爲什麼要做這樣的轉換呢？爲何是latin-1、gbk呢？繼續dig

補充：

除了使用encode、decode實現str、bytes轉換外，還可以使用str()、bytes()來執行兩者的轉換，下面會用到。

補充：

str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'.

bytes.decode(encoding="utf-8", errors="strict")
bytearray.decode(encoding="utf-8", errors="strict")
Return a string decoded from the given bytes. Default encoding is 'utf-8'.

疑問：怎麼判斷字符串用的什麼編碼方式呢？

字符串，可以認爲是字符組成的數組，那麼，獲取每個字符串中的字符在內存中的表示如何？是什麼樣的整數？當然，Python中是沒有單純的字符的，都是字符串。

在參考鏈接2中，找到了將字符轉換爲整數的函數——ord：

下面是使用

>>> tzname = time.tzname
>>> for ch in tzname[0]:
    print("0x%x" % ord(ch))

    
0xd6
0xd0
0xb9
0xfa
0xb1
0xea
0xd7
0xbc
0xca
0xb1
0xbc
0xe4

遺憾的是，由於自己水平有限，無法根據上面的信息使用的是何種編碼方式。

下面是更進一步測試

item = time.tzname[0]

Test 1:

bsx0 = bytes(item, encoding="gbk")

發生異常：

UnicodeEncodeError: 'gbk' codec can't encode character '\xd6' in position 0: illegal multibyte sequence

看來上面的做法是不對的。上面的bytes()函數類似於encode的功能——字符串str 轉 bytes。

Test 2:

bs0 = bytes(item, encoding="utf-8")
print(bs0)
print(chardet.detect(bs0))
print("OK 1? ", str(bs0, 'utf-8'))
print('0x%x' % ord(str(bs0, 'utf-8')[0]))

測試結果：

b'\xc3\x96\xc3\x90\xc2\xb9\xc3\xba\xc2\xb1\xc3\xaa\xc3\x97\xc2\xbc\xc3\x8a\xc2\xb1\xc2\xbc\xc3\xa4'
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} 使用chardet.detect檢測到的編碼類型
OK 1? ÖÐ¹ú±ê×¼Ê±¼ä 還是亂碼，和IDLE中一樣
0xd6 先執行bytes、再執行str，兩次都用utf-8，結果，得到的第一個字符的十六進制仍然是0Xd6

Test 3:

bs = bytes(item, encoding="latin-1")
print(bs)
print(chardet.detect(bs))
str_bs = str(bs, 'gbk')
print(str_bs)
print('0x%x' % ord(str_bs[0]))

print(bytes(str_bs, encoding='gbk'))

測試結果：

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' 字符串轉bytes時使用了latin-1，得到的編碼，和打印每個字符的16進制的結果一致
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'} 可是，使用chardet.detect檢測到的編碼居然是GB2312
中國標準時間 bytes使用gbk（gb2312也可以）轉換爲str後輸出的結果，好了，不是亂碼了
0x4e2d 查看上面的字符串的第一個字符的十六進制數值，這次不一樣了，

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' 使用gbk編碼得到的bytes，和前面使用latin-1編碼得到的bytes一樣啊！

Test 4:

繼續上面的Test 3進行測試：str_bs是上面使用gbk轉換後得到的字符串

bs2 = bytes(str_bs, encoding='utf-8')
print(bs2)
print(chardet.detect(bs2))
print("OK 2? ", str(bs2, 'utf-8'))
print('0x%x' % ord(str(bs2, 'utf-8')[0]))

測試結果：

b'\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4' 將編碼爲gbk字符串用utf-8轉換爲bytes，結果和Test 3中得到的不一樣，
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} 檢測到編碼爲utf-8，
OK 2? 中國標準時間也顯示了看得懂的字符串
0x4e2d 第一個字符的十六進制，

疑問

問題在哪兒呢？爲何孤要將字符串轉換爲UTF-8呢？

Unicode字符編碼、UTF-8、GBK、GB2312到底有什麼關係呢？

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'

怎麼轉換爲：

b'\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4'

計算方法是什麼？

Test 5:

new_item = item.encode('latin-1').decode('gbk')
print('OK 3?', new_item)
print('0x%x' % ord(new_item[0]))

new_item2 = new_item.encode('gbk').decode('utf8')
print(new_item2)

測試結果：

OK 3? 中國標準時間
0x4e2d
Traceback (most recent call last):
File "D:\eclipse\workspace\zl0425\src\test\aug\time01.py", line 52, in <module>
new_item2 = new_item.encode('gbk').decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 0: invalid continuation byte 出錯了！

疑惑：

在Test 3中，原始字符串使用latin-1轉換爲bytes 再使用gbk轉化爲字符串；

在Test 4中，將Test 3得到的gbk轉化來的字符串使用utf-8轉換爲bytes 再用utf-8轉換爲字符串；

latin-1 --> gbk -->utf-8，沒有出錯，可在Test 5中使用encode、decode時出錯了呢？

將出錯語句中的gbk更改爲utf8，結果，new_item2中顯示正常了：

new_item2 = new_item.encode('utf8').decode('utf8')

結果：

中國標準時間

1723 還是不完全明白，晚點再看看

1805 開機，電量滿滿的，再戰此問題

看過參考鏈接4、5，並對漢字“漢”做了一些實驗，發現，無論是 encode還是decode，都是對內存中的字節進行操作。

下面是使用bytes()、str()函數進行測試的結果：

# 本身就是Unicode字符
>>> han = '漢'
# 輸出 漢 在Unicode字符集中的 編碼
>>> ord(han)
27721
>>> print('0x%x' % ord(han))
0x6c49


# 使用latin-1將Unicode字符——大於255——轉換爲bytes：異常，無法解析
# Unicode字符小於256時是可以的！
>>> bytes(han, encoding='latin-1')
Traceback (most recent call last):
  File "<pyshell#63>", line 1, in <module>
    bytes(han, encoding='latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u6c49' in position 0: ordinal not in range(256)

# Unicode字符 使用 utf-8轉換爲bytes，成功
# 用什麼解碼，就用什麼進行編碼
>>> hanbs = bytes(han, encoding='utf-8')
# 三個字節的UTF-8編碼
>>> hanbs
b'\xe6\xb1\x89'

# 將UTF-8編碼得到的字節使用latin-1編碼轉換爲字符串
# 是按照上面的每個直接進行處理，結果得到一個長度爲3的字符串，存在看不懂的亂碼
# \xe6、\xb1、\x89分別代表一個字符
# 此時是Unicode字符，但小於256
>>> han_latin = str(hanbs, 'latin-1')
>>> han_latin
'æ±\x89'

# 將latin-1編碼的字符串使用utf-8轉換爲bytes
# 每個字符一個轉換，三個字符就是三個轉換
# 結果得到下面的bytes——utf-8編碼的bytes，此時有6個字節了
# 每兩個字節代表一個 前面han_latin中的一個字符（Unicode字符）
>>> hanbs2 = bytes(han_latin, encoding='utf-8')
>>> hanbs2
b'\xc3\xa6\xc2\xb1\xc2\x89'
>>> str(hanbs2, encoding='utf-8')
'æ±\x89'

# 還是用latin-1編碼將latin-1編碼的字符串轉換爲bytes吧
# 在把bytes轉換爲utf-8編碼的字符串，又恢復了“漢”
>>> hanbs3 = bytes(han_latin, encoding='latin-1')
>>> hanbs3
b'\xe6\xb1\x89'
>>> str(hanbs3, encoding='utf-8')
'漢'


# 對字母a進行測試
>>> zimu = 'a'
>>> ord(zimu)
97
>>> print('0x%x' % ord(zimu))
0x61

>>> zimubs = bytes(zimu, encoding='utf-8')
>>> zimubs
b'a'
>>> zimu_latin = str(zimubs, 'latin-1')
>>> zimu_latin
'a'

# 無論如何轉換，得到的bytes都是b'a'，
# 因爲latin-1、utf-8編碼對小於256的是兼容的——相同
>>> bytes(zimu_latin, 'utf-8')
b'a'