介紹幾個常用概念:
1. unicode和str:
前者是沒有編碼過的字符串;後者是已經編碼成某一種編碼方式的字符串,例如是gbk,utf-8,ascii等編碼方式的字符串。兩者都是basestring的子類
2. 系統編碼,代碼編碼,文件編碼,終端輸入輸出編碼
系統編碼: 默認編碼,正常情況下window系統默認是gbk,linux系統默認是utf-8,可用locale.getdefaultlocale()和locale.setdefaultlocale()來控制,與encode有關
代碼編碼:python代碼中的編碼,默認是ascii,可用"# -*- coding: utf-8 -*-"這種方式指定。python默認編碼可用sys.getdefaultencoding()和sys.setdefaultencoding()來控制
文件編碼:sys.getfilesystemencoding()
終端輸入編碼:sys.stdin.encoding
終端輸出編碼:sys.stdout.encoding,必須與locale編碼保持一致,才能print出正確str
3. 針對編碼轉換,儘量在代碼中全程使用unicode處理,在輸入口decode爲unicode,在輸出口encode爲相對應的str
例1:
#coding:utf-8 #由於.py文件是utf-8的,所以必須有這一句
import sys
import locale
import os
import codecs
reload(sys)
print sys.getdefaultencoding() + " - sys.getdefaultencoding()"
sys.setdefaultencoding('utf8') #影響encode()
print sys.getdefaultencoding() + " - sys.getdefaultencoding()"
print sys.stdout.encoding + " - sys.stdout.encoding:"
#sys.stdout = codecs.getwriter('utf8')(sys.stdout) #影響print
print sys.stdout.encoding + " - sys.stdout.encoding:"
u = u'中國'
print u + " - u"
a = '中國'
print a + " - a"
print a.decode('utf-8') + " - a.decode('utf-8')"
print a.decode('utf-8').encode('gbk') + " - a.decode('utf-8').encode('gbk')"
print a.decode('utf-8').encode('utf-8') + " - a.decode('utf-8').encode('utf-8')"
print a.decode('utf-8').encode() + " - a.decode('utf-8').encode()"
print (sys.stdout.encoding) + " - (sys.stdout.encoding)"
print (sys.stdout.isatty())
print (locale.getpreferredencoding())
print (sys.getfilesystemencoding())
結果:
1. 終端:utf-8 locale:gbk
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
GBK - sys.stdout.encoding:
GBK - sys.stdout.encoding:
�й� - u
中國 - a
�й� - a.decode('utf-8')
�й� - a.decode('utf-8').encode('gbk')
中國 - a.decode('utf-8').encode('utf-8')
中國 - a.decode('utf-8').encode()
GBK - (sys.stdout.encoding)
True
GBK
utf-8
2. 終端:utf-8 locale:utf-8
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
UTF-8 - sys.stdout.encoding:
UTF-8 - sys.stdout.encoding:
中國 - u
中國 - a
中國 - a.decode('utf-8')
�й� - a.decode('utf-8').encode('gbk')
中國 - a.decode('utf-8').encode('utf-8')
中國 - a.decode('utf-8').encode()
UTF-8 - (sys.stdout.encoding)
True
UTF-8
utf-8
3. 終端:gbk locale:gbk
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
GBK - sys.stdout.encoding:
GBK - sys.stdout.encoding:
中國 - u
涓???? - a
中國 - a.decode('utf-8')
中國 - a.decode('utf-8').encode('gbk')
涓???? - a.decode('utf-8').encode('utf-8')
涓???? - a.decode('utf-8').encode()
GBK - (sys.stdout.encoding)
True
GBK
utf-8
4. 終端:gbk locale:utf-8
ascii - sys.getdefaultencoding()
utf8 - sys.getdefaultencoding()
UTF-8 - sys.stdout.encoding:
UTF-8 - sys.stdout.encoding:
涓???? - u
涓???? - a
涓???? - a.decode('utf-8')
中國 - a.decode('utf-8').encode('gbk')
涓???? - a.decode('utf-8').encode('utf-8')
涓???? - a.decode('utf-8').encode()
UTF-8 - (sys.stdout.encoding)
True
UTF-8
utf-8