Python中的bytes、str以及unicode区别

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"从Python发展历史谈起"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Python3和Python2表示字符序列的方式有所不同。"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"Python3字符序列的两种表示为"},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":16}},{"type":"strong"}],"text":"byte和str"},{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"。"},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"前者的实例包含原始的8位值,即原始的字节;后者的实例包括Unicode字符。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"Python2字符序列的两种表示为"},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":16}},{"type":"strong"}],"text":"str和unicode"},{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"。"},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"与Python3不同的是,str实例包含原始的8位值;而unicode的实例,则包含Unicode字符。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"类型转换"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把Unicode字符表示为二进制数据有许多方法。做常见的编码方式是UTF-8。但是python3的str实例和Python2的unicode实例都没有和特定的二进制编码形式相关联。要想把Unicode字符转换为二进制数据,就必须使用encode方法。要想把二进制数据转换成Unicode字符,则必须使用decode方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,在编写大型复杂的Python程序的时候,一般把编码和解码的相关操作放到最外层来做。程序核心部分使用Unicode字符类型,也就是Python3的str以及Python2的unicode,而且不要对字符编码做任何假设。这种办法既可以让程序接受多种类型的文本编码,又可以保证输出的文本信息只采用一种编码形式(最好是UTF-8),灵活性很高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以,可以编写两个辅助函数,以便对序列进行转换,使得转换后的输入数据符合开发者预期。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"对于Python3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Python3中,我们需要编写一个接收str或bytes,并总是返回str的方法:"}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"def to_str(bytes_or_str):\n if isinstance(bytes_or_str, bytes):\n return bytes_or_str.decode('utf-8')\n return bytes_or_str # instance of str"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以及一个总是返回bytes的方法:"}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"def to_bytes(bytes_or_str):\n if isinstance(bytes_or_str, str):\n return bytes_or_str.encode('utf-8')\n return bytes_or_str # instance of bytes"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"对于Python2"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Python2中,我们需要编写一个接收str或unicode,并总是返回unicode的方法:"}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"def to_unicode(unicode_or_str):\n if isinstance(unicode_or_str, str):\n return unicode_or_str.decode('utf-8')\n return unicode_or_str # instance of unicode"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以及一个总是返回str的方法:"}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"def to_str(unicode_or_str):\n if isinstance(unicode_or_str, unicode):\n return unicode_or_str.encode('utf-8')\n return unicode_or_str # instance of str"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"面临的问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Python中使用原始8位值与Unicode字符时,通常有两个问题需要注意。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"underline"},{"type":"strong"}],"text":"第一个问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一个问题通常出现在Python2中,如果你用的是Python3,可以暂且忽略这个问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果str只包含7位的ASCII字符,那么unicode和str实例似乎就成了同一种类型。"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以用+操作符把str与unicode连接起来"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以进行比较操作比如, "},{"type":"text","marks":[{"type":"strong"}],"text":"=="},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"!=、"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这些行为一位着,只处理7位的ASCII的情况下,如果某个函数接受str,那么可以给它传入unicode;如果某个函数接收unicode,也可以传入str。而在Python3中,bytes和str实例绝对不会等价,更不能进行比较,即使是空字符串也不可以,因为这是完全不同的两个类型。所以,在传入字符序列的时候必须留意其类型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"underline"},{"type":"strong"}],"text":"第二个问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二个问题可能会出现在Python3上,也是经常遇到的一些问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果通过内置的open函数获取讴歌文件句柄,那么需要注意的是,该句柄默认会采用UTF-8编码格式来操作文件。而在Python2中,文件操作的默认格式是二进制的,这可能会导致程序出现奇怪的错误。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,现在要向文件中随机写入二进制数据,下面这个写法在Python2中不会有什么问题,但在Python3中会有异常,提示"},{"type":"codeinline","content":[{"type":"text","text":"TypeError: write() argument must be str, not bytes"}]}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"with open('./random.bin', 'w') as f:\n f.write(os.urandom(10))\n\n>>> TypeError: write() argument must be str, not bytes"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"发生上述问题的原因是Python3给open函数添加了encode参数,而这个新参数的默认值是UTF-8。这样一来,系统就会要求开发者必须传入包含Unicode字符的str实例,而不是包含二进制数据的bytes实例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决这个问题,我们必须用二进制写入模式,即将原来的'w'修改为'wb',这样一来就可以同时适配Python2和Python3。从文件中读取数据的时候也会有类似的问题,解决方法与写入类似,使用"},{"type":"text","marks":[{"type":"strong"}],"text":"'rb'"},{"type":"text","text":"模式打开文件,而不是"},{"type":"text","marks":[{"type":"strong"}],"text":"'r'"},{"type":"text","text":"模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"总结"}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在Python3中,bytes是一种包含8位值的序列,str是一种包含Unicode字符的序列。开发者不用"},{"type":"text","marks":[{"type":"strong"}],"text":"比较操作"},{"type":"text","text":"来混合处理。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在Python2中,str是一种包含8位值的序列,unicode是一种包含Unicode字符的序列。如果str只有7位ASCII字符,那么可以进行比较运算以及连接操作。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"在对输入的数据操作之前,使用辅助函数来保证字符序列的类型与开发者的期望一致。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":1,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"从文件中读取二进制数据,或向其中写入二进制数据时,总应该以'rb'或'wb'等二进制模式来开启文件。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"参考阅读:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://book.douban.com/subject/26312313/","title":""},"content":[{"type":"text","text":"《Effective Python》"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章