ASCII和Unicode有什麼區別?

本文翻譯自:What's the difference between ASCII and Unicode?

Can I know the exact difference between Unicode and ASCII? 我可以知道Unicode和ASCII之間的確切區別嗎?

ASCII has a total of 128 characters (256 in the extended set). ASCII總共有128個字符(擴展集中爲256個字符)。

Is there any size specification for Unicode characters? Unicode字符有任何大小規格嗎?


#1樓

參考:https://stackoom.com/question/1Ibzu/ASCII和Unicode有什麼區別


#2樓

ASCII defines 128 characters, which map to the numbers 0–127. ASCII定義128個字符,映射到數字0-127。 Unicode defines (less than) 2 21 characters, which, similarly, map to numbers 0–2 21 (though not all numbers are currently assigned, and some are reserved). Unicode定義(少於)2 21個字符,類似地,映射到數字0-2 21 (儘管並非所有數字當前都已分配,有些是保留的)。

Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. Unicode是ASCII的超集,數字0-127在ASCII中具有與Unicode中相同的含義。 For example, the number 65 means "Latin capital 'A'". 例如,數字65表示“拉丁語資本'A'”。

Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8. 由於Unicode字符通常不適合一個8位字節,因此有許多方法可以在字節序列中存儲Unicode字符,例如UTF-32和UTF-8。


#3樓

ASCII has 128 code positions, allocated to graphic characters and control characters (control codes). ASCII有128個代碼位,分配給圖形字符和控制字符(控制代碼)。

Unicode has 1,114,112 code positions. Unicode有1,114,112個代碼位。 About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (ie not used to encode any character ever), and most code points are not yet assigned. 目前已將大約100,000個字符分配給字符,並且許多代碼點已經永久地成爲非字符(即,不用於對任何字符進行編碼),並且大多數代碼點尚未分配。

The only things that ASCII and Unicode have in common are: 1) They are character codes. ASCII和Unicode 的共同點是:1)它們是字符代碼。 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode. 2)Unicode的128個第一個代碼位置被定義爲具有與ASCII相同的含義,除了ASCII控制字符的代碼位置被定義爲表示控制字符,其名稱對應於它們的ASCII名稱,但它們的含義是沒有在Unicode中定義。

Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. 但是,有時,Unicode(甚至在Unicode標準中)被表徵爲“寬ASCII”。 This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages. 這是一個口號,主要是試圖傳達這樣一種觀點,即Unicode與ASCII曾經是一樣的通用字符代碼(雖然ASCII的字符庫絕對不能用於普遍使用),與使用不同的代碼相反。不同的系統和應用程序以及不同的語言。

Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. Unicode本身僅定義字符的“邏輯大小”:每個字符都具有特定範圍內的代碼編號。 These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character. 這些代碼編號可以使用不同的傳輸編碼來表示,而在內部,在內存中,Unicode字符通常使用每個字符一個或兩個16位數量來表示,具體取決於字符範圍,有時每個字符使用一個32位數量。


#4樓

ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. ASCII有128個代碼點,0到127.它可以放在一個8位字節中,值128到255傾向於用於其他字符。 With incompatible choices, causing the code page disaster. 具有不兼容的選擇,導致代碼頁發生災難。 Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page. 在一個代碼頁中編碼的文本無法由在另一個代碼頁上假定或猜到的程序正確讀取。

Unicode came about to solve this disaster. Unicode即將解決這場災難。 Version 1 started out with 65536 code points, commonly encoded in 16 bits. 版本1以65536個代碼點開始,通常以16位編碼。 Later extended in version 2 to 1.1 million code points. 後來在第2版擴展到110萬個代碼點。 The current version is 6.3, using 110,187 of the available 1.1 million code points. 當前版本爲6.3,使用了110,187個可用的110萬個代碼點。 That doesn't fit in 16 bits anymore. 這不再適合16位。

Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. 當v2出現時,16位編碼很常見,例如微軟和Apple操作系統使用。 And language runtimes like Java. 像Java這樣的語言運行時。 The v2 spec came up with a way to map those 1.1 million code points into 16-bits. v2規範提出了將這110萬個代碼點映射到16位的方法。 An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. 一種稱爲UTF-16的編碼,一種可變長度編碼,其中一個代碼點可以採用2或4個字節。 The original v1 code points take 2 bytes, added ones take 4. 原始的v1代碼點佔用2個字節,添加的佔用4個字節。

Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. 在* nix操作系統和工具中使用的另一種非常常見的可變長度編碼是UTF-8,代碼點可以佔用1到4個字節,原始的ASCII代碼佔用1個字節,其餘的佔用更多。 The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. 唯一的非可變長度編碼是UTF-32,代碼點需要4個字節。 Not often used since it is pretty wasteful. 不經常使用,因爲它非常浪費。 There are other ones, like UTF-1 and UTF-7, widely ignored. 還有其他一些,如UTF-1和UTF-7,被廣泛忽視。

An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. UTF-16/32編碼的一個問題是字節的順序將取決於創建文本流的機器的字節順序。 So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. 所以加入混合UTF-16BE,UTF-16LE,UTF-32BE和UTF-32LE。

Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". 擁有這些不同的編碼選擇會在一定程度上帶來代碼頁災難,以及UTF選擇“最佳”的程序員之間激烈的爭論。 Their association with operating system defaults pretty much draws the lines. 他們與操作系統默認的關聯幾乎可以說明問題。 One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. 一個反措施是BOM的定義,字節順序標記,文本流開頭的特殊代碼點(U + FEFF,零寬度空間),指示如何對流的其餘部分進行編碼。 It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. 它表示UTF編碼和endianess,對文本呈現引擎是中性的。 Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common. 不幸的是,這是可選的,許多程序員聲稱他們有權省略它,所以事故仍然很常見。


#5樓

ASCII定義了128個字符,因爲Unicode包含超過120,000個字符的全部字符串。


#6樓

Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two. 理解爲什麼首先創建ASCII和Unicode有助於我理解兩者之間的差異。

ASCII, Origins ASCII,起源

As stated in the other answers, ASCII uses 7 bits to represent a character. 如其他答案中所述,ASCII使用7位來表示字符。 By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations * . 通過使用7位,我們可以具有最多2 ^ 7(= 128)個不同的組合* Which means that we can represent 128 characters maximum. 這意味着我們最多可以代表128個字符。

Wait, 7 bits? 等等,7位? But why not 1 byte (8 bits)? 但爲什麼不是1字節(8位)?

The last bit (8th) is used for avoiding errors as parity bit . 最後一位(第8位)用於避免錯誤作爲奇偶校驗位 This was relevant years ago. 這與多年前有關。

Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed , tab, etc. 大多數ASCII字符是字母表中的可打印字符,例如abc,ABC,123,?和!等。其他字符控制字符,例如回車符,換行符 ,製表符等。

See below the binary representation of a few characters in ASCII: 請參閱下面ASCII中幾個字符的二進制表示:

0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)

See the full ASCII table over here . 在此處查看完整的ASCII表。

ASCII was meant for English only. ASCII僅適用於英語。

What? 什麼? Why English only? 爲什麼只有英文? So many languages out there! 那裏有很多語言!

Because the center of the computer industry was in the USA at that time. 因爲當時計算機行業的中心在美國。 As a consequence, they didn't need to support accents or other marks such as á, ü, ç, ñ, etc. (aka diacritics ). 因此,他們不需要支持口音或其他標記,如á,ü,ç,ñ等(又名變音符號 )。

ASCII Extended ASCII擴展

Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). 一些聰明的人開始使用第8位(用於奇偶校驗的位)來編碼更多字符以支持他們的語言(例如,支持“é”,例如法語)。 Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). 只需使用一個額外的位,就可以將原始ASCII表的大小加倍,最多可以映射256個字符(2 ^ 8 = 256個字符)。 And not 2^7 as before (128). 而不是像以前那樣2 ^ 7(128)。

10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)

The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII". 這個“ASCII擴展到8位而不是之前的7位”的名稱可以簡稱爲“擴展ASCII”或“8位ASCII”。

As @Tom pointed out in his comment below there is no such thing as " extended ASCII " yet this is an easy way to refer to this 8th-bit trick. 正如@Tom在下面的評論中所指出的,沒有“ 擴展ASCII ”這樣的東西,但這是一個簡單的方法來引用這個第8位技巧。 There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1 . 8位ASCII表有許多變體,例如ISO 8859-1,也稱爲ISO Latin-1

Unicode, The Rise Unicode,The Rise

ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? ASCII Extended解決了基於拉丁字母的語言的問題......其他需要完全不同的字母表的人呢? Greek? 希臘語? Russian? 俄語? Chinese and the likes? 中國人喜歡?

We would have needed an entirely new character set... that's the rational behind Unicode. 我們需要一個全新的字符集......這是Unicode背後的理性。 Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters ( see this table ). Unicode不包含每種語言的每個字符,但它確實包含大量字符( 請參閱此表 )。

You cannot save text to your hard drive as "Unicode". 您無法將文本作爲“Unicode”保存到硬盤驅動器。 Unicode is an abstract representation of the text. Unicode是文本的抽象表示。 You need to "encode" this abstract representation. 您需要“編碼”此抽象表示。 That's where an encoding comes into play. 這就是編碼發揮作用的地方。

Encodings: UTF-8 vs UTF-16 vs UTF-32 編碼:UTF-8 vs UTF-16 vs UTF-32

This answer does a pretty good job at explaining the basics: 這個答案在解釋基礎知識方面做得非常好:

  • UTF-8 and UTF-16 are variable length encodings. UTF-8和UTF-16是可變長度編碼。
  • In UTF-8, a character may occupy a minimum of 8 bits. 在UTF-8中,字符可能佔用最少8位。
  • In UTF-16, a character length starts with 16 bits. 在UTF-16中,字符長度以16位開頭。
  • UTF-32 is a fixed length encoding of 32 bits. UTF-32是32位的固定長度編碼。

UTF-8 uses the ASCII set for the first 128 characters. UTF-8使用ASCII集作爲前128個字符。 That's handy because it means ASCII text is also valid in UTF-8. 這很方便,因爲它意味着ASCII文本在UTF-8中也有效。

Mnemonics: 口訣:

  • UTF- 8 : minimum 8 bits. UTF- 8 :最小8位。
  • UTF- 16 : minimum 16 bits. UTF- 16 :最小16位。
  • UTF- 32 : minimum and maximum 32 bits. UTF- 32 :最小和最大32位。

Note: 注意:

Why 2^7? 爲什麼2 ^ 7?

This is obvious for some, but just in case. 這對一些人來說是顯而易見的,但以防萬一。 We have seven slots available filled with either 0 or 1 ( Binary Code ). 我們有七個插槽可用0或1( 二進制代碼 )填充。 Each can have two combinations. 每個可以有兩種組合。 If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. 如果我們有七個點,我們有2 * 2 * 2 * 2 * 2 * 2 * 2 = 2 ^ 7 = 128個組合。 Think about this as a combination lock with seven wheels, each wheel having two numbers only. 把它想象成一個帶七個輪子的密碼鎖,每個輪子只有兩個數字。

Source: Wikipedia , this great blog post and Mocki where I initially posted this summary. 來源: 維基百科這篇偉大的博客文章Mocki ,我最初發布此摘要。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章