這個博客醞釀好久,不敢發,這個計算機的基本知識,我坦白說,我一直很混沌,一直不清楚,自己寫點啥,糾結不知道自己的是否正確,容易被鄙視,儘量測試來論證,但是由於本人水平不高,還是會還怕對於這麼基礎的知識,還是掌握的不好。
在學習文字編碼的細節之前,先要認識幾個概念:
- 字符
- 字符集
- 字符編碼
- 字符編碼方式
文字:
以視覺方式表現語言體系所用的符號。這個很好理解就是我們每天看見的A、B、C、D、啊、喔、額此類的東西。
字符集:
由於我們日常所見的文字,符號和數字總和的數量是巨大的,同事處理所有的文字是不可能的,所以事先規定使用哪些文字,這些文字的集合就叫字符集。具有代表性的字符集有人比較熟知的美國的ASCII,歐洲的ISO8859,咱們中國人的GB_2312,以及後來的以表現多語言爲目的的Unicode字符集,我們看一下ASCII表:
字符編碼:
在字符集中,每個字符都分配一個編碼,就叫做字符編碼。
字符編碼方式:
計算機上僅僅用整數來表示字符編碼的方式成爲字符編碼方式。
現在似乎明白一點了,雖然計算機能夠處理圖像、動畫、以及各種程序、各種數據,但是CPU只能處理二進制的數字。所以必須將各種形式的處理對象轉換成二進制,因爲當初最開始搞計算機的人說英語,所有最開始的例如ASCII中只有字母,數字,和基本符號。然後隨着計算機的發展,發展到中國了,ASCII已經不好使了,所以就出現了Unicode,和GB_2312,以及其他各個國家的字符集。
好了現在寫點代碼來詳細講講。
在C#中查看一下C#中Unicode支持的字符集編碼方式:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate); 14 StringBuilder sb = new StringBuilder(); 15 foreach (EncodingInfo coif in Encoding.GetEncodings()) 16 { 17 sb.Append("Display Name: " + coif.DisplayName + "----Name: " + coif.Name + "\n"); 18 } 19 byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString()); 20 21 fs.Write(coByte, 0, coByte.Length); 22 fs.Close(); 23 Console.ReadKey(); 24 25 } 26 } 27 }
本來是輸出到控制檯的,結果發現輸出的內容還挺多的,只要寫到文件裏了,下面是輸出的內容:
1 Display Name: IBM EBCDIC (美國-加拿大)----Name: IBM037 2 Display Name: OEM 美國----Name: IBM437 3 Display Name: IBM EBCDIC (國際)----Name: IBM500 4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708 5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720 6 Display Name: 希臘字符(DOS)----Name: ibm737 7 Display Name: 波羅的海字符(DOS)----Name: ibm775 8 Display Name: 西歐字符(DOS)----Name: ibm850 9 Display Name: 中歐字符(DOS)----Name: ibm852 10 Display Name: OEM 西里爾語----Name: IBM855 11 Display Name: 土耳其字符(DOS)----Name: ibm857 12 Display Name: OEM 多語言拉丁語 I----Name: IBM00858 13 Display Name: 葡萄牙語(DOS)----Name: IBM860 14 Display Name: 冰島語(DOS)----Name: ibm861 15 Display Name: 希伯來字符(DOS)----Name: DOS-862 16 Display Name: 加拿大法語(DOS)----Name: IBM863 17 Display Name: 阿拉伯字符(864)----Name: IBM864 18 Display Name: 北歐字符(DOS)----Name: IBM865 19 Display Name: 西里爾字符(DOS)----Name: cp866 20 Display Name: 現代希臘字符(DOS)----Name: ibm869 21 Display Name: IBM EBCDIC (多語言拉丁語 2)----Name: IBM870 22 Display Name: 泰語(Windows)----Name: windows-874 23 Display Name: IBM EBCDIC (現代希臘語)----Name: cp875 24 Display Name: 日語(Shift-JIS)----Name: shift_jis 25 Display Name: 簡體中文(GB2312)----Name: gb2312 26 Display Name: 朝鮮語----Name: ks_c_5601-1987 27 Display Name: 繁體中文(Big5)----Name: big5 28 Display Name: IBM EBCDIC (土耳其拉丁語 5)----Name: IBM1026 29 Display Name: IBM 拉丁語 1----Name: IBM01047 30 Display Name: IBM EBCDIC (美國-加拿大-歐洲)----Name: IBM01140 31 Display Name: IBM EBCDIC (德國-歐洲)----Name: IBM01141 32 Display Name: IBM EBCDIC (丹麥-挪威-歐洲)----Name: IBM01142 33 Display Name: IBM EBCDIC (芬蘭-瑞典-歐洲)----Name: IBM01143 34 Display Name: IBM EBCDIC (意大利-歐洲)----Name: IBM01144 35 Display Name: IBM EBCDIC (西班牙-歐洲)----Name: IBM01145 36 Display Name: IBM EBCDIC (英國-歐洲)----Name: IBM01146 37 Display Name: IBM EBCDIC (法國-歐洲)----Name: IBM01147 38 Display Name: IBM EBCDIC (國際-歐洲)----Name: IBM01148 39 Display Name: IBM EBCDIC (冰島語-歐洲)----Name: IBM01149 40 Display Name: Unicode----Name: utf-16 41 Display Name: Unicode (Big-Endian)----Name: utf-16BE 42 Display Name: 中歐字符(Windows)----Name: windows-1250 43 Display Name: 西里爾字符(Windows)----Name: windows-1251 44 Display Name: 西歐字符(Windows)----Name: Windows-1252 45 Display Name: 希臘字符(Windows)----Name: windows-1253 46 Display Name: 土耳其字符(Windows)----Name: windows-1254 47 Display Name: 希伯來字符(Windows)----Name: windows-1255 48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256 49 Display Name: 波羅的海字符(Windows)----Name: windows-1257 50 Display Name: 越南字符(Windows)----Name: windows-1258 51 Display Name: 朝鮮語(Johab)----Name: Johab 52 Display Name: 西歐字符(Mac)----Name: macintosh 53 Display Name: 日語(Mac)----Name: x-mac-japanese 54 Display Name: 繁體中文(Mac)----Name: x-mac-chinesetrad 55 Display Name: 朝鮮語(Mac)----Name: x-mac-korean 56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic 57 Display Name: 希伯來字符(Mac)----Name: x-mac-hebrew 58 Display Name: 希臘字符(Mac)----Name: x-mac-greek 59 Display Name: 西里爾字符(Mac)----Name: x-mac-cyrillic 60 Display Name: 簡體中文(Mac)----Name: x-mac-chinesesimp 61 Display Name: 羅馬尼亞語(Mac)----Name: x-mac-romanian 62 Display Name: 烏克蘭語(Mac)----Name: x-mac-ukrainian 63 Display Name: 泰語(Mac)----Name: x-mac-thai 64 Display Name: 中歐字符(Mac)----Name: x-mac-ce 65 Display Name: 冰島語(Mac)----Name: x-mac-icelandic 66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish 67 Display Name: 克羅地亞語(Mac)----Name: x-mac-croatian 68 Display Name: Unicode (UTF-32)----Name: utf-32 69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE 70 Display Name: 繁體中文(CNS)----Name: x-Chinese-CNS 71 Display Name: TCA 臺灣----Name: x-cp20001 72 Display Name: 繁體中文(Eten)----Name: x-Chinese-Eten 73 Display Name: IBM5550 臺灣----Name: x-cp20003 74 Display Name: TeleText 臺灣----Name: x-cp20004 75 Display Name: Wang 臺灣----Name: x-cp20005 76 Display Name: 西歐字符(IA5)----Name: x-IA5 77 Display Name: 德語(IA5)----Name: x-IA5-German 78 Display Name: 瑞典語(IA5)----Name: x-IA5-Swedish 79 Display Name: 挪威語(IA5)----Name: x-IA5-Norwegian 80 Display Name: US-ASCII----Name: us-ascii 81 Display Name: T.61----Name: x-cp20261 82 Display Name: ISO-6937----Name: x-cp20269 83 Display Name: IBM EBCDIC (德國)----Name: IBM273 84 Display Name: IBM EBCDIC (丹麥-挪威)----Name: IBM277 85 Display Name: IBM EBCDIC (芬蘭-瑞典)----Name: IBM278 86 Display Name: IBM EBCDIC (意大利)----Name: IBM280 87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284 88 Display Name: IBM EBCDIC (UK)----Name: IBM285 89 Display Name: IBM EBCDIC (日語片假名)----Name: IBM290 90 Display Name: IBM EBCDIC (法國)----Name: IBM297 91 Display Name: IBM EBCDIC (阿拉伯語)----Name: IBM420 92 Display Name: IBM EBCDIC (希臘語)----Name: IBM423 93 Display Name: IBM EBCDIC (希伯來語)----Name: IBM424 94 Display Name: IBM EBCDIC (朝鮮語擴展)----Name: x-EBCDIC-KoreanExtended 95 Display Name: IBM EBCDIC (泰語)----Name: IBM-Thai 96 Display Name: 西里爾字符(KOI8-R)----Name: koi8-r 97 Display Name: IBM EBCDIC (冰島語)----Name: IBM871 98 Display Name: IBM EBCDIC (西里爾俄語)----Name: IBM880 99 Display Name: IBM EBCDIC (土耳其語)----Name: IBM905 100 Display Name: IBM 拉丁語 1----Name: IBM00924 101 Display Name: 日語(JIS 0208-1990 和 0212-1990)----Name: EUC-JP 102 Display Name: 簡體中文(GB2312-80)----Name: x-cp20936 103 Display Name: 朝鮮語 Wansung----Name: x-cp20949 104 Display Name: IBM EBCDIC (西里爾塞爾維亞-保加利亞語)----Name: cp1025 105 Display Name: 西里爾字符(KOI8-U)----Name: koi8-u 106 Display Name: 西歐字符(ISO)----Name: iso-8859-1 107 Display Name: 中歐字符(ISO)----Name: iso-8859-2 108 Display Name: 拉丁語 3 (ISO)----Name: iso-8859-3 109 Display Name: 波羅的海字符(ISO)----Name: iso-8859-4 110 Display Name: 西里爾字符(ISO)----Name: iso-8859-5 111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6 112 Display Name: 希臘字符(ISO)----Name: iso-8859-7 113 Display Name: 希伯來字符(ISO-Visual)----Name: iso-8859-8 114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9 115 Display Name: 愛沙尼亞語(ISO)----Name: iso-8859-13 116 Display Name: 拉丁語 9 (ISO)----Name: iso-8859-15 117 Display Name: 歐羅巴----Name: x-Europa 118 Display Name: 希伯來字符(ISO-Logical)----Name: iso-8859-8-i 119 Display Name: 日語(JIS)----Name: iso-2022-jp 120 Display Name: 日語(JIS-允許 1 字節假名)----Name: csISO2022JP 121 Display Name: 日語(JIS-允許 1 字節假名 - SO/SI)----Name: iso-2022-jp 122 Display Name: 朝鮮語(ISO)----Name: iso-2022-kr 123 Display Name: 簡體中文(ISO-2022)----Name: x-cp50227 124 Display Name: 日語(EUC)----Name: euc-jp 125 Display Name: 簡體中文(EUC)----Name: EUC-CN 126 Display Name: 朝鮮語(EUC)----Name: euc-kr 127 Display Name: 簡體中文(HZ)----Name: hz-gb-2312 128 Display Name: 簡體中文(GB18030)----Name: GB18030 129 Display Name: ISCII 梵文----Name: x-iscii-de 130 Display Name: ISCII 孟加拉語----Name: x-iscii-be 131 Display Name: ISCII 泰米爾語----Name: x-iscii-ta 132 Display Name: ISCII 泰盧固語----Name: x-iscii-te 133 Display Name: ISCII 阿薩姆語----Name: x-iscii-as 134 Display Name: ISCII 奧裏雅語----Name: x-iscii-or 135 Display Name: ISCII 卡納達語----Name: x-iscii-ka 136 Display Name: ISCII 馬拉雅拉姆語----Name: x-iscii-ma 137 Display Name: ISCII 古吉拉特語----Name: x-iscii-gu 138 Display Name: ISCII 旁遮普語----Name: x-iscii-pa 139 Display Name: Unicode (UTF-7)----Name: utf-7 140 Display Name: Unicode (UTF-8)----Name: utf-8
下面看一下Java的:
1 package code; 2 3 import java.nio.charset.Charset; 4 import java.util.SortedMap; 5 6 public class Code { 7 8 public static void main(String[] args) { 9 SortedMap<String, Charset> availableSet = Charset.availableCharsets(); 10 for (String setKey : availableSet.keySet()) { 11 System.out.println("DisplayName: "+availableSet.get(setKey).displayName() +" Name: "+ availableSet.get(setKey).name()); 12 } 13 14 } 15 16 }
看輸出結果:
1 DisplayName: Big5 Name: Big5 2 DisplayName: Big5-HKSCS Name: Big5-HKSCS 3 DisplayName: EUC-JP Name: EUC-JP 4 DisplayName: EUC-KR Name: EUC-KR 5 DisplayName: GB18030 Name: GB18030 6 DisplayName: GB2312 Name: GB2312 7 DisplayName: GBK Name: GBK 8 DisplayName: IBM-Thai Name: IBM-Thai 9 DisplayName: IBM00858 Name: IBM00858 10 DisplayName: IBM01140 Name: IBM01140 11 DisplayName: IBM01141 Name: IBM01141 12 DisplayName: IBM01142 Name: IBM01142 13 DisplayName: IBM01143 Name: IBM01143 14 DisplayName: IBM01144 Name: IBM01144 15 DisplayName: IBM01145 Name: IBM01145 16 DisplayName: IBM01146 Name: IBM01146 17 DisplayName: IBM01147 Name: IBM01147 18 DisplayName: IBM01148 Name: IBM01148 19 DisplayName: IBM01149 Name: IBM01149 20 DisplayName: IBM037 Name: IBM037 21 DisplayName: IBM1026 Name: IBM1026 22 DisplayName: IBM1047 Name: IBM1047 23 DisplayName: IBM273 Name: IBM273 24 DisplayName: IBM277 Name: IBM277 25 DisplayName: IBM278 Name: IBM278 26 DisplayName: IBM280 Name: IBM280 27 DisplayName: IBM284 Name: IBM284 28 DisplayName: IBM285 Name: IBM285 29 DisplayName: IBM297 Name: IBM297 30 DisplayName: IBM420 Name: IBM420 31 DisplayName: IBM424 Name: IBM424 32 DisplayName: IBM437 Name: IBM437 33 DisplayName: IBM500 Name: IBM500 34 DisplayName: IBM775 Name: IBM775 35 DisplayName: IBM850 Name: IBM850 36 DisplayName: IBM852 Name: IBM852 37 DisplayName: IBM855 Name: IBM855 38 DisplayName: IBM857 Name: IBM857 39 DisplayName: IBM860 Name: IBM860 40 DisplayName: IBM861 Name: IBM861 41 DisplayName: IBM862 Name: IBM862 42 DisplayName: IBM863 Name: IBM863 43 DisplayName: IBM864 Name: IBM864 44 DisplayName: IBM865 Name: IBM865 45 DisplayName: IBM866 Name: IBM866 46 DisplayName: IBM868 Name: IBM868 47 DisplayName: IBM869 Name: IBM869 48 DisplayName: IBM870 Name: IBM870 49 DisplayName: IBM871 Name: IBM871 50 DisplayName: IBM918 Name: IBM918 51 DisplayName: ISO-2022-CN Name: ISO-2022-CN 52 DisplayName: ISO-2022-JP Name: ISO-2022-JP 53 DisplayName: ISO-2022-JP-2 Name: ISO-2022-JP-2 54 DisplayName: ISO-2022-KR Name: ISO-2022-KR 55 DisplayName: ISO-8859-1 Name: ISO-8859-1 56 DisplayName: ISO-8859-13 Name: ISO-8859-13 57 DisplayName: ISO-8859-15 Name: ISO-8859-15 58 DisplayName: ISO-8859-2 Name: ISO-8859-2 59 DisplayName: ISO-8859-3 Name: ISO-8859-3 60 DisplayName: ISO-8859-4 Name: ISO-8859-4 61 DisplayName: ISO-8859-5 Name: ISO-8859-5 62 DisplayName: ISO-8859-6 Name: ISO-8859-6 63 DisplayName: ISO-8859-7 Name: ISO-8859-7 64 DisplayName: ISO-8859-8 Name: ISO-8859-8 65 DisplayName: ISO-8859-9 Name: ISO-8859-9 66 DisplayName: JIS_X0201 Name: JIS_X0201 67 DisplayName: JIS_X0212-1990 Name: JIS_X0212-1990 68 DisplayName: KOI8-R Name: KOI8-R 69 DisplayName: KOI8-U Name: KOI8-U 70 DisplayName: Shift_JIS Name: Shift_JIS 71 DisplayName: TIS-620 Name: TIS-620 72 DisplayName: US-ASCII Name: US-ASCII 73 DisplayName: UTF-16 Name: UTF-16 74 DisplayName: UTF-16BE Name: UTF-16BE 75 DisplayName: UTF-16LE Name: UTF-16LE 76 DisplayName: UTF-32 Name: UTF-32 77 DisplayName: UTF-32BE Name: UTF-32BE 78 DisplayName: UTF-32LE Name: UTF-32LE 79 DisplayName: UTF-8 Name: UTF-8 80 DisplayName: windows-1250 Name: windows-1250 81 DisplayName: windows-1251 Name: windows-1251 82 DisplayName: windows-1252 Name: windows-1252 83 DisplayName: windows-1253 Name: windows-1253 84 DisplayName: windows-1254 Name: windows-1254 85 DisplayName: windows-1255 Name: windows-1255 86 DisplayName: windows-1256 Name: windows-1256 87 DisplayName: windows-1257 Name: windows-1257 88 DisplayName: windows-1258 Name: windows-1258 89 DisplayName: windows-31j Name: windows-31j 90 DisplayName: x-Big5-HKSCS-2001 Name: x-Big5-HKSCS-2001 91 DisplayName: x-Big5-Solaris Name: x-Big5-Solaris 92 DisplayName: x-euc-jp-linux Name: x-euc-jp-linux 93 DisplayName: x-EUC-TW Name: x-EUC-TW 94 DisplayName: x-eucJP-Open Name: x-eucJP-Open 95 DisplayName: x-IBM1006 Name: x-IBM1006 96 DisplayName: x-IBM1025 Name: x-IBM1025 97 DisplayName: x-IBM1046 Name: x-IBM1046 98 DisplayName: x-IBM1097 Name: x-IBM1097 99 DisplayName: x-IBM1098 Name: x-IBM1098 100 DisplayName: x-IBM1112 Name: x-IBM1112 101 DisplayName: x-IBM1122 Name: x-IBM1122 102 DisplayName: x-IBM1123 Name: x-IBM1123 103 DisplayName: x-IBM1124 Name: x-IBM1124 104 DisplayName: x-IBM1364 Name: x-IBM1364 105 DisplayName: x-IBM1381 Name: x-IBM1381 106 DisplayName: x-IBM1383 Name: x-IBM1383 107 DisplayName: x-IBM33722 Name: x-IBM33722 108 DisplayName: x-IBM737 Name: x-IBM737 109 DisplayName: x-IBM833 Name: x-IBM833 110 DisplayName: x-IBM834 Name: x-IBM834 111 DisplayName: x-IBM856 Name: x-IBM856 112 DisplayName: x-IBM874 Name: x-IBM874 113 DisplayName: x-IBM875 Name: x-IBM875 114 DisplayName: x-IBM921 Name: x-IBM921 115 DisplayName: x-IBM922 Name: x-IBM922 116 DisplayName: x-IBM930 Name: x-IBM930 117 DisplayName: x-IBM933 Name: x-IBM933 118 DisplayName: x-IBM935 Name: x-IBM935 119 DisplayName: x-IBM937 Name: x-IBM937 120 DisplayName: x-IBM939 Name: x-IBM939 121 DisplayName: x-IBM942 Name: x-IBM942 122 DisplayName: x-IBM942C Name: x-IBM942C 123 DisplayName: x-IBM943 Name: x-IBM943 124 DisplayName: x-IBM943C Name: x-IBM943C 125 DisplayName: x-IBM948 Name: x-IBM948 126 DisplayName: x-IBM949 Name: x-IBM949 127 DisplayName: x-IBM949C Name: x-IBM949C 128 DisplayName: x-IBM950 Name: x-IBM950 129 DisplayName: x-IBM964 Name: x-IBM964 130 DisplayName: x-IBM970 Name: x-IBM970 131 DisplayName: x-ISCII91 Name: x-ISCII91 132 DisplayName: x-ISO-2022-CN-CNS Name: x-ISO-2022-CN-CNS 133 DisplayName: x-ISO-2022-CN-GB Name: x-ISO-2022-CN-GB 134 DisplayName: x-iso-8859-11 Name: x-iso-8859-11 135 DisplayName: x-JIS0208 Name: x-JIS0208 136 DisplayName: x-JISAutoDetect Name: x-JISAutoDetect 137 DisplayName: x-Johab Name: x-Johab 138 DisplayName: x-MacArabic Name: x-MacArabic 139 DisplayName: x-MacCentralEurope Name: x-MacCentralEurope 140 DisplayName: x-MacCroatian Name: x-MacCroatian 141 DisplayName: x-MacCyrillic Name: x-MacCyrillic 142 DisplayName: x-MacDingbat Name: x-MacDingbat 143 DisplayName: x-MacGreek Name: x-MacGreek 144 DisplayName: x-MacHebrew Name: x-MacHebrew 145 DisplayName: x-MacIceland Name: x-MacIceland 146 DisplayName: x-MacRoman Name: x-MacRoman 147 DisplayName: x-MacRomania Name: x-MacRomania 148 DisplayName: x-MacSymbol Name: x-MacSymbol 149 DisplayName: x-MacThai Name: x-MacThai 150 DisplayName: x-MacTurkish Name: x-MacTurkish 151 DisplayName: x-MacUkraine Name: x-MacUkraine 152 DisplayName: x-MS932_0213 Name: x-MS932_0213 153 DisplayName: x-MS950-HKSCS Name: x-MS950-HKSCS 154 DisplayName: x-MS950-HKSCS-XP Name: x-MS950-HKSCS-XP 155 DisplayName: x-mswin-936 Name: x-mswin-936 156 DisplayName: x-PCK Name: x-PCK 157 DisplayName: x-SJIS_0213 Name: x-SJIS_0213 158 DisplayName: x-UTF-16LE-BOM Name: x-UTF-16LE-BOM 159 DisplayName: X-UTF-32BE-BOM Name: X-UTF-32BE-BOM 160 DisplayName: X-UTF-32LE-BOM Name: X-UTF-32LE-BOM 161 DisplayName: x-windows-50220 Name: x-windows-50220 162 DisplayName: x-windows-50221 Name: x-windows-50221 163 DisplayName: x-windows-874 Name: x-windows-874 164 DisplayName: x-windows-949 Name: x-windows-949 165 DisplayName: x-windows-950 Name: x-windows-950 166 DisplayName: x-windows-iso2022jp Name: x-windows-iso2022jp
貌似比C#支持的編碼方式更多一些。
在Eclipse中設置默認的程序集
這個很簡單,不同的電腦和程序可能設置不同的編碼方式作爲默認值,所以一個程序從一臺電腦上拷貝到另一臺電腦上,程序不一定能夠編譯。接下來在程序默認的程序集:
JAVA:
1 package code; 2 3 import java.nio.charset.Charset; 4 5 public class Code { 6 7 public static void main(String[] args) { 8 System.out.println("Default CharSet: "+Charset.defaultCharset()); 9 } 10 11 }
輸出結果:
1 Default CharSet: UTF-8
我的環境中的C#的默認編碼格式:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 Console.WriteLine(Encoding.Default.EncodingName); 14 Console.ReadKey(); 15 } 16 } 17 }
輸出結果:
下面說做個有意思的事情,看看C#支持的編碼格式都有那種格式能夠支持咱們中文,借用一下最開始的那段程序:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate,FileAccess.ReadWrite); 14 string testStr = "天添"; 15 StringBuilder sb = new StringBuilder(); 16 foreach (EncodingInfo coif in Encoding.GetEncodings()) 17 { 18 Byte[] desBytes = Encoding.GetEncoding(coif.Name).GetBytes(testStr); 19 string desStr = Encoding.GetEncoding(coif.Name).GetString(desBytes); 20 21 sb.Append(" Display Name: " + coif.DisplayName + "----Name: " + coif.Name +"----And The result is: "+ desStr + "\n"); 22 } 23 byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString()); 24 25 fs.Write(coByte, 0, coByte.Length); 26 fs.Close(); 27 Console.ReadKey(); 28 } 29 } 30 }
輸出結果:
1 Display Name: IBM EBCDIC (美國-加拿大)----Name: IBM037----And The result is: ?? 2 Display Name: OEM 美國----Name: IBM437----And The result is: ?? 3 Display Name: IBM EBCDIC (國際)----Name: IBM500----And The result is: ?? 4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708----And The result is: ?? 5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720----And The result is: ?? 6 Display Name: 希臘字符(DOS)----Name: ibm737----And The result is: ?? 7 Display Name: 波羅的海字符(DOS)----Name: ibm775----And The result is: ?? 8 Display Name: 西歐字符(DOS)----Name: ibm850----And The result is: ?? 9 Display Name: 中歐字符(DOS)----Name: ibm852----And The result is: ?? 10 Display Name: OEM 西里爾語----Name: IBM855----And The result is: ?? 11 Display Name: 土耳其字符(DOS)----Name: ibm857----And The result is: ?? 12 Display Name: OEM 多語言拉丁語 I----Name: IBM00858----And The result is: ?? 13 Display Name: 葡萄牙語(DOS)----Name: IBM860----And The result is: ?? 14 Display Name: 冰島語(DOS)----Name: ibm861----And The result is: ?? 15 Display Name: 希伯來字符(DOS)----Name: DOS-862----And The result is: ?? 16 Display Name: 加拿大法語(DOS)----Name: IBM863----And The result is: ?? 17 Display Name: 阿拉伯字符(864)----Name: IBM864----And The result is: ?? 18 Display Name: 北歐字符(DOS)----Name: IBM865----And The result is: ?? 19 Display Name: 西里爾字符(DOS)----Name: cp866----And The result is: ?? 20 Display Name: 現代希臘字符(DOS)----Name: ibm869----And The result is: ?? 21 Display Name: IBM EBCDIC (多語言拉丁語 2)----Name: IBM870----And The result is: ?? 22 Display Name: 泰語(Windows)----Name: windows-874----And The result is: ?? 23 Display Name: IBM EBCDIC (現代希臘語)----Name: cp875----And The result is: ?? 24 Display Name: 日語(Shift-JIS)----Name: shift_jis----And The result is: 天添 25 Display Name: 簡體中文(GB2312)----Name: gb2312----And The result is: 天添 26 Display Name: 朝鮮語----Name: ks_c_5601-1987----And The result is: 天添 27 Display Name: 繁體中文(Big5)----Name: big5----And The result is: 天添 28 Display Name: IBM EBCDIC (土耳其拉丁語 5)----Name: IBM1026----And The result is: ?? 29 Display Name: IBM 拉丁語 1----Name: IBM01047----And The result is: ?? 30 Display Name: IBM EBCDIC (美國-加拿大-歐洲)----Name: IBM01140----And The result is: ?? 31 Display Name: IBM EBCDIC (德國-歐洲)----Name: IBM01141----And The result is: ?? 32 Display Name: IBM EBCDIC (丹麥-挪威-歐洲)----Name: IBM01142----And The result is: ?? 33 Display Name: IBM EBCDIC (芬蘭-瑞典-歐洲)----Name: IBM01143----And The result is: ?? 34 Display Name: IBM EBCDIC (意大利-歐洲)----Name: IBM01144----And The result is: ?? 35 Display Name: IBM EBCDIC (西班牙-歐洲)----Name: IBM01145----And The result is: ?? 36 Display Name: IBM EBCDIC (英國-歐洲)----Name: IBM01146----And The result is: ?? 37 Display Name: IBM EBCDIC (法國-歐洲)----Name: IBM01147----And The result is: ?? 38 Display Name: IBM EBCDIC (國際-歐洲)----Name: IBM01148----And The result is: ?? 39 Display Name: IBM EBCDIC (冰島語-歐洲)----Name: IBM01149----And The result is: ?? 40 Display Name: Unicode----Name: utf-16----And The result is: 天添 41 Display Name: Unicode (Big-Endian)----Name: utf-16BE----And The result is: 天添 42 Display Name: 中歐字符(Windows)----Name: windows-1250----And The result is: ?? 43 Display Name: 西里爾字符(Windows)----Name: windows-1251----And The result is: ?? 44 Display Name: 西歐字符(Windows)----Name: Windows-1252----And The result is: ?? 45 Display Name: 希臘字符(Windows)----Name: windows-1253----And The result is: ?? 46 Display Name: 土耳其字符(Windows)----Name: windows-1254----And The result is: ?? 47 Display Name: 希伯來字符(Windows)----Name: windows-1255----And The result is: ?? 48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256----And The result is: ?? 49 Display Name: 波羅的海字符(Windows)----Name: windows-1257----And The result is: ?? 50 Display Name: 越南字符(Windows)----Name: windows-1258----And The result is: ?? 51 Display Name: 朝鮮語(Johab)----Name: Johab----And The result is: 天添 52 Display Name: 西歐字符(Mac)----Name: macintosh----And The result is: ?? 53 Display Name: 日語(Mac)----Name: x-mac-japanese----And The result is: 天添 54 Display Name: 繁體中文(Mac)----Name: x-mac-chinesetrad----And The result is: 天添 55 Display Name: 朝鮮語(Mac)----Name: x-mac-korean----And The result is: 天添 56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic----And The result is: ?? 57 Display Name: 希伯來字符(Mac)----Name: x-mac-hebrew----And The result is: ?? 58 Display Name: 希臘字符(Mac)----Name: x-mac-greek----And The result is: ?? 59 Display Name: 西里爾字符(Mac)----Name: x-mac-cyrillic----And The result is: ?? 60 Display Name: 簡體中文(Mac)----Name: x-mac-chinesesimp----And The result is: 天添 61 Display Name: 羅馬尼亞語(Mac)----Name: x-mac-romanian----And The result is: ?? 62 Display Name: 烏克蘭語(Mac)----Name: x-mac-ukrainian----And The result is: ?? 63 Display Name: 泰語(Mac)----Name: x-mac-thai----And The result is: ?? 64 Display Name: 中歐字符(Mac)----Name: x-mac-ce----And The result is: ?? 65 Display Name: 冰島語(Mac)----Name: x-mac-icelandic----And The result is: ?? 66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish----And The result is: ?? 67 Display Name: 克羅地亞語(Mac)----Name: x-mac-croatian----And The result is: ?? 68 Display Name: Unicode (UTF-32)----Name: utf-32----And The result is: 天添 69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE----And The result is: 天添 70 Display Name: 繁體中文(CNS)----Name: x-Chinese-CNS----And The result is: 天添 71 Display Name: TCA 臺灣----Name: x-cp20001----And The result is: 天添 72 Display Name: 繁體中文(Eten)----Name: x-Chinese-Eten----And The result is: 天添 73 Display Name: IBM5550 臺灣----Name: x-cp20003----And The result is: 天添 74 Display Name: TeleText 臺灣----Name: x-cp20004----And The result is: 天添 75 Display Name: Wang 臺灣----Name: x-cp20005----And The result is: 天添 76 Display Name: 西歐字符(IA5)----Name: x-IA5----And The result is: ?? 77 Display Name: 德語(IA5)----Name: x-IA5-German----And The result is: ?? 78 Display Name: 瑞典語(IA5)----Name: x-IA5-Swedish----And The result is: ?? 79 Display Name: 挪威語(IA5)----Name: x-IA5-Norwegian----And The result is: ?? 80 Display Name: US-ASCII----Name: us-ascii----And The result is: ?? 81 Display Name: T.61----Name: x-cp20261----And The result is: ?? 82 Display Name: ISO-6937----Name: x-cp20269----And The result is: ?? 83 Display Name: IBM EBCDIC (德國)----Name: IBM273----And The result is: ?? 84 Display Name: IBM EBCDIC (丹麥-挪威)----Name: IBM277----And The result is: ?? 85 Display Name: IBM EBCDIC (芬蘭-瑞典)----Name: IBM278----And The result is: ?? 86 Display Name: IBM EBCDIC (意大利)----Name: IBM280----And The result is: ?? 87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284----And The result is: ?? 88 Display Name: IBM EBCDIC (UK)----Name: IBM285----And The result is: ?? 89 Display Name: IBM EBCDIC (日語片假名)----Name: IBM290----And The result is: ?? 90 Display Name: IBM EBCDIC (法國)----Name: IBM297----And The result is: ?? 91 Display Name: IBM EBCDIC (阿拉伯語)----Name: IBM420----And The result is: ?? 92 Display Name: IBM EBCDIC (希臘語)----Name: IBM423----And The result is: ?? 93 Display Name: IBM EBCDIC (希伯來語)----Name: IBM424----And The result is: ?? 94 Display Name: IBM EBCDIC (朝鮮語擴展)----Name: x-EBCDIC-KoreanExtended----And The result is: ?? 95 Display Name: IBM EBCDIC (泰語)----Name: IBM-Thai----And The result is: ?? 96 Display Name: 西里爾字符(KOI8-R)----Name: koi8-r----And The result is: ?? 97 Display Name: IBM EBCDIC (冰島語)----Name: IBM871----And The result is: ?? 98 Display Name: IBM EBCDIC (西里爾俄語)----Name: IBM880----And The result is: ?? 99 Display Name: IBM EBCDIC (土耳其語)----Name: IBM905----And The result is: ?? 100 Display Name: IBM 拉丁語 1----Name: IBM00924----And The result is: ?? 101 Display Name: 日語(JIS 0208-1990 和 0212-1990)----Name: EUC-JP----And The result is: 天添 102 Display Name: 簡體中文(GB2312-80)----Name: x-cp20936----And The result is: 天添 103 Display Name: 朝鮮語 Wansung----Name: x-cp20949----And The result is: 天添 104 Display Name: IBM EBCDIC (西里爾塞爾維亞-保加利亞語)----Name: cp1025----And The result is: ?? 105 Display Name: 西里爾字符(KOI8-U)----Name: koi8-u----And The result is: ?? 106 Display Name: 西歐字符(ISO)----Name: iso-8859-1----And The result is: ?? 107 Display Name: 中歐字符(ISO)----Name: iso-8859-2----And The result is: ?? 108 Display Name: 拉丁語 3 (ISO)----Name: iso-8859-3----And The result is: ?? 109 Display Name: 波羅的海字符(ISO)----Name: iso-8859-4----And The result is: ?? 110 Display Name: 西里爾字符(ISO)----Name: iso-8859-5----And The result is: ?? 111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6----And The result is: ?? 112 Display Name: 希臘字符(ISO)----Name: iso-8859-7----And The result is: ?? 113 Display Name: 希伯來字符(ISO-Visual)----Name: iso-8859-8----And The result is: ?? 114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9----And The result is: ?? 115 Display Name: 愛沙尼亞語(ISO)----Name: iso-8859-13----And The result is: ?? 116 Display Name: 拉丁語 9 (ISO)----Name: iso-8859-15----And The result is: ?? 117 Display Name: 歐羅巴----Name: x-Europa----And The result is: ?? 118 Display Name: 希伯來字符(ISO-Logical)----Name: iso-8859-8-i----And The result is: ?? 119 Display Name: 日語(JIS)----Name: iso-2022-jp----And The result is: 天添 120 Display Name: 日語(JIS-允許 1 字節假名)----Name: csISO2022JP----And The result is: 天添 121 Display Name: 日語(JIS-允許 1 字節假名 - SO/SI)----Name: iso-2022-jp----And The result is: 天添 122 Display Name: 朝鮮語(ISO)----Name: iso-2022-kr----And The result is: 天添 123 Display Name: 簡體中文(ISO-2022)----Name: x-cp50227----And The result is: 天添 124 Display Name: 日語(EUC)----Name: euc-jp----And The result is: 天添 125 Display Name: 簡體中文(EUC)----Name: EUC-CN----And The result is: 天添 126 Display Name: 朝鮮語(EUC)----Name: euc-kr----And The result is: 天添 127 Display Name: 簡體中文(HZ)----Name: hz-gb-2312----And The result is: 天添 128 Display Name: 簡體中文(GB18030)----Name: GB18030----And The result is: 天添 129 Display Name: ISCII 梵文----Name: x-iscii-de----And The result is: ?? 130 Display Name: ISCII 孟加拉語----Name: x-iscii-be----And The result is: ?? 131 Display Name: ISCII 泰米爾語----Name: x-iscii-ta----And The result is: ?? 132 Display Name: ISCII 泰盧固語----Name: x-iscii-te----And The result is: ?? 133 Display Name: ISCII 阿薩姆語----Name: x-iscii-as----And The result is: ?? 134 Display Name: ISCII 奧裏雅語----Name: x-iscii-or----And The result is: ?? 135 Display Name: ISCII 卡納達語----Name: x-iscii-ka----And The result is: ?? 136 Display Name: ISCII 馬拉雅拉姆語----Name: x-iscii-ma----And The result is: ?? 137 Display Name: ISCII 古吉拉特語----Name: x-iscii-gu----And The result is: ?? 138 Display Name: ISCII 旁遮普語----Name: x-iscii-pa----And The result is: ?? 139 Display Name: Unicode (UTF-7)----Name: utf-7----And The result is: 天添 140 Display Name: Unicode (UTF-8)----Name: utf-8----And The result is: 天添
看了一下,有24中編碼方式能夠解析中文,其中還包括日本朝鮮臺灣。有點意思。
雖然有一些編碼方式都支持中文,但是他們確實是一樣的嗎?找幾個看一下:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 14 string testStr = "天添"; 15 16 ASCIIEncoding ascii = new ASCIIEncoding(); 17 UTF8Encoding utf8Encoding = new UTF8Encoding(); 18 19 Console.WriteLine("原字符串爲: " + testStr); 20 Byte[] asciiBytes = ascii.GetBytes(testStr); 21 Console.Write("Ascii轉換的字節爲:"); 22 foreach (Byte b in asciiBytes) 23 { 24 Console.Write("[{0}]", b); 25 } 26 Byte[] utf8Bytes = utf8Encoding.GetBytes(testStr); 27 Console.WriteLine(); 28 Console.Write("UTF8轉換的字節爲:"); 29 foreach (Byte b in utf8Bytes) 30 { 31 Console.Write("[{0}]", b); 32 } 33 Console.WriteLine(); 34 Byte[] gb2312Bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(testStr); 35 Console.Write("Gb2312轉換的字節爲: "); 36 foreach (Byte b in gb2312Bytes) 37 { 38 Console.Write("[{0}]", b); 39 } 40 Console.WriteLine(); 41 Byte[] jpBytes = Encoding.GetEncoding("iso-2022-jp").GetBytes(testStr); 42 Console.Write("iso-2022-jp轉換的字節爲: "); 43 foreach (Byte b in jpBytes) 44 { 45 Console.Write("[{0}]", b); 46 } 47 Console.WriteLine(); 48 string desAsciiStr = Encoding.GetEncoding("ascii").GetString(asciiBytes); 49 string desUtf8Str = Encoding.GetEncoding("utf-8").GetString(utf8Bytes); 50 string desGb2312Str = Encoding.GetEncoding("hz-gb-2312").GetString(gb2312Bytes); 51 string desJpStr = Encoding.GetEncoding("csISO2022JP").GetString(jpBytes); 52 Console.WriteLine("ascii轉換結果: " + desAsciiStr); 53 Console.WriteLine("uft8轉換結果: " + desUtf8Str); 54 Console.WriteLine("gb2312轉換結果: " + desGb2312Str); 55 Console.WriteLine("iso-2022-jp轉換結果: " + desJpStr); 56 Console.ReadKey(); 57 } 58 } 59 }
執行結果:
發現個問題:
即使最終解析成功的UTF8和GB2312,但是它們中間產生的byte數組其實不一樣的,這個好理解。這也是因爲使用不同的字符編碼。
下面看一下.NET FRAMEWORK提供的Encoding類提供處理編碼的方式
- ASCIIEncoding
- UTF7Encoding
- UTF8Encoding
- UnicodeEncoding(UTF-16)
- UTF32Encoding
ASCIIEncoding,UTF8Encoding剛纔已經稍微的用了一下了,下面試用一下其他的三個,在嘗試的過程中發現有一點點的不一樣。這也是Unicode的兩個問題,
NUL問題:因爲C語言處理字符串中的NUL和C#處理方式不同。(我也不是特別熟悉,囧)
字節排序問題:計算機中表示16位整數的時候,關於字節順序有兩種方式,一種是little endian,低位的8位先放,英特爾x86系列的CPU就是這樣設計的。另一種成爲big endian,代表性的SUN公司APARC的CPU。這樣就有問題,選擇哪種方式特別重要,再此CPU上使用這種方式編寫,在另一種CPU上執行此程序需要更久的時間。
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace Text 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 14 string testStr = "天添"; 15 16 UnicodeEncoding unicodingBigEnd = new UnicodeEncoding(true, true); 17 UnicodeEncoding unicodingLittleEnd = new UnicodeEncoding(false, true); 18 Console.WriteLine("原字符串爲: " + testStr); 19 Byte[] unicodingBigEndBytes = unicodingBigEnd.GetBytes(testStr); 20 Console.Write("BinEnd轉換的字節爲:"); 21 foreach (Byte b in unicodingBigEndBytes) 22 { 23 Console.Write("[{0}]", b); 24 } 25 Console.WriteLine(); 26 Byte[] unicodingLittleBytes = unicodingLittleEnd.GetBytes(testStr); 27 Console.Write("Little轉換的字節爲:"); 28 foreach (Byte b in unicodingLittleBytes) 29 { 30 Console.Write("[{0}]", b); 31 } 32 Console.WriteLine(); 33 string unicodeBigEnd = Encoding.GetEncoding("utf-16BE").GetString(unicodingBigEndBytes); 34 string unicodeLittleEnd = Encoding.GetEncoding("utf-16").GetString(unicodingLittleBytes); 35 36 Console.WriteLine("BinEnd轉換結果: " + unicodeBigEnd); 37 Console.WriteLine("Little轉換結果: " + unicodeLittleEnd); 38 Console.ReadKey(); 39 } 40 } 41 }
看結果:
發現果然是byte的順序不一樣,UTF32Encoding也有此問題。
貌似說了好多,又好像什麼都沒說,而且說的亂糟糟的。感覺對於編碼方式有了一點新的認識,不知道我理解的對也不對,歡迎大家交流。上個圖:
編程語言處理文本數據UCS方式和CSI方式的內容。以後再說吧。