文字編碼

      這個博客醞釀好久,不敢發,這個計算機的基本知識,我坦白說,我一直很混沌,一直不清楚,自己寫點啥,糾結不知道自己的是否正確,容易被鄙視,儘量測試來論證,但是由於本人水平不高,還是會還怕對於這麼基礎的知識,還是掌握的不好。

     在學習文字編碼的細節之前,先要認識幾個概念:

  • 字符                         
  • 字符集
  • 字符編碼
  • 字符編碼方式 

     文字:

           以視覺方式表現語言體系所用的符號。這個很好理解就是我們每天看見的A、B、C、D、啊、喔、額此類的東西。

     字符集: 

           由於我們日常所見的文字,符號和數字總和的數量是巨大的,同事處理所有的文字是不可能的,所以事先規定使用哪些文字,這些文字的集合就叫字符集。具有代表性的字符集有人比較熟知的美國的ASCII,歐洲的ISO8859,咱們中國人的GB_2312,以及後來的以表現多語言爲目的的Unicode字符集,我們看一下ASCII表:   

     字符編碼:

           在字符集中,每個字符都分配一個編碼,就叫做字符編碼。

     字符編碼方式: 

           計算機上僅僅用整數來表示字符編碼的方式成爲字符編碼方式。

      現在似乎明白一點了,雖然計算機能夠處理圖像、動畫、以及各種程序、各種數據,但是CPU只能處理二進制的數字。所以必須將各種形式的處理對象轉換成二進制,因爲當初最開始搞計算機的人說英語,所有最開始的例如ASCII中只有字母,數字,和基本符號。然後隨着計算機的發展,發展到中國了,ASCII已經不好使了,所以就出現了Unicode,和GB_2312,以及其他各個國家的字符集。

      好了現在寫點代碼來詳細講講。

      在C#中查看一下C#中Unicode支持的字符集編碼方式:       

 1 using System;   2 using System.Collections.Generic;   3 using System.Linq;   4 using System.Text;   5 using System.IO;   6    7 namespace Text   8 {   9     class Program  10     {  11         static void Main(string[] args)  12         {  13             FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate);  14             StringBuilder sb = new StringBuilder();  15             foreach (EncodingInfo coif in Encoding.GetEncodings())  16             {   17                 sb.Append("Display Name: " + coif.DisplayName + "----Name: " + coif.Name + "\n");  18             }  19             byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString());  20   21             fs.Write(coByte, 0, coByte.Length);  22             fs.Close();  23             Console.ReadKey();  24                25         }  26     }  27 }

 

         本來是輸出到控制檯的,結果發現輸出的內容還挺多的,只要寫到文件裏了,下面是輸出的內容:

  1 Display Name: IBM EBCDIC (美國-加拿大)----Name: IBM037    2 Display Name: OEM 美國----Name: IBM437    3 Display Name: IBM EBCDIC (國際)----Name: IBM500    4 Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708    5 Display Name: 阿拉伯字符(DOS)----Name: DOS-720    6 Display Name: 希臘字符(DOS)----Name: ibm737    7 Display Name: 波羅的海字符(DOS)----Name: ibm775    8 Display Name: 西歐字符(DOS)----Name: ibm850    9 Display Name: 中歐字符(DOS)----Name: ibm852   10 Display Name: OEM 西里爾語----Name: IBM855   11 Display Name: 土耳其字符(DOS)----Name: ibm857   12 Display Name: OEM 多語言拉丁語 I----Name: IBM00858   13 Display Name: 葡萄牙語(DOS)----Name: IBM860   14 Display Name: 冰島語(DOS)----Name: ibm861   15 Display Name: 希伯來字符(DOS)----Name: DOS-862   16 Display Name: 加拿大法語(DOS)----Name: IBM863   17 Display Name: 阿拉伯字符(864)----Name: IBM864   18 Display Name: 北歐字符(DOS)----Name: IBM865   19 Display Name: 西里爾字符(DOS)----Name: cp866   20 Display Name: 現代希臘字符(DOS)----Name: ibm869   21 Display Name: IBM EBCDIC (多語言拉丁語 2)----Name: IBM870   22 Display Name: 泰語(Windows)----Name: windows-874   23 Display Name: IBM EBCDIC (現代希臘語)----Name: cp875   24 Display Name: 日語(Shift-JIS)----Name: shift_jis   25 Display Name: 簡體中文(GB2312)----Name: gb2312   26 Display Name: 朝鮮語----Name: ks_c_5601-1987   27 Display Name: 繁體中文(Big5)----Name: big5   28 Display Name: IBM EBCDIC (土耳其拉丁語 5)----Name: IBM1026   29 Display Name: IBM 拉丁語 1----Name: IBM01047   30 Display Name: IBM EBCDIC (美國-加拿大-歐洲)----Name: IBM01140   31 Display Name: IBM EBCDIC (德國-歐洲)----Name: IBM01141   32 Display Name: IBM EBCDIC (丹麥-挪威-歐洲)----Name: IBM01142   33 Display Name: IBM EBCDIC (芬蘭-瑞典-歐洲)----Name: IBM01143   34 Display Name: IBM EBCDIC (意大利-歐洲)----Name: IBM01144   35 Display Name: IBM EBCDIC (西班牙-歐洲)----Name: IBM01145   36 Display Name: IBM EBCDIC (英國-歐洲)----Name: IBM01146   37 Display Name: IBM EBCDIC (法國-歐洲)----Name: IBM01147   38 Display Name: IBM EBCDIC (國際-歐洲)----Name: IBM01148   39 Display Name: IBM EBCDIC (冰島語-歐洲)----Name: IBM01149   40 Display Name: Unicode----Name: utf-16   41 Display Name: Unicode (Big-Endian)----Name: utf-16BE   42 Display Name: 中歐字符(Windows)----Name: windows-1250   43 Display Name: 西里爾字符(Windows)----Name: windows-1251   44 Display Name: 西歐字符(Windows)----Name: Windows-1252   45 Display Name: 希臘字符(Windows)----Name: windows-1253   46 Display Name: 土耳其字符(Windows)----Name: windows-1254   47 Display Name: 希伯來字符(Windows)----Name: windows-1255   48 Display Name: 阿拉伯字符(Windows)----Name: windows-1256   49 Display Name: 波羅的海字符(Windows)----Name: windows-1257   50 Display Name: 越南字符(Windows)----Name: windows-1258   51 Display Name: 朝鮮語(Johab)----Name: Johab   52 Display Name: 西歐字符(Mac)----Name: macintosh   53 Display Name: 日語(Mac)----Name: x-mac-japanese   54 Display Name: 繁體中文(Mac)----Name: x-mac-chinesetrad   55 Display Name: 朝鮮語(Mac)----Name: x-mac-korean   56 Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic   57 Display Name: 希伯來字符(Mac)----Name: x-mac-hebrew   58 Display Name: 希臘字符(Mac)----Name: x-mac-greek   59 Display Name: 西里爾字符(Mac)----Name: x-mac-cyrillic   60 Display Name: 簡體中文(Mac)----Name: x-mac-chinesesimp   61 Display Name: 羅馬尼亞語(Mac)----Name: x-mac-romanian   62 Display Name: 烏克蘭語(Mac)----Name: x-mac-ukrainian   63 Display Name: 泰語(Mac)----Name: x-mac-thai   64 Display Name: 中歐字符(Mac)----Name: x-mac-ce   65 Display Name: 冰島語(Mac)----Name: x-mac-icelandic   66 Display Name: 土耳其字符(Mac)----Name: x-mac-turkish   67 Display Name: 克羅地亞語(Mac)----Name: x-mac-croatian   68 Display Name: Unicode (UTF-32)----Name: utf-32   69 Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE   70 Display Name: 繁體中文(CNS)----Name: x-Chinese-CNS   71 Display Name: TCA 臺灣----Name: x-cp20001   72 Display Name: 繁體中文(Eten)----Name: x-Chinese-Eten   73 Display Name: IBM5550 臺灣----Name: x-cp20003   74 Display Name: TeleText 臺灣----Name: x-cp20004   75 Display Name: Wang 臺灣----Name: x-cp20005   76 Display Name: 西歐字符(IA5)----Name: x-IA5   77 Display Name: 德語(IA5)----Name: x-IA5-German   78 Display Name: 瑞典語(IA5)----Name: x-IA5-Swedish   79 Display Name: 挪威語(IA5)----Name: x-IA5-Norwegian   80 Display Name: US-ASCII----Name: us-ascii   81 Display Name: T.61----Name: x-cp20261   82 Display Name: ISO-6937----Name: x-cp20269   83 Display Name: IBM EBCDIC (德國)----Name: IBM273   84 Display Name: IBM EBCDIC (丹麥-挪威)----Name: IBM277   85 Display Name: IBM EBCDIC (芬蘭-瑞典)----Name: IBM278   86 Display Name: IBM EBCDIC (意大利)----Name: IBM280   87 Display Name: IBM EBCDIC (西班牙)----Name: IBM284   88 Display Name: IBM EBCDIC (UK)----Name: IBM285   89 Display Name: IBM EBCDIC (日語片假名)----Name: IBM290   90 Display Name: IBM EBCDIC (法國)----Name: IBM297   91 Display Name: IBM EBCDIC (阿拉伯語)----Name: IBM420   92 Display Name: IBM EBCDIC (希臘語)----Name: IBM423   93 Display Name: IBM EBCDIC (希伯來語)----Name: IBM424   94 Display Name: IBM EBCDIC (朝鮮語擴展)----Name: x-EBCDIC-KoreanExtended   95 Display Name: IBM EBCDIC (泰語)----Name: IBM-Thai   96 Display Name: 西里爾字符(KOI8-R)----Name: koi8-r   97 Display Name: IBM EBCDIC (冰島語)----Name: IBM871   98 Display Name: IBM EBCDIC (西里爾俄語)----Name: IBM880   99 Display Name: IBM EBCDIC (土耳其語)----Name: IBM905  100 Display Name: IBM 拉丁語 1----Name: IBM00924  101 Display Name: 日語(JIS 0208-19900212-1990)----Name: EUC-JP  102 Display Name: 簡體中文(GB2312-80)----Name: x-cp20936  103 Display Name: 朝鮮語 Wansung----Name: x-cp20949  104 Display Name: IBM EBCDIC (西里爾塞爾維亞-保加利亞語)----Name: cp1025  105 Display Name: 西里爾字符(KOI8-U)----Name: koi8-u  106 Display Name: 西歐字符(ISO)----Name: iso-8859-1  107 Display Name: 中歐字符(ISO)----Name: iso-8859-2  108 Display Name: 拉丁語 3 (ISO)----Name: iso-8859-3  109 Display Name: 波羅的海字符(ISO)----Name: iso-8859-4  110 Display Name: 西里爾字符(ISO)----Name: iso-8859-5  111 Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6  112 Display Name: 希臘字符(ISO)----Name: iso-8859-7  113 Display Name: 希伯來字符(ISO-Visual)----Name: iso-8859-8  114 Display Name: 土耳其字符(ISO)----Name: iso-8859-9  115 Display Name: 愛沙尼亞語(ISO)----Name: iso-8859-13  116 Display Name: 拉丁語 9 (ISO)----Name: iso-8859-15  117 Display Name: 歐羅巴----Name: x-Europa  118 Display Name: 希伯來字符(ISO-Logical)----Name: iso-8859-8-i  119 Display Name: 日語(JIS)----Name: iso-2022-jp  120 Display Name: 日語(JIS-允許 1 字節假名)----Name: csISO2022JP  121 Display Name: 日語(JIS-允許 1 字節假名 - SO/SI)----Name: iso-2022-jp  122 Display Name: 朝鮮語(ISO)----Name: iso-2022-kr  123 Display Name: 簡體中文(ISO-2022)----Name: x-cp50227  124 Display Name: 日語(EUC)----Name: euc-jp  125 Display Name: 簡體中文(EUC)----Name: EUC-CN  126 Display Name: 朝鮮語(EUC)----Name: euc-kr  127 Display Name: 簡體中文(HZ)----Name: hz-gb-2312  128 Display Name: 簡體中文(GB18030)----Name: GB18030  129 Display Name: ISCII 梵文----Name: x-iscii-de  130 Display Name: ISCII 孟加拉語----Name: x-iscii-be  131 Display Name: ISCII 泰米爾語----Name: x-iscii-ta  132 Display Name: ISCII 泰盧固語----Name: x-iscii-te  133 Display Name: ISCII 阿薩姆語----Name: x-iscii-as  134 Display Name: ISCII 奧裏雅語----Name: x-iscii-or  135 Display Name: ISCII 卡納達語----Name: x-iscii-ka  136 Display Name: ISCII 馬拉雅拉姆語----Name: x-iscii-ma  137 Display Name: ISCII 古吉拉特語----Name: x-iscii-gu  138 Display Name: ISCII 旁遮普語----Name: x-iscii-pa  139 Display Name: Unicode (UTF-7)----Name: utf-7  140 Display Name: Unicode (UTF-8)----Name: utf-8

 

下面看一下Java的:

 1 package code;   2    3 import java.nio.charset.Charset;   4 import java.util.SortedMap;   5    6 public class Code {   7    8     public static void main(String[] args) {   9         SortedMap<String, Charset> availableSet = Charset.availableCharsets();  10         for (String setKey : availableSet.keySet()) {  11             System.out.println("DisplayName: "+availableSet.get(setKey).displayName() +" Name: "+ availableSet.get(setKey).name());   12         }  13            14     }  15   16 }

看輸出結果:

  1 DisplayName: Big5 Name: Big5    2 DisplayName: Big5-HKSCS Name: Big5-HKSCS    3 DisplayName: EUC-JP Name: EUC-JP    4 DisplayName: EUC-KR Name: EUC-KR    5 DisplayName: GB18030 Name: GB18030    6 DisplayName: GB2312 Name: GB2312    7 DisplayName: GBK Name: GBK    8 DisplayName: IBM-Thai Name: IBM-Thai    9 DisplayName: IBM00858 Name: IBM00858   10 DisplayName: IBM01140 Name: IBM01140   11 DisplayName: IBM01141 Name: IBM01141   12 DisplayName: IBM01142 Name: IBM01142   13 DisplayName: IBM01143 Name: IBM01143   14 DisplayName: IBM01144 Name: IBM01144   15 DisplayName: IBM01145 Name: IBM01145   16 DisplayName: IBM01146 Name: IBM01146   17 DisplayName: IBM01147 Name: IBM01147   18 DisplayName: IBM01148 Name: IBM01148   19 DisplayName: IBM01149 Name: IBM01149   20 DisplayName: IBM037 Name: IBM037   21 DisplayName: IBM1026 Name: IBM1026   22 DisplayName: IBM1047 Name: IBM1047   23 DisplayName: IBM273 Name: IBM273   24 DisplayName: IBM277 Name: IBM277   25 DisplayName: IBM278 Name: IBM278   26 DisplayName: IBM280 Name: IBM280   27 DisplayName: IBM284 Name: IBM284   28 DisplayName: IBM285 Name: IBM285   29 DisplayName: IBM297 Name: IBM297   30 DisplayName: IBM420 Name: IBM420   31 DisplayName: IBM424 Name: IBM424   32 DisplayName: IBM437 Name: IBM437   33 DisplayName: IBM500 Name: IBM500   34 DisplayName: IBM775 Name: IBM775   35 DisplayName: IBM850 Name: IBM850   36 DisplayName: IBM852 Name: IBM852   37 DisplayName: IBM855 Name: IBM855   38 DisplayName: IBM857 Name: IBM857   39 DisplayName: IBM860 Name: IBM860   40 DisplayName: IBM861 Name: IBM861   41 DisplayName: IBM862 Name: IBM862   42 DisplayName: IBM863 Name: IBM863   43 DisplayName: IBM864 Name: IBM864   44 DisplayName: IBM865 Name: IBM865   45 DisplayName: IBM866 Name: IBM866   46 DisplayName: IBM868 Name: IBM868   47 DisplayName: IBM869 Name: IBM869   48 DisplayName: IBM870 Name: IBM870   49 DisplayName: IBM871 Name: IBM871   50 DisplayName: IBM918 Name: IBM918   51 DisplayName: ISO-2022-CN Name: ISO-2022-CN   52 DisplayName: ISO-2022-JP Name: ISO-2022-JP   53 DisplayName: ISO-2022-JP-2 Name: ISO-2022-JP-2   54 DisplayName: ISO-2022-KR Name: ISO-2022-KR   55 DisplayName: ISO-8859-1 Name: ISO-8859-1   56 DisplayName: ISO-8859-13 Name: ISO-8859-13   57 DisplayName: ISO-8859-15 Name: ISO-8859-15   58 DisplayName: ISO-8859-2 Name: ISO-8859-2   59 DisplayName: ISO-8859-3 Name: ISO-8859-3   60 DisplayName: ISO-8859-4 Name: ISO-8859-4   61 DisplayName: ISO-8859-5 Name: ISO-8859-5   62 DisplayName: ISO-8859-6 Name: ISO-8859-6   63 DisplayName: ISO-8859-7 Name: ISO-8859-7   64 DisplayName: ISO-8859-8 Name: ISO-8859-8   65 DisplayName: ISO-8859-9 Name: ISO-8859-9   66 DisplayName: JIS_X0201 Name: JIS_X0201   67 DisplayName: JIS_X0212-1990 Name: JIS_X0212-1990   68 DisplayName: KOI8-R Name: KOI8-R   69 DisplayName: KOI8-U Name: KOI8-U   70 DisplayName: Shift_JIS Name: Shift_JIS   71 DisplayName: TIS-620 Name: TIS-620   72 DisplayName: US-ASCII Name: US-ASCII   73 DisplayName: UTF-16 Name: UTF-16   74 DisplayName: UTF-16BE Name: UTF-16BE   75 DisplayName: UTF-16LE Name: UTF-16LE   76 DisplayName: UTF-32 Name: UTF-32   77 DisplayName: UTF-32BE Name: UTF-32BE   78 DisplayName: UTF-32LE Name: UTF-32LE   79 DisplayName: UTF-8 Name: UTF-8   80 DisplayName: windows-1250 Name: windows-1250   81 DisplayName: windows-1251 Name: windows-1251   82 DisplayName: windows-1252 Name: windows-1252   83 DisplayName: windows-1253 Name: windows-1253   84 DisplayName: windows-1254 Name: windows-1254   85 DisplayName: windows-1255 Name: windows-1255   86 DisplayName: windows-1256 Name: windows-1256   87 DisplayName: windows-1257 Name: windows-1257   88 DisplayName: windows-1258 Name: windows-1258   89 DisplayName: windows-31j Name: windows-31j   90 DisplayName: x-Big5-HKSCS-2001 Name: x-Big5-HKSCS-2001   91 DisplayName: x-Big5-Solaris Name: x-Big5-Solaris   92 DisplayName: x-euc-jp-linux Name: x-euc-jp-linux   93 DisplayName: x-EUC-TW Name: x-EUC-TW   94 DisplayName: x-eucJP-Open Name: x-eucJP-Open   95 DisplayName: x-IBM1006 Name: x-IBM1006   96 DisplayName: x-IBM1025 Name: x-IBM1025   97 DisplayName: x-IBM1046 Name: x-IBM1046   98 DisplayName: x-IBM1097 Name: x-IBM1097   99 DisplayName: x-IBM1098 Name: x-IBM1098  100 DisplayName: x-IBM1112 Name: x-IBM1112  101 DisplayName: x-IBM1122 Name: x-IBM1122  102 DisplayName: x-IBM1123 Name: x-IBM1123  103 DisplayName: x-IBM1124 Name: x-IBM1124  104 DisplayName: x-IBM1364 Name: x-IBM1364  105 DisplayName: x-IBM1381 Name: x-IBM1381  106 DisplayName: x-IBM1383 Name: x-IBM1383  107 DisplayName: x-IBM33722 Name: x-IBM33722  108 DisplayName: x-IBM737 Name: x-IBM737  109 DisplayName: x-IBM833 Name: x-IBM833  110 DisplayName: x-IBM834 Name: x-IBM834  111 DisplayName: x-IBM856 Name: x-IBM856  112 DisplayName: x-IBM874 Name: x-IBM874  113 DisplayName: x-IBM875 Name: x-IBM875  114 DisplayName: x-IBM921 Name: x-IBM921  115 DisplayName: x-IBM922 Name: x-IBM922  116 DisplayName: x-IBM930 Name: x-IBM930  117 DisplayName: x-IBM933 Name: x-IBM933  118 DisplayName: x-IBM935 Name: x-IBM935  119 DisplayName: x-IBM937 Name: x-IBM937  120 DisplayName: x-IBM939 Name: x-IBM939  121 DisplayName: x-IBM942 Name: x-IBM942  122 DisplayName: x-IBM942C Name: x-IBM942C  123 DisplayName: x-IBM943 Name: x-IBM943  124 DisplayName: x-IBM943C Name: x-IBM943C  125 DisplayName: x-IBM948 Name: x-IBM948  126 DisplayName: x-IBM949 Name: x-IBM949  127 DisplayName: x-IBM949C Name: x-IBM949C  128 DisplayName: x-IBM950 Name: x-IBM950  129 DisplayName: x-IBM964 Name: x-IBM964  130 DisplayName: x-IBM970 Name: x-IBM970  131 DisplayName: x-ISCII91 Name: x-ISCII91  132 DisplayName: x-ISO-2022-CN-CNS Name: x-ISO-2022-CN-CNS  133 DisplayName: x-ISO-2022-CN-GB Name: x-ISO-2022-CN-GB  134 DisplayName: x-iso-8859-11 Name: x-iso-8859-11  135 DisplayName: x-JIS0208 Name: x-JIS0208  136 DisplayName: x-JISAutoDetect Name: x-JISAutoDetect  137 DisplayName: x-Johab Name: x-Johab  138 DisplayName: x-MacArabic Name: x-MacArabic  139 DisplayName: x-MacCentralEurope Name: x-MacCentralEurope  140 DisplayName: x-MacCroatian Name: x-MacCroatian  141 DisplayName: x-MacCyrillic Name: x-MacCyrillic  142 DisplayName: x-MacDingbat Name: x-MacDingbat  143 DisplayName: x-MacGreek Name: x-MacGreek  144 DisplayName: x-MacHebrew Name: x-MacHebrew  145 DisplayName: x-MacIceland Name: x-MacIceland  146 DisplayName: x-MacRoman Name: x-MacRoman  147 DisplayName: x-MacRomania Name: x-MacRomania  148 DisplayName: x-MacSymbol Name: x-MacSymbol  149 DisplayName: x-MacThai Name: x-MacThai  150 DisplayName: x-MacTurkish Name: x-MacTurkish  151 DisplayName: x-MacUkraine Name: x-MacUkraine  152 DisplayName: x-MS932_0213 Name: x-MS932_0213  153 DisplayName: x-MS950-HKSCS Name: x-MS950-HKSCS  154 DisplayName: x-MS950-HKSCS-XP Name: x-MS950-HKSCS-XP  155 DisplayName: x-mswin-936 Name: x-mswin-936  156 DisplayName: x-PCK Name: x-PCK  157 DisplayName: x-SJIS_0213 Name: x-SJIS_0213  158 DisplayName: x-UTF-16LE-BOM Name: x-UTF-16LE-BOM  159 DisplayName: X-UTF-32BE-BOM Name: X-UTF-32BE-BOM  160 DisplayName: X-UTF-32LE-BOM Name: X-UTF-32LE-BOM  161 DisplayName: x-windows-50220 Name: x-windows-50220  162 DisplayName: x-windows-50221 Name: x-windows-50221  163 DisplayName: x-windows-874 Name: x-windows-874  164 DisplayName: x-windows-949 Name: x-windows-949  165 DisplayName: x-windows-950 Name: x-windows-950  166 DisplayName: x-windows-iso2022jp Name: x-windows-iso2022jp

貌似比C#支持的編碼方式更多一些。 

 在Eclipse中設置默認的程序集

這個很簡單,不同的電腦和程序可能設置不同的編碼方式作爲默認值,所以一個程序從一臺電腦上拷貝到另一臺電腦上,程序不一定能夠編譯。接下來在程序默認的程序集:

JAVA:

 1 package code;   2    3 import java.nio.charset.Charset;   4    5 public class Code {   6    7     public static void main(String[] args) {   8         System.out.println("Default CharSet: "+Charset.defaultCharset());    9     }  10   11 }

輸出結果:

1 Default CharSet: UTF-8

我的環境中的C#的默認編碼格式:

 1 using System;   2 using System.Collections.Generic;   3 using System.Linq;   4 using System.Text;   5 using System.IO;   6    7 namespace Text   8 {   9     class Program  10     {  11         static void Main(string[] args)  12         {   13             Console.WriteLine(Encoding.Default.EncodingName);   14             Console.ReadKey();   15         }  16     }  17 }

輸出結果:

 

下面說做個有意思的事情,看看C#支持的編碼格式都有那種格式能夠支持咱們中文,借用一下最開始的那段程序:

 1 using System;   2 using System.Collections.Generic;   3 using System.Linq;   4 using System.Text;   5 using System.IO;   6    7 namespace Text   8 {   9     class Program  10     {  11         static void Main(string[] args)  12         {  13              FileStream fs = File.Open("c:\\code.txt", FileMode.OpenOrCreate,FileAccess.ReadWrite);  14              string testStr = "天添";  15              StringBuilder sb = new StringBuilder();  16              foreach (EncodingInfo coif in Encoding.GetEncodings())  17              {  18                  Byte[] desBytes = Encoding.GetEncoding(coif.Name).GetBytes(testStr);  19                  string desStr = Encoding.GetEncoding(coif.Name).GetString(desBytes);  20   21                  sb.Append(" Display Name: " + coif.DisplayName + "----Name: " + coif.Name +"----And The result is:  "+ desStr + "\n");  22              }  23              byte[] coByte = Encoding.GetEncoding("Unicode").GetBytes(sb.ToString());  24    25              fs.Write(coByte, 0, coByte.Length);  26              fs.Close();  27              Console.ReadKey();   28         }  29     }  30 }

輸出結果:

  1  Display Name: IBM EBCDIC (美國-加拿大)----Name: IBM037----And The result is:  ??    2  Display Name: OEM 美國----Name: IBM437----And The result is:  ??    3  Display Name: IBM EBCDIC (國際)----Name: IBM500----And The result is:  ??    4  Display Name: 阿拉伯字符(ASMO-708)----Name: ASMO-708----And The result is:  ??    5  Display Name: 阿拉伯字符(DOS)----Name: DOS-720----And The result is:  ??    6  Display Name: 希臘字符(DOS)----Name: ibm737----And The result is:  ??    7  Display Name: 波羅的海字符(DOS)----Name: ibm775----And The result is:  ??    8  Display Name: 西歐字符(DOS)----Name: ibm850----And The result is:  ??    9  Display Name: 中歐字符(DOS)----Name: ibm852----And The result is:  ??   10  Display Name: OEM 西里爾語----Name: IBM855----And The result is:  ??   11  Display Name: 土耳其字符(DOS)----Name: ibm857----And The result is:  ??   12  Display Name: OEM 多語言拉丁語 I----Name: IBM00858----And The result is:  ??   13  Display Name: 葡萄牙語(DOS)----Name: IBM860----And The result is:  ??   14  Display Name: 冰島語(DOS)----Name: ibm861----And The result is:  ??   15  Display Name: 希伯來字符(DOS)----Name: DOS-862----And The result is:  ??   16  Display Name: 加拿大法語(DOS)----Name: IBM863----And The result is:  ??   17  Display Name: 阿拉伯字符(864)----Name: IBM864----And The result is:  ??   18  Display Name: 北歐字符(DOS)----Name: IBM865----And The result is:  ??   19  Display Name: 西里爾字符(DOS)----Name: cp866----And The result is:  ??   20  Display Name: 現代希臘字符(DOS)----Name: ibm869----And The result is:  ??   21  Display Name: IBM EBCDIC (多語言拉丁語 2)----Name: IBM870----And The result is:  ??   22  Display Name: 泰語(Windows)----Name: windows-874----And The result is:  ??   23  Display Name: IBM EBCDIC (現代希臘語)----Name: cp875----And The result is:  ??   24  Display Name: 日語(Shift-JIS)----Name: shift_jis----And The result is:  天添   25  Display Name: 簡體中文(GB2312)----Name: gb2312----And The result is:  天添   26  Display Name: 朝鮮語----Name: ks_c_5601-1987----And The result is:  天添   27  Display Name: 繁體中文(Big5)----Name: big5----And The result is:  天添   28  Display Name: IBM EBCDIC (土耳其拉丁語 5)----Name: IBM1026----And The result is:  ??   29  Display Name: IBM 拉丁語 1----Name: IBM01047----And The result is:  ??   30  Display Name: IBM EBCDIC (美國-加拿大-歐洲)----Name: IBM01140----And The result is:  ??   31  Display Name: IBM EBCDIC (德國-歐洲)----Name: IBM01141----And The result is:  ??   32  Display Name: IBM EBCDIC (丹麥-挪威-歐洲)----Name: IBM01142----And The result is:  ??   33  Display Name: IBM EBCDIC (芬蘭-瑞典-歐洲)----Name: IBM01143----And The result is:  ??   34  Display Name: IBM EBCDIC (意大利-歐洲)----Name: IBM01144----And The result is:  ??   35  Display Name: IBM EBCDIC (西班牙-歐洲)----Name: IBM01145----And The result is:  ??   36  Display Name: IBM EBCDIC (英國-歐洲)----Name: IBM01146----And The result is:  ??   37  Display Name: IBM EBCDIC (法國-歐洲)----Name: IBM01147----And The result is:  ??   38  Display Name: IBM EBCDIC (國際-歐洲)----Name: IBM01148----And The result is:  ??   39  Display Name: IBM EBCDIC (冰島語-歐洲)----Name: IBM01149----And The result is:  ??   40  Display Name: Unicode----Name: utf-16----And The result is:  天添   41  Display Name: Unicode (Big-Endian)----Name: utf-16BE----And The result is:  天添   42  Display Name: 中歐字符(Windows)----Name: windows-1250----And The result is:  ??   43  Display Name: 西里爾字符(Windows)----Name: windows-1251----And The result is:  ??   44  Display Name: 西歐字符(Windows)----Name: Windows-1252----And The result is:  ??   45  Display Name: 希臘字符(Windows)----Name: windows-1253----And The result is:  ??   46  Display Name: 土耳其字符(Windows)----Name: windows-1254----And The result is:  ??   47  Display Name: 希伯來字符(Windows)----Name: windows-1255----And The result is:  ??   48  Display Name: 阿拉伯字符(Windows)----Name: windows-1256----And The result is:  ??   49  Display Name: 波羅的海字符(Windows)----Name: windows-1257----And The result is:  ??   50  Display Name: 越南字符(Windows)----Name: windows-1258----And The result is:  ??   51  Display Name: 朝鮮語(Johab)----Name: Johab----And The result is:  天添   52  Display Name: 西歐字符(Mac)----Name: macintosh----And The result is:  ??   53  Display Name: 日語(Mac)----Name: x-mac-japanese----And The result is:  天添   54  Display Name: 繁體中文(Mac)----Name: x-mac-chinesetrad----And The result is:  天添   55  Display Name: 朝鮮語(Mac)----Name: x-mac-korean----And The result is:  天添   56  Display Name: 阿拉伯字符(Mac)----Name: x-mac-arabic----And The result is:  ??   57  Display Name: 希伯來字符(Mac)----Name: x-mac-hebrew----And The result is:  ??   58  Display Name: 希臘字符(Mac)----Name: x-mac-greek----And The result is:  ??   59  Display Name: 西里爾字符(Mac)----Name: x-mac-cyrillic----And The result is:  ??   60  Display Name: 簡體中文(Mac)----Name: x-mac-chinesesimp----And The result is:  天添   61  Display Name: 羅馬尼亞語(Mac)----Name: x-mac-romanian----And The result is:  ??   62  Display Name: 烏克蘭語(Mac)----Name: x-mac-ukrainian----And The result is:  ??   63  Display Name: 泰語(Mac)----Name: x-mac-thai----And The result is:  ??   64  Display Name: 中歐字符(Mac)----Name: x-mac-ce----And The result is:  ??   65  Display Name: 冰島語(Mac)----Name: x-mac-icelandic----And The result is:  ??   66  Display Name: 土耳其字符(Mac)----Name: x-mac-turkish----And The result is:  ??   67  Display Name: 克羅地亞語(Mac)----Name: x-mac-croatian----And The result is:  ??   68  Display Name: Unicode (UTF-32)----Name: utf-32----And The result is:  天添   69  Display Name: Unicode (UTF-32 Big-Endian)----Name: utf-32BE----And The result is:  天添   70  Display Name: 繁體中文(CNS)----Name: x-Chinese-CNS----And The result is:  天添   71  Display Name: TCA 臺灣----Name: x-cp20001----And The result is:  天添   72  Display Name: 繁體中文(Eten)----Name: x-Chinese-Eten----And The result is:  天添   73  Display Name: IBM5550 臺灣----Name: x-cp20003----And The result is:  天添   74  Display Name: TeleText 臺灣----Name: x-cp20004----And The result is:  天添   75  Display Name: Wang 臺灣----Name: x-cp20005----And The result is:  天添   76  Display Name: 西歐字符(IA5)----Name: x-IA5----And The result is:  ??   77  Display Name: 德語(IA5)----Name: x-IA5-German----And The result is:  ??   78  Display Name: 瑞典語(IA5)----Name: x-IA5-Swedish----And The result is:  ??   79  Display Name: 挪威語(IA5)----Name: x-IA5-Norwegian----And The result is:  ??   80  Display Name: US-ASCII----Name: us-ascii----And The result is:  ??   81  Display Name: T.61----Name: x-cp20261----And The result is:  ??   82  Display Name: ISO-6937----Name: x-cp20269----And The result is:  ??   83  Display Name: IBM EBCDIC (德國)----Name: IBM273----And The result is:  ??   84  Display Name: IBM EBCDIC (丹麥-挪威)----Name: IBM277----And The result is:  ??   85  Display Name: IBM EBCDIC (芬蘭-瑞典)----Name: IBM278----And The result is:  ??   86  Display Name: IBM EBCDIC (意大利)----Name: IBM280----And The result is:  ??   87  Display Name: IBM EBCDIC (西班牙)----Name: IBM284----And The result is:  ??   88  Display Name: IBM EBCDIC (UK)----Name: IBM285----And The result is:  ??   89  Display Name: IBM EBCDIC (日語片假名)----Name: IBM290----And The result is:  ??   90  Display Name: IBM EBCDIC (法國)----Name: IBM297----And The result is:  ??   91  Display Name: IBM EBCDIC (阿拉伯語)----Name: IBM420----And The result is:  ??   92  Display Name: IBM EBCDIC (希臘語)----Name: IBM423----And The result is:  ??   93  Display Name: IBM EBCDIC (希伯來語)----Name: IBM424----And The result is:  ??   94  Display Name: IBM EBCDIC (朝鮮語擴展)----Name: x-EBCDIC-KoreanExtended----And The result is:  ??   95  Display Name: IBM EBCDIC (泰語)----Name: IBM-Thai----And The result is:  ??   96  Display Name: 西里爾字符(KOI8-R)----Name: koi8-r----And The result is:  ??   97  Display Name: IBM EBCDIC (冰島語)----Name: IBM871----And The result is:  ??   98  Display Name: IBM EBCDIC (西里爾俄語)----Name: IBM880----And The result is:  ??   99  Display Name: IBM EBCDIC (土耳其語)----Name: IBM905----And The result is:  ??  100  Display Name: IBM 拉丁語 1----Name: IBM00924----And The result is:  ??  101  Display Name: 日語(JIS 0208-19900212-1990)----Name: EUC-JP----And The result is:  天添  102  Display Name: 簡體中文(GB2312-80)----Name: x-cp20936----And The result is:  天添  103  Display Name: 朝鮮語 Wansung----Name: x-cp20949----And The result is:  天添  104  Display Name: IBM EBCDIC (西里爾塞爾維亞-保加利亞語)----Name: cp1025----And The result is:  ??  105  Display Name: 西里爾字符(KOI8-U)----Name: koi8-u----And The result is:  ??  106  Display Name: 西歐字符(ISO)----Name: iso-8859-1----And The result is:  ??  107  Display Name: 中歐字符(ISO)----Name: iso-8859-2----And The result is:  ??  108  Display Name: 拉丁語 3 (ISO)----Name: iso-8859-3----And The result is:  ??  109  Display Name: 波羅的海字符(ISO)----Name: iso-8859-4----And The result is:  ??  110  Display Name: 西里爾字符(ISO)----Name: iso-8859-5----And The result is:  ??  111  Display Name: 阿拉伯字符(ISO)----Name: iso-8859-6----And The result is:  ??  112  Display Name: 希臘字符(ISO)----Name: iso-8859-7----And The result is:  ??  113  Display Name: 希伯來字符(ISO-Visual)----Name: iso-8859-8----And The result is:  ??  114  Display Name: 土耳其字符(ISO)----Name: iso-8859-9----And The result is:  ??  115  Display Name: 愛沙尼亞語(ISO)----Name: iso-8859-13----And The result is:  ??  116  Display Name: 拉丁語 9 (ISO)----Name: iso-8859-15----And The result is:  ??  117  Display Name: 歐羅巴----Name: x-Europa----And The result is:  ??  118  Display Name: 希伯來字符(ISO-Logical)----Name: iso-8859-8-i----And The result is:  ??  119  Display Name: 日語(JIS)----Name: iso-2022-jp----And The result is:  天添  120  Display Name: 日語(JIS-允許 1 字節假名)----Name: csISO2022JP----And The result is:  天添  121  Display Name: 日語(JIS-允許 1 字節假名 - SO/SI)----Name: iso-2022-jp----And The result is:  天添  122  Display Name: 朝鮮語(ISO)----Name: iso-2022-kr----And The result is:  天添  123  Display Name: 簡體中文(ISO-2022)----Name: x-cp50227----And The result is:  天添  124  Display Name: 日語(EUC)----Name: euc-jp----And The result is:  天添  125  Display Name: 簡體中文(EUC)----Name: EUC-CN----And The result is:  天添  126  Display Name: 朝鮮語(EUC)----Name: euc-kr----And The result is:  天添  127  Display Name: 簡體中文(HZ)----Name: hz-gb-2312----And The result is:  天添  128  Display Name: 簡體中文(GB18030)----Name: GB18030----And The result is:  天添  129  Display Name: ISCII 梵文----Name: x-iscii-de----And The result is:  ??  130  Display Name: ISCII 孟加拉語----Name: x-iscii-be----And The result is:  ??  131  Display Name: ISCII 泰米爾語----Name: x-iscii-ta----And The result is:  ??  132  Display Name: ISCII 泰盧固語----Name: x-iscii-te----And The result is:  ??  133  Display Name: ISCII 阿薩姆語----Name: x-iscii-as----And The result is:  ??  134  Display Name: ISCII 奧裏雅語----Name: x-iscii-or----And The result is:  ??  135  Display Name: ISCII 卡納達語----Name: x-iscii-ka----And The result is:  ??  136  Display Name: ISCII 馬拉雅拉姆語----Name: x-iscii-ma----And The result is:  ??  137  Display Name: ISCII 古吉拉特語----Name: x-iscii-gu----And The result is:  ??  138  Display Name: ISCII 旁遮普語----Name: x-iscii-pa----And The result is:  ??  139  Display Name: Unicode (UTF-7)----Name: utf-7----And The result is:  天添  140  Display Name: Unicode (UTF-8)----Name: utf-8----And The result is:  天添

看了一下,有24中編碼方式能夠解析中文,其中還包括日本朝鮮臺灣。有點意思。

雖然有一些編碼方式都支持中文,但是他們確實是一樣的嗎?找幾個看一下:

 1 using System;   2 using System.Collections.Generic;   3 using System.Linq;   4 using System.Text;   5 using System.IO;   6    7 namespace Text   8 {   9     class Program  10     {  11         static void Main(string[] args)  12         {  13   14             string testStr = "天添";  15   16             ASCIIEncoding ascii = new ASCIIEncoding();  17             UTF8Encoding utf8Encoding = new UTF8Encoding();  18   19             Console.WriteLine("原字符串爲: " + testStr);  20             Byte[] asciiBytes = ascii.GetBytes(testStr);  21             Console.Write("Ascii轉換的字節爲:");  22             foreach (Byte b in asciiBytes)  23             {  24                 Console.Write("[{0}]", b);  25             }  26             Byte[] utf8Bytes = utf8Encoding.GetBytes(testStr);  27             Console.WriteLine();  28             Console.Write("UTF8轉換的字節爲:");  29             foreach (Byte b in utf8Bytes)  30             {  31                 Console.Write("[{0}]", b);  32             }  33             Console.WriteLine();  34             Byte[] gb2312Bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(testStr);  35             Console.Write("Gb2312轉換的字節爲: ");  36             foreach (Byte b in gb2312Bytes)  37             {  38                 Console.Write("[{0}]", b);  39             }  40             Console.WriteLine();  41             Byte[] jpBytes = Encoding.GetEncoding("iso-2022-jp").GetBytes(testStr);  42             Console.Write("iso-2022-jp轉換的字節爲: ");  43             foreach (Byte b in jpBytes)  44             {  45                 Console.Write("[{0}]", b);  46             }  47             Console.WriteLine();  48             string desAsciiStr = Encoding.GetEncoding("ascii").GetString(asciiBytes);  49             string desUtf8Str = Encoding.GetEncoding("utf-8").GetString(utf8Bytes);  50             string desGb2312Str = Encoding.GetEncoding("hz-gb-2312").GetString(gb2312Bytes);   51             string desJpStr = Encoding.GetEncoding("csISO2022JP").GetString(jpBytes);   52             Console.WriteLine("ascii轉換結果: " + desAsciiStr);  53             Console.WriteLine("uft8轉換結果:  " + desUtf8Str);  54             Console.WriteLine("gb2312轉換結果: " + desGb2312Str);  55             Console.WriteLine("iso-2022-jp轉換結果: " + desJpStr);  56             Console.ReadKey();  57         }  58     }  59 }

 

執行結果:

發現個問題:

     即使最終解析成功的UTF8和GB2312,但是它們中間產生的byte數組其實不一樣的,這個好理解。這也是因爲使用不同的字符編碼。 

   

下面看一下.NET FRAMEWORK提供的Encoding類提供處理編碼的方式      

  • ASCIIEncoding
  • UTF7Encoding
  • UTF8Encoding
  • UnicodeEncoding(UTF-16)
  • UTF32Encoding

  ASCIIEncoding,UTF8Encoding剛纔已經稍微的用了一下了,下面試用一下其他的三個,在嘗試的過程中發現有一點點的不一樣。這也是Unicode的兩個問題,

     NUL問題:因爲C語言處理字符串中的NUL和C#處理方式不同。(我也不是特別熟悉,囧)

     字節排序問題:計算機中表示16位整數的時候,關於字節順序有兩種方式,一種是little endian,低位的8位先放,英特爾x86系列的CPU就是這樣設計的。另一種成爲big endian,代表性的SUN公司APARC的CPU。這樣就有問題,選擇哪種方式特別重要,再此CPU上使用這種方式編寫,在另一種CPU上執行此程序需要更久的時間。

 1 using System;   2 using System.Collections.Generic;   3 using System.Linq;   4 using System.Text;   5 using System.IO;   6    7 namespace Text   8 {   9     class Program  10     {  11         static void Main(string[] args)  12         {  13   14             string testStr = "天添";  15   16             UnicodeEncoding unicodingBigEnd = new UnicodeEncoding(true, true);  17             UnicodeEncoding unicodingLittleEnd = new UnicodeEncoding(false, true);   18             Console.WriteLine("原字符串爲: " + testStr);  19             Byte[] unicodingBigEndBytes = unicodingBigEnd.GetBytes(testStr);  20             Console.Write("BinEnd轉換的字節爲:");  21             foreach (Byte b in unicodingBigEndBytes)  22             {  23                 Console.Write("[{0}]", b);  24             }  25             Console.WriteLine();  26             Byte[] unicodingLittleBytes = unicodingLittleEnd.GetBytes(testStr);  27             Console.Write("Little轉換的字節爲:");  28             foreach (Byte b in unicodingLittleBytes)  29             {  30                 Console.Write("[{0}]", b);  31             }  32             Console.WriteLine();  33             string unicodeBigEnd = Encoding.GetEncoding("utf-16BE").GetString(unicodingBigEndBytes);  34             string unicodeLittleEnd = Encoding.GetEncoding("utf-16").GetString(unicodingLittleBytes);  35   36             Console.WriteLine("BinEnd轉換結果: " + unicodeBigEnd);  37             Console.WriteLine("Little轉換結果: " + unicodeLittleEnd);  38             Console.ReadKey();  39         }  40     }  41 }

看結果:

發現果然是byte的順序不一樣,UTF32Encoding也有此問題。

 

貌似說了好多,又好像什麼都沒說,而且說的亂糟糟的。感覺對於編碼方式有了一點新的認識,不知道我理解的對也不對,歡迎大家交流。上個圖:

編程語言處理文本數據UCS方式和CSI方式的內容。以後再說吧。

 

 

 

   

      

      

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章