《Windows Via C/C++》邊學習,邊翻譯(二)操作字符和字符串-1

第二章 操作字符和字符串(Working with Characters and Strings)

 Overview  概述

 Windows has always offered support to help developers localize their applications. An application can get country-specific information from various functions and can examine Control Panel settings to determine the user's preferences. Windows even supports different fonts for our applications. Last but not least, in Windows Vista, Unicode 5.0 is now supported. (Read "Extend The Global Reach Of Your Applications With Unicode 5.0" at http://msdn.microsoft.com/msdnmag/issues/07/01/Unicode/default.aspx for a high-level presentation of Unicode 5.0.)

 Windows一直爲開發者進行應用程序本地化提供支持。應用程序可以從不同函數中取得設定國家的信息,並能通過檢查檢查控制面板設置來決定用戶的首選項。Windows也能爲應用程序提供不同字體支持。最後,但並非不要,Windows Vista目前已支持Unicode 5.0。(可閱讀"Extend The Global Reach Of Your Applications With Unicode 5.0"獲得Unicode 5.0的更多介紹)

 Buffer overrun errors (which are typical when manipulating character strings) have become a vector for security attacks against applications and even against parts of the operating system. In previous years, Microsoft put forth a lot of internal and external efforts to raise the security bar in the Windows world. The second part of this chapter presents new functions provided by Microsoft in the C run-time library. You should use these new functions to protect your code against buffer over-runs when manipulating strings.

 緩衝區溢出錯誤(尤其是在操作字符串時)是對應用程序及操作系統局部進行安全攻擊的一個途徑。在過去數年,微軟一直進行內外部努力來提高Windows環境的安全防護能力。本章的第二部分介紹了C運行時庫中微軟提供的新函數,應該在處理字符串時使用這些新函數來避免代碼產生緩衝區溢出錯誤。

 If you have a code base that is non-Unicode, you'll be best served by moving that code base to Unicode, as this will improve your application's execution performance as well as prepare it for localization. It will also help when interoperating with COM and the .NET Framework.

 如果你的代碼是非Unicode編碼,最好將其遷移到Unicode基礎上,因爲這樣會使應用程序易於本地化。這樣做也有助於與COM和.NET框架進行交互。

Character Encodings  字符編碼

 The real problem with localization has always been manipulating different character sets. For years, most of us have been coding text strings as a series of single-byte characters with a zero at the end. This is second nature to us. When we call strlen, it returns the number of characters in a zero-terminated array of ANSI single-byte characters.

 本地化的核心問題在於對不同字符集的操作。多年來,我們一直將字符串編碼爲一系列單字節字符加上末尾的零字符(’/0’),這已成爲我們的習性。當調用strlen函數時,返回以零字符結尾的數組中ANSI單字節字符的個數。

 Unicode is a standard founded by Apple and Xerox in 1988. In 1991, a consortium was created to develop and promote Unicode. The consortium consists of companies such as Apple, Compaq, Hewlett-Packard, IBM, Microsoft, Oracle, Silicon Graphics, Sybase, Unisys, and Xerox. (A complete and updated list of consortium members is available at http://www.Unicode.org.) This group of companies is responsible for maintaining the Unicode standard. The full description of Unicode can be found in The Unicode Standard, published by Addison-Wesley. (This book is available through http://www.Unicode.org.)

 1988年,蘋果(Apple)和施樂(Xerox)公司創建了Unicode標準。1991年,發展與促進Unicode的協會被創建,此協會由蘋果(Apple)、康柏(Compaq)、惠普(Hewlett-Packard)、IBM、微軟、 Oracle、SGI(Silicon Graphics)、Sybase、Unisys及施樂(Xerox)等多家公司組成(協會成員最新列表參看http://www.Unicode.org)。這些公司負責維護Unicode標準。Unicode的完整描述清參考《The Unicode Standard》,由Addison-Wesley出版(此書可參見 http://www.Unicode.org)。

 In Windows Vista, each Unicode character is encoded using UTF-16 (where UTF is an acronym for Unicode Transformation Format). UTF-16 encodes each character as 2 bytes (or 16 bits). In this book, when we talk about Unicode, we are always referring to UTF-16 encoding unless we state otherwise. Windows uses UTF-16 because characters from most languages used throughout the world can easily be represented via a 16-bit value, allowing programs to easily traverse a string and calculate its length. However, 16-bits is not enough to represent all characters from certain languages. For these languages, UTF-16 supports surrogates, which are a way of using 32 bits (or 4 bytes) to represent a single character. Because few applications need to represent the characters of these languages, UTF-16 is a good compromise between saving space and providing ease of coding. Note that the .NET Framework always encodes all characters and strings using UTF-16, so using UTF-16 in your Windows application will improve performance and reduce memory consumption if you need to pass characters or strings between native and managed code.

 Windows Vista中的Unicode字符均採用UTF-16編碼(UTF即Unicode Transformation Format)。UTF-16將每個字符編碼爲2字節(16位)。本書中討論到Unicode,如果沒有特殊說明,均指 UTF-16編碼。Windows採用UTF-16,是因爲全世界範圍使用的絕大多數語言的字符,都能通過16位值來表示,這使得程序能夠容易地轉換字符串並計算其長度。然而,16位長並不足以表示某些語言的所有字符。對於這些語言,UTF-16支持替代(surrogates)——用32位值(4字節)來表示單個字符。由於只有極少數應用程序需要表示這些語言的字符,所以UTF-16是節省空間和簡化編碼之間很好地的折衷方案。注意.NET Framework對所有字符和字符串都採用UTF-16編碼,因此在Windows應用程序中,當需要在本地和託管代碼間傳遞字符和字符串時,採用UTF-16編碼會提升性能和減少內存消耗。

 There are other UTF standards for representing characters, including the following ones:

 UTF-8 UTF-8 encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 are compressed to 1 byte, which works very well for characters used in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well for European and Middle Eastern languages. Characters of 0x0800 and above are converted to 3 bytes, which works well for East Asian languages. Finally, surrogate pairs are written out as 4 bytes. UTF-8 is an extremely popular encoding format, but it's less efficient than UTF-16 if you encode many characters with values of 0x0800 or above.

 UTF-32 UTF-32 encodes every character as 4 bytes. This encoding is useful when you want to write a simple algorithm to traverse characters (used in any language) and you don't want to have to deal with characters taking a variable number of bytes. For example, with UTF-32, you do not need to think about surrogates because every character is 4 bytes. Obviously, UTF-32 is not an efficient encoding format in terms of memory usage. Therefore, it's rarely used for saving or transmitting strings to a file or network. This encoding format is typically used inside the program itself.

 以下是表示字符的其他UTF標準:

 UTF-8 UTF-8將字符編碼爲1字節、爲2字節、3字節或4字節。值低於0x0080的字符被壓縮爲1字節,可以很好地表示美國所使用的字符;值介於0x0080和0x07FF之間的字符被轉換爲2字節,能夠很好地表示歐洲及中東國家所使用的字符;值大於等於0x0800的字符被轉化爲3字節,表示東亞國家的語言字符;最後,替代對(surrogate pairs??)被編碼爲4字節。UTF-8是最流行的編碼格式,但是當你要對許多值大於0x0800的字符進行編碼時,會比UTF-16編碼的效率低。

 UTF-32 UTF-32將每個字符編碼爲4字節。當你只想寫一個簡單算法遍歷字符(任何語言中所使用的字符),並且不想考慮字節長度變化的問題時,這種編碼方式很有用。例如,使用UTF-32無需考慮 surrogates,因爲每個字符都是4字節。顯然UTF-32在內存使用上是缺乏效率的一種編碼方式。因此,它很少被用於向文件或網絡存儲或傳送字符串。典型地,它被用於程序內部處理。

 Currently, Unicode code points are defined for the Arabic, Chinese bopomofo, Cyrillic (Russian), Greek, Hebrew, Japanese kana, Korean hangul, and Latin (English) alphabets—called scripts—and more. Each version of Unicode brings new characters in existing scripts and even new scripts such as Phoenician (an ancient Mediterranean alphabet). A large number of punctuation marks, mathematical symbols, technical symbols, arrows, dingbats, diacritics, and other characters are also included in the character sets. These 65,536 characters are divided into regions. Table 2-1 shows some of the regions and the characters that are assigned to them.

 目前,Unicode代碼點(code points,指符號在字符表中的位置)定義了阿拉伯語、漢語、西裏爾字母(俄語所使用的字母)、希臘語、希伯來語、日語假名、韓文、拉丁(英文)字母——所謂的文字體系——以及更多。每個Unicode的版本都引入現存文字體系的新字符,甚至引入像菲尼基文(一種古老的地中海文字)這樣的新的文字體系。大量的標點符號、數學符號、專業符號、箭頭符號、新發明的符號、醫學符號以及其他的符號,也包含在字符集中。這65,536個字符被分成多個區塊。表2-1表示了其中的一些區塊以及所分配的字符。

 Table 2-1: Unicode Character Sets and Alphabets  Unicode字符集和字母表

16-Bit Code Characters 16-Bit Code Alphabet/Scripts
0000-007F ASCII 0300-036F Generic diacritical marks
一般變音符/附加符號
0080-00FF Latin1 characters 0400-04FF Cyrillic 西裏爾字母
0100-017F European Latin 0530-058F Armenian 亞美尼亞語
0180-01FF Extended Latin 0590-05FF Hebrew 希伯來語
0250-02AF

Standard phonetic
標準語音(音標)

0600-06FF Arabic 阿拉伯語
02B0-02FF Modified letters 0900-097F Devanagari 梵文字母

 

本文翻譯自《Windows Via C/C++》.

發佈了22 篇原創文章 · 獲贊 0 · 訪問量 8萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章