《Windows Via C/C++》邊學習,邊翻譯(三)操作字符和字符串-2

ANSI and Unicode Character and String Data Types  ANSI、Unicode字符及字符串數據類型

I'm sure you're aware that the C language uses the char data type to represent an 8-bit ANSI character. By default, when you declare a literal string in your source code, the C compiler turns the string's characters into an array of 8-bit char data types:

你一定知道C語言用char類型來表示8位的ANSI字符。當你在代碼中聲明一個字符串時,C編譯器默認將其轉化爲8位的char型數組。

// An 8-bit character    8位字符
char c = 'A';

//
 An array of 99 8-bit characters and an 8-bit terminating zero.
// 由99個8位字符和1個8位零結束符組成的數組

char szBuffer[100= "A String";

Microsoft's C/C++ compiler defines a built-in data type, wchar_t, which represents a 16-bit Unicode (UTF-16) character. Because earlier versions of Microsoft's compiler did not offer this built-in data type, the compiler defines this data type only when the /Zc:wchar_t compiler switch is specified. By default, when you create a C++ project in Microsoft Visual Studio, this compiler switch is specified. We recommend that you always specify this compiler switch, as it is better to work with Unicode characters by way of the built-in primitive type understood intrinsically by the compiler.

微軟的C/C++編譯器定義了內建數據類型wchar_t,用來表示16位的Unicode(UTF-16)字符。微軟早期版本的編譯器並未提供這種類型,因此在指定/Zc:wchar_t編譯開關時,編譯器纔對其作定義。在Visual Studio中新建一個C/C++工程時此編譯開關是默認打開的。建議總是打開此編譯開關,通過編譯器本身能理解的內建類型,能夠更好地使用Unicode字符。

Note: Prior to the built-in compiler support, a C header file defined a wchar_t data type as follows:

注意: 內建在編譯器中支持之前,一個C的頭文件定義了wchar_t類型,如下:

typedef unsigned short wchar_t;

Here is how you declare a Unicode character and string:

以下是如何聲明Unicode字符及字符串:

// A 16-bit character    16位字符
wchar_t c = L'A';

//
 An array up to 99 16-bit characters and a 16-bit terminating zero.
// 由99個16位字符和1個16位零結束符所組成的數組

wchar_t szBuffer[100= L"A String";

An uppercase L before a literal string informs the compiler that the string should be compiled as a Unicode string. When the compiler places the string in the program's data section, it encodes each character using UTF16, interspersing zero bytes between every ASCII character in this simple case.

在字符串前面放一個大寫的”L”,會告訴編譯器將其編譯爲Unicode字符串。當編譯器將此字符串放入程序數據段時,將使用UTF-16對每個字符進行編碼,這樣的簡單情形下會在每個ASCII字符之間填補零字節。

The Windows team at Microsoft wants to define its own data types to isolate itself a little bit from the C language. And so, the Windows header file, WinNT.h, defines the following data types:

微軟Windows開發組想通過定義自己的數據類型來與C語言的類型進行區別。在Windows的WinNT.h頭文件中定義了以下數據類型:

typedef char     CHAR;    // An 8-bit character    8位字符

typedef wchar_t WCHAR;    
// A 16-bit character    16位字符

Furthermore, the WinNT.h header file defines a bunch of convenience data types for working with pointers to characters and pointers to strings:

此外,WinNT.h中還定義了一組方便指向字符和字符串的指針類型:

// Pointer to 8-bit character(s)    指向8位字符的指針
typedef CHAR *PCHAR;
typedef CHAR 
*
PSTR;
typedef CONST CHAR 
*
PCSTR

// Pointer to 16-bit character(s)    指向16位字符的指針

typedef WCHAR *PWCHAR;
typedef WCHAR 
*
PWSTR;
typedef CONST WCHAR 
*PCWSTR;

Note:  If you take a look at WinNT.h, you'll find the following definition:

注意: 如果查看WinNT.h會發現以下定義:

typedef __nullterminated WCHAR *NWPSTR, *LPWSTR, *PWSTR;

The __nullterminated prefix is a header annotation that describes how types are expected to be used as function parameters and return values. In the Enterprise version of Visual Studio, you can set the Code Analysis option in the project properties. This adds the /analyze switch to the command line of the compiler that detects when your code calls functions in a way that breaks the semantic defined by the annotations. Notice that only Enterprise versions of the compiler support this /analyze switch. To keep the code more readable in this book, the header annotations are removed. You should read the "Header Annotations" documentation on MSDN at http://msdn2.microsoft.com/En-US/library/aa383701.aspx for more details about the header annotations language.

前綴__nullterminated是描述類型作爲函數參數或返回值如何使用的標示註解(header annotation)。Visual Studio的企業版,可以在工程屬性裏設置代碼分析選項,即對編譯器的命令行增加了/analyze開關項,這樣就能夠檢測代碼在調用函數時是否違反了標示註解(header annotation)所定義的語義規則。注意只有企業版才提供對/analyze開關的支持。本書爲保持代碼的易讀性,將標示註解(header annotation)都去掉了。可以通過閱讀MSDN,http://msdn2.microsoft.com/En-US/library/aa383701.aspx的"Header Annotations"文檔,瞭解更多有關注解語言的細節。

In your own source code, it doesn't matter which data type you use, but I'd recommend you try to be consistent to improve maintainability in your code. Personally, as a Windows programmer, I always use the Windows data types because the data types match up with the MSDN documentation, making things easier for everyone reading the code.

在你自己的源代碼中使用哪種數據類型都沒關係,但是應該注意,應該堅持提供代碼的可維護性。就我個人來講,作爲一名Windows程序員,我總是使用Windows的數據類型,因爲它們有對應的MSDN文檔相匹配,使其他人能更容易地閱讀你的代碼。

It is possible to write your source code so that it can be compiled using ANSI or Unicode characters and strings. In the WinNT.h header file, the following types and macros are defined:

可以編寫你的代碼,使之能用ANSI又能用Unicode編碼對字符或字符串進行編譯。在WinNT.h中,定義了以下類型和宏:

#ifdef UNICODE

typedef WCHAR TCHAR, 
*PTCHAR, PTSTR;
typedef CONST WCHAR 
*PCTSTR;
#define __TEXT(quote) quote          // r_winnt

#define __TEXT(quote) L##quote

#else

typedef CHAR TCHAR, 
*PTCHAR, PTSTR;
typedef CONST CHAR 
*PCTSTR;
#define __TEXT(quote) quote

#endif

#define   TEXT(quote) __TEXT(quote)

These types and macros (plus a few less commonly used ones that I do not show here) are used to create source code that can be compiled using either ANSI or Unicode characters and strings, for example:

使用這些類型和宏定義(極少使用的未列出),生成代碼時既能用ANSI又能用Unicode來編碼字符和字符串,例如:

// If UNICODE defined, a 16-bit character; else an 8-bit character
// 如果定義了Unicode,編碼爲16位字符;否則爲8位。
TCHAR c = TEXT('A');

// If UNICODE defined, an array of 16-bit characters; else 8-bit characters
//如果定義了Unicode,編碼爲16位字符數組;否則爲8位數組。
TCHAR szBuffer[100= TEXT("A String");

 

 

Unicode and ANSI Functions in Windows  Windows的Unicode及ANSI函數

Since Windows NT, all Windows versions are built from the ground up using Unicode. That is, all the core functions for creating windows, displaying text, performing string manipulations, and so forth require Unicode strings. If you call any Windows function passing it an ANSI string (a string of 1-byte characters), the function first converts the string to Unicode and then passes the Unicode string to the operating system. If you are expecting ANSI strings back from a function, the system converts the Unicode string to an ANSI string before returning to your application. All these conversions occur invisibly to you. Of course, there is time and memory overhead involved for the system to carry out all these string conversions.

從Windows NT起的所有Windows版本都建立在Unicode背景之上,所有創建窗口、顯示字符串、執行字符串操作等核心函數,都要求使用Unicode字符串。如果你在調用任何Windows函數時傳給它一個ANSI字符串(由單字節字符組成的字符串),函數會首先將字符串轉換爲Unicode編碼並將Unicode字符串傳給操作系統。如果你希望函數返回ANSI字符串,系統會在返回前將Unicode字符串轉化爲ANSI。當然,系統在進行這些字符串轉換時會涉及到時間和存儲空間的開支。

When Windows exposes a function that takes a string as a parameter, two versions of the same function are usually provided—for example, a CreateWindowEx that accepts Unicode strings and a second CreateWindowEx that accepts ANSI strings. This is true, but the two functions are actually prototyped as follows:

如果Windows所暴露得函數接口含有字符串作參數的話,會提供一個函數的兩種版本——例如,函數CreateWindowEx的一個版本接受Unicode字符串,而第二種版本接受ANSI字符串。這些是事實,但兩種函數版本的實際原型如下:

HWND WINAPI CreateWindowExW(
   DWORD dwExStyle,
   PCWSTR pClassName,    
// A Unicode string
   PCWSTR pWindowName,   // A Unicode string
   DWORD dwStyle,
   
int X,
   
int Y,
   
int nWidth,
   
int nHeight,
   HWND hWndParent,
   HMENU hMenu,
   HINSTANCE hInstance,
   PVOID pParam);

HWND WINAPI CreateWindowExA(
   DWORD dwExStyle,
   PCSTR pClassName,     
// An ANSI string
   PCSTR pWindowName,    // An ANSI string
   DWORD dwStyle,
   
int X,
   
int Y,
   
int nWidth,
   
int nHeight,
   HWND hWndParent,
   HMENU hMenu,
   HINSTANCE hInstance,
   PVOID pParam);

CreateWindowExW is the version that accepts Unicode strings. The uppercase W at the end of the function name stands for wide. Unicode characters are 16 bits wide, so they are frequently referred to as wide characters. The uppercase A at the end of CreateWindowExA indicates that the function accepts ANSI character strings.

CreateWindowExW是接受Unicode字符串的版本,函數名末尾的大寫字母"W”代表“寬(字符)”。Unicode字符是16位寬,因此常被當作寬字符。CreateWindowExA結尾的大寫字母"A”指明函數接受ANSI字符串。

But usually we just include a call to CreateWindowEx in our code and don't directly call either CreateWindowExW or CreateWindowExA. In WinUser.h, CreateWindowEx is actually a macro defined as

但是通常我們在代碼中調用CreateWindowEx,而不是直接調用CreateWindowExWCreateWindowExA。在WinUser.h中,CreateWindowEx實際上是個宏定義:

#ifdef UNICODE
#define CreateWindowEx CreateWindowExW
#else
#define CreateWindowEx CreateWindowExA
#endif

Whether or not UNICODE is defined when you compile your source code module determines which version of CreateWindowEx is called. When you create a new project with Visual Studio, it defines UNICODE by default. So, by default, any calls you make to CreateWindowEx expand the macro to call CreateWindowExW—the Unicode version of CreateWindowEx.

當你在調用你的代碼模塊時,是否定義了UNICODE宏將決定那個版本的CreateWindowEx被調用。在Visual Studio中新建一個工程時,默認是定義UNICODE的,因此默認情況下調用CreateWindowEx將被宏展開爲調用函數CreateWindowExW——CreateWindowEx的Unicode版本。

Under Windows Vista, Microsoft's source code for CreateWindowExA is simply a translation layer that allocates memory to convert ANSI strings to Unicode strings; the code then calls CreateWindowExW, passing the converted strings. When CreateWindowExW returns, CreateWindowExA frees its memory buffers and returns the window handle to you. So, for functions that fill buffers with strings, the system must convert from Unicode to non-Unicode equivalents before your application can process the string. Because the system must perform all these conversions, your application requires more memory and runs slower. You can make your application perform more efficiently by developing your application using Unicode from the start. Also, Windows has been known to have some bugs in these translation functions, so avoiding them also eliminates some potential bugs.

Windows Vista中,CreateWindowExA源代碼只是簡單的轉譯層,它分配內存來將ANSI字符串轉換爲Unicode字符串,然後調用CreateWindowExW,並傳遞轉換後字符串。當CreateWindowExW返回時,CreateWindowExA釋放存儲緩衝區並返回窗口句柄。因此,函數向緩衝區填入字符串時,系統必須爲它將Unicode轉換爲等價的非Unicode字符,你的應用程序會需要更多內存並且運行較慢。在開發應用程序開始就使用Unicode編碼可以使程序更有效率。並且,Windows的這些轉換函數已經發現存在一些Bug,所以應避免使用以排除一些潛在的Bug。

If you're creating dynamic-link libraries (DLLs) that other software developers will use, consider using this technique: supply two exported functions in the DLL—an ANSI version and a Unicode version. In the ANSI version, simply allocate memory, perform the necessary string conversions, and call the Unicode version of the function. I'll demonstrate this process later in this chapter in "Exporting ANSI and Unicode DLL Functions" on page 29.

如果你要生成動態連接庫(DLLs)供其它軟件開發者使用,請考慮使用此技術:在DLL中提供兩種輸出函數——一個ANSI版本和一個Unicode版本。在ANSI版本中,簡單地做分配內存和必要字符串轉換操作,並調用此函數的Unicode版本。我將在本章稍後的"Exporting ANSI and Unicode DLL Functions"中示範此過程。

Certain functions in the Windows API, such as WinExec and OpenFile, exist solely for backward compatibility with 16-bit Windows programs that supported only ANSI strings. These methods should be avoided by today's programs. You should replace any calls to WinExec and OpenFile with calls to the CreateProcess and CreateFile functions. Internally, the old functions call the new functions anyway. The big problem with the old functions is that they don't accept Unicode strings and they typically offer fewer features. When you call these functions, you must pass ANSI strings. On Windows Vista, most non-obsolete functions have both Unicode and ANSI versions. However, Microsoft has started to get into the habit of producing some functions offering only Unicode versions—for example, ReadDirectoryChangesW and CreateProcessWithLogonW.

Windows API中的某些函數,比如WinExecOpenFile,爲向後兼容16位的Windows程序,只提供了支持ANSI字符串一種版本。這些函數應在現今的編程中避免使用。應使用CreateProcessCreateFile函數來代替WinExecOpenFile。舊函數內部都調用了新函數。關於舊函數的最大問題是它們不接受Unicode字符且只提供極少的特性。當你在調用這些舊函數時,必須傳遞ANSI字符串。在Windows Vista中,大多數未過時的函數都有Unicode和ANSI兩種版本。然而,微軟已經開始只對一些函數提供Unicode單一版本——比如,ReadDirectoryChangesWCreateProcessWithLogonW

When Microsoft was porting COM from 16-bit Windows to Win32, an executive decision was made that all COM interface methods requiring a string would accept only Unicode strings. This was a great decision because COM is typically used to allow different components to talk to each other and Unicode is the richest way to pass strings around. Using Unicode throughout your application makes interacting with COM easier too.

當微軟將COM從16位Windows移植到Win32上時,做出了所有COM接口方法都只接受Unicode字符串的決定。這是一個極好的決定,因爲COM通常用於使不同組件彼此通話,而Unicode是在各處傳遞字符串最融合 (richest)的方式。應用程序使用Unicode也能更容易地和COM進行交互。

Finally, when the resource compiler compiles all your resources, the output file is a binary representation of the resources. String values in your resources (string tables, dialog box templates, menus, and so on) are always written as Unicode strings. Under Windows Vista, the system performs internal conversions if your application doesn't define the UNICODE macro. For example, if UNICODE is not defined when you compile your source module, a call to LoadString will actually call the LoadStringA function. LoadStringA will then read the Unicode string from your resources and convert the string to ANSI. The ANSI representation of the string will be returned from the function to your application.

最後,當資源編譯器編譯所有資源時,輸出文件是二進制表示的。資源中的字符串值(string tables, dialog box templates, menus等等)總是寫爲Unicode型。在Windows Vista中,如果你的應用程序沒有定義UNICODE宏,系統會執行內部轉換。例如,當編譯程序模塊時如果未定義UNICODE宏,調用函數LoadString實際會調用LoadStringALoadStringA會從資源中讀取Unicode字符串並轉換成ANSI。字符串的ANSI表示會從函數返回給應用程序。

 

本文翻譯自《Windows Via C/C++》

發佈了22 篇原創文章 · 獲贊 0 · 訪問量 8萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章