《Windows Via C/C++》边学习,边翻译(三)操作字符和字符串-2

ANSI and Unicode Character and String Data Types  ANSI、Unicode字符及字符串数据类型

I'm sure you're aware that the C language uses the char data type to represent an 8-bit ANSI character. By default, when you declare a literal string in your source code, the C compiler turns the string's characters into an array of 8-bit char data types:

你一定知道C语言用char类型来表示8位的ANSI字符。当你在代码中声明一个字符串时,C编译器默认将其转化为8位的char型数组。

// An 8-bit character    8位字符
char c = 'A';

//
 An array of 99 8-bit characters and an 8-bit terminating zero.
// 由99个8位字符和1个8位零结束符组成的数组

char szBuffer[100= "A String";

Microsoft's C/C++ compiler defines a built-in data type, wchar_t, which represents a 16-bit Unicode (UTF-16) character. Because earlier versions of Microsoft's compiler did not offer this built-in data type, the compiler defines this data type only when the /Zc:wchar_t compiler switch is specified. By default, when you create a C++ project in Microsoft Visual Studio, this compiler switch is specified. We recommend that you always specify this compiler switch, as it is better to work with Unicode characters by way of the built-in primitive type understood intrinsically by the compiler.

微软的C/C++编译器定义了内建数据类型wchar_t,用来表示16位的Unicode(UTF-16)字符。微软早期版本的编译器并未提供这种类型,因此在指定/Zc:wchar_t编译开关时,编译器才对其作定义。在Visual Studio中新建一个C/C++工程时此编译开关是默认打开的。建议总是打开此编译开关,通过编译器本身能理解的内建类型,能够更好地使用Unicode字符。

Note: Prior to the built-in compiler support, a C header file defined a wchar_t data type as follows:

注意: 内建在编译器中支持之前,一个C的头文件定义了wchar_t类型,如下:

typedef unsigned short wchar_t;

Here is how you declare a Unicode character and string:

以下是如何声明Unicode字符及字符串:

// A 16-bit character    16位字符
wchar_t c = L'A';

//
 An array up to 99 16-bit characters and a 16-bit terminating zero.
// 由99个16位字符和1个16位零结束符所组成的数组

wchar_t szBuffer[100= L"A String";

An uppercase L before a literal string informs the compiler that the string should be compiled as a Unicode string. When the compiler places the string in the program's data section, it encodes each character using UTF16, interspersing zero bytes between every ASCII character in this simple case.

在字符串前面放一个大写的”L”,会告诉编译器将其编译为Unicode字符串。当编译器将此字符串放入程序数据段时,将使用UTF-16对每个字符进行编码,这样的简单情形下会在每个ASCII字符之间填补零字节。

The Windows team at Microsoft wants to define its own data types to isolate itself a little bit from the C language. And so, the Windows header file, WinNT.h, defines the following data types:

微软Windows开发组想通过定义自己的数据类型来与C语言的类型进行区别。在Windows的WinNT.h头文件中定义了以下数据类型:

typedef char     CHAR;    // An 8-bit character    8位字符

typedef wchar_t WCHAR;    
// A 16-bit character    16位字符

Furthermore, the WinNT.h header file defines a bunch of convenience data types for working with pointers to characters and pointers to strings:

此外,WinNT.h中还定义了一组方便指向字符和字符串的指针类型:

// Pointer to 8-bit character(s)    指向8位字符的指针
typedef CHAR *PCHAR;
typedef CHAR 
*
PSTR;
typedef CONST CHAR 
*
PCSTR

// Pointer to 16-bit character(s)    指向16位字符的指针

typedef WCHAR *PWCHAR;
typedef WCHAR 
*
PWSTR;
typedef CONST WCHAR 
*PCWSTR;

Note:  If you take a look at WinNT.h, you'll find the following definition:

注意: 如果查看WinNT.h会发现以下定义:

typedef __nullterminated WCHAR *NWPSTR, *LPWSTR, *PWSTR;

The __nullterminated prefix is a header annotation that describes how types are expected to be used as function parameters and return values. In the Enterprise version of Visual Studio, you can set the Code Analysis option in the project properties. This adds the /analyze switch to the command line of the compiler that detects when your code calls functions in a way that breaks the semantic defined by the annotations. Notice that only Enterprise versions of the compiler support this /analyze switch. To keep the code more readable in this book, the header annotations are removed. You should read the "Header Annotations" documentation on MSDN at http://msdn2.microsoft.com/En-US/library/aa383701.aspx for more details about the header annotations language.

前缀__nullterminated是描述类型作为函数参数或返回值如何使用的标示注解(header annotation)。Visual Studio的企业版,可以在工程属性里设置代码分析选项,即对编译器的命令行增加了/analyze开关项,这样就能够检测代码在调用函数时是否违反了标示注解(header annotation)所定义的语义规则。注意只有企业版才提供对/analyze开关的支持。本书为保持代码的易读性,将标示注解(header annotation)都去掉了。可以通过阅读MSDN,http://msdn2.microsoft.com/En-US/library/aa383701.aspx的"Header Annotations"文档,了解更多有关注解语言的细节。

In your own source code, it doesn't matter which data type you use, but I'd recommend you try to be consistent to improve maintainability in your code. Personally, as a Windows programmer, I always use the Windows data types because the data types match up with the MSDN documentation, making things easier for everyone reading the code.

在你自己的源代码中使用哪种数据类型都没关系,但是应该注意,应该坚持提供代码的可维护性。就我个人来讲,作为一名Windows程序员,我总是使用Windows的数据类型,因为它们有对应的MSDN文档相匹配,使其他人能更容易地阅读你的代码。

It is possible to write your source code so that it can be compiled using ANSI or Unicode characters and strings. In the WinNT.h header file, the following types and macros are defined:

可以编写你的代码,使之能用ANSI又能用Unicode编码对字符或字符串进行编译。在WinNT.h中,定义了以下类型和宏:

#ifdef UNICODE

typedef WCHAR TCHAR, 
*PTCHAR, PTSTR;
typedef CONST WCHAR 
*PCTSTR;
#define __TEXT(quote) quote          // r_winnt

#define __TEXT(quote) L##quote

#else

typedef CHAR TCHAR, 
*PTCHAR, PTSTR;
typedef CONST CHAR 
*PCTSTR;
#define __TEXT(quote) quote

#endif

#define   TEXT(quote) __TEXT(quote)

These types and macros (plus a few less commonly used ones that I do not show here) are used to create source code that can be compiled using either ANSI or Unicode characters and strings, for example:

使用这些类型和宏定义(极少使用的未列出),生成代码时既能用ANSI又能用Unicode来编码字符和字符串,例如:

// If UNICODE defined, a 16-bit character; else an 8-bit character
// 如果定义了Unicode,编码为16位字符;否则为8位。
TCHAR c = TEXT('A');

// If UNICODE defined, an array of 16-bit characters; else 8-bit characters
//如果定义了Unicode,编码为16位字符数组;否则为8位数组。
TCHAR szBuffer[100= TEXT("A String");

 

 

Unicode and ANSI Functions in Windows  Windows的Unicode及ANSI函数

Since Windows NT, all Windows versions are built from the ground up using Unicode. That is, all the core functions for creating windows, displaying text, performing string manipulations, and so forth require Unicode strings. If you call any Windows function passing it an ANSI string (a string of 1-byte characters), the function first converts the string to Unicode and then passes the Unicode string to the operating system. If you are expecting ANSI strings back from a function, the system converts the Unicode string to an ANSI string before returning to your application. All these conversions occur invisibly to you. Of course, there is time and memory overhead involved for the system to carry out all these string conversions.

从Windows NT起的所有Windows版本都建立在Unicode背景之上,所有创建窗口、显示字符串、执行字符串操作等核心函数,都要求使用Unicode字符串。如果你在调用任何Windows函数时传给它一个ANSI字符串(由单字节字符组成的字符串),函数会首先将字符串转换为Unicode编码并将Unicode字符串传给操作系统。如果你希望函数返回ANSI字符串,系统会在返回前将Unicode字符串转化为ANSI。当然,系统在进行这些字符串转换时会涉及到时间和存储空间的开支。

When Windows exposes a function that takes a string as a parameter, two versions of the same function are usually provided—for example, a CreateWindowEx that accepts Unicode strings and a second CreateWindowEx that accepts ANSI strings. This is true, but the two functions are actually prototyped as follows:

如果Windows所暴露得函数接口含有字符串作参数的话,会提供一个函数的两种版本——例如,函数CreateWindowEx的一个版本接受Unicode字符串,而第二种版本接受ANSI字符串。这些是事实,但两种函数版本的实际原型如下:

HWND WINAPI CreateWindowExW(
   DWORD dwExStyle,
   PCWSTR pClassName,    
// A Unicode string
   PCWSTR pWindowName,   // A Unicode string
   DWORD dwStyle,
   
int X,
   
int Y,
   
int nWidth,
   
int nHeight,
   HWND hWndParent,
   HMENU hMenu,
   HINSTANCE hInstance,
   PVOID pParam);

HWND WINAPI CreateWindowExA(
   DWORD dwExStyle,
   PCSTR pClassName,     
// An ANSI string
   PCSTR pWindowName,    // An ANSI string
   DWORD dwStyle,
   
int X,
   
int Y,
   
int nWidth,
   
int nHeight,
   HWND hWndParent,
   HMENU hMenu,
   HINSTANCE hInstance,
   PVOID pParam);

CreateWindowExW is the version that accepts Unicode strings. The uppercase W at the end of the function name stands for wide. Unicode characters are 16 bits wide, so they are frequently referred to as wide characters. The uppercase A at the end of CreateWindowExA indicates that the function accepts ANSI character strings.

CreateWindowExW是接受Unicode字符串的版本,函数名末尾的大写字母"W”代表“宽(字符)”。Unicode字符是16位宽,因此常被当作宽字符。CreateWindowExA结尾的大写字母"A”指明函数接受ANSI字符串。

But usually we just include a call to CreateWindowEx in our code and don't directly call either CreateWindowExW or CreateWindowExA. In WinUser.h, CreateWindowEx is actually a macro defined as

但是通常我们在代码中调用CreateWindowEx,而不是直接调用CreateWindowExWCreateWindowExA。在WinUser.h中,CreateWindowEx实际上是个宏定义:

#ifdef UNICODE
#define CreateWindowEx CreateWindowExW
#else
#define CreateWindowEx CreateWindowExA
#endif

Whether or not UNICODE is defined when you compile your source code module determines which version of CreateWindowEx is called. When you create a new project with Visual Studio, it defines UNICODE by default. So, by default, any calls you make to CreateWindowEx expand the macro to call CreateWindowExW—the Unicode version of CreateWindowEx.

当你在调用你的代码模块时,是否定义了UNICODE宏将决定那个版本的CreateWindowEx被调用。在Visual Studio中新建一个工程时,默认是定义UNICODE的,因此默认情况下调用CreateWindowEx将被宏展开为调用函数CreateWindowExW——CreateWindowEx的Unicode版本。

Under Windows Vista, Microsoft's source code for CreateWindowExA is simply a translation layer that allocates memory to convert ANSI strings to Unicode strings; the code then calls CreateWindowExW, passing the converted strings. When CreateWindowExW returns, CreateWindowExA frees its memory buffers and returns the window handle to you. So, for functions that fill buffers with strings, the system must convert from Unicode to non-Unicode equivalents before your application can process the string. Because the system must perform all these conversions, your application requires more memory and runs slower. You can make your application perform more efficiently by developing your application using Unicode from the start. Also, Windows has been known to have some bugs in these translation functions, so avoiding them also eliminates some potential bugs.

Windows Vista中,CreateWindowExA源代码只是简单的转译层,它分配内存来将ANSI字符串转换为Unicode字符串,然后调用CreateWindowExW,并传递转换后字符串。当CreateWindowExW返回时,CreateWindowExA释放存储缓冲区并返回窗口句柄。因此,函数向缓冲区填入字符串时,系统必须为它将Unicode转换为等价的非Unicode字符,你的应用程序会需要更多内存并且运行较慢。在开发应用程序开始就使用Unicode编码可以使程序更有效率。并且,Windows的这些转换函数已经发现存在一些Bug,所以应避免使用以排除一些潜在的Bug。

If you're creating dynamic-link libraries (DLLs) that other software developers will use, consider using this technique: supply two exported functions in the DLL—an ANSI version and a Unicode version. In the ANSI version, simply allocate memory, perform the necessary string conversions, and call the Unicode version of the function. I'll demonstrate this process later in this chapter in "Exporting ANSI and Unicode DLL Functions" on page 29.

如果你要生成动态连接库(DLLs)供其它软件开发者使用,请考虑使用此技术:在DLL中提供两种输出函数——一个ANSI版本和一个Unicode版本。在ANSI版本中,简单地做分配内存和必要字符串转换操作,并调用此函数的Unicode版本。我将在本章稍后的"Exporting ANSI and Unicode DLL Functions"中示范此过程。

Certain functions in the Windows API, such as WinExec and OpenFile, exist solely for backward compatibility with 16-bit Windows programs that supported only ANSI strings. These methods should be avoided by today's programs. You should replace any calls to WinExec and OpenFile with calls to the CreateProcess and CreateFile functions. Internally, the old functions call the new functions anyway. The big problem with the old functions is that they don't accept Unicode strings and they typically offer fewer features. When you call these functions, you must pass ANSI strings. On Windows Vista, most non-obsolete functions have both Unicode and ANSI versions. However, Microsoft has started to get into the habit of producing some functions offering only Unicode versions—for example, ReadDirectoryChangesW and CreateProcessWithLogonW.

Windows API中的某些函数,比如WinExecOpenFile,为向后兼容16位的Windows程序,只提供了支持ANSI字符串一种版本。这些函数应在现今的编程中避免使用。应使用CreateProcessCreateFile函数来代替WinExecOpenFile。旧函数内部都调用了新函数。关于旧函数的最大问题是它们不接受Unicode字符且只提供极少的特性。当你在调用这些旧函数时,必须传递ANSI字符串。在Windows Vista中,大多数未过时的函数都有Unicode和ANSI两种版本。然而,微软已经开始只对一些函数提供Unicode单一版本——比如,ReadDirectoryChangesWCreateProcessWithLogonW

When Microsoft was porting COM from 16-bit Windows to Win32, an executive decision was made that all COM interface methods requiring a string would accept only Unicode strings. This was a great decision because COM is typically used to allow different components to talk to each other and Unicode is the richest way to pass strings around. Using Unicode throughout your application makes interacting with COM easier too.

当微软将COM从16位Windows移植到Win32上时,做出了所有COM接口方法都只接受Unicode字符串的决定。这是一个极好的决定,因为COM通常用于使不同组件彼此通话,而Unicode是在各处传递字符串最融合 (richest)的方式。应用程序使用Unicode也能更容易地和COM进行交互。

Finally, when the resource compiler compiles all your resources, the output file is a binary representation of the resources. String values in your resources (string tables, dialog box templates, menus, and so on) are always written as Unicode strings. Under Windows Vista, the system performs internal conversions if your application doesn't define the UNICODE macro. For example, if UNICODE is not defined when you compile your source module, a call to LoadString will actually call the LoadStringA function. LoadStringA will then read the Unicode string from your resources and convert the string to ANSI. The ANSI representation of the string will be returned from the function to your application.

最后,当资源编译器编译所有资源时,输出文件是二进制表示的。资源中的字符串值(string tables, dialog box templates, menus等等)总是写为Unicode型。在Windows Vista中,如果你的应用程序没有定义UNICODE宏,系统会执行内部转换。例如,当编译程序模块时如果未定义UNICODE宏,调用函数LoadString实际会调用LoadStringALoadStringA会从资源中读取Unicode字符串并转换成ANSI。字符串的ANSI表示会从函数返回给应用程序。

 

本文翻译自《Windows Via C/C++》

发布了22 篇原创文章 · 获赞 0 · 访问量 8万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章