关于代码点的理解
关于代码单元和代码点我的理解是:
1、一个代码点可能包含一个或两个代码单元。
2、在我的测试程序中,“我 ”也只占用一个代码单元。即代码点数等于代码单元数。
下面是在unicode的官方网站上找到的关于unicode的中文,韩文,日文的一些说明:
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.
Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
是否无论是那个编码方式(UTF-8, UTF-16, or UTF-32)都可以对中文支持支持的程度都是一样的,我的意思是三种编码支持的中文字符数相等?
我的测试程序如下:
public class test0 {
public static void main(String[] args)
{String a="我 ";
int cuCount=a.length();
System.out.println("the number of code units required for string /"test/" in the UTF-16 encoding is "+cuCount);
int cpCount=a.codePointCount(0, a.length());
System.out.println("the number of code points is "+cpCount);
System.out.println("the end of string /"我 /" is "+a.charAt(a.length()-1));
}
}
输出结果为:
the number of code units required for string "test" in the UTF-16 encoding is 2
the number of code points is 2
the end of string "我 " is [空格]
1、一个代码点可能包含一个或两个代码单元。
2、在我的测试程序中,“我 ”也只占用一个代码单元。即代码点数等于代码单元数。
下面是在unicode的官方网站上找到的关于unicode的中文,韩文,日文的一些说明:
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.
Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
是否无论是那个编码方式(UTF-8, UTF-16, or UTF-32)都可以对中文支持支持的程度都是一样的,我的意思是三种编码支持的中文字符数相等?
我的测试程序如下:
public class test0 {
public static void main(String[] args)
{String a="我 ";
int cuCount=a.length();
System.out.println("the number of code units required for string /"test/" in the UTF-16 encoding is "+cuCount);
int cpCount=a.codePointCount(0, a.length());
System.out.println("the number of code points is "+cpCount);
System.out.println("the end of string /"我 /" is "+a.charAt(a.length()-1));
}
}
输出结果为:
the number of code units required for string "test" in the UTF-16 encoding is 2
the number of code points is 2
the end of string "我 " is [空格]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.