前言

在開發過程中，我們可能會遇到Java各種編碼格式之間的轉換問題。下面我們來研究下UTF-8和GBK等編碼格式之間的相互轉化。

實踐

在進行編碼轉換時，我們用ISO-8859-1編碼來接受和保存數據，並轉換爲相應編碼。

爲什麼採用ISO-8859-1編碼作爲中間轉存方案呢？

下面我們通過程序驗證：

通過ISO-8859-1轉存：

public static void test(String str1,String encode) throws UnsupportedEncodingException {    
      System.out.println("字符串："+str1);
      //將str轉爲原編碼字節流
      byte[] byteArray1=str1.getBytes(encode);
      System.out.println(byteArray1.length);
      //轉換爲一個ISO-8859-1的字符串
      String str2=new String(byteArray1,"ISO-8859-1");
      System.out.println("轉成ISO-8859-1："+str2);
      //轉回爲byte數組
      byte[] byteArray2=str2.getBytes("ISO-8859-1");          
      System.out.println(byteArray2.length);
      //重新用目標編碼格式編碼
      String str3=new String(byteArray2,encode);
      System.out.println("字符串："+str3);        
}   
public static void main(String[] args) throws UnsupportedEncodingException {
    String str1="你好";
    String str2="你好呀";
    test(str1,"UTF-8");
    test(str2,"UTF-8");
}

運行結果：

字符串：你好
6
轉成ISO-8859-1：ä½ å¥½
6
字符串：你好
字符串：你好呀
9
轉成ISO-8859-1：ä½ å¥½å‘€
9
字符串：你好呀

通過GBK轉存：

public static void test(String str1,String encode) throws UnsupportedEncodingException {    
      System.out.println("字符串："+str1);
      //將str轉爲原編碼字節流
      byte[] byteArray1=str1.getBytes(encode);
      System.out.println(byteArray1.length);
      //轉換爲一個GBK的字符串
      String str2=new String(byteArray1,"GBK");
      System.out.println("轉成GBK："+str2);
      //轉回爲byte數組
      byte[] byteArray2=str2.getBytes("GBK");         
      System.out.println(byteArray2.length);
      //重新用目標編碼格式編碼
      String str3=new String(byteArray2,encode);
      System.out.println("字符串："+str3);        
}   
public static void main(String[] args) throws UnsupportedEncodingException {
    String str1="你好";
    String str2="你好呀";
    test(str1,"UTF-8");
    test(str2,"UTF-8");
}

運行結果：

字符串：你好
6
轉成GBK：浣犲ソ
6
字符串：你好
字符串：你好呀
9
轉成GBK：浣犲ソ鍛�
9
字符串：你好�?

可以看到，當用GBK暫存UTF-8編碼字符串時，字符串漢字出現了亂碼。

爲什麼會這樣？

分析

我們新增一個方法，將byte數組打印出來：

public static void printHex(byte[] byteArray) {
  StringBuffer sb = new StringBuffer();
  for (byte b : byteArray) {
    sb.append(Integer.toHexString((b >> 4) & 0xF));
    sb.append(Integer.toHexString(b & 0xF));
    sb.append(" ");
  }
  System.out.println(sb.toString());
};

這樣上面兩個的運行結果分別如下：
ISO-8859-1:

字符串：你好
e4 bd a0 e5 a5 bd 
轉成ISO-8859-1：ä½ å¥½
e4 bd a0 e5 a5 bd 
字符串：你好
字符串：你好呀
e4 bd a0 e5 a5 bd e5 91 80 
轉成ISO-8859-1：ä½ å¥½å‘€
e4 bd a0 e5 a5 bd e5 91 80 
字符串：你好呀

GBK:

字符串：你好
e4 bd a0 e5 a5 bd 
轉成GBK：浣犲ソ
e4 bd a0 e5 a5 bd 
字符串：你好
字符串：你好呀
e4 bd a0 e5 a5 bd e5 91 80 
轉成GBK：浣犲ソ鍛�
e4 bd a0 e5 a5 bd e5 91 3f 
字符串：你好�?

可以看到，UTF-8轉換爲GBK在轉換回來時，最後的80變成了3f，爲什麼會這樣？

我們使用”你好呀” 三個字來分析，它的UTF-8 的字節流爲：

[e4 bd a0] [e5 a5 bd] [e5 91 80]

我們按照三個字節一組分組，用GBK處理，因爲GBK是雙字節編碼，如下按照兩兩一組進行分組：

[e4 bd] [a0 e5] [a5 bd] [e5 91] [80 ?]

不夠了，怎麼辦？它把 0x8d當做一個未知字符，用一個半角Ascii字符的 “？” 代替，變成了：

[e4 bd] [a0 e5] [a5 bd] [e5 91] \3f

數據被破壞了。

爲什麼 ISO-8859-1 沒問題呢？

因爲 ISO-8859-1 是單字節編碼，因此它的分組方案是：

[e4] [bd] [a0] [e5] [a5] [bd] [e5] [91] [80]

因此中間不做任何操作，因此數據沒有變化。

問題

你也許會問到，比如將“你好嗎”三個字先由UTF-8轉爲ISO-8859-1，再由ISO-8859-1轉爲GBK，結果也是亂碼啊，不是和下面的代碼一樣麼，性質上？

String isoFont = new String(chinese.getBytes("UTF-8"),"ISO-8859-1");
String gbkFont = new String(isoFont.getBytes("ISO-8859-1"),"GBK");

String gbkFont = new String(chinese.getBytes("UTF-8"),"GBK");

兩者的性質確實是一樣的。

那與上面說的不矛盾嗎？

不矛盾。上面的代碼，第一步你指定了字符串編碼格式爲UTF-8，第二步你將其轉換爲GBK，肯定會亂碼。可以認爲你拿一個UTF-8的字符串去轉GBK字符串，其實在程序裏這種寫法本身是錯誤的！

我們來看下面一段代碼：

public static void test2() throws UnsupportedEncodingException {
                String chinese = "你好呀";
                //GBK 測試
                String gbkChinese = new String(chinese.getBytes("GBK"),"ISO-8859-1");
                System.out.println(gbkChinese);
                printHex(gbkChinese.getBytes("ISO-8859-1"));
                String gbkTest = new String(gbkChinese.getBytes("ISO-8859-1"),"GBK");
                System.out.println(gbkTest);

                //UTF-8測試
                String utf8Chinese = new String(chinese.getBytes("UTF-8"),"ISO-8859-1");
                System.out.println(utf8Chinese);
                printHex(utf8Chinese.getBytes("ISO-8859-1"));
                String utfTest = new String(utf8Chinese.getBytes("ISO-8859-1"),"UTF-8");
                System.out.println(utfTest);            
}

輸出結果：

ÄãºÃÑ½
c4 e3 ba c3 d1 bd 
你好呀
ä½ å¥½å‘€
e4 bd a0 e5 a5 bd e5 91 80 
你好呀

可以看到，

GBK分組：[c4 e3]–>你 [ba c3]–>好 [d1 bd]–>呀

UTF-8分組：[e4 bd a0]–>你 [e5 a5 bd]–>好 [e5 91 80]–>呀

字符串“你好呀”在GBK編碼和UTF-8編碼裏生成的byte數據流是不一樣的。

結論

所以如何正確將兩種編碼格式數據進行轉換？

注意：這兒的轉換指的是這樣，比如一個GBK編碼文件，裏面有“你好呀”字符串，寫入到UTF-8編碼文件裏仍然是“你好呀”。

我們新建一個GBK編碼文件，裏面有你好呀，三個字符，同時將三個字用UTF-8，寫入到另一個文件裏。

public class Test2 {
    public static void main(String[] args) throws Exception {
        String line = readInFile("/Users/zhangwentong/junrongdai/gbk.txt", "GBK");
        System.out.println(line);
        writeInFile("/Users/zhangwentong/junrongdai/utf8.txt", line, "UTF-8");

    }
    public static String readInFile(String fileName, String charset) {
        File gbkfile = new File(fileName);
        String line = "";
        FileInputStream gbkIO = null;
        InputStreamReader gbkISR = null;
        BufferedReader br = null;
        try {
            gbkIO = new FileInputStream(gbkfile);
            gbkISR = new InputStreamReader(gbkIO, charset);
            br = new BufferedReader(gbkISR);
            String rline = "";
            while ((rline = br.readLine()) != null) {
                line += rline;
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
                try {
                    if(br!=null) fos.close();
                    if(gbkISR!=null) gbkISR.close();
                    if(gbkIO!=null) gbkIO.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
        }

        return line;
    }
    public static void writeInFile(String fileName, String content, String charset) {
        File f = new File(fileName);
        FileOutputStream fos = null;
        try {
            if (!f.exists()) {
                f.createNewFile();
            }
            fos = new FileOutputStream(f);
            fos.write(content.getBytes(charset));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (fos != null) {
                try {
                    fos.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

可以測試下上段代碼，GBK文字被轉爲了UTF-8文字。反過來一個UTF-8文件寫入到GBK也是可以實現的。

所以，在讀取和寫入文字時，指定文字的編碼格式，再進行讀取和寫入操作，便不會有亂碼的產生。否則讀取和寫入時會按照執行操作的class文件的編碼格式進行寫入和讀取。

結語

歡迎光臨我的博客

https://www.sakuratears.top

我的GitHub地址

https://github.com/javazwt

UTF-8和GBK等編碼格式轉換問題

前言

實踐

分析

問題

結論

結語

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

JVM堆內存及垃圾回收簡介

Spring Transaction註解不生效bug引發的思考

.DSStore文件

搭建自己的Hexo博客

Redis簡介及瞭解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結