mysql字符編碼問題latin1到utf8

測試目的:

java從mysql讀取數據並打印

測試環境1

a. hadoop集羣某節點的環境是utf8,java代碼也是utf8編碼

b. 需要讀取的mysql服,數據庫、數據表均是latin1



運行mysql -u* -p* -A -h 進入mysql服

(一)查看mysql編碼的基本方法

 先在mysql下確認原始數據是什麼編碼,經過下面三步可以確認原始數據是latin1

進入mysql > use db;

mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | gbk                        |
| character_set_connection | gbk                        |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | gbk                        |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)


// set names的作用是改變三個紅色部分的三個參數

mysql> set names 'latin1';
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

8 rows in set (0.00 sec)


Set names 'utf8'等價於下面三條語句

SET character_set_client = utf8 用來設置客戶端送給MySQL服務器的數據的 字符集
SET character_set_results = utf8 服務器返回查詢結果時使用的字符集
SET character_set_connection = utf8    MySQL 服務器 把客戶端傳來的數據,從character_set_client字符集轉換成character_set_connection字符集

mysql的信息流向及編碼轉換

接收:client --> connection --> database

反饋:database--->connection-->results



mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令                |       -1019 |
| 神器解鎖            |       -1018 |
| 購買經驗藥水      |       -1017 |
| 購買時裝            |       -1016 |
| 運營活動(領物品) |       -1015 |
| 神器精煉(棍)       |       -1014 |
| 神器精煉(拳)       |       -1013 |
| 商店購買            |       -1012 |

(二) java下的代碼測試

 測試1: java連接代碼

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//與參數characterEncoding=UTF-8的作用類似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

        System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 對應的解碼用ISO-8859-1或者Cp1252,我測試的時候發現應該用cp1252

}

// 查看mysql支持的字符集及其描述

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |



【參考】

http://stackoverflow.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa :

提到latin1應該用cp1252解碼:

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql語句的解決方法:SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...


測試2,改sql,將編碼轉換交給mysql來處理

url置爲jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

執行sql前,先執行statement.execute("set names 'utf8'");

在ResultSet中,rs.getString()獲得的就是utf-8編碼


在windows下顯示正常。

在linux下顯示有問題。

gm命令  -1019
神器解鎖        -1018
購買經驗藥水    -1017
購買時裝        -1016
運營活動(領物-1015
神器精煉(-1014
神器精煉(-1013

商店購買        -1012

(說明:紅色部分不能正常解析)


這個是linux環境的編碼環境導致的:解決方法

1) 使用locale查看file.encoding這個系統變量,如果不是utf8可以運行這個命令 export LANG=zh_CN.utf8

或者這樣LANG=zh_CN.utf8  java -Djava.ext.dir=/sdfls/asdlfjal/test Main執行之後

2) 或者java -Dfile.encoding=utf8 MainClass



測試3:修改測試環境

a. hadoop集羣某節點的環境是utf8,java代碼也是utf8編碼

b. 將上面的latin1數據庫的編碼改爲utf8(修改生效的判定:新建的表默認是utf8.修改之前默認的編碼是latin1)

步驟及結果


url置爲jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql還是普通寫法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在執行上面這個sql之前,查看字符集


System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始連接進去之後查看 字符集
/** 結果5
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    latin1</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

statement.execute("set names 'utf8'"); //修改連接方式
//statement.execute("set character_set_server='utf8'");
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之後,查看連接使用的字符集
/** 結果6
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    utf8</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

在ResultSet中,用utf8解釋:new String(rs.getString(i).getBytes("UTF-8"));//     右邊這個寫法完全是亂碼 new String(rs.getString(i).getBytes(),"UTF-8")

跟測試2的最終輸出結果是一樣的。


(對於純粹的utf8環境:服務器編碼是utf8;mysql的默認編碼也是utf8,進入mysql之後查看字符集,會看到字符集展示就是結果6,即默認的character_set_server=utf8)

感覺所有問題的根源在於默認的character_set_server=latin1


http://stackoverflow.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc  :提到jdbc對不識別的latin1的編碼字符插入了特殊的替換字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters



【參考】使用Java讀寫存儲在latin1編碼的MySQL中的UTF-8編碼的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘


 其他相關資料及描述

貌似最好把所有字符集設置成utf8的方法,解決起來最徹底;並且這個問題只出現在jdbc中,使用python操作的時候,讀取都是正常的

    conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )
    cursor10 = conn10.cursor()
    cursor10.execute("set @@autocommit=1")
    cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")
    data=cursor10.fetchone()
    if data[1]=='utf8':
        cursor10.close()
        cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8來連接
        cursor10.execute("set @@autocommit=1")


Java中String解碼、編碼

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:


byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).
  

Unicode - How to get the characters right?

latin1轉gbk的亂碼問題,jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最後發現:

1. 不同jdbc對編碼的支持情況不一樣

2. 測試2,其實能解決問題,也就是set names 'utf8'  + 在SQL裏邊使用convert()來轉碼是萬能的: 能繞開不同jdbc帶來的問題。



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章