測試目的:
java從mysql讀取數據並打印
測試環境1
a. hadoop集羣某節點的環境是utf8,java代碼也是utf8編碼
b. 需要讀取的mysql服,數據庫、數據表均是latin1
運行mysql -u* -p* -A -h 進入mysql服
(一)查看mysql編碼的基本方法
先在mysql下確認原始數據是什麼編碼,經過下面三步可以確認原始數據是latin1
進入mysql > use db;
mysql> show variables like 'character%';+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
// set names的作用是改變三個紅色部分的三個參數
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
Set names 'utf8'等價於下面三條語句
SET character_set_client = utf8 用來設置客戶端送給MySQL服務器的數據的 字符集
SET character_set_results = utf8 服務器返回查詢結果時使用的字符集
SET character_set_connection = utf8 MySQL 服務器 把客戶端傳來的數據,從character_set_client字符集轉換成character_set_connection字符集
mysql的信息流向及編碼轉換
接收:client --> connection --> database
反饋:database--->connection-->results
mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令 | -1019 |
| 神器解鎖 | -1018 |
| 購買經驗藥水 | -1017 |
| 購買時裝 | -1016 |
| 運營活動(領物品) | -1015 |
| 神器精煉(棍) | -1014 |
| 神器精煉(拳) | -1013 |
| 商店購買 | -1012 |
(二) java下的代碼測試
測試1: java連接代碼
jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8
Statement statement = con.createStatement();
statement.execute("set names 'utf8'");//與參數characterEncoding=UTF-8的作用類似
sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"
ResultSet rs = statement.executeQuery(sql);
while (rs.next()) {
System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 對應的解碼用ISO-8859-1或者Cp1252,我測試的時候發現應該用cp1252
}
// 查看mysql支持的字符集及其描述
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
【參考】
http://stackoverflow.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa :
提到latin1應該用cp1252解碼:
MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible
to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.
測試2,改sql,將編碼轉換交給mysql來處理
url置爲jdbc:mysql://123.*.*.108/db
sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table
執行sql前,先執行statement.execute("set names 'utf8'");
在ResultSet中,rs.getString()獲得的就是utf-8編碼
在windows下顯示正常。
在linux下顯示有問題。
gm命令 -1019
神器解鎖 -1018
購買經驗藥水 -1017
購買時裝 -1016
運營活動(領物-1015
神器精煉(-1014
神器精煉(-1013
商店購買 -1012
(說明:紅色部分不能正常解析)
這個是linux環境的編碼環境導致的:解決方法
1) 使用locale查看file.encoding這個系統變量,如果不是utf8可以運行這個命令 export LANG=zh_CN.utf8
或者這樣LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main執行之後
2) 或者java -Dfile.encoding=utf8 MainClass
測試3:修改測試環境
a. hadoop集羣某節點的環境是utf8,java代碼也是utf8編碼
b. 將上面的latin1數據庫的編碼改爲utf8(修改生效的判定:新建的表默認是utf8.修改之前默認的編碼是latin1)
步驟及結果
url置爲jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8
sql還是普通寫法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table
在執行上面這個sql之前,查看字符集
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始連接進去之後查看 字符集
/** 結果5
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results
<span style="color:#FF0000;">character_set_server latin1</span>
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
*/
statement.execute("set names 'utf8'"); //修改連接方式
//statement.execute("set character_set_server='utf8'");
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之後,查看連接使用的字符集
/** 結果6
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results
<span style="color:#FF0000;">character_set_server utf8</span>
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
*/
在ResultSet中,用utf8解釋:new String(rs.getString(i).getBytes("UTF-8"));// 右邊這個寫法完全是亂碼 new String(rs.getString(i).getBytes(),"UTF-8")
跟測試2的最終輸出結果是一樣的。
(對於純粹的utf8環境:服務器編碼是utf8;mysql的默認編碼也是utf8,進入mysql之後查看字符集,會看到字符集展示就是結果6,即默認的character_set_server=utf8)
感覺所有問題的根源在於默認的character_set_server=latin1
http://stackoverflow.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc :提到jdbc對不識別的latin1的編碼字符插入了特殊的替換字符
JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters
【參考】使用Java讀寫存儲在latin1編碼的MySQL中的UTF-8編碼的中文
Character set bug at server with utf8 column and latin1 connection
Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘
其他相關資料及描述
貌似最好把所有字符集設置成utf8的方法,解決起來最徹底;並且這個問題只出現在jdbc中,使用python操作的時候,讀取都是正常的
conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )
cursor10 = conn10.cursor()
cursor10.execute("set @@autocommit=1")
cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")
data=cursor10.fetchone()
if data[1]=='utf8':
cursor10.close()
cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8來連接
cursor10.execute("set @@autocommit=1")
Java中String解碼、編碼
Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:
byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.
Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).
Unicode - How to get the characters right?
最後發現:
1. 不同jdbc對編碼的支持情況不一樣
2. 測試2,其實能解決問題,也就是set names 'utf8' + 在SQL裏邊使用convert()來轉碼是萬能的: 能繞開不同jdbc帶來的問題。