mysql字符編碼問題latin1到utf8

測試目的：

java從mysql讀取數據並打印

測試環境1

a. hadoop集羣某節點的環境是utf8，java代碼也是utf8編碼

b. 需要讀取的mysql服，數據庫、數據表均是latin1

運行mysql -u* -p* -A -h 進入mysql服

（一）查看mysql編碼的基本方法

先在mysql下確認原始數據是什麼編碼，經過下面三步可以確認原始數據是latin1

進入mysql > use db;

// set names的作用是改變三個紅色部分的三個參數

8 rows in set (0.00 sec)

Set names 'utf8'等價於下面三條語句

SET character_set_client = utf8 用來設置客戶端送給MySQL服務器的數據的字符集
SET character_set_results = utf8 服務器返回查詢結果時使用的字符集
SET character_set_connection = utf8 MySQL 服務器把客戶端傳來的數據，從character_set_client字符集轉換成character_set_connection字符集

mysql的信息流向及編碼轉換

接收：client --> connection --> database

反饋：database--->connection-->results

mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令                |       -1019 |
| 神器解鎖            |       -1018 |
| 購買經驗藥水      |       -1017 |
| 購買時裝            |       -1016 |
| 運營活動(領物品) |       -1015 |
| 神器精煉(棍)       |       -1014 |
| 神器精煉(拳)       |       -1013 |
| 商店購買            |       -1012 |

（二） java下的代碼測試

測試1： java連接代碼

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//與參數characterEncoding=UTF-8的作用類似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 對應的解碼用ISO-8859-1或者Cp1252，我測試的時候發現應該用cp1252

}

// 查看mysql支持的字符集及其描述

【參考】

http://stackoverflow.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa ：

提到latin1應該用cp1252解碼：

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql語句的解決方法：SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...

測試2，改sql，將編碼轉換交給mysql來處理

url置爲jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

執行sql前，先執行statement.execute("set names 'utf8'");

在ResultSet中，rs.getString()獲得的就是utf-8編碼

在windows下顯示正常。

在linux下顯示有問題。

gm命令 -1019
神器解鎖        -1018
購買經驗藥水    -1017
購買時裝        -1016
運營活動(領物-1015
神器精煉(-1014
神器精煉(-1013
商店購買        -1012

（說明：紅色部分不能正常解析）

這個是linux環境的編碼環境導致的：解決方法

1) 使用locale查看file.encoding這個系統變量，如果不是utf8可以運行這個命令 export LANG=zh_CN.utf8

或者這樣LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main執行之後

2) 或者java -Dfile.encoding=utf8 MainClass

測試3：修改測試環境

a. hadoop集羣某節點的環境是utf8，java代碼也是utf8編碼

b. 將上面的latin1數據庫的編碼改爲utf8（修改生效的判定：新建的表默認是utf8.修改之前默認的編碼是latin1）

步驟及結果

url置爲jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql還是普通寫法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在執行上面這個sql之前，查看字符集

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始連接進去之後查看 字符集
/** 結果5
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    latin1</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

statement.execute("set names 'utf8'"); //修改連接方式
//statement.execute("set character_set_server='utf8'");
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之後，查看連接使用的字符集
/** 結果6
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    utf8</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

在ResultSet中，用utf8解釋：new String(rs.getString(i).getBytes("UTF-8"));// 右邊這個寫法完全是亂碼 new String(rs.getString(i).getBytes(),"UTF-8")

跟測試2的最終輸出結果是一樣的。

（對於純粹的utf8環境：服務器編碼是utf8;mysql的默認編碼也是utf8，進入mysql之後查看字符集，會看到字符集展示就是結果6，即默認的character_set_server=utf8）

感覺所有問題的根源在於默認的character_set_server=latin1

http://stackoverflow.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc ：提到jdbc對不識別的latin1的編碼字符插入了特殊的替換字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters

【參考】使用Java讀寫存儲在latin1編碼的MySQL中的UTF-8編碼的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘

其他相關資料及描述

貌似最好把所有字符集設置成utf8的方法，解決起來最徹底；並且這個問題只出現在jdbc中，使用python操作的時候，讀取都是正常的

    conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )
    cursor10 = conn10.cursor()
    cursor10.execute("set @@autocommit=1")
    cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")
    data=cursor10.fetchone()
    if data[1]=='utf8':
        cursor10.close()
        cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8來連接
        cursor10.execute("set @@autocommit=1")

Java中String解碼、編碼

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:

byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Unicode - How to get the characters right?

latin1轉gbk的亂碼問題，jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最後發現：

1. 不同jdbc對編碼的支持情況不一樣

2. 測試2，其實能解決問題，也就是set names 'utf8' + 在SQL裏邊使用convert()來轉碼是萬能的：能繞開不同jdbc帶來的問題。

mysql字符編碼問題latin1到utf8

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

mysql字符編碼問題latin1到utf8

Thrift初步

爬蟲之自動保存文檔-使用python/selenium

hadoop生態圈綜合簡介及架構案例

FTRL之初學筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結