JAVA獲取文件編碼

原創

2020-02-24 21:42

當讀取文件時,我們一般都會指定文本或字符串使用的編碼格式,但有時我們不清楚是什麼編碼的時候,我們需要分析文件或字符是什麼編碼,我們可以使用以下代碼.

 /**
  * 獲取文件編碼
  * @param file 要分析的文件
  **/
public static String getCharset(File file) {
	String charset = "GBK"; // 默認編碼
	byte[] first3Bytes = new byte[3];
	BufferedInputStream bis = null;
	try {
		boolean checked = false;
		bis = new BufferedInputStream(new FileInputStream(file));
		bis.mark(0);
		int read = bis.read(first3Bytes, 0, 3);
		if (read == -1)
			return charset;
		if (first3Bytes[0] == (byte) 0xFF && first3Bytes[1] == (byte) 0xFE) {
			charset = "UTF-16LE";
			checked = true;
		} else if (first3Bytes[0] == (byte) 0xEF
				&& first3Bytes[1] == (byte) 0xBB
				&& first3Bytes[2] == (byte) 0xBF) {
			charset = "UTF-8";
			checked = true;
		}
		bis.reset();
		if (!checked) {
			int loc = 0;
			while ((read = bis.read()) != -1) {
				loc++;
				if (read >= 0xF0)
					break;
				// 單獨出現BF以下的，也算是GBK
				if (0x80 <= read && read <= 0xBF)
					break;
				if (0xC0 <= read && read <= 0xDF) {
					read = bis.read();
					if (0x80 <= read && read <= 0xBF)// 雙字節 (0xC0 - 0xDF)
						// (0x80 -0xBF),也可能在GB編碼內
						continue;
					else
						break;
					// 也有可能出錯，但是機率較小
				} else if (0xE0 <= read && read <= 0xEF) {
					read = bis.read();
					if (0x80 <= read && read <= 0xBF) {
						read = bis.read();
						if (0x80 <= read && read <= 0xBF) {
							charset = "UTF-8";
							break;
						} else
							break;
					} else
						break;
				}
			}
			System.out.println(loc + " " + Integer.toHexString(read));
		}
		bis.close();
	} catch (Exception e) {
		e.printStackTrace();
	} finally {
		if (bis != null) {
			try {
				bis.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
	}
	return charset;
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

JAVA獲取文件編碼

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

ORA-28000 the account is locked

Javascript || && 運算符

根據IP獲取天氣預報信息29種樣式

文字編碼轉換[待補充]

Eclipse 設置文件默認Editor

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結