Java過濾Unicode

原創

2020-02-24 03:11

我們在解析XML文件時，會碰到程序發生以下一些異常信息：

引用
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".

引用
An invalid XML character (Unicode: 0x1d) was found in the CDATA section.

這些錯誤的發生是由於一些不可見的特殊字符的存在，而這些字符對於XMl文件來說又是非法的，所以XML解析器在解析時會發生異常，官方定義了XML的無效字符分爲三段：

0x00 - 0x08
0x0b - 0x0c
0x0e - 0x1f

解決方法是：在解析之前先把字符串中的這些非法字符過濾掉：

string.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "")

測試代碼：TestXmlInvalidChar.java

package michael.xml;

import java.io.ByteArrayInputStream;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;

/**
* @author michael
*
*/
public class TestXmlInvalidChar {

/**
* @param args
*/
public static void main(String[] args) {

// 測試的字符串應該爲：<r><c d="s" n="j"></c></r>
// 正常的對應的byte數組爲
byte[] ba1 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 34, 62, 60, 47, 99, 62, 60, 47, 114,
62 };
System.out.println("ba1 length=" + ba1.length);
String ba1str = new String(ba1);
System.out.println(ba1str);
System.out.println("ba1str length=" + ba1str.length());
System.out.println("-----------------------------------------");
// 和正常的byte 數組相比多了一個不可見的 31
byte[] ba2 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 31, 34, 62, 60, 47, 99, 62, 60, 47,
114, 62 };
System.out.println("ba2 length=" + ba2.length);
String ba2str = new String(ba2);
System.out.println(ba2str);
System.out.println("ba2str length=" + ba2str.length());
System.out.println("-----------------------------------------");
try {
DocumentBuilderFactory dbfactory = DocumentBuilderFactory
.newInstance();
dbfactory.setIgnoringComments(true);
DocumentBuilder docBuilder = dbfactory.newDocumentBuilder();

// 過濾掉非法不可見字符如果不過濾 XML解析就報異常
String filter = ba2str.replaceAll(
"[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");
System.out.println("過濾後的length=" + filter.length());
ByteArrayInputStream bais = new ByteArrayInputStream(filter
.getBytes());
Document doc = docBuilder.parse(bais);
Element rootEl = doc.getDocumentElement();
System.out.println("過濾後解析正常 root child length="
+ rootEl.getChildNodes().getLength());
} catch (Exception e) {
e.printStackTrace();
}
}

}

測試代碼運行結果如下：
引用

ba1 length=26
<r><c d="s" n="j"></c></r>
ba1str length=26
-----------------------------------------
ba2 length=27
<r><c d="s" n="j"></c></r>
ba2str length=27
-----------------------------------------
過濾後的length=26
過濾後解析正常 root child length=1

對比可見，byte數組及字符串的長度前後是不一樣的，但打印到控制檯顯示的結果卻是一樣的。同樣過濾之後的字符串長度是有變化的。

參考：http://sjsky.iteye.com/blog/1055063
http://www.blogjava.net/fingki/archive/2008/09/04/226969.html

--復旦檢索圖書館報錯：
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".

String xmlCode2 = HttpClientUtil.getWebInfoByHttpClientGetMethodGBK(searchURL); // 抓取網頁
xmlCode2 = xmlCode2.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");//過濾Unicode

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Java過濾Unicode

SQL優化-20231016

關於Windows cmd下目錄帶空格的處理辦法

Postgresql: UUID的使用

jQuery的domReady

用sqlserver的sqlcmd、osql、isql的備份與還原

java模擬js的escape和unescape函數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結