dom4j的亂碼問題

1）背景

長期運行的爬蟲程序（抓取xml）突然出了問題。xml的亂碼導致無法驗證通過

2）亂碼是怎麼產生的

發現不同的網站返回的xml編碼不一致，有的是gb2312,有的utf-8。
爬蟲程序將urlConnection.getInputStream() 的字節流傳遞給了SAXReader來構造Document
可惜SAXReader還不夠強悍，由於只是獲取了字節流，但不知道編碼方式，於是SAXReader採用了系統默認的編碼方式對對待字節流，問題就出在這裏。

3) 未指定編碼，SAXReader如何處理字節流

org.gjt.xpp.sax2.Driver.paser(InputSource source)

if(encoding == null)
reader = new InputStreamReader(stream);

java.io.InputStreamReader

sd = StreamDecoder.forInputStreamReader(in, this, (String)null)
編碼方式爲空

sun.nio.cs.StreamDecoderforInputStreamReader

               (InputStream in,Object lock,String charsetName)
               if (csn == null)
                        csn = Charset.defaultCharset().name();
               獲取默認編碼方式

java.nio.charset.Charset.defaultCharset()

           java.security.PrivilegedAction pa =
           new GetPropertyAction("file.encoding");
          String csn = (String)AccessController.doPrivileged(pa);
          Charset cs = lookup(csn);
          if (cs != null)
           defaultCharset = cs;
          else
           defaultCharset = forName("UTF-8");
          首先參考-Dfileencoding,如果沒有就是系統默認字符編碼，還找不到就是“UTF-8”
          如果在eclipse中運行程序，eclipse會指定-Dfileencoding, 值就是你得文件編碼

4）如何確定xml編碼方式

參考com.sun.syndication.io.XmlReader

查看文件第一行，看是後有<?xml .... encoding="xx" ...?>
查看http response header中是否含有 Content-Type text/xml; charset=xx
探測BOM (UTF-8 簽名)

            取頭3個字節
            UTF_16BE:0xFE 0xFF
            UTF_16LE:0xFF   0xFE
            UTF_8: 0xEF 0xBB 0xBF

            實際通過測試發現：
            //utf-16BE、utf-16LE、utf-16,utf-8編碼差別
            System.out.println(Arrays.toString("<".getBytes("utf-16BE"))); :[0, 60]
            System.out.println(Arrays.toString("<".getBytes("utf-16LE")));  :[60, 0]
            System.out.println(Arrays.toString("<".getBytes("utf-16")));      :[-2, -1, 0, 60]
            System.out.println(Arrays.toString("<".getBytes("utf-8")));  :[60]
            //能識別BOM？
            byte[] b1=new byte[]{-2,-1,0,60};
            System.out.println(new String(b1,"UTF-16BE")); // <
            System.out.println(new String(b1,"UTF-16")); // <

            byte[] b1=new byte[]{-1,-2,60,0};
            System.out.println(new String(b1,"UTF-16LE")); // ?<
            System.out.println(new String(b1,"UTF-16")); // <

            byte[] b1=new byte[]{-17,-69,-65,60};
            System.out.println(new String(b1,"UTF-8")); // ?<

           上面紅色代表錯誤，綠色代表正確
            可見java中的BOM純粹是爲UTF-16 big endian 或者little endian準備，基本上已不具備識別UTF-16BE、UTF-16LE、UTF-16、UTF-8功能

猜測

          取頭4個字節，看是否匹配<?xm
          UTF_16BE: 0x00 0x3C 0x00 0x3F
          UTF_16BE: 0x3C 0x00 0x3F 0x00
          UTF_8: 0x3C 0x3F 0x78 0x6D

5）修正方式

採用第一種

使用PushbackInputStream封裝預讀少量數據（200）
退回讀取的字節（PushbackInputStream.unread）保持輸入字節流的完整
使用正則抓取數據的第一行，獲取encoding
構造new InputStreamReader(pis,encoding)，傳給XmlReader以免不知道採用何種編碼解析

6) org.dom4j.Document.asXML()的bug

      經過上面的步驟輸入正確了，Document也成功解析了，爲什麼      org.dom4j.Document.asXML() 仍然亂碼？

     看看代碼：
     public String asXML() {
       try {
           ByteArrayOutputStream out = new ByteArrayOutputStream();
           XMLWriter writer = new XMLWriter(out, outputFormat);
           writer.write(this);
           return out.toString();
       }
       catch (IOException e) {
           throw new RuntimeException("IOException while generating textual representation: " + e.getMessage());
       }
    }

6.1) 問題在哪？

out.toString()

java.io.ByteArrayOutputStream.toString()

return new String(buf, 0, count);

java.lang.String(byte bytes[], int offset, int length)

char[] v = StringCoding.decode(bytes, offset, length);

java.lang.StringCoding.decode(byte[] ba, int off, int len)

             String csn = Charset.defaultCharset().name();
             try {
               return decode(csn, ba, off, len);
             } catch (UnsupportedEncodingException x) {
               warnUnsupportedCharset(csn);
             }
      採用了系統默認編碼來輸出導致問題

6.2)如何修正？

需要看看XMLWriter採用了何種編碼寫入document

org.dom4j.io.XMLWriter(OutputStream out, OutputFormat format)

        this.writer = createWriter(out, format.getEncoding());
       採用了format.getEncoding()

        知道了寫入時的編碼就好說了
        out.toString(outputFormat.getEncoding());

7）小結

當程序處理字節流的時候，必須想辦法知道字節的編碼方式，否者就會出問題

pwlazy

發佈了200 篇原創文章 · 獲贊 7 · 訪問量 106萬+

私信關注

dom4j的亂碼問題

1）背景

2）亂碼是怎麼產生的

3) 未指定編碼，SAXReader如何處理字節流

4）如何確定xml編碼方式

5）修正方式

6) org.dom4j.Document.asXML()的bug

6.1) 問題在哪？

6.2)如何修正？

7）小結

redis中的hash實現

深入淺出redis事件框架

ubuntu 下的squid安裝日誌

ubuntu9.0.4 安裝中文輸入法ibus

談談jetty8 的io模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結