how to get charset from string and file

a.get charset from string

public String getCharsetFromString(String srcString) throws IOException {
   BufferedInputStream bin = new BufferedInputStream(new ByteArrayInputStream(

           srcString.getBytes()));
   int p = (bin.read() << 8) + bin.read();
   String code = null;
   //the 0xefbb、0xfffe、0xfeff、0x5c75 at the beginning of each string, can be used to defines the char set
   switch (p) {
   case 0xefbb:
       code = "UTF-8";
       break;
   case 0xfffe:
       code = "Unicode";
       break;
   case 0xfeff:
       code = "UTF-16BE";
       break;
   case 0x5c75:
       code = "ANSI|ASCII";
       break;
   default:
       code = "ISO-8859-1";
    }
   return code;

}


b.get charset from file(not sure)

public String getCharsetFromFile(String filePath)throwsIOException{

   FileInputStream fis =null;

   InputStreamReader isr =null;

   String s;

   try{

       //new input stream reader is created

       fis =newFileInputStream(filePath);

       isr =newInputStreamReader(fis);

       //the name of the character encoding returned

       s=isr.getEncoding();

   }catch(Exception e){

       // print error

       System.out.print("The stream is already closed");

   }finally{

       // closes the stream and releases resources associatedif(fis!=null)

       fis.close();if(isr!=null)

       isr.close();

   }

   return s;

}

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.


c.get charset from file(sure)

private String getCharsetByInputStream(InputStream ins){
   String charset = "";
   if(null != ins){
       UniversalDetector detector = new UniversalDetector(null);
       try {
           byte[] buf = new byte[ins.available()];
           int nread;

           while ((nread = ins.read(buf)) > 0 && !detector.isDone()) {
               detector.handleData(buf, 0, nread);
           }
       } catch (IOException e) {
           LOG.error("--getCharsetByInputStream:error happened while getting charset from inputstream. ",e);
           charset = "utf-8";
           return charset;
         }
           detector.dataEnd();
           charset = detector.getDetectedCharset();
           if (charset == null || "".equals(charset)) {
               charset = "utf-8";
           }

           detector.reset();
   }else{
       charset = "utf-8";
   }

   return charset;
}

link to http://code.google.com/p/juniversalchardet/


then read inputstream as string with detected charset

private String parseTruncaredSizeBinaryResourceToString(Integer resourceKey, Integer limitedSize){
   String truncaredResourceText = null;
   InputStream ins = proactiveAnalysisService.retrieveGlobalResourceBinary(resourceKey, null, limitedSize);
   if (null != ins) {
       String charset = getCharsetByInputStream(ins);
       InputStreamReader reader = null;

       try {

           //skip to beginning after get charset by inputStream(which leads to end of inputStream)
           ins.reset();
           //ins.skip(ins.available());
       } catch (IOException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
       }

       try {
           reader = new InputStreamReader(ins, charset);
       } catch (UnsupportedEncodingException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
           return truncaredResourceText;
       }

       OutputStream out = null;
       try {
           out = new ByteArrayOutputStream();
           int i = -1;
           while ((i = reader.read()) != -1) {
               out.write(i);
           }

           truncaredResourceText = out.toString();
       } catch (IOException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend when reading inputsream. ", e);
       } finally {
           try{
               if (null != out) {
                   out.close();
               }
               if (null != ins) {
                   ins.close();
               }
               if(null != reader){
                  reader.close();
               }
           }catch(IOException e){
               LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend while close inputsream. ",e);
           }

       }
   }
   return truncaredResourceText;

}


link to http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

andhow to change index position of inputstream http://stackoverflow.com/questions/3474911/changing-the-index-positioning-in-inputstream


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章