Notes on Chinese Web Data Extraction in Java(part 1)

轉載   http://isaacyang.wordpress.com/

 

 

 

 

Note. The code is developed with Eclipse and tested under JDK 1.6. To make the code running correctly, you need to set the encoding of the project to utf-8 and include some necessary libraries. All the code will be available at http://sourceforge.net/projects/ptawebdataextra.

1. Correctly Loading a Chinese Web Page

Correctly loading a Chinese Web page using Java is not a trivial task. Given a target url, you need to read the content from the url and then decode the content using the right encoding. Chinese Web pages can be encoded using utf-8, gbk, gb2312, gb18030, big5, etc. If you did not use the right encoding when resolving a page, you will only get meaningless characters with some html tags. This is usually true for Web pages not written in English. However, Java does not handle the encoding issue automatically. So your code is responsible for the encoding detection of Web pages.

The first step is to get a http connection to the target url. This can be done using the following code.

public static HttpURLConnection getConnection(URL url)
    throws IOException 
{
    HttpURLConnection httpurlconnection = null;
    try {
        URLConnection urlconnection = url.openConnection();
        urlconnection.setConnectTimeout(60000);
        urlconnection.setReadTimeout(60000);
        urlconnection.connect();

 

        if (!(urlconnection instanceof HttpURLConnection)) {
            return null;
        }

        httpurlconnection = (HttpURLConnectionurlconnection;
        int responsecode = httpurlconnection.getResponseCode();
        switch (responsecode{
        case HttpURLConnection.HTTP_OK:
        case HttpURLConnection.HTTP_MOVED_PERM:
        case HttpURLConnection.HTTP_MOVED_TEMP:
            break;
        default:
            System.err.println("Invalid response code: " + 
                responsecode + " " + url);
            httpurlconnection.disconnect();
            return null;
        }
    } catch (IOException ioexception{
        System.err.println("unable to connect: " + ioexception);
        if (httpurlconnection != null{
            httpurlconnection.disconnect();
        }
        throw ioexception;
    }
    return httpurlconnection;
}

The code first gets a URLConnection instance and then sets the time out parameter. These parameters must be set before calling the connect() method. Calling the getResponseCode() method to get the response code. If the code is valid, it returns with the cast objectHttpURLConnection.

The next step is to get an InputStream from the HttpURLConnection. It retries 3 times before returns nothing.

public static InputStream getInputStream(HttpURLConnection connection
{
    InputStream inputstream = null;
    for (int i = 0i < 3++i{
        try {
            inputstream = connection.getInputStream();
            break;
        } catch (IOException e{
            System.err.println("error opening connection " + e);
        }
    }
    return inputstream;
}

The third step is the most important part which reads the content attribute of the connection and detects the encoding at the same time. The code is as follows.

public static final int STREAM_BUFFER_SIZE = 4096;
public static final String DEFAULT_ENCODING = "utf-8";
public static String[] getContent(HttpURLConnection connection)
    throws IOException 
{
    InputStream inputstream = null;
    try {
        LinkedList<byte[]> byteList = new LinkedList<byte[]>();
        LinkedList<Integer> byteLength = new LinkedList<Integer>();
        inputstream = getInputStream(connection);
        if (inputstream == null{
            return null;
        }
        UniversalDetector detector = new UniversalDetector(null);
        byte[] buf = new byte[STREAM_BUFFER_SIZE];
        int nread = 0, nTotal = 0;
        while ((nread = inputstream.read(buf, 0, STREAM_BUFFER_SIZE)) > 0{
            byteList.add(buf.clone());
            byteLength.add(nread);
            nTotal += nread;
            detector.handleData(buf, 0, nread);
            if (detector.isDone())
                break;
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        if (encoding == null{
            encoding = DEFAULT_ENCODING;
        }
        while ((nread = inputstream.read(buf, 0, STREAM_BUFFER_SIZE)) > 0{
            byteList.add(buf.clone());
            byteLength.add(nread);
            nTotal += nread;
        }
        byte[] contentByte = new byte[nTotal];
        int offSet = 0, l = byteList.size();
        for (int i = 0i < l++i{
            byte[] bytes = byteList.get(i);
            int length = byteLength.get(i);
            System.arraycopy(bytes, 0, contentByte, offSet, length);
            offSet += length;
        }
        return new String[] { encoding, new String(contentByte, encoding};
    } catch (IOException ioe{
        throw ioe;
    } finally {
        if (inputstream != null{
            inputstream.close();
        }
    }
}

The encoding detection is achieved using a library called ‘juniversalchardet’. It is a Java implementation of ‘universalchardet’ which is the encoding detector library of Mozilla. To use the library, you need to construct an instance oforg.mozilla.universalchardet.UniversalDetector and feed some data to the detector by calling UniversalDetector.handleData(). After notifying the detector of the end of data by calling UniversalDetector.dataEnd(), you can get the detected encoding name by callingUniversalDetector.getDetectedCharset(). Please refer tohttp://code.google.com/p/juniversalchardet for more details.

The getContent function reads bytes from the input stream and feeds these bytes to the encoding detector. These bytes are also stored into a list. When the encoding detection is done, the function reads up the remaining bytes and concatenates all the bytes into one array. It then decodes the bytes using the detected encoding. Here is a very small trick. You shouldn’t read the remaining bytes using the detected encoding because the encoding detection may stop in the middle of a specific character(a character is two bytes). If the detection stops in the middle of a character, then the remaining bytes is a single byte plus consecutive characters. Decoding these bytes using the detected encoding will get unreadable characters.

Finally, put the above three functions together, we get the following function which reads the content from a specific url.

public static String getContent(URL url
{
    HttpURLConnection connection = null;
    try {
        connection = NetUtilities.getConnection(url);
        if (connection != null{
            String[] resource = NetUtilities.getContent(connection);
            if (resource != null{
                return resource[1];
            }
        }
    } catch (Exception e{
    } finally {
        if (connection != null{
            connection.disconnect();
        }
    }
    return null;
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章