httpclient抓取https網頁數據

近日接到一個任務,需要採集某個https網站的部分內容,用到了httpclient(4.5.X),它是Apache Jakarta Common下的子項目,用來提供高效的、最新的、功能豐富的支持HTTP協議的客戶端編程工具包,並且它支持HTTP協議最新的版本和建議。先看一下httpclient的簡單使用。

  • 封裝一個httpclient查詢方法:
public String getHtml(String url) {
        String html = null;
        for (int i = 1; i <= 3; i++) {
            CloseableHttpClient httpclient = HttpClients.createDefault();// 創建httpClient對象
            HttpGet httpget;
            CloseableHttpResponse response = null;
            httpget = new HttpGet(url);// 以get方式請求該URL
            httpget.addHeader(HttpHeaders.USER_AGENT,
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0");
            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(10000).setConnectTimeout(10000).build();// 設置請求和傳輸超時時間
            httpget.setConfig(requestConfig);
            try {
                response = httpclient.execute(httpget);// 得到response對象
                int resStatu = response.getStatusLine().getStatusCode();// 返回碼
                System.out.println("狀態碼" + resStatu);
                if (resStatu == HttpStatus.SC_OK) {// 200正常 
                    // 獲得相應實體
                    HttpEntity entity = response.getEntity();
                    if (entity != null) {
                        html = EntityUtils.toString(entity, "UTF-8");
                        html = html.replace("&nbsp;", " ");
                        break;
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            finally {
                httpclient.getConnectionManager().shutdown();
            }
        }
        return html;
    }

用這個方法抓取一般的http網頁沒問題.但是如果用來抓取某些https的網頁便會出現如下異常:
unable to find valid certification path to requested target
異常提示你需要導入一個網站的證書.下面正式來抓取https網頁,以本人最近經常使用的一個google鏡像網站(https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=)爲例.
1.導出網站的證書(谷歌瀏覽器):
這裏寫圖片描述
點擊瀏覽器地址欄裏的鎖標誌,在右側彈框內點擊view certificate,點擊詳細信息,導出base64編碼x.509(.cer)(s)證書即可。
2. 導入keystore證書:
使用Java自帶的keytool工具將導出的.cer證書導入爲httpclient可以使用的keystore證書.cmd內進入jdk的bin目錄。
使用如下命令:keytool -import -alias Root -file d:/Root.cer -keystore "d:/trust.keystore" -storepass 123456
3.使用帶有ssl的httpclient實例訪問https網站.

public class SSLHttpClient {

    public static String gethtml(String url) {
        String html = "";
        CloseableHttpClient httpclient = null;
        CloseableHttpResponse response = null;
        try {
            SSLConnectionSocketFactory sslsf = createSSLConnSocketFactory();
            httpclient = HttpClients.custom()
                .setSSLSocketFactory(sslsf).build();
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader(HttpHeaders.USER_AGENT,
                    "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0");
            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(10000).setConnectTimeout(10000).build();// 設置請求和傳輸超時時間
            httpget.setConfig(requestConfig);
            System.out.println("Executing request " + httpget.getRequestLine());
            response = httpclient.execute(httpget);
            HttpEntity entity = response.getEntity();
            System.out.println("----------------------------------------");
            System.out.println(response.getStatusLine());
            int resStatu = response.getStatusLine().getStatusCode();// 返回碼
            if (resStatu == HttpStatus.SC_OK) {// 200正常 其他就不對
                // 獲得相應實體
                if (entity != null) {
                    html = EntityUtils.toString(entity, "UTF-8");
                    html = html.replace("&nbsp;", " ");
                }
            }
            EntityUtils.consume(entity);
        } catch(Exception e){
            e.printStackTrace();
        }finally{
            if(response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(httpclient!=null){
                try {
                    httpclient.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return html;
    }

    // ssl通道證書的創建
    private static SSLConnectionSocketFactory createSSLConnSocketFactory()
            throws Exception {
        SSLContext sslcontext = SSLContexts
                .custom()
                .loadTrustMaterial(
                        new File(
                                "C://Users//cloud//Desktop//證書//trust.keystore"),
                        "123456".toCharArray(), new TrustSelfSignedStrategy())
                .build();
        SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
                sslcontext, new String[] { "TLSv1" }, null,
                SSLConnectionSocketFactory.getDefaultHostnameVerifier());
        return sslsf;
    }
}

測試方法:

public static void main(String[] args){
        String html = SSLHttpClient.gethtml("https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=");
        if(html!=null&&!html.equals("")){
            Document doc = Jsoup.parse(html);
            if(doc!=null){
                Elements eles = doc.select("#gs_ccl_results div.gs_r h3.gs_rt a");
                if(eles!=null&&eles.size()!=0){
                    for(int i=0;i<eles.size();i++){
                        System.out.println(i+1+"-"+eles.get(i).text());
                    }
                }
            }
        }
    }

測試結果

完成收工.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章