java爬蟲帶你爬天爬地爬人生,爬新浪

HttpClient簡介

HttpClient是Apache Jakarta Common下的子項目,可以用來提供高效的、最新的、功能豐富的支持HTTP協議的客戶端編程工具包,並且它支持 HTTP 協議最新的版本。它的主要功能有:

(1) 實現了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)

(2) 支持自動轉向

(3) 支持 HTTPS 協議

(4) 支持代理服務器等

Jsoup簡介

jsoup是一款Java的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似於jQuery的操作方法來取出和操作數據。它的主要功能有:

(1) 從一個URL,文件或字符串中解析HTML;

(2) 使用DOM或CSS選擇器來查找、取出數據;

(3) 可操作HTML元素、屬性、文本;

使用步驟

代碼
import org.apache.http.HttpEntity;

import org.apache.http.client.config.RequestConfig;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.client.protocol.HttpClientContext;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClientBuilder;

import org.apache.http.util.EntityUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.junit.Test;

import java.util.List;

/**

  • HttpClient & Jsoup libruary test class

  • Created by xuyh at 2017/11/6 15:28.

*/

public classHttpClientJsoupTest{

@Test

public void test() {

        //通過httpClient獲取網頁響應,將返回的響應解析爲純文本

    HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");

    httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());

    CloseableHttpClient httpClient = null;

    CloseableHttpResponse response = null;

    String responseStr = "";

    try {

        httpClient = HttpClientBuilder.create().build();

        HttpClientContext context = HttpClientContext.create();

        response = httpClient.execute(httpGet, context);

        int state = response.getStatusLine().getStatusCode();

        if (state != 200)

            responseStr = "";

        HttpEntity entity = response.getEntity();

        if (entity != null)

            responseStr = EntityUtils.toString(entity, "utf-8");

    } catch (Exception e) {

        e.printStackTrace();

    } finally {

        try {

            if (response != null)

                response.close();

            if (httpClient != null)

                httpClient.close();

        } catch (Exception ex) {

            ex.printStackTrace();

        }

    }

    if (responseStr == null)

        return;

    //將解析到的純文本用Jsoup工具轉換成Document文檔並進行操作

    Document document = Jsoup.parse(responseStr);

    List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()

            .getElementsByAttributeValue("class", "phdnews_hdline");

    elements.forEach(element -> {

        for (Element e : element.getElementsByTag("a")) {

            System.out.println(e.attr("href"));

            System.out.println(e.text());

        }

    });

}                                                                                                                                                                                                                    

詳解

新建HttpGet對象,對象將從 http://sports.sina.com.cn/ 這個URL地址獲取GET響應。並設置socket超時時間和連接超時時間分別爲30000ms。

將HttpClient和Jsoup進行封裝,形成一個工具類,內容如下:

import org.apache.http.HttpEntity;

import org.apache.http.NameValuePair;

import org.apache.http.client.CookieStore;

import org.apache.http.client.config.RequestConfig;

import org.apache.http.client.entity.UrlEncodedFormEntity;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.client.methods.HttpPost;

import org.apache.http.client.protocol.HttpClientContext;

import org.apache.http.conn.ssl.SSLConnectionSocketFactory;

import org.apache.http.cookie.Cookie;

import org.apache.http.entity.ContentType;

import org.apache.http.entity.StringEntity;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClientBuilder;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.message.BasicNameValuePair;

import org.apache.http.ssl.SSLContextBuilder;

import org.apache.http.util.EntityUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import javax.net.ssl.*;

import java.io.IOException;

import java.security.GeneralSecurityException;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

/**

  • Http工具,包含:

  • 普通http請求工具(使用httpClient進行http,https請求的發送)

  • Created by xuyh at 2017/7/17 19:08.

*/

public classHttpUtils{

/**
  • 請求超時時間,默認20000ms

*/

private int timeout = 20000;

/**
  • cookie表

*/

private Map<String, String> cookieMap = new HashMap<>();

/**
  • 請求編碼(處理返回結果),默認UTF-8

*/

private String charset = "UTF-8";

private static HttpUtils httpUtils;

privateHttpUtils(){

}

/**
  • 獲取實例

*@return

*/

publicstaticHttpUtilsgetInstance(){

    if (httpUtils == null)

        httpUtils = new HttpUtils();

    return httpUtils;

}

/**
  • 清空cookieMap

*/

publicvoidinvalidCookieMap(){

    cookieMap.clear();

}

publicintgetTimeout(){

    return timeout;

}

/**
  • 設置請求超時時間

*@paramtimeout

*/

publicvoidsetTimeout(inttimeout){

    this.timeout = timeout;

}

publicStringgetCharset(){

    return charset;

}

/**
  • 設置請求字符編碼集

*@paramcharset

*/

publicvoidsetCharset(String charset){

    this.charset = charset;

}

/**
  • 將網頁返回爲解析後的文檔格式

*@paramhtml

*@return

*@throwsException

*/

publicstaticDocumentparseHtmlToDoc(String html)throwsException{

    return removeHtmlSpace(html);

}

privatestaticDocumentremoveHtmlSpace(String str){

    Document doc = Jsoup.parse(str);

    String result = doc.html().replace("&nbsp;", "");

    return Jsoup.parse(result);

}

/**
  • 執行get請求,返回doc

*@paramurl

*@return

*@throwsException

*/

publicDocumentexecuteGetAsDocument(String url)throwsException{

    return parseHtmlToDoc(executeGet(url));

}

/**
  • 執行get請求

*@paramurl

*@return

*@throwsException

*/

publicStringexecuteGet(String url)throwsException{

    HttpGet httpGet = new HttpGet(url);

    httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));

    httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    CloseableHttpClient httpClient = null;

    String str = "";

    try {

        httpClient = HttpClientBuilder.create().build();

        HttpClientContext context = HttpClientContext.create();

        CloseableHttpResponse response = httpClient.execute(httpGet, context);

        getCookiesFromCookieStore(context.getCookieStore(), cookieMap);

        int state = response.getStatusLine().getStatusCode();

        if (state == 404) {

            str = "";

        }

        try {

            HttpEntity entity = response.getEntity();

            if (entity != null) {

                str = EntityUtils.toString(entity, charset);

            }

        } finally {

            response.close();

        }

    } catch (IOException e) {

        throw e;

    } finally {

        try {

            if (httpClient != null)

                httpClient.close();

        } catch (IOException e) {

            throw e;

        }

    }

    return str;

}

/**
  • 用https執行get請求,返回doc

*@paramurl

*@return

*@throwsException

*/

publicDocumentexecuteGetWithSSLAsDocument(String url)throwsException{

    return parseHtmlToDoc(executeGetWithSSL(url));

}

/**
  • 用https執行get請求

*@paramurl

*@return

*@throwsException

*/

publicStringexecuteGetWithSSL(String url)throwsException{

    HttpGet httpGet = new HttpGet(url);

    httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));

    httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    CloseableHttpClient httpClient = null;

    String str = "";

    try {

        httpClient = createSSLInsecureClient();

        HttpClientContext context = HttpClientContext.create();

        CloseableHttpResponse response = httpClient.execute(httpGet, context);

        getCookiesFromCookieStore(context.getCookieStore(), cookieMap);

        int state = response.getStatusLine().getStatusCode();

        if (state == 404) {

            str = "";

        }

        try {

            HttpEntity entity = response.getEntity();

            if (entity != null) {

                str = EntityUtils.toString(entity, charset);

            }

        } finally {

            response.close();

        }

    } catch (IOException e) {

        throw e;

    } catch (GeneralSecurityException ex) {

        throw ex;

    } finally {

        try {

            if (httpClient != null)

                httpClient.close();

        } catch (IOException e) {

            throw e;

        }

    }

    return str;

}

/**
  • 執行post請求,返回doc

*@paramurl

*@paramparams

*@return

*@throwsException

*/

publicDocumentexecutePostAsDocument(String url, Map<String, String> params)throwsException{

    return parseHtmlToDoc(executePost(url, params));

}

/**
  • 執行post請求

*@paramurl

*@paramparams

*@return

*@throwsException

*/

publicStringexecutePost(String url, Map<String, String> params)throwsException{

    String reStr = "";

    HttpPost httpPost = new HttpPost(url);

    httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));

    List<NameValuePair> paramsRe = new ArrayList<>();

    for (String key : params.keySet()) {

        paramsRe.add(new BasicNameValuePair(key, params.get(key)));

    }

    CloseableHttpClient httpclient = HttpClientBuilder.create().build();

    CloseableHttpResponse response;

    try {

        httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));

        HttpClientContext context = HttpClientContext.create();

        response = httpclient.execute(httpPost, context);

        getCookiesFromCookieStore(context.getCookieStore(), cookieMap);

        HttpEntity entity = response.getEntity();

        reStr = EntityUtils.toString(entity, charset);

    } catch (IOException e) {

        throw e;

    } finally {

        httpPost.releaseConnection();

    }

    return reStr;

}

/**
  • 用https執行post請求,返回doc

*@paramurl

*@paramparams

*@return

*@throwsException

*/

publicDocumentexecutePostWithSSLAsDocument(String url, Map<String, String> params)throwsException{

    return parseHtmlToDoc(executePostWithSSL(url, params));

}

/**
  • 用https執行post請求

*@paramurl

*@paramparams

*@return

*@throwsException

*/

publicStringexecutePostWithSSL(String url, Map<String, String> params)throwsException{

    String re = "";

    HttpPost post = new HttpPost(url);

    List<NameValuePair> paramsRe = new ArrayList<>();

    for (String key : params.keySet()) {

        paramsRe.add(new BasicNameValuePair(key, params.get(key)));

    }

    post.setHeader("Cookie", convertCookieMapToString(cookieMap));

    post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    CloseableHttpResponse response;

    try {

        CloseableHttpClient httpClientRe = createSSLInsecureClient();

        HttpClientContext contextRe = HttpClientContext.create();

        post.setEntity(new UrlEncodedFormEntity(paramsRe));

        response = httpClientRe.execute(post, contextRe);

        HttpEntity entity = response.getEntity();

        if (entity != null) {

            re = EntityUtils.toString(entity, charset);

        }

        getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);

    } catch (Exception e) {

        throw e;

    }

    return re;

}

/**
  • 發送JSON格式body的POST請求

*@paramurl 地址

*@paramjsonBody json body

*@return

*@throwsException

*/

publicStringexecutePostWithJson(String url, String jsonBody)throwsException{

    String reStr = "";

    HttpPost httpPost = new HttpPost(url);

    httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));

    CloseableHttpClient httpclient = HttpClientBuilder.create().build();

    CloseableHttpResponse response;

    try {

        httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));

        HttpClientContext context = HttpClientContext.create();

        response = httpclient.execute(httpPost, context);

        getCookiesFromCookieStore(context.getCookieStore(), cookieMap);

        HttpEntity entity = response.getEntity();

        reStr = EntityUtils.toString(entity, charset);

    } catch (IOException e) {

        throw e;

    } finally {

        httpPost.releaseConnection();

    }

    return reStr;

}

/**
  • 發送JSON格式body的SSL POST請求

*@paramurl 地址

*@paramjsonBody json body

*@return

*@throwsException

*/

publicStringexecutePostWithJsonAndSSL(String url, String jsonBody)throwsException{

    String re = "";

    HttpPost post = new HttpPost(url);

    post.setHeader("Cookie", convertCookieMapToString(cookieMap));

    post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());

    CloseableHttpResponse response;

    try {

        CloseableHttpClient httpClientRe = createSSLInsecureClient();

        HttpClientContext contextRe = HttpClientContext.create();

        post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));

        response = httpClientRe.execute(post, contextRe);

        HttpEntity entity = response.getEntity();

        if (entity != null) {

            re = EntityUtils.toString(entity, charset);

        }

        getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);

    } catch (Exception e) {

        throw e;

    }

    return re;

}

privatevoidgetCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap){

    List<Cookie> cookies = cookieStore.getCookies();

    for (Cookie cookie : cookies) {

        cookieMap.put(cookie.getName(), cookie.getValue());

    }

}

privateStringconvertCookieMapToString(Map<String, String> map){

    String cookie = "";

    for (String key : map.keySet()) {

        cookie += (key + "=" + map.get(key) + "; ");

    }

    if (map.size() > 0) {

        cookie = cookie.substring(0, cookie.length() - 2);

    }

    return cookie;

}

/**
  • 創建 SSL連接

*@return

*@throwsGeneralSecurityException

*/

privatestaticCloseableHttpClientcreateSSLInsecureClient()throwsGeneralSecurityException{

    try {

        SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();

        SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,

                (s, sslContextL) -> true);

        return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();

    } catch (GeneralSecurityException e) {

        throw e;

    }

}

}

給大家推薦一個程序員學習交流羣:863621962。羣裏有分享的視頻,還有思維導圖

羣公告有視頻,都是乾貨的,你可以下載來看。主要分享分佈式架構、高可擴展、高性能、高併發、性能優化、Spring boot、Redis、ActiveMQ、Nginx、Mycat、Netty、Jvm大型分佈式項目實戰學習架構師視頻。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章