近日接到一個任務,需要採集某個https網站的部分內容,用到了httpclient(4.5.X),它是Apache Jakarta Common下的子項目,用來提供高效的、最新的、功能豐富的支持HTTP協議的客戶端編程工具包,並且它支持HTTP協議最新的版本和建議。先看一下httpclient的簡單使用。
- 封裝一個httpclient查詢方法:
public String getHtml(String url) {
String html = null;
for (int i = 1; i <= 3; i++) {
CloseableHttpClient httpclient = HttpClients.createDefault();// 創建httpClient對象
HttpGet httpget;
CloseableHttpResponse response = null;
httpget = new HttpGet(url);// 以get方式請求該URL
httpget.addHeader(HttpHeaders.USER_AGENT,
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0");
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000).setConnectTimeout(10000).build();// 設置請求和傳輸超時時間
httpget.setConfig(requestConfig);
try {
response = httpclient.execute(httpget);// 得到response對象
int resStatu = response.getStatusLine().getStatusCode();// 返回碼
System.out.println("狀態碼" + resStatu);
if (resStatu == HttpStatus.SC_OK) {// 200正常
// 獲得相應實體
HttpEntity entity = response.getEntity();
if (entity != null) {
html = EntityUtils.toString(entity, "UTF-8");
html = html.replace(" ", " ");
break;
}
}
} catch (Exception e) {
e.printStackTrace();
}
finally {
httpclient.getConnectionManager().shutdown();
}
}
return html;
}
用這個方法抓取一般的http網頁沒問題.但是如果用來抓取某些https的網頁便會出現如下異常:
unable to find valid certification path to requested target
異常提示你需要導入一個網站的證書.下面正式來抓取https網頁,以本人最近經常使用的一個google鏡像網站(https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=)爲例.
1.導出網站的證書(谷歌瀏覽器):
點擊瀏覽器地址欄裏的鎖標誌,在右側彈框內點擊view certificate,點擊詳細信息,導出base64編碼x.509(.cer)(s)證書即可。
2. 導入keystore證書:
使用Java自帶的keytool工具將導出的.cer證書導入爲httpclient可以使用的keystore證書.cmd內進入jdk的bin目錄。
使用如下命令:keytool -import -alias Root -file d:/Root.cer -keystore "d:/trust.keystore" -storepass 123456
3.使用帶有ssl的httpclient實例訪問https網站.
public class SSLHttpClient {
public static String gethtml(String url) {
String html = "";
CloseableHttpClient httpclient = null;
CloseableHttpResponse response = null;
try {
SSLConnectionSocketFactory sslsf = createSSLConnSocketFactory();
httpclient = HttpClients.custom()
.setSSLSocketFactory(sslsf).build();
HttpGet httpget = new HttpGet(url);
httpget.addHeader(HttpHeaders.USER_AGENT,
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0");
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000).setConnectTimeout(10000).build();// 設置請求和傳輸超時時間
httpget.setConfig(requestConfig);
System.out.println("Executing request " + httpget.getRequestLine());
response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
int resStatu = response.getStatusLine().getStatusCode();// 返回碼
if (resStatu == HttpStatus.SC_OK) {// 200正常 其他就不對
// 獲得相應實體
if (entity != null) {
html = EntityUtils.toString(entity, "UTF-8");
html = html.replace(" ", " ");
}
}
EntityUtils.consume(entity);
} catch(Exception e){
e.printStackTrace();
}finally{
if(response!=null){
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if(httpclient!=null){
try {
httpclient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return html;
}
// ssl通道證書的創建
private static SSLConnectionSocketFactory createSSLConnSocketFactory()
throws Exception {
SSLContext sslcontext = SSLContexts
.custom()
.loadTrustMaterial(
new File(
"C://Users//cloud//Desktop//證書//trust.keystore"),
"123456".toCharArray(), new TrustSelfSignedStrategy())
.build();
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
sslcontext, new String[] { "TLSv1" }, null,
SSLConnectionSocketFactory.getDefaultHostnameVerifier());
return sslsf;
}
}
測試方法:
public static void main(String[] args){
String html = SSLHttpClient.gethtml("https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=");
if(html!=null&&!html.equals("")){
Document doc = Jsoup.parse(html);
if(doc!=null){
Elements eles = doc.select("#gs_ccl_results div.gs_r h3.gs_rt a");
if(eles!=null&&eles.size()!=0){
for(int i=0;i<eles.size();i++){
System.out.println(i+1+"-"+eles.get(i).text());
}
}
}
}
}
完成收工.