近日接到一个任务,需要采集某个https网站的部分内容,用到了httpclient(4.5.X),它是Apache Jakarta Common下的子项目,用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持HTTP协议最新的版本和建议。先看一下httpclient的简单使用。
- 封装一个httpclient查询方法:
public String getHtml(String url) {
String html = null;
for (int i = 1; i <= 3; i++) {
CloseableHttpClient httpclient = HttpClients.createDefault();// 创建httpClient对象
HttpGet httpget;
CloseableHttpResponse response = null;
httpget = new HttpGet(url);// 以get方式请求该URL
httpget.addHeader(HttpHeaders.USER_AGENT,
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0");
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000).setConnectTimeout(10000).build();// 设置请求和传输超时时间
httpget.setConfig(requestConfig);
try {
response = httpclient.execute(httpget);// 得到response对象
int resStatu = response.getStatusLine().getStatusCode();// 返回码
System.out.println("状态码" + resStatu);
if (resStatu == HttpStatus.SC_OK) {// 200正常
// 获得相应实体
HttpEntity entity = response.getEntity();
if (entity != null) {
html = EntityUtils.toString(entity, "UTF-8");
html = html.replace(" ", " ");
break;
}
}
} catch (Exception e) {
e.printStackTrace();
}
finally {
httpclient.getConnectionManager().shutdown();
}
}
return html;
}
用这个方法抓取一般的http网页没问题.但是如果用来抓取某些https的网页便会出现如下异常:
unable to find valid certification path to requested target
异常提示你需要导入一个网站的证书.下面正式来抓取https网页,以本人最近经常使用的一个google镜像网站(https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=)为例.
1.导出网站的证书(谷歌浏览器):
点击浏览器地址栏里的锁标志,在右侧弹框内点击view certificate,点击详细信息,导出base64编码x.509(.cer)(s)证书即可。
2. 导入keystore证书:
使用Java自带的keytool工具将导出的.cer证书导入为httpclient可以使用的keystore证书.cmd内进入jdk的bin目录。
使用如下命令:keytool -import -alias Root -file d:/Root.cer -keystore "d:/trust.keystore" -storepass 123456
3.使用带有ssl的httpclient实例访问https网站.
public class SSLHttpClient {
public static String gethtml(String url) {
String html = "";
CloseableHttpClient httpclient = null;
CloseableHttpResponse response = null;
try {
SSLConnectionSocketFactory sslsf = createSSLConnSocketFactory();
httpclient = HttpClients.custom()
.setSSLSocketFactory(sslsf).build();
HttpGet httpget = new HttpGet(url);
httpget.addHeader(HttpHeaders.USER_AGENT,
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0");
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000).setConnectTimeout(10000).build();// 设置请求和传输超时时间
httpget.setConfig(requestConfig);
System.out.println("Executing request " + httpget.getRequestLine());
response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
int resStatu = response.getStatusLine().getStatusCode();// 返回码
if (resStatu == HttpStatus.SC_OK) {// 200正常 其他就不对
// 获得相应实体
if (entity != null) {
html = EntityUtils.toString(entity, "UTF-8");
html = html.replace(" ", " ");
}
}
EntityUtils.consume(entity);
} catch(Exception e){
e.printStackTrace();
}finally{
if(response!=null){
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if(httpclient!=null){
try {
httpclient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return html;
}
// ssl通道证书的创建
private static SSLConnectionSocketFactory createSSLConnSocketFactory()
throws Exception {
SSLContext sslcontext = SSLContexts
.custom()
.loadTrustMaterial(
new File(
"C://Users//cloud//Desktop//证书//trust.keystore"),
"123456".toCharArray(), new TrustSelfSignedStrategy())
.build();
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
sslcontext, new String[] { "TLSv1" }, null,
SSLConnectionSocketFactory.getDefaultHostnameVerifier());
return sslsf;
}
}
测试方法:
public static void main(String[] args){
String html = SSLHttpClient.gethtml("https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=");
if(html!=null&&!html.equals("")){
Document doc = Jsoup.parse(html);
if(doc!=null){
Elements eles = doc.select("#gs_ccl_results div.gs_r h3.gs_rt a");
if(eles!=null&&eles.size()!=0){
for(int i=0;i<eles.size();i++){
System.out.println(i+1+"-"+eles.get(i).text());
}
}
}
}
}
完成收工.