【網絡爬蟲】【java】微博爬蟲（二）：如何抓取HTML頁面及HttpClient使用

一、寫在前面

上篇文章以網易微博爬蟲爲例，給出了一個很簡單的微博爬蟲的爬取過程，大概說明了網絡爬蟲其實也就這麼回事，或許初次看到這個例子覺得有些複雜，不過沒有關係，上篇文章給的例子只是讓大家對爬蟲過程有所瞭解。接下來的系列裏，將一步一步地剖析每個過程。

爬蟲總體流程在上篇文章已經說得很清楚了，沒有看過的朋友可以去看下：【網絡爬蟲】[java]微博爬蟲（一）：網易微博爬蟲（自定義關鍵字爬取微博信息數據）

現在再回顧下爬蟲過程：

step1: 通過請求url得到html的string，用httpClient-4.3.1工具，同時設置socket超時和連接超時connectTimeout，本文將詳解此步驟。

step2: 對於上步得到的html，驗證是否爲合法HTML，判斷是否爲有效搜索頁面，因爲有些請求的html頁面不存在。

step3: 把html這個string存放到本地，寫入txt文件；

step4: 從txt文件解析微博數據：userid，timestamp……解析過程纔是重點，對於不同網頁結構的分析及特徵提取，將在系列三中詳細講解。

step5: 解析出來的數據放入txt和xml中，這裏主要jsoup解析html，dom4j工具讀寫xml，將在系列四中講解。

然後在系列五中會給出一些防止被牆的方法，使用代理IP訪問或解析本地IP數據庫（前提是你有存放的IP數據庫），後面再說。

二、HttpClient工具包

搞過web開發的朋友對這個應該很熟悉了，不需要再多說，這是個很基本的工具包，一個代碼級Http客戶端工具，可以使用其模擬瀏覽器向http服務器發送請求。HttpClient是HttpComponents(簡稱hc)項目其中的一部分，可以直接下載組件。使用HttpClient還需要HttpCore，後者包括Http請求與Http響應代碼封裝。它使客戶端發送http請求變得容易，同時也會更加深入理解http協議。

在這裏可以下載HttpComponents組件：http://hc.apache.org/，下載後目錄結構：

首先要注意的有以下幾點：

1.httpclient鏈接後釋放問題很重要，就跟用database connection要釋放資源一樣。

2.https網站使用ssl加密傳輸，證書導入要注意。

3.對於http協議要有基本的瞭解，比如http的200,301,302,400,404,500等返回代碼時什麼意思（這個是最基本的），還有cookie和session機制（這個在之後的python爬蟲系列三“模擬登錄”的方法需要抓取數據包分析，主要就是看cookie這些東西，要學會分析數據包）

4.httpclient的redirect（重定向）狀態默認是自動的，這在很大程度上給開發者很大的方便（如一些授權獲得的cookie），但有時需要手動設置，比如有時會遇到CircularRedictException異常，出現這樣的情況是因爲返回的頭文件中location值指向之前重複地址（端口號可以不同），導致可能會出現死循環遞歸重定向，此時可以手動關閉:method.setFollowRedirects(false)。

5.模擬瀏覽器登錄，這個對於爬蟲來說相當重要，有的網站會先判別用戶的請求是否來自瀏覽器，如果不是直接拒絕訪問，這個直接僞裝成瀏覽器訪問就好了，好用httpclient抓取信息時在頭部加入一些信息：header.put(“User-Agent”, “Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)”);

6.當post請求提交數據時要改變默認編碼，不然提交上去的數據會出現亂碼。重寫postMethod的setContentCharSet()方法就可以了。

下面給幾個例子：

（1）發post請求訪問本地應用並根據傳遞參數不同返回不同結果

public void post() {
		//創建默認httpClient實例
		CloseableHttpClient httpclient = HttpClients.createDefault();
		//創建httpPost
		HttpPost httppost = new HttpPost("http://localhost:8088/weibo/Ajax/service.action");
		//創建參數隊列
		List<keyvalue> formparams = new ArrayList<keyvalue>();
		formparams.add(new BasicKeyValue("name", "alice"));
		UrlEncodeFormEntity uefEntity;
		try {
			uefEntity = new UrlEncodeFormEntity(formparams, "utf-8");
			httppost.setEntity(uefEntity);
			System.out.println("executing request " + httppost.getURI());
			CloseableHttpResponse response = httpclient.execute(httppost);
			try {
				HttpEntity entity = response.getEntity();
				if(entity != null) {
					System.out.println("Response content: " + EntityUtils.toString(entity, "utf-8"));
				}
			} finally {
				response.close();
			}
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			//關閉連接,釋放資源
			try {
				httpclient.close();
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}

（2）發get請求

public void get() {
		CloseableHttpClient httpclient = HttpClients.createDefault();
		try {
			//創建httpget
			HttpGet httpget = new HttpGet("http://www.baidu.com");
			System.out.println("executing request " + httpget.getURI());
			//執行get請求
			CloseableHttpResponse response = httpclient.execute(httpget);
			try {
				//獲取響應實體
				HttpEntity entity = response.getEntity();
				//響應狀態
				System.out.println(response.getStatusLine());
				if(entity != null) {
					//響應內容長度
					System.out.println("response length: " + entity.getContentLength());
					//響應內容
					System.out.println("response content: " + EntityUtils.toString(entity));
				}
			} finally {
				response.close();
			}
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			//關閉鏈接,釋放資源
			try {
				httpclient.close();
			} catch(IOException e) {
				e.printStackTrace();
			}
		}
	}

（3）設置header

比如在百度搜索”httpclient”關鍵字，百度一下，發送請求，chrome裏按F12開發者工具，在Network選項卡查看分析數據包，可以看到數據包相關信息，比如這裏請求頭Request Header裏的信息。

有時需要模擬瀏覽器登錄，把header設置一下就OK，照着這裏改吧。

public void header() {
		HttpClient httpClient = new DefaultHttpClient();
		try {
			HttpGet httpget = new HttpGet("http://www.baidu.com");
			httpget.setHeader("Accept", "text/html, */*; q=0.01");
			httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch");
			httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
			httpget.setHeader("Connection", "keep-alive");
			httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)");
						
			HttpResponse response = httpClient.execute(httpget);
			HttpEntity entity = response.getEntity();
			System.out.println(response.getStatusLine()); //狀態碼
			if(entity != null) {
				System.out.println(entity.getContentLength());
				System.out.println(entity.getContent());
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

三、通過url得到html頁面

前面說了這麼多，都是些準備工作主要是HttpClient的一些基本使用，其實還有很多，網上其他資料更詳細，也不是這裏要講的重點。下面來看如何通過url來得到html頁面，其實方法已經在上一篇文章中說過了：【網絡爬蟲】[java]微博爬蟲（一）：網易微博爬蟲（自定義關鍵字爬取微博信息數據）

新浪微博和網易微博：（這裏尤其要注意地址及參數！）

新浪微博搜索話題地址：http://s.weibo.com/weibo/蘋果手機&nodup=1&page=50

網易微博搜索話題地址：http://t.163.com/tag/蘋果手機

這裏參數&nodup和參數&page=50，表示從搜索結果返回的前50個html頁面，從第50個頁面開始爬取。也可以修改參數的值，爬取的頁面個數不同。

在這裏寫了三個方法，分別設置用戶cookie、默認一般的方法、代理IP方法，基本思路差不多，主要是在RequestConfig和CloseableHttpClient的custom()可以自定義配置。

/** 
 * @note 三種連接url並獲取html的方法(有一般方法,自定義cookie方法,代理IP方法) 
 * @author DianaCody
 * @since 2014-09-26 16:03
 * 
 */

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URISyntaxException;
import java.text.ParseException;

import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.cookie.Cookie;
import org.apache.http.cookie.CookieOrigin;
import org.apache.http.cookie.CookieSpec;
import org.apache.http.cookie.CookieSpecProvider;
import org.apache.http.cookie.MalformedCookieException;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.DefaultProxyRoutePlanner;
import org.apache.http.impl.cookie.BestMatchSpecFactory;
import org.apache.http.impl.cookie.BrowserCompatSpec;
import org.apache.http.impl.cookie.BrowserCompatSpecFactory;
import org.apache.http.protocol.HttpContext;
import org.apache.http.util.EntityUtils;

public class HTML {
	
	/** 默認方法 */
	public String[] getHTML(String url) throws ClientProtocolException, IOException {
		String[] html = new String[2];
		html[1] = "null";
		RequestConfig requestConfig = RequestConfig.custom()
				.setSocketTimeout(5000)   //socket超時
				.setConnectTimeout(5000)   //connect超時
				.build();
		CloseableHttpClient httpClient = HttpClients.custom()
				.setDefaultRequestConfig(requestConfig)
				.build();
		HttpGet httpGet = new HttpGet(url);
		try {
			CloseableHttpResponse response = httpClient.execute(httpGet);			
			html[0] = String.valueOf(response.getStatusLine().getStatusCode());
			html[1] = EntityUtils.toString(response.getEntity(), "utf-8");
			//System.out.println(html);
		} catch (IOException e) {
			System.out.println("----------Connection timeout--------");
		}
		return html;
	}
	
	/** cookie方法的getHTMl() 設置cookie策略,防止cookie rejected問題,拒絕寫入cookie     --重載,3參數:url, hostName, port */
	public String getHTML(String url, String hostName, int port) throws URISyntaxException, ClientProtocolException, IOException {
		//採用用戶自定義的cookie策略
		HttpHost proxy = new HttpHost(hostName, port);
		DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
		CookieSpecProvider cookieSpecProvider = new CookieSpecProvider() {
			public CookieSpec create(HttpContext context) {
				return new BrowserCompatSpec() {
					@Override
					public void validate(Cookie cookie, CookieOrigin origin) throws MalformedCookieException {
						//Oh, I am easy...
					}
				};
			}
		};
		Registry<CookieSpecProvider> r = RegistryBuilder
				.<CookieSpecProvider> create()
				.register(CookieSpecs.BEST_MATCH, new BestMatchSpecFactory())
				.register(CookieSpecs.BROWSER_COMPATIBILITY, new BrowserCompatSpecFactory())
				.register("easy", cookieSpecProvider)
				.build();
		RequestConfig requestConfig = RequestConfig.custom()
				.setCookieSpec("easy")
				.setSocketTimeout(5000) //socket超時
				.setConnectTimeout(5000) //connect超時
				.build();
		CloseableHttpClient httpClient = HttpClients.custom()
				.setDefaultCookieSpecRegistry(r)
				.setRoutePlanner(routePlanner)
				.build();
		HttpGet httpGet = new HttpGet(url);
		httpGet.setConfig(requestConfig);
		String html = "null"; //用於驗證是否正常取到html
		try {
			CloseableHttpResponse response = httpClient.execute(httpGet);
			html = EntityUtils.toString(response.getEntity(), "utf-8");			
		} catch (IOException e) {
			System.out.println("----Connection timeout----");
		}
		return html;
	}

	/** proxy代理IP方法 */
	public String getHTMLbyProxy(String targetUrl, String hostName, int port) throws ClientProtocolException, IOException {
		HttpHost proxy = new HttpHost(hostName, port);
		String html = "null";
		DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
		RequestConfig requestConfig = RequestConfig.custom()
				.setSocketTimeout(5000) //socket超時
				.setConnectTimeout(5000) //connect超時
				.build();
		CloseableHttpClient httpClient = HttpClients.custom()
				.setRoutePlanner(routePlanner)
				.setDefaultRequestConfig(requestConfig)
				.build();
		HttpGet httpGet = new HttpGet(targetUrl);
		try {
			CloseableHttpResponse response = httpClient.execute(httpGet);
			int statusCode = response.getStatusLine().getStatusCode();
			if(statusCode == HttpStatus.SC_OK) { //狀態碼200: OK
				html = EntityUtils.toString(response.getEntity(), "gb2312");
			}
			response.close();
			//System.out.println(html); //打印返回的html
		} catch (IOException e) {
			System.out.println("----Connection timeout----");
		}
		return html;
	}
}

四、驗證是否存在HTML頁面

有時請求的html不存在，比如在上篇文章中提到的情況一樣，這裏加個判斷函數。

private boolean isExistHTML(String html) throws InterruptedException {  
        boolean isExist = false;  
        Pattern pNoResult = Pattern.compile("\\\\u6ca1\\\\u6709\\\\u627e\\\\u5230\\\\u76f8"  
                + "\\\\u5173\\\\u7684\\\\u5fae\\\\u535a\\\\u5462\\\\uff0c\\\\u6362\\\\u4e2a"  
                + "\\\\u5173\\\\u952e\\\\u8bcd\\\\u8bd5\\\\u5427\\\\uff01"); //沒有找到相關的微博呢，換個關鍵詞試試吧！（html頁面上的信息）  
        Matcher mNoResult = pNoResult.matcher(html);  
        if(!mNoResult.find()) {  
            isExist = true;  
        }  
        return isExist;  
}

五、爬取微博返回的HTML字符串

把所有html寫到本地txt文件裏。

/** 把所有html寫到本地txt文件存儲 */
	public static void writeHTML2txt(String html, int num) throws IOException {
		String savePath = "e:/weibo/weibohtml/" + num + ".txt";
		File f = new File(savePath);
		FileWriter fw = new FileWriter(f);
		BufferedWriter bw = new BufferedWriter(fw);
		bw.write(html);
		bw.close();
	}

爬下來的html：

來看下每個html頁面，頭部一些數據：

微博正文數據信息，是個json格式，包含一些信息：

至於如何解析提取關鍵數據，在下篇文章中再寫。

原創文章，轉載請註明出處：http://blog.csdn.net/dianacody/article/details/39695285

【網絡爬蟲】【java】微博爬蟲（二）：如何抓取HTML頁面及HttpClient使用

一、寫在前面

二、HttpClient工具包

三、通過url得到html頁面

四、驗證是否存在HTML頁面

五、爬取微博返回的HTML字符串

druid數據源 xml配置

【HBase】HBase筆記：HBase的Region機制

【網絡爬蟲】【java】微博爬蟲（二）：如何抓取HTML頁面及HttpClient使用

linux創建守護進程

【網絡爬蟲】【java】微博爬蟲（四）：數據處理——jsoup工具解析html、dom4j讀寫xml

打包python文件爲exe文件（PyInstaller工具使用方法）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結