Crawler學習：2.Download Pages

聲明：所有內容均爲本人學習《自己動手寫網絡爬蟲》心得，有任何疑問可以參考原文。

1.網頁抓取

所謂網頁抓取，就是把URL 地址中指定的網絡資源從網絡流中讀取出來，保存到本地。

類似於使用程序模擬IE 瀏覽器的功能，把URL 作爲HTTP 請求的內容發送到服務器端，然後讀取服務器端的響應資源。

Java 語言是爲網絡而生的編程語言，它把網絡資源看成是一種文件，它對網絡資源的訪問和對本地文件的訪問一樣方便。它把請求和響應封裝爲流。

因此我們可以根據相應內容，獲得響應流，之後從流中按字節讀取數據。

例如，java.net.URL 類可以對相應的Web服務器發出請求並且獲得響應文檔。

java.net.URL 類有一個默認的構造函數，使用URL 地址作爲參數，構造URL 對象：

URL pageURL = new URL(path);

接着，可以通過獲得的URL 對象來取得網絡流，進而像操作本地文件一樣來操作網絡資源：

InputStream stream = pageURL.openStream();

在實際的項目中，網絡環境比較複雜，因此，只用java.net 包中的API 來模擬IE 客戶端的工作，代碼量非常大。

需要處理HTTP 返回的狀態碼，設置HTTP 代理，處理HTTPS協議等工作。

爲了便於應用程序的開發，實際開發時常常使用Apache 的HTTP 客戶端開源項目——HttpClient。

它完全能夠處理HTTP 連接中的各種問題，使用起來非常方便。

只需在項目中引入HttpClient.jar(3.0版本) 包，就可以模擬IE 來獲取網頁內容。例如：

//創建一個客戶端，類似於打開一個瀏覽器
HttpClient httpclient=new HttpClient();
//創建一個get 方法，類似於在瀏覽器地址欄中輸入一個地址
GetMethod getMethod=new GetMethod("http://www.blablabla.com");
//回車，獲得響應狀態碼
int statusCode=httpclient.executeMethod(getMethod);
//查看命中情況，可以獲得的東西還有很多，比如head、cookies 等
System.out.println("response=" + getMethod.getResponseBodyAsString());
//釋放
getMethod.releaseConnection();

2.傳參數方法：Get和Post

Get 請求方式把需要傳遞給服務器的參數作爲URL 的一部分傳遞給服務器。

但是，HTTP 協議本身對URL 字符串長度有所限制。因此不能傳遞過多的參數給服務器。

爲了避免這種問題，通常情況下，採用Post 方法進行Http請求，HttpClient 包對post 方法也有很好的支持。例如：

/得到post 方法
PostMethod PostMethod = new PostMethod("http://www.saybot.com/postme");
//使用數組來傳遞參數
NameValuePair[] postData = new NameValuePair[2];
//設置參數
postData[0] = new NameValuePair("武器", "槍");
postData[1] = new NameValuePair("什麼槍", "神槍");
postMethod.addParameters(postData);
//回車，獲得響應狀態碼
int statusCode=httpclient.executeMethod(getMethod);
//查看命中情況，可以獲得的東西還有很多，比如head、cookies 等
System.out.println("response=" + getMethod.getResponseBodyAsString());
//釋放
getMethod.releaseConnection();

上面的例子說明了如何使用post 方法來訪問Web 資源。與Get 方法不同，Post 方法可以使用NameValuePair 來設置參數，因此可以設置“無限”多的參數。

3.處理Http狀態碼

HttpClient 訪問Web 資源的時候，涉及Http狀態碼。比如：

int statusCode=httpClient.executeMethod(getMethod);//回車，獲得響應狀態碼

Http 狀態碼錶示Http協議所返回的響應的狀態。

比如客戶端向服務器發送請求，如果成功地獲得請求的資源，則返回的狀態碼爲200，表示響應成功。

如果請求的資源不存在，則通常返回404 錯誤。

這裏只簡單處理了狀態碼爲200的響應，其他狀態碼則都丟棄。

// 判斷訪問的狀態碼
if ( statusCode != HttpStatus.SC_OK ) {
    System.err.println("Method failed: " + getMethod.getStatusLine());
    filePath = null;
}

4.實現Download Pages

主要包含三個函數：

1.String getFileNameByUrl(String url,String contentType); // 過濾URL中的非法字符，得到保存文件名。

2. void saveToLocal(byte[] data, String filePath); // 保存網絡字節數組到本地文件。

3.String downloadFile(String url); // 下載URL指向的網頁。

package chici.util;

import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;

public class DownloadFile {
	/**
	 * 根據URL和網頁類型生成需要保存的網頁的文件名，去除URL中的非文件名字符
	 * */
	public String getFileNameByUrl(String url,String contentType){
		// 移除http
		url = url.substring(7);
		// text/html類型
		if( contentType.indexOf("html")!=-1 ){
			return url.replaceAll("[\\?/:*|<>\"]","_") +".html";
		}
		// 如application/pdf類型
		else{
			return url.replaceAll("[\\?/:*|<>\"]","_") +"."+
					contentType.substring(contentType.lastIndexOf("/")+1);		
		}
	}
	
	/**
	 *保存網頁字節數組到本地文件，filePath爲要保存的文件的相對地址
	 * */
	private void saveToLocal(byte[] data, String filePath) {
		try {
			DataOutputStream out = new DataOutputStream( new FileOutputStream( new File(filePath) ) );
			
			for (int i = 0; i < data.length; i++)
				out.write(data[i]);
			
			out.flush();
			out.close();
		} catch (IOException e) {
				e.printStackTrace();
		}
	} 
	
	/**
	 * 下載URL指向的網頁
	 * */
	public String downloadFile(String url) {
		String filePath = null;
		// 1.生成HttpClient 對象並設置參數
		HttpClient httpClient = new HttpClient();
		// 設置HTTP 連接超時5s
		httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);
		
		// 2.生成GetMethod 對象並設置參數
		GetMethod getMethod = new GetMethod(url);
		// 設置get 請求超時5s
		getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,5000);
		// 設置請求重試處理
		getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());
	
		// 3.執行HTTP GET 請求
		try {
			int statusCode = httpClient.executeMethod(getMethod);
			// 判斷訪問的狀態碼
			if ( statusCode != HttpStatus.SC_OK ) {
				System.err.println("Method failed: " + getMethod.getStatusLine());
				filePath = null;
			}
			
			// 4.處理HTTP 響應內容
			byte[] responseBody = getMethod.getResponseBody();// 讀取爲字節數組
			// 根據網頁url 生成保存時的文件名
			filePath = "E:\\chici\\"+ getFileNameByUrl(url, getMethod.getResponseHeader("Content-Type").getValue());
			saveToLocal(responseBody, filePath);
		} catch (HttpException e) {
			// 發生致命的異常，可能是協議不對或者返回的內容有問題
			System.out.println("Please check your provided http address!");
			e.printStackTrace();
		} catch (IOException e) {
			// 發生網絡異常
			e.printStackTrace();
		} finally {
			// 釋放連接
			getMethod.releaseConnection();
		}
		return filePath;
	}
		
}

Crawler學習：2.Download Pages

Crawler學習：2.Download Pages

Sicily 1119. Factstone Benchmark

Sicily 1014. Specialized Four-Dig

Sicily 1059. Exocenter of a Trian

Java 學習筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結