用.Net core寫爬蟲之HttpClient用法詳解

HttpClient用法詳解

現在很多爬蟲程序都是用Python寫的,但是其實什麼語言都可以寫爬蟲,在Python流行之前,我瞭解到很多公司的爬蟲都是拿Java來寫,當然也可能有其他語言,閒言少敘,圓規正轉,由於我最近在學習.Net core,所以就嘗試着,用C#來寫爬蟲程序,因爲.Net core框架也是跨平臺的,輸個命令也能在Linux下跑,跟Python腳本的效果差不多。既然寫爬蟲,就免不了涉及發送HTTP請求相關的類庫,在python中比較常用的是requests庫,異步的有aiohttp庫,在C#中與之對應的就是HttpClient庫,也是支持異步高併發的庫,而且支持的非常好。

1. 搭建測試服務

在講發送Http請求之前,我們先要搭建好一個請求的服務或網站,當然咱也可以隨便找個網站發請求,但是隨便的網站不太利於學習,有個現成的服務就非常好,它能把你每次請求的參數和標頭信息都格式工整的返回來,非常利於測試和學習,這個服務就是大名鼎鼎的httpbin.org ,官方的服務比較卡 http://httpbin.org,可以自己搭建一個,非常簡單,也可以看我寫的搭建筆記 Docker搭建httpbin服務,也可以先玩我自己搭好的 http://zhousonglin.cn:8080/

2. 發送GET請求

發送GET請求的時候比較多,大部分的時候我們都發GET請求來獲取數據,POST請求一般只有在我們登陸驗證的時候會用到。下面的代碼就是我對Get請求的異步封裝方法,微軟官方也建議儘量用異步來實現業務,因爲好處多多,這裏就不再細說了。

/// <summary>
/// Get請求發送
/// </summary>
/// <param name="requestUrl">url地址</param>
/// <returns></returns>
public static async Task<string> HtmlGet(string requestUrl)
{
	string responseBody = string.Empty;
	using (HttpClient httpClient = new HttpClient())
	{
		httpClient.DefaultRequestHeaders.Add("Method", "Get");
		httpClient.DefaultRequestHeaders.Add("KeepAlive", "false");
		httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
		
		HttpResponseMessage response = await httpClient.GetAsync(requestUrl);
		// var response = await httpClient.GetStringAsync(requestUrl);
		response.EnsureSuccessStatusCode();
		responseBody = await response.Content.ReadAsStringAsync();
	}
	return responseBody;
}

使用方法:

string urlRequestGet = "http://zhousonglin.cn:8080/get";
string responseStr = string.Empty;
responseStr = HtmlGet(urlRequestGet).Result;
Console.WriteLine(responseStr);

執行效果:
Get請求截圖

3. 發送POST請求

POST請求傳參有兩種方式,一種是傳form類型的參數,一種是傳Json字符串類型的參數。

3.1 傳遞form類型參數
/ <summary>
/// Post請求發送
/// </summary>
/// <param name="requestUrl">url</param>
/// <param name="postParams">傳遞參數</param>
/// <returns></returns>
public static async Task<string> HtmlPost(string requestUrl,Dictionary<string, string> postParams)
{
	string responseBody = string.Empty;
	using (HttpClient httpClient = new HttpClient())
	{
		httpClient.DefaultRequestHeaders.Add("Method", "Post");
		httpClient.DefaultRequestHeaders.Add("KeepAlive", "false");
		httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
		HttpContent postContent = new FormUrlEncodedContent(postParams);
		HttpResponseMessage response = await httpClient.PostAsync(requestUrl, postContent);
		response.EnsureSuccessStatusCode();
		responseBody = await response.Content.ReadAsStringAsync();
	}
	return responseBody;
}

使用方法:


string urlRequestPost = "http://zhousonglin.cn:8080/post";
string responseStr = string.Empty;
Dictionary<string, string> postParams = new Dictionary<string, string>()
{
	{"say","Hello" },
	{"ask","question" }
};
responseStr = HtmlPost(urlRequestPost, postParams).Result;
Console.WriteLine(responseStr);

執行效果:
Post請求發送截圖

3.2 傳遞Json類型參數
/// <summary>
/// Post請求Json參數
/// </summary>
/// <param name="requestUrl"></param>
/// <param name="jsonParams"></param>
/// <returns></returns>
public static async Task<string> HtmlPostJson(string requestUrl, string jsonParams)
{
	string responseBody = string.Empty;
	using (HttpClient httpClient = new HttpClient())
	{
		httpClient.DefaultRequestHeaders.Add("Method", "Post");
		httpClient.DefaultRequestHeaders.Add("KeepAlive", "false"); 
		httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

		HttpContent content = new StringContent(jsonParams);
		content.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/json");
		HttpResponseMessage response = await httpClient.PostAsync(requestUrl, content);

		response.EnsureSuccessStatusCode();
		responseBody = await response.Content.ReadAsStringAsync();
	}
	return responseBody;
}

使用方法:


public class User
{
	public User()
	{ }
	public string Name {get;set;}
	public string Sex {get; set;}
}
string urlRequestPost = "http://zhousonglin.cn:8080/post";
User user = new User()
{
	Name = "Dahlin",
	Sex = "male"
};
string jsonParam = JsonConvert.SerializeObject(user);
responseStr = HtmlPostJson(urlRequestPost, jsonParam).Result;
Console.WriteLine(responseStr);

執行效果:
Post請求截圖

4. 文件下載請求

爬蟲程序一般是用來爬取字符數據的,但有時候我們也爬取一些圖片或視頻類的文件,HttpClient也是支持文件下載的,方法封裝如下:

/// <summary>
/// 下載文件
/// </summary>
/// <param name="requestUrl"></param>
/// <param name="fileName"></param>
/// <returns></returns>
public static async Task HtmlDownloadFile(string requestUrl, string fileName)
{

	using HttpClient httpClient = new HttpClient();

	httpClient.DefaultRequestHeaders.Add("Method", "Get");
	httpClient.DefaultRequestHeaders.Add("KeepAlive", "false");
	httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

	HttpResponseMessage response = await httpClient.GetAsync(requestUrl);
	response.EnsureSuccessStatusCode();

	await response.Content.ReadAsByteArrayAsync().ContinueWith(
		(readBytestTask) =>
		{
			byte[] data = readBytestTask.Result;
			using FileStream fs = new FileStream(fileName, FileMode.Create);
			fs.Write(data, 0, data.Length);
			fs.Flush();
			fs.Close();
		});
}

使用方法:

string urlPicture = "http://qn.zhousonglin.cn/DaGuanYuan34.jpg?imageslim";
HtmlDownloadFile(urlPicture, "1.jpg").Wait();

關於HttpClient庫,以上這些方法基本就足夠用了,當然還有一些比較深度的玩法,比如自行擴展消息處理器是HttpClientHandler,再比如添加Cookie發送,如下:

CookieContainer cookieContainer = new CookieContainer();
cookieContainer.Add(new Cookie("XXXXXX", "XXXXXXX"));   
HttpClientHandler httpClientHandler = new HttpClientHandler()
{
   CookieContainer = cookieContainer,
   AllowAutoRedirect = true,
   UseCookies = true
};
HttpClient httpClient = new HttpClient(httpClientHandler);

還有加入代理等等用法,大同小異,F12 HttpClientHandler一下就明白了,這裏就不再細說了,或者以後用到了我再總結一篇深度玩法,其實就是對官方公開的接口基類做一些自定義擴展和重寫。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章