簡易爬蟲實現校園網剩餘流量查詢

學校公衆號要用爬蟲查詢校園網流量，記錄一下實現這個簡易爬蟲的過程。

開發工具：

Eclipse，Chrome/Firefox

第三方庫”：

jsoup：用來解析網頁數據，用法傳送門：http://www.open-open.com/jsoup/，HttpClient用來連接web頁面，模擬get和post請求

Step 1：明確目標

簡單的理解爬蟲的過程就是模擬網頁操作的過程，GET網頁數據，POST數據請求的模擬。

So，第一部先明確查詢校園網流量的步驟：

1：確認目標網頁：http://zyzfw.xidian.edu.cn/ 我們查詢校園網流量用戶登錄界面

2：輸入學號，密碼，驗證碼，作爲post的數據，然後點擊登錄

3：登錄後轉向頁面地址：http://zyzfw.xidian.edu.cn/home/base/index 流量信息查詢頁面

4：記錄所查看到的流量信息

Step 2：java編程

java文件列表

HttpClientManager.java： 獲取一個HttpClient的單例，通過這個單例來連接網站

HttpOperate.java： HttpClient相關網絡請求的函數

1，獲取網頁cookies信息與驗證碼的GET方法；

2，賬戶登錄POST請求方法；

3，獲取登錄後網頁中流量信息的get方法。

DocHandle.java ：對通過jsoup庫獲得到的網頁的html源碼的document類進行分析處理獲取網頁內容並保存

1，獲取網頁錯誤信息方法，

2，獲取流量信息方法，

3，獲取令牌token方法，

4，驗證碼獲取分析方法

ImageOP.java 根據url和cookies下載驗證碼圖片

NetConstans.java 網址等常量

PictureOperate.java 對下載來的驗證碼圖片進行操作

1，讀取圖片方法（返回int[][]二維數組）；

2，裁剪圖片方法（使得驗證碼圖片的4個數字變成4張圖片可以單獨處理）；

3，保存圖片方法

4，簡單的識別圖片數字方法

UserInfo.java 用戶信息的保存

1，用戶基本信息

2，cookies信息

3，圖片路徑信息

4，令牌信息

5，流量信息

6，錯誤信息

MainRunning.java 主程序

MainRunning.java

public class MainRunning {

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		HttpClientManager.init();

		UserInfo user = new UserInfo("0000000001","000000001");
		if(user.getUserName().equals(""))
		{
			System.out.println("輸入用戶名");
		}
		//login
		boolean loginOk=false;
		int cc=0;
		do{
			cc++;
			if(cc>5)
			{
				System.out.println("login error!tyr late.");
				break;
			}
			if(HttpOperate.getLoginInfo(user)){
				loginOk=HttpOperate.loginFlowQuery(user);
				if(!loginOk&&!user.codeError.equals("")){
					System.out.println(user.userError);
					break;
				}
			}
			else{
				System.out.println("getLoginInfo error!!");
				break;
			}
			
		}while(!loginOk);
		if(loginOk&&HttpOperate.getFlowInfo(user)){
			user.printFlowInfo();
		}
		else{
			System.out.println("get FlowInfo error!");
		}
	

		
		
	}

}</span>

從Main方法中可以發現，邏輯十分的簡單：

1，初始化：

HttpClientManager.init();

初始化HttpClient實例

UserInfo user = new UserInfo("0000000001","000000001");

初始化用戶的賬號和密碼

2，賬戶登錄：

do{……}while（……）嘗試5次登錄（原諒我的驗證碼識別函數有點low），如果5次都沒有登錄成功，提示用戶稍後再試。

HttpOperate.getLoginInfo(user)

獲取網頁cookies信息與驗證碼，模擬用戶在地址欄輸入了目標網頁：http://zyzfw.xidian.edu.cn/ 的GET請求，然後獲取目標網站的內容，主要是要獲取cookies，token令牌（稍後說明），驗證碼圖片

loginOk=HttpOperate.loginFlowQuery(user);

根據用戶名，密碼，驗證碼，token令牌進行登錄，即模擬一次post請求

3，登錄成功，查詢登錄界面的流量信息

HttpOperate.getFlowInfo(user)

模擬對登錄後轉向頁面地址：http://zyzfw.xidian.edu.cn/home/base/index 的一個GET請求

三個步驟的詳細說明：

1，初始化

HttpClientManager.java的代碼

public class HttpClientManager {

    private static HttpClient httpClient = null;

    private HttpClientManager(){

    }

    public static void init(){
        httpClient = null;
        if(httpClient == null){
            synchronized (HttpClientManager.class){
                if (httpClient == null){
                    httpClient = HttpClients.createDefault();
                }
            }
        }
    }

    public static HttpClient getInstance(){

        return httpClient;
    }

}

簡單的一個單例模式，把構造方法變成私有，僅在靜態方法init中創建唯一實例，僅通過靜態方法getInstance獲取這個實例，通過這個HttpClinet可以進行網站的GET和POST請求，如果要超時等待等設置也是在這邊設置。

UserInfo.java中的UesrInfo類沒什麼好說明的，就是簡單各種成員屬性，用來保存網頁中獲得的數據。

2，賬戶登錄
2-1：獲取網頁cookies信息與驗證碼，模擬用戶在地址欄輸入了目標網頁：http://zyzfw.xidian.edu.cn/ 的GET請求，然後獲取目標網站的內容，主要是要獲取cookies，token令牌（稍後說明），驗證碼圖片

先通過Chrome瀏覽器來看一看，瀏覽器瀏覽（GET）網頁內容是如何發送請求的。

打開Chrome瀏覽器，按F12打開開發者工具，選擇NETWORK（可以查看所有來往的網絡數據包，裏面包含了請求信息和網頁的所有信息），可以看到來自地址http://zyzfw.xidian.edu.cn/的數據包，點擊選擇headers，查看http頭文件

可以看到Requst Headers 就是爲了打開一個web頁面，向目標地址服務器發送的請求的頭信息。可以看到頭信息裏有一些設置。

再點開Response Headers，

這個就是向服務器發送GET請求，服務器發送給一個response，Response主要包含兩個部分，一個是ResponseHeaders（http頭信息）和ResponseEntity（就是看到的網站頁面的html源碼）

然後看HttpOperate.getLoginInfo(user)方法如下：

HttpOperate.getLoginInfo(user)在HttpOperate.java中

 public static boolean getLoginInfo(UserInfo user){

	        HttpGet httpGet = new HttpGet(NetConstans.LOGIN_URL);
	        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0");
	        httpGet.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
	        httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3");
	        httpGet.setHeader("Accept-Encoding", "gzip, deflate");
	        httpGet.setHeader("Connection", "keep-alive");
	        httpGet.setHeader("Host","zyzfw.xidian.edu.cn");
	        httpGet.setHeader("Cache-Control","max-age=0");
	        httpGet.setHeader("Referer","http://pay.xidian.edu.cn/");
	        try {
	            HttpResponse response = HttpClientManager.getInstance().execute(httpGet);
	           

	            Header[] headers = response.getHeaders("Set-Cookie");
	            StringBuilder sb=new StringBuilder();
	            for(int i=0;i<headers.length;i++){
	            	sb.append(headers[i].toString());
	            }
	            user.setCookiesString(sb.toString());
	            //System.out.println(user.getCookiesString());
	            //System.out.println(user.getUserCookies().toString());
	            
	            String loginWebStr = EntityUtils.toString(response.getEntity());
	            Document document = Jsoup.parse(loginWebStr);
	            DocHandle.getCsrfToken(document, user);
	            DocHandle.getVerifyCode(document, user);
	            return true;
	        } catch (IOException e) {
	            e.printStackTrace();
	            return false;
	        }


	    }

創建一個HttpGet請求，先設置目標地址，再根據Chrome瀏覽器中截取的數據包中所示的頭信息的方式，逐一設置頭信息。

設置完信息後，通過獲取HttpClient的實例，執行GET請求操作： HttpResponse response = HttpClientManager.getInstance().execute(httpGet)；

然後這個執行結果返回值就是我們的Response了：

1，先獲取Response的headers部分，主要是要獲取headers中的Set-Cookie中的內容，之前可以看到有3個set-cookie項，所以循環處理一下並分別保存一下這3個cookie（不一定會用到，因爲你之後的操作已經默認你保存這些Cookies）

2，在獲取Response的Entity部分，即網頁內容。先把html內容保存成字符串的形式，然後通過Jsoup.parse（String）方法把html字符串轉換成可以識別的Document的形式。

然後我們主要獲取這個html頁面中的兩樣信息：驗證碼 和token令牌（其實我也不知道爲什麼我要叫這個是令牌。。總好像有點什麼印象。。）

1，token令牌，DocHandle.getCsrfToken(document, user);在文件DocHandle.java中

	public static void getCsrfToken(Document document,UserInfo user){
		  Elements es=document.select("meta");
          for (Element element : es) {
			if(element.attr("name").equals("csrf-token"))
			{
				user.setCsrf_token(element.attr("content"));
				return;
			}
		  }
          System.out.print("token error");
	}

這個token令牌是等會POST提交表單數據時要用到的。剛開始，我發現提交POST時有一個_csrf參數，我一直不知道這個參數的數值是哪裏來。。。。所以一直無法成功提交post請求，因此我查看了一下http://zyzfw.xidian.edu.cn/的html的源代碼，我看到了html源代碼中<head>標籤中有一個name="csrf-token"的標籤。因此，我們需要獲取這個標籤的內容。

<metaname="csrf-token"content="YUt1dfdtgno4KDcyaRUHDhczTRBaIzIDFikXT0woDhEVAx0xXjAKDw==">

Document的select方法選擇所有的meta標籤，然後遍歷這個標籤數組，去meta標籤的name屬性，當這個meta的name屬性是"csrf-tosen"時，獲取content屬性中的內容保存到user中，即獲得了這個token令牌。

Document的操作都是Jsoup中的內容，很簡單的一些操作，傳送門：http://www.open-open.com/jsoup/

2，驗證碼

DocHandle.getVerifyCode(document, user);在文件DocHandle.java中

	public static void getVerifyCode(Document document,UserInfo user){
		Element eee  = document.getElementById(NetConstans.VERIFYCODEID);
		String url = NetConstans.LOGIN_URL+eee.attr("src");
		user.imagePath=eee.attr("src").split("=")[1];
		user.imagePath=user.imagePath;
		//System.out.println(url);
		ImageOP.downloadImageByURL(url,user);
		
		int[][] data=PictureOperate.readPic2IntArray(user.imagePath+".png");
		//File f1 = new File(user.imagePath+".png");
		//if(f1.exists())
			//f1.deleteOnExit();
		String newPath2=user.imagePath+"-";
		PictureOperate.cutPicture(data,newPath2);
		StringBuilder sb=new StringBuilder();
		for(int i=0;i<4;i++){
			String newPath3=user.imagePath+"-"
					+Integer.toString(i+1)+".png";
			data = PictureOperate.readPic2IntArray(newPath3);
			float[] f=PictureOperate.changeDataToInt9(data);
			double res;
			double minRes=9999;
			int val=-1;
			//0..9 = 10numbers
			for(int j =0;j<10;j++){
				res=0;
				//9 blocks
				for(int k =0;k<9;k++){
					res=res+Math.abs(f[k]-PictureOperate.training[j][k]);
				}
				if(res<minRes){
					minRes=res;
					val=j;
				}
				
			}
			sb.append(Integer.toString(val));
		}
		
		//String s=HttpOperate.recognizeCodeByORCKingWebsite(url);
		user.setCode(sb.toString());
		System.out.println(sb.toString());

		for(int i=1;i<=4;i++){
			File ff=new File(user.imagePath+"-"+Integer.toString(i)+".png");
			ff.delete();
		}
		File fx = new File(user.imagePath+".png");
		if(fx.exists())
		{  
			if(!fx.delete())
			{

			    System.gc();

			    fx.delete();

			}
		}
		
	}

File fx = new File(user.imagePath+".png");
		if(fx.exists())
		{  
			if(!fx.delete())
			{

			    System.gc();

			    fx.delete();

			}
		}

這段代碼，是由於奇怪的佔用導致文件打開了無法順利刪除，保證能刪除掉這個圖片。

先是根據驗證碼圖片的<img>標籤的特定的id號NetConstans.VERIFYCODEID="loginform-verifycode-image"來得到這個標籤，然後獲取這個img的對應src地址，然後根據這個地址調用downloadImageByURL方法下載這個驗證碼圖片。

驗證碼下載到本地以後是一張圖片的形式，接下來就開始

門外漢的圖像識別驗證碼數字

簡單說一下處理思路，

1，訓練樣本------這個操作是提前完成的，只要做一次就夠了，以後不用做了

1-1，獲取足夠多驗證碼圖片，一直刷新目標網頁，並每次保存驗證碼圖片

1-2,分割圖片，把每張圖片變成4個單獨的數字圖片，（這裏有個難的地方在於很多數字是粘連在一起的。。不好分割。。反正我的分割思路很low。。）

1-3,然後按0-9每個數字單獨分類，每種數字有足夠多的樣本

1-4，接着就是樣本分析了。。提取0123456789他們分別的特徵，我是把每張圖片分成九宮格，找出九宮格每一格中黑點佔全部黑點的百分作爲一個數字的特徵。

1-5，把訓練得到的結果保存成一個數組9（格）*10（類）的數組

2：把下載來的驗證碼也劃分成4個單獨數字，然後進行比較，看數字最就近哪個樣本特徵，就是哪個數字。最後把數字連起來就得到了驗證碼了。。。。。。。。。。

ImageOP.downloadImageByURL(url,user);在文件ImageOP.java中

	public static void downloadImageByURL(String s,UserInfo user){
		URL url;
		try {
			url = new URL(s);
			//HttpURLConnection uc= (HttpURLConnection)url.openConnection(); 
			URLConnection   uc   =   url.openConnection(); 
			uc.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
			uc.setRequestProperty("accept", "*/*");  
            
			uc.setRequestProperty("connection", "Keep-Alive");  
  
			uc.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");  
			
			uc.setRequestProperty("Cookie",user.getUserCookies().toString());
			uc.connect();
			File   file   =   new   File(user.imagePath+".png"); 
			FileOutputStream   out   =   new   FileOutputStream(file); 
			int   i=0; 
			InputStream   is   =   uc.getInputStream(); 
			while   ((i=is.read())!=-1)   { 
				out.write(i); 
			} 
			is.close();
		} catch (MalformedURLException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} 
		
	}

PS。。。注意這個GET請求要設置cookie，不然，獲得到的驗證碼不是同一個驗證碼。。。似乎是兩種請求不方式不同，不共用cookie。。。。

2-2：根據用戶名，密碼，驗證碼，token令牌進行登錄，即模擬一次post請求

同樣先看看點擊登錄時 Chrome瀏覽器和網站服務器之間傳送的數據包

同樣的方式，可以看到來自目標網站的數據包我們先看看Request Headers和Response Headers

Request Headers

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:180
Content-Type:application/x-www-form-urlencoded
Cookie:safedog-flow-item=C864AD7C523216AFDE4807601B; lzstat_uv=118312236|3401870; PHPSESSID=21sa2hufe124312aswrwqhghvcxc3; _csrf=f4559712aba7b9sadasda5d7655fcf96ede5b1df95febc673124452ca%3A2%3A%7Bi%3A0%3Bs%3A5%3A%22_csrf%22%3Bi%3A1%3Bs%3A32%3A%22YcBDPpAtvx8fcFtywbb9uMHktHhGgULu%22%3B%7D; BIGipServerzyzfw.xidian.edu.cn=13412690.24610.0000
Host:zyzfw.xidian.edu.cn
Origin:http://zyzfw.xidian.edu.cn
Referer:http://zyzfw.xidian.edu.cn/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36

Response Headers

Cache-Control:no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection:Keep-Alive
Content-Length:0
Content-Type:text/html; charset=UTF-8
Date:Fri, 29 Apr 2016 12:47:32 GMT
Expires:Thu, 19 Nov 1981 08:52:00 GMT
Keep-Alive:timeout=1, max=250
Location:http://zyzfw.xidian.edu.cn/home/base/index
Pragma:no-cache
Server:Apache/2.4.12 (Unix) OpenSSL/1.0.1g-fips PHP/5.5.23
Set-Cookie:PHPSESSID=a3c2f1p3rnf94ktcvfgi0vrvs4; path=/; HttpOnly
X-Powered-By:PHP/5.5.23

Cookie會自動獲得，無須設置

Location:http://zyzfw.xidian.edu.cn/home/base/index代表了登錄成功後轉向的地址。即流量信息查看的地址。

除了Resquest和Response，我們繼續往下拉，我們發現Form Data 這就是POST請求時，所需要的傳遞的參數，

我們發現除了用戶名，密碼，和驗證碼，還有一個關鍵的_csrf參數，因此就有了之前的獲取這個參數需求

Form Data

_csrf:U3NMOVJpWFcKEA59AhkZIyULdF8xLywuJBEuACckEDwnOyR.NTwUIg==
LoginForm[username]:00000001
LoginForm[password]:00000001
LoginForm[verifyCode]:7567
login-button:

方法loginOk=HttpOperate.loginFlowQuery(user);在HttpOperate,java中

 public static boolean loginFlowQuery(UserInfo user){


	        HttpPost httpPost = new HttpPost(NetConstans.LOGIN_URL);
	        httpPost.setHeader("Host", "zyzfw.xidian.edu.cn");
	        httpPost.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0");
	        httpPost.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
	        httpPost.setHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3");
	        httpPost.setHeader("Accept-Encoding", "gzip, deflate");
	        httpPost.setHeader("Referer", "http://zyzfw.xidian.edu.cn/");
	        httpPost.setHeader("Origin", "http://zyzfw.xidian.edu.cn");
	        httpPost.setHeader("Connection", "keep-alive");
	        

	        // set param
	        List<BasicNameValuePair> formparams = new ArrayList<BasicNameValuePair>();
	        
	        formparams.add(new BasicNameValuePair("_csrf", user.getCsrf_token()));
	        formparams.add(new BasicNameValuePair("LoginForm[username]", user.getUserName()));
	        formparams.add(new BasicNameValuePair("LoginForm[password]", user.getPassword()));
	        formparams.add(new BasicNameValuePair("LoginForm[verifyCode]", user.getCode()));
	        formparams.add(new BasicNameValuePair("login-button", ""));
	        
	        UrlEncodedFormEntity encodedFormEntity = new UrlEncodedFormEntity(formparams, Consts.UTF_8);
	        httpPost.setEntity(encodedFormEntity);


	        //System.out.println("Bfore Login");
	        HttpResponse response = null;
	        try {
	            response = HttpClientManager.getInstance().execute(httpPost);
	            if(response == null){
	            	System.out.println("null");
	                return false;
	            }
	            String loginWebStr = EntityUtils.toString(response.getEntity());
	            //System.out.println(loginWebStr);
	            
	            if(loginWebStr.equals("")||loginWebStr==null)
	            	return  true;//無返回值代表登陸成功
	            else {
	            	Document document = Jsoup.parse(loginWebStr);
		            DocHandle.getErrorInfo(document, user);
	            	return false;//有返回值代表出錯，查看錯誤信息
	            }
	        } catch (IOException e) {
	            e.printStackTrace();
	            return false;
	        }

	    }

3：登錄成功後，就是在轉向後的網址內容GET網頁內容，通過jsoup獲取內容，保存在user中即可。

Form Data

_csrf:U3NMOVJpWFcKEA59AhkZIyULdF8xLywuJBEuACckEDwnOyR.NTwUIg==
LoginForm[username]:00000001
LoginForm[password]:00000001
LoginForm[verifyCode]:7567
login-button:

簡易爬蟲實現校園網剩餘流量查詢

TaintDroid實現數據流的污點追蹤

Android短信----發送流程---框架層（Frameworks）

android5.0與android4.3中的棧的源碼分析比較

Android短信----接收流程---框架層（Frameworks）

Android中handler的用法實例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結