Java爬蟲：大量抓取二手房網頁信息並存入雲端數據庫過程詳解（一)

分析：
、、爬蟲的結構分爲三部分：
1、獲取網頁鏈接並解析網頁，
2、將解析好的信息，臨時存儲到寫好的類對象裏（有可能跳過這一步）
3、將緩存信息傳入雲端數據庫

**使用的工具：**eclipse

、、本次主要講網頁的鏈接和解析：
、、由於二手房網頁是靜態網頁，所以使用了Java的第三方包jsoup處理
PS：jsoup是Java一款強大的HTML解析工具，用於解析網頁相當方便，在此就不詳述了。
、、首先分析房屋網頁：
$可以看到，頁面裏沒有個房源信息對應一個，而這種頁面有100頁\n所以顯然房屋列表頁面的地址是由規律的，寫出方法，傳入頁數，獲取，該頁所列房屋頁面的鏈接：$



//***********************************重慶二手房****************************//
	//寫出方法,獲取二手房每一頁的所有房源地址
	public static String[] getEershoufangUrlList(int page) throws Exception {
		String pageUrl = 
		//傳入每一頁，拼接成地址
		"https://cq.ke.com/ershoufang/pg" +page+ "/";
		Connection con = Jsoup.connect(pageUrl);
		con.header("User-Agent",user_agent);
		Document doc = con.get();
		Elements lis = doc.getElementsByClass("sellListContent").get(0).getElementsByClass("title");
		int sum = lis.size();
		String[] urlList = new String[sum];
		//獲取每一條地址
		for (int i = 0; i < sum; i++) {
			urlList[i] = lis.get(i).getElementsByTag("a").get(0).attr("href");
		}
		return urlList;
	}

// 簡單解釋一下：pageUrl就是每一頁列表頁面地址，通過封裝的方法獲取整個網頁，再獲取其element元素裏面對應的房屋頁面地址，返回房屋地址數組。
// PS：值得注意的是，一般這種網站存在反爬蟲機制，所以設置一個user-agent來僞裝成瀏覽器獲取網頁，不過這隻能處理簡單的反爬蟲，如果遇到複雜的反爬蟲，就很難奏效了

//瀏覽器代理,防止反爬蟲
	private static String user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2727.400";

2、然後通過獲取到的每一個房屋地址，用jsoup獲取到房屋頁面：

//頁面信息處理
	public static SecondHouse getErshoufangInfo(String urlStr) throws Exception {
		//創建房屋對象
		SecondHouse house = new SecondHouse();
		Connection con = Jsoup.connect(urlStr);
		con.header("User-Agent",user_agent);
		Document doc = con.get();

PS： SecondHouse是我自建的用於存儲解析出來的房屋信息的類
這是要解析出所需信息的頁面：

、、哦，由於本人基本不會正則表達式，所以全程僅使用了一個表達式：

toString().replaceAll("</?[^>]+>", "").trim()

、、這一段就是將包含對應信息的HTML語句轉換成去掉HTML標籤且首位沒有空格的字符串，事實上就這一段語句完全足夠用於解析了。
還有就是寫一個方法用於提取字符串裏面的數字，這個相當重要：

//***************************類輔助方法***********************************//
	//從字符串裏面提取數字，只提取整數部分，作爲輔助函數使用
	private static int getNumFromStr(String str) {
		str = str.trim();
		String strTemp = "";
		int num = 0;
		for (int i = 0; i < str.length(); i++) {
			//只取整數部分
			if (str.charAt(i) == '.') {
				break;
			}
		     if (str.charAt(i) >= '0' && str.charAt(i) <= '9') {
		    	strTemp += str.charAt(i);
			}
		}
		if (strTemp.length() == 0) {
			return 0;
		}
		num = Integer.parseInt(strTemp);
		return num;
	}

以下是jsoup解析房屋網頁的具體方法：

//頁面信息處理
	public static SecondHouse getErshoufangInfo(String urlStr) throws Exception {
		//創建房屋對象
		SecondHouse house = new SecondHouse();
		Connection con = Jsoup.connect(urlStr);
		con.header("User-Agent",user_agent);
		Document doc = con.get();
//		Test.print(doc.toString());
		//獲取標題
		String title = doc.getElementsByClass("title-wrapper").get(0).getElementsByClass("main").get(0).attr("title");
//		Test.print(title);
		house.setElemName(title);
		Element houseInfo = doc.getElementsByClass("overview").get(0).getElementsByClass("content").get(0);
//		Test.print(houseInfo.toString());
		//獲取每一項信息寫入對象
//		replaceAll("</?[^>]+>", "");
		//寫入總價
		int price = getNumFromStr(houseInfo.getElementsByClass("total").toString());
		house.setPrice(price);
		//寫入單價
		int unit_price = getNumFromStr(houseInfo.getElementsByClass("unitPriceValue").toString());
//		Test.print(unit_price+"");
		house.setUnit_price(unit_price);
		//寫入面積
		house.setArea(price*10000/unit_price);
		//
		Elements mainInfos = houseInfo.getElementsByClass("mainInfo");
		//寫入戶型
		String houseStyle = mainInfos.get(0).toString().replaceAll("</?[^>]+>", "").trim();
		house.setHouseStyle(houseStyle);
		String direction = mainInfos.get(1).toString().replaceAll("</?[^>]+>", "").trim();
		house.setDirection(direction);
		//
		Elements subInfos = houseInfo.getElementsByClass("subInfo");
		//寫入樓層
		String floor = subInfos.get(0).toString().replaceAll("</?[^>]+>", "").trim();
		house.setFloor(floor);
		String decoration = subInfos.get(1).toString().replaceAll("</?[^>]+>", "").trim();
		house.setDecoration(decoration);
		int buildTime = getNumFromStr(subInfos.get(2).toString());
		house.setBuildTime(buildTime);
		//獲取小區名和所在區域
		Elements infos = houseInfo.getElementsByClass("info");
		String community = infos.get(0).toString().replaceAll("</?[^>]+>", "");
		house.setCommunity(community);
		String location = infos.get(1).toString().replaceAll("</?[^>]+>", "").substring(0, 2);
		house.setLocation(location);
		/////\
		//獲取關注人數
		int viewerSum = getNumFromStr(doc.getElementById("favCount").toString());
		house.setViewerSum(viewerSum);
		//Test.print(viewerSum+ "");
//		SecondHouse.printHouseInfo(house);
		return house;
	}

、、嗯，關於jsoup解析頁面元素的具體用法在這裏就不再一一說明了，我就用了幾個比較常用的getElement方法，興趣的朋友可以自己搜一下。

另：關於解析下來的房屋信息暫時存儲下次再寫。還有就是這是我第一次寫Java爬蟲，還相當生疏，有可以優化和不足的地方希望大佬指點。

Java爬蟲：大量抓取二手房網頁信息並存入雲端數據庫過程詳解（一)

Java 基礎 String類概述

Java爬蟲：大量抓取二手房網頁信息並存入雲端數據庫過程詳解（一)

目前最簡單且強悍的百度網盤文件高速下載方式

IDEA整合Springboot和Mybatis中xml配置找不到Mapper問題解決

Unknown column 'user_name' in 'field list' 錯誤解決

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結