最近工作要收集點酒店數據,就到攜程上看了看,記錄爬取過程去下
1.根據城市名稱來分類酒店數據,所以先找了所有城市的名稱
在這個網頁上有http://hotels.ctrip.com/domestic-city-hotel.html
從網站地圖上可以很容易發現這個頁面
2.然後查看源碼
發現所有需要的數據都在
<dl class = "pinyin_filter_detail layoutfix"></dl>
3.我們獲取一下dl 這個元素和其中的所有子元素
我們用jsoup的jar包來解析獲取的html,官網https://jsoup.org/,有API和jar包
String result = HttpUtil.getInstance().httpGet(null, "http://hotels.ctrip.com/domestic-city-hotel.html");
Document root_document = Jsoup.parse(result);
Elements pinyin_filter_elements = root_document.getElementsByClass("pinyin_filter_detail layoutfix");
//包含所有城市的Element
Element pinyin_filter = pinyin_filter_elements.first();
4.我準備把獲取的城市數據存儲到mysql中,所以下面連接了本地mysql數據庫
// 連接數據庫
Connection conn = SqlDBUtils.getConnection();
StringBuilder create_table_sql = new StringBuilder();
create_table_sql.append("create table if not exists ctrip_hotel_city (id integer primary key auto_increment, city_id integer not null, city_name varchar(255) not null, head_pinyin varchar(80) not null, pinyin varchar(255) not null)");
PreparedStatement preparedStatement;
try {
//每次執行刪除一下表,防止數據插入重複
preparedStatement = conn.prepareStatement("DROP TABLE IF EXISTS ctrip_hotel_city");
preparedStatement.execute();
// 創建ctrip_hotel_city表,存儲城市數據
preparedStatement = conn.prepareStatement(create_table_sql.toString());
preparedStatement.execute();
} catch (SQLException e) {
e.printStackTrace();
}
5.獲取dl下所有的dt和dd,並從中提取數據庫表中所需要的字段,實現存儲
//拼音首字符Elements
Elements pinyins = pinyin_filter.getElementsByTag("dt");
//所有dd的Elements
Elements hotelsLinks = pinyin_filter.getElementsByTag("dd");
6.數據提取
for (int i = 0; i < pinyins.size(); i++) {
Element head_pinyin = pinyins.get(i);
Element head_hotelsLink = hotelsLinks.get(i);
Elements links = head_hotelsLink.children();
for (Element link : links) {
String cityId = StringUtil.getNumbers(link.attr("href"));
String cityName = link.html();
String head_pinyin_str = head_pinyin.html();
String pinyin_cityId = link.attr("href").replace("/hotel/", "");
String pinyin = pinyin_cityId.replace(StringUtil.getNumbers(link.attr("href")), "");
StringBuffer insert_sql = new StringBuffer();
insert_sql.append("insert into ctrip_hotel_city (city_id, city_name, head_pinyin, pinyin) values (");
insert_sql.append(cityId);
insert_sql.append(", '" + cityName + "'");
insert_sql.append(", '" + head_pinyin_str + "'");
//此處注意漢語拼音中會有',直接插入數據庫會報錯,要把一個'替換爲兩個''
insert_sql.append(", '" + pinyin.replace("'", "''") + "')");
try {
preparedStatement = conn.prepareStatement(insert_sql.toString());
preparedStatement.execute();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
7.運行後查看mysql數據庫ctrip_hotel_city
表,如下
至此,酒店城市獲取思路已介紹完畢,下面將介紹怎麼用城市獲取城市所有酒店的數據,
github源碼地址 https://github.com/jianiuqi/CTripSpider
博文Java數據爬取——爬取攜程酒店數據(二) 中介紹瞭如何利用地區爬取酒店數據,並保存到了mysql數據庫