爬蟲是獲取網絡大數據的重要手段,爬蟲是一種非常成熟的技術了,然而想着在Spark環境下測試一下效果.
還是非常簡單的,利用JavaSparkContext來構建,就可以採用原來Java中的網頁獲取那一套來實現.
首先給定幾個初始種子,生成一個JavaRDD對象即可
JavaRDD<String> rdd = sc.parallelize("urllist");
JavaRDD<String> content = rdd.map(new Function<String, String>() {
public String call(String url) throws Exception {
System.out.println(url);
CloseableHttpClient client = null;
HttpGet get = null;
CloseableHttpResponse response = null;
try {
//## 創建默認連接
client = HttpClients.createDefault();
get = new HttpGet(url);
response = client.execute(get);
HttpEntity entity = response.getEntity();
//## 獲得輸出字節流
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
entity.writeTo(byteArrayOutputStream);
//## 轉化爲文檔
String html = new String(byteArrayOutputStream.toByteArray(), Charsets.UTF_8);
Document document = Jsoup.parse(html);
return html;
} catch (Exception ex) {
ex.printStackTrace();
return "";
} finally {
if (response != null) {
response.close();
}
if (client != null) {
client.close();
}
}
}
});
當然可以從HTML再找到子頁連接,繼續以深度或者廣度進行優先爬蟲.