上數據挖掘課,數據準備部分考慮這樣做:根據配置文件打開相應的網址並保存。之後再對這些文件進行內容解析、文本提取、矩陣轉換、聚類等。
public static void main(String[] args){
final int THREAD_COUNT=5;
String baseUrl=null;
String searchBlogs=null;
String blogs[]=null;
String fileDir=null;
//String category=null;
InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties");
Properties p = new Properties();
try {
p.load(inputStream);
baseUrl=p.getProperty("baseUrl");
fileDir=p.getProperty("fileDir");
searchBlogs=p.getProperty("searchBlogs");
if(searchBlogs!=""){
blogs=searchBlogs.split(";");
}
ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT);
for(String s:blogs){
pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html"));
}
pool.shutdown();
//category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
}
打開網頁並保存模塊:
public class SaveWeb implements Runnable{
private String url;
private String filename;
public SaveWeb(String url,String filename){
this.url=url;
this.filename=filename;
}
@Override
public void run() {
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename));
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
if (entity != null) {
String res=EntityUtils.toString(entity,"UTF-8");
outputStream.write(res.getBytes("UTF-8"));
outputStream.flush();
}
}
outputStream.close();
}catch(IOException e){
e.printStackTrace();
}
}
}
後續:
作業完成了,但幾乎和上面的內容沒啥關係,本來想全刪了。再想也不算寫錯,只是沒用上而已,還是留着吧。
最終,用java代碼循環加併發去獲得一個地址列表存到文件裏。而採用R語言去做的挖掘工作。包括獲取網頁、解析正文、分詞、聚類、結果輸出等。R語言真是省事,幾十行代碼全搞定了。但最終分類的結果不理想。看來基於全文的計算特徵不明顯,劃分出來的類也很不準確,還得考慮改進。