最近在學習搜索引擎,想使用Heritrix + solr 搭建一個內網搜索引擎。Heritrix爬取網頁保存到本地倉庫,solr在本地倉庫的基礎上建立索引,然後進行搜索。整合是發現solr只能讀取文件編碼格式爲UTF-8的文件,否則會出現亂碼,而Heritrix保存文件是以ANSI格式保存的。所以需要修改Heritrix使用UTF-8格式保存。基礎太差,看源碼非常困難,整整弄了一天才弄明白。
修改org.archive.crawler.writer.MirrorWriterProcessor中writeToPath方法。源碼是
private void writeToPath(RecordingInputStream recis, File dest)
throws IOException {
ReplayInputStream replayis = recis.getContentReplayInputStream();
File tf = new File (dest.getPath() + "N");
FileOutputStream fos = new FileOutputStream(tf);
try {
replayis.readFullyTo(fos);
} finally {
fos.close();
replayis.close();
}
if (!tf.renameTo(dest)) {
throw new IOException("Can not rename " + tf.getAbsolutePath()
+ " to " + dest.getAbsolutePath());
}
}
修改後爲 private void writeToPathToUtf8(RecordingInputStream recis, File dest)
throws IOException {
ReplayInputStream replayis = recis.getContentReplayInputStream();
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest.getPath()),"UTF-8");
try {
byte[] b = new byte[4096];
for (int n; (n = replayis.read(b)) != -1;) {
out.write(new String(b, 0, n));
}
out.flush();
out.close();
} finally {
//fos.close();
replayis.close();
}
}