Heritrix使用UTF-8編碼格式存儲文件

原創

2020-02-23 02:43

最近在學習搜索引擎，想使用Heritrix + solr 搭建一個內網搜索引擎。Heritrix爬取網頁保存到本地倉庫，solr在本地倉庫的基礎上建立索引，然後進行搜索。整合是發現solr只能讀取文件編碼格式爲UTF-8的文件，否則會出現亂碼，而Heritrix保存文件是以ANSI格式保存的。所以需要修改Heritrix使用UTF-8格式保存。基礎太差，看源碼非常困難，整整弄了一天才弄明白。

修改org.archive.crawler.writer.MirrorWriterProcessor中writeToPath方法。源碼是

private void writeToPath(RecordingInputStream recis, File dest)
        throws IOException {
        ReplayInputStream replayis = recis.getContentReplayInputStream();
        File tf = new File (dest.getPath() + "N");
        FileOutputStream fos = new FileOutputStream(tf);
        try {
            replayis.readFullyTo(fos);
        } finally {
            fos.close();
            replayis.close();
        }
        if (!tf.renameTo(dest)) {
            throw new IOException("Can not rename " + tf.getAbsolutePath()
                                  + " to " + dest.getAbsolutePath());
        }

    }

修改後爲

 private void writeToPathToUtf8(RecordingInputStream recis, File dest)
        throws IOException {
        ReplayInputStream replayis = recis.getContentReplayInputStream();
        OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest.getPath()),"UTF-8"); 
        try {
             byte[]   b   =   new   byte[4096]; 
             for   (int   n;   (n   =   replayis.read(b))   !=   -1;)   { 
             	out.write(new   String(b,   0,   n));
             } 
             out.flush(); 
             out.close(); 
        } finally {
            //fos.close();
            replayis.close();
        }
    }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Heritrix使用UTF-8編碼格式存儲文件

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Spring中標籤

eclipse打開文件所在文件夾的方法

springmvc+spring+mybatis集成問題彙總

項目更換struts2的核心包

Heritrix使用UTF-8編碼格式存儲文件

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結