使用Lucene開發自己的搜索引擎–(2)配置環境和索引文件的建立indexer

原創

2020-02-21 06:40

文章來源：http://www.wenbanana.com/?p=708

一、Lucene安裝包下載

由於我是根據《Lucene In Action》第二版這本書來學習Lucene的，書中使用的是3.x版本的Lucene安裝包作爲教學資料，於是我下載了lucene-3.6.2版本的。大家最好還是使用3.x版本的，不同版本之間會存在一些差異，可能在編程是會造成一些不必要的錯誤。我下載的是lucene-3.6.2.zip。

下面我給出官方下載地址：http://www.apache.org/dyn/closer.cgi/lucene/java/3.6.2

2.lucene-core-3.6.2.jar的使用

下載完後，大家只要解壓到某一個磁盤上即可。下面我們就可以開始編寫代碼了。搜索引擎可以歸結爲三步驟：一、網頁抓取二、建立索引三、服務用戶。本來第一步應該是先去抓取網頁，但是我們這次主要講的是搜索信息，也就是說重點是文獻的檢索，那麼重點就在搜索而不是網頁抓取。在這之前，我們要創建一個索引程序Indexer來建立索引文件，方便引擎可以搜索。

3.創建Indexer程序

step1:設置CLASSPATH路徑，將lucene-core-3.6.2.jar添加到CLASSPATH下或者可以再Java 工程，右鍵屬性下添加這個jar包也可以。我採用的是後一種方法。

step2:創建LuceneInAction Java工程，工程目錄如下：

step3:在寫代碼之前，我們要先創建兩個文件夾，一個是index文件夾，用來保存索引文件；

一個是data文件夾，用來保存數據文件（如txt文件)。文件夾的位置可以隨意創建，這裏，我創建在解壓的安裝目錄下。分別是："E:\\lucene-3.6.2\\index"和"E:\\lucene-3.6.2\\data"。

接下來還要在data文件夾下創建幾個txt文件用來創建索引，內容要用英文，因爲我們還沒有添加中分分詞解析的功能，目前只能針對英文。

step4:接下來就可以編寫代碼了：

package Lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;


import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * 建立索引
 * @author Administrator
 *
 */
public class Indexer {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception{
	
		String indexDir = "E:\\lucene-3.6.2\\index";///在指定目錄創建索引文件夾
		String dataDir = "E:\\lucene-3.6.2\\data";///對指定目錄中的“.txt”文件進行索引
		
		long start = System.currentTimeMillis();
		Indexer indexer = new Indexer(indexDir);
		int numIndexed;
		try{
			numIndexed = indexer.index(dataDir, new TextFilesFilter());
		}finally{
			indexer.close();
		}
		long end = System.currentTimeMillis();
		
		System.out.println("索引 "+ numIndexed + " 文件花費 "+
		(end - start) + "ms");
		
	}
	
	
	private IndexWriter writer;
	
	//創建Lucene Index Writer
	public Indexer(String indexDir)throws IOException{
		Directory dir = FSDirectory.open(new File(indexDir));
		/*
		 * Version.LUCENE_30:是版本號參數，Lucene會根據輸入的版本值，
		 * 針對該值對應的版本進行環境和行爲匹配
		 */
		writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true,
				IndexWriter.MaxFieldLength.UNLIMITED);
	}
	
	//關閉Index Writer
	public void close()throws IOException{
		writer.close();
	}
	
	
	//返回被索引文檔文檔數
	public int index(String dataDir, FileFilter filter)throws Exception{
		File[] files = new File(dataDir).listFiles();
		
		for(File f:files){
			if(!f.isDirectory() &&
					!f.isHidden()&&
					f.exists()&&
					f.canRead()&&
					(filter == null || filter.accept(f))){
				indexFile(f);
			}
		}
		return writer.numDocs();
	}
	
	//只索引.txt文件，採用FileFilter
	private static class TextFilesFilter implements FileFilter{

		@Override
		public boolean accept(File pathname) {
			// TODO Auto-generated method stub
			return pathname.getName().toLowerCase().endsWith(".txt");
		}
		
	}
	
	protected Document getDocument(File f) throws Exception{
		Document doc = new Document();
		doc.add(new Field("contents", new FileReader(f)));//索引文件內容
		doc.add(new Field("filename", f.getName(),//索引文件名
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc.add(new Field("fullpath", f.getCanonicalPath(),//索引文件完整路徑
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		
		return doc;
	}
	
	//向Lucene索引中添加文檔
	private void indexFile(File f) throws Exception{
		System.out.println("Indexing "+f.getCanonicalPath());
		Document doc = getDocument(f);
		writer.addDocument(doc);
	}

}

這時編譯運行代碼，如果沒出錯的話，會出現下面的結果：

Indexing E:\lucene-3.6.2\data\1.txt
Indexing E:\lucene-3.6.2\data\2.txt
Indexing E:\lucene-3.6.2\data\3.txt
Indexing E:\lucene-3.6.2\data\4.txt
索引 4 文件花費 259ms

在index文件夾下，還會多出很多文件。這就表明索引成功建立了。大家或許會對上面的一些代碼抱有疑惑、不解，別急，我會在之後來一一講解這些類，現在大家有個瞭解即可。

w踏雪w

發佈了66 篇原創文章 · 獲贊 82 · 訪問量 26萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Lucene開發自己的搜索引擎–(2)配置環境和索引文件的建立indexer

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

嵌入式學習（2）---ubuntu(12.04)下minicom的上傳和下載

（3）windows下hadoop+eclipse環境搭建

使用Lucene開發自己的搜索引擎--(1)倒排索引基礎知識

安裝ubuntu(12.04LTS)後3件必須做的事情

Linux C進程與多線程入門—(5)使用互斥量進行同步

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結