使用Lucene開發自己的搜索引擎–(3)indexer索引程序中基本類介紹

原創

w踏雪w

2020-02-21 06:40

（1）Directory：

Directory類描述了Lucene索引的存放位置，它是一個抽象，其子類負責具體制定索引的存儲路徑。FSDirectory.open方法來獲取真實文件在文件系統中的存儲路徑，然後將他們一次傳遞給IndexWriter類構造方法。

Directory dir = FSDirectory.open(new File(indexDir));

（2）IndexWriter：

負責創建新索引或者打開已有的索引，以及向索引中添加、刪除或更新被索引文檔的信息。

（3）Analyzer：

在文本文件被索引之前，需要經過Analyzer處理。Analyzer是由IndexWriter構造方法指定的，它負責從被索引文本文件中提取詞彙單元，並剔除剩下的無用信息。

代碼如下：

writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true,
				IndexWriter.MaxFieldLength.UNLIMITED);

（4）Document：

Document對象代表一些域（Field）的集合。可以理解爲如web頁面、文本文件等。

Document對象的結構比較簡單，爲一個包含多個Field對象的容器

（5）Field：

指包含能被縮影的文本內容的類。索引中每個文檔都有一個或多個不同的域，這些域包含在Field類中。

每個域都有一個域名和對應的域值，以及一組選項來精確控制Lucene索引操作各個域值。

代碼解釋：

public Indexer(String indexDir)throws IOException{  
        Directory dir = FSDirectory.open(new File(indexDir));  
        /* 
         * Version.LUCENE_30:是版本號參數，Lucene會根據輸入的版本值， 
         * 針對該值對應的版本進行環境和行爲匹配 
         */  
        writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true,  
                IndexWriter.MaxFieldLength.UNLIMITED);  
    }

首先是Directory通過FSDirectory.open方法開闢一定空間來存儲索引，並制定索引的存儲路徑。然後，創建IndexWriter對象來實現對索引文件的寫入操作，如後面的：

writer.addDocument(doc)

添加索引操作。同時，在IndexWriter構造方法中，制定了Analyzer分析器。

protected Document getDocument(File f) throws Exception{
		Document doc = new Document();
		/**
		 * contents是域名， new FileReader(f)是域值
		 * filename是域名，f.getName是域值
		 * .......
		 */
		doc.add(new Field("contents", new FileReader(f)));//索引文件內容
		doc.add(new Field("filename", f.getName(),//索引文件名
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc.add(new Field("fullpath", f.getCanonicalPath(),//索引文件完整路徑
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		
		return doc;
	}

每一個文本文件都會創建一個文檔對象。

numIndexed = indexer.index(dataDir, new TextFilesFilter());

...........

//返回被索引文檔文檔數
	public int index(String dataDir, FileFilter filter)throws Exception{
		File[] files = new File(dataDir).listFiles();
		
		for(File f:files){
			if(!f.isDirectory() &&
					!f.isHidden()&&
					f.exists()&&
					f.canRead()&&
					(filter == null || filter.accept(f))){
				indexFile(f);
			}
		}
		return writer.numDocs();
	}

//向Lucene索引中添加文檔
	private void indexFile(File f) throws Exception{
		System.out.println("Indexing "+f.getCanonicalPath());
		Document doc = getDocument(f);
		writer.addDocument(doc);
	}

將處理好的索引添加到index文件加下。

w踏雪w

發佈了66 篇原創文章 · 獲贊 82 · 訪問量 26萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Lucene開發自己的搜索引擎–(3)indexer索引程序中基本類介紹

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

嵌入式學習（2）---ubuntu(12.04)下minicom的上傳和下載

（3）windows下hadoop+eclipse環境搭建

使用Lucene開發自己的搜索引擎--(1)倒排索引基礎知識

安裝ubuntu(12.04LTS)後3件必須做的事情

Linux C進程與多線程入門—(5)使用互斥量進行同步

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結