lucene入門

Lucene是apache下的一個開放源代碼的全文檢索引擎工具包。提供了完整的查詢引擎和索引引擎，部分文本分析引擎。Lucene的目的時爲軟件開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索功能。

Lucene實現全文檢索的流程

綠色表示索引過程，對要搜索的原始內容進行索引構建一個索引庫，索引過程包括：確定原始內容即要搜索的內容-->採集文檔-->創建文檔-->分析文檔-->索引文檔

紅色表示搜索過程，從索引庫中搜索內容，搜索過程包括：用戶通過搜索界面-->創建查詢-->執行搜索，從索引庫搜索-->渲染搜索結果

一、創建索引

1、獲取原始文檔

原始文檔時指要索引和搜索的內容。原始內容包括互聯網上的網頁、數據庫中的數據等。在Internet上採集信息的軟件通常就稱爲爬蟲或蜘蛛，信息採集工具lucene沒有提供，需要自己編寫或通過一些開源軟件實現信息採集。

2、創建文檔對象

在索引前需要將原始內容創建成文檔（document），文檔中包括一個一個的域（Field，相當於文件屬性，如文件名，文件大小，文件內容，文件路徑等），域中存儲內容。每個document可以有多個Field，不同的document可以有不同的Field，同一個document可以有相同的Field（域名和域值相同）。每個文檔都有一個唯一編號，就是文檔id，我們不能更改。

3、分析文檔

將原始內容創建爲包含域（Field）的文檔（document），需要再對域中的內容進行分析，分析的過程是經過對原始文檔提取單詞、將字母轉爲小寫、去除標點符號、去除停用詞等過程生成最終的語彙單元，可以將語彙單元理解爲一個一個的單詞。每個單詞叫做一個term，term中包含兩部分。一部分是文檔的域名，另一部分是單詞的內容。例如：文件名中包含的apache和文件內容中包含的apache是不同的term

4、創建索引

對所有文檔分析得出的語彙單元進行索引，索引的目的時爲了搜索，最終要實現只搜索被索引的語彙單元從而找到document（文檔）。

創建索引是對語彙單元索引，通過詞語找文檔，這種索引的結構叫倒排索引結構。

二、在java中使用lucene

1、配置開發環境，導入jar包

2、創建索引庫代碼

	public void test1() throws IOException {
		//第一步、創建IndexWriter對象
		/**
		 * 1)創建indexWriter對象所需要的directory目錄，用於指定索引庫存放的位置
		 * 2)創建官方推薦解析器，用於創建IndexWriterConfig對象
		 * 3)創建IndexWriter對象需要的IndexWriterConfig對象
		 * 4)創建IndexWriter對象
		 */
		Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);//創建IndexWriterConfig對象時需要傳入版本號和分詞器
		IndexWriter index = new IndexWriter(directory, config);
		//第二步、創建Field對象，並將field對象添加到document對象中
		//獲取文件夾對象，對文件夾下的文件進行遍歷創建索引庫
		File f = new File("D:\\searchsource");
		File[] listfiles = f.listFiles();
		for (File file : listfiles) {
			//第三步、創建document對象
			Document document = new Document();
			//文件名稱
			String file_name = file.getName();
			Field fileNameField = new TextField("fileName", file_name, Store.YES);
			//文件大小
			long file_size = FileUtils.sizeOf(file);  
			Field fileSizeField = new LongField("fileSize", file_size, Store.YES);
			//文件路徑
			String file_path = file.getPath();
			Field filePathField = new StoredField("filePath", file_path);
			//文件內容
			String file_content = FileUtils.readFileToString(file);
			Field fileContentField = new TextField("fileContent", file_content,Store.NO);
			//將這些Field域存放到document對象中
			document.add(fileNameField);
			document.add(fileSizeField);
			document.add(filePathField);
			document.add(fileContentField);
			//第四步、使用indexwriter對象將document對象寫入索引庫，此過程進行索引創建。並將索引和document對象寫入索引庫
			index.addDocument(document);
		}
		//第五步、關閉IndexWriter對象
		index.close();
	}

關於field域的屬性

是否分析：是否對域的內容進行分詞處理。前提是我們要對域的內容進行查詢。

是否索引：將Field分析後的詞或整個Field值進行索引，只有索引方可搜索到。

比如：商品名稱、商品簡介分析後進行索引，訂單號、身份證號不用分析但也要索引，這些將來都要作爲查詢條件。

是否存儲：將Field值存儲在文檔中，存儲在文檔中的Field纔可以從Document中獲取

比如：商品名稱、訂單號，凡是將來要從Document中獲取的Field都要存儲。

Field類	數據類型	是否分析	是否索引	是否存儲	說明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	這個Field用來構建一個字符串Field，但是不會進行分析，會將整個串存儲在索引中，比如(訂單號,姓名等) 是否存儲在文檔中用Store.YES或Store.NO決定
LongField(FieldName, FieldValue,Store.YES)	Long型	Y	Y	Y或N	這個Field用來構建一個Long數字型Field，進行分析和索引，比如(價格) 是否存儲在文檔中用Store.YES或Store.NO決定
StoredField(FieldName, FieldValue)	重載方法，支持多種類型	N	N	Y	這個Field用來構建不同類型Field 不分析，不索引，但要Field存儲在文檔中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一個Reader, lucene猜測內容比較多,會採用Unstored的策略.

使用Luke工具可以查看索引庫中的詳細信息

3、查詢索引代碼

	public void testIndexReader() throws IOException {
//		第一步：創建一個Directory對象，也就是索引庫存放的位置。
		Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
		
/*		索引庫的位置如果在內存中
		Directory directory2 = new RAMDirectory();*/
		
//		第二步：創建一個indexReader對象，需要指定Directory對象。
		IndexReader indexReader = DirectoryReader.open(directory);
//		第三步：創建一個indexsearcher對象，需要指定IndexReader對象
		IndexSearcher searcher = new IndexSearcher(indexReader);
//		第四步：創建一個TermQuery對象，指定查詢的域和查詢的關鍵詞。
		Term term = new Term("fileName","全文檢索");//指定term的域名和域值
		Query query = new TermQuery(term);
//		第五步：執行查詢。
		TopDocs topDocs = searcher.search(query,3);
//		第六步：返回查詢結果。遍歷查詢結果並輸出。
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			int doc = scoreDoc.doc;  //獲取文檔id
			//獲取文檔對象
			Document document = indexReader.document(doc);
			//文件名
			String fileName = document.get("fileName");
			//文件大小
			String fileSize = document.get("fileSize");
			//文件路徑
			String filePath = document.get("filePath");
			//文件內容
			String fileContent = document.get("fileContent");
			System.out.println(fileName+"...."+fileSize+"...."+filePath+"...."+fileContent);
		}
//		第七步：關閉IndexReader對象
		indexReader.close();
	}

indexSearcher 搜索方法

indexSearcher.search(query, n)：根據Query搜索，返回評分最高的n條記錄

indexSearcher.search(query, filter, n)：根據Query搜索，添加過濾策略，返回評分最高的n條記錄

indexSearcher.search(query, n, sort)：根據Query搜索，添加排序策略，返回評分最高的n條記錄

indexSearcher.search(booleanQuery, filter, n, sort)：根據Query搜索，添加過濾策略，添加排序策略，返回評分最高的n條記錄

lucene默認使用的分詞器對英文支持很好，對中文支持很差，所以需要使用第三方的分詞器。這裏使用IKAnalyzer

注意：搜索使用的分析器要和創建索引使用的分析器一致。

4、索引庫的維護

1）刪除全部索引

	public void test3() throws IOException {
		IndexWriter writer = getIndexWriter();
		writer.deleteAll();
		writer.close();
	}

2）按條件刪除索引庫內容

	public void test4() throws IOException {
		IndexWriter writer = getIndexWriter();
		Query query = new TermQuery(new Term("fileName","apache"));
		writer.deleteDocuments(query);
		writer.close();
	}

3）對索引庫進行修改操作，原理就是先刪除後添加。

	public void test5() throws IOException {
		IndexWriter writer = getIndexWriter();
		Document document = new Document();
		TextField text1 = new TextField("fileN","test1",Store.YES);
		TextField text2 = new TextField("fileC","test2",Store.YES);
		document.add(text1);
		document.add(text2);
		writer.updateDocument(new Term("fileName","lucene"),document,new IKAnalyzer());
		writer.close();
	}

5、索引庫查詢

1）、使用query的子類查詢

** MatchAllDocsQuery 查詢索引目錄中的所有文檔

	public void testMatchAllDocsQuery() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		Query query = new MatchAllDocsQuery();
		TopDocs topDocs = searcher.search(query, 10);
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			int doc = scoreDoc.doc;
			Document document = searcher.getIndexReader().document(doc);
			String fileName = document.get("fileName");
			System.out.println(fileName);
			String filePath = document.get("filePath");
			System.out.println(filePath);
			String fileSize = document.get("fileSize");
			System.out.println(fileSize);
			String fileContent = document.get("fileContent");
			System.out.println(fileContent);
			System.out.println("--------------------");
		}
		searcher.getIndexReader().close();
	}

** TermQuery 指定要查詢的域和要查詢的關鍵詞。

		Term term = new Term("fileName","全文檢索");//指定term的域名和域值
		Query query = new TermQuery(term);

** NumericRangeQuery 根據數值範圍查詢

	public void test6() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		Query query = NumericRangeQuery.newLongRange("fileSize", 47L, 200L, false, true);
		printResult(searcher, query);
	}

這裏根據文件大小範圍進行查詢，創建query對象時第一個參數是field域名，第二個參數是下邊界值，第三個參數是上邊界值，第四個是否包括這個最小值，第五個是否包括這個最大值。

** BooleanQuery 組合條件查詢

	public void testBooleanQuery() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		BooleanQuery query = new BooleanQuery();
		Query query2 = new TermQuery(new Term("fileName","apache"));
		Query query3 = new TermQuery(new Term("fileName","lucene"));
		query.add(query2,Occur.SHOULD);
		query.add(query3,Occur.MUST);
		printResult(searcher, query);
	}

Occur.MUST：必須滿足此條件，相當於and

Occur.SHOULD：應該滿足，但是不滿足也可以，相當於or

Occur.MUST_NOT：必須不滿足。相當於not

2）、使用queryparser查詢

語法域名：域值

需要加入queryParser的jar包

	public void testQueryParser() throws Exception {
		//參數1、默認查詢的域
		//參數2、採用的分析器
		QueryParser queryParse = new QueryParser("fileName", new IKAnalyzer());
		//MatchAllDocsQuery底層使用的就是*:*
		Query query = queryParse.parse("*:*");
		IndexSearcher searcher = getIndexSearcher();
		printResult(searcher, query);
		//關閉資源
		searcher.getIndexReader().close();
	}

通過QueryParser也可以創建Query，QueryParser提供一個Parse方法，此方法可以直接根據查詢語法來查詢。Query對象執行的查詢語法可通過System.out.println(query);查詢。

需要使用到分析器。建議創建索引時使用的分析器和查詢索引時使用的分析器要一致。

查詢語法

1、基礎的查詢語法，關鍵詞查詢：

域名+“：”+搜索的關鍵字

例如：content:java

2、範圍查詢

域名+“:”+[最小值TO 最大值]

例如：size:[1 TO 1000]

範圍查詢在lucene中支持數值類型，不支持字符串類型。在solr中支持字符串類型。

3、組合條件查詢

1）+條件1 +條件2：兩個條件之間是並且的關係and

例如：+filename:apache +content:apache

2）+條件1 條件2：必須滿足第一個條件，應該滿足第二個條件

例如：+filename:apache content:apache

3）條件1 條件2：兩個條件滿足其一即可。

例如：filename:apache content:apache

4）-條件1條件2：必須不滿足條件1，要滿足條件2

例如：-filename:apache content:apache

Occur.MUST 查詢條件必須滿足，相當於and

+（加號）

Occur.SHOULD 查詢條件可選，相當於or

空（不用符號）

Occur.MUST_NOT 查詢條件不能滿足，相當於not非

-（減號）

第二種寫法：

條件1 AND 條件2

條件1 OR 條件2

條件1 NOT 條件2

MultiFieldQueryParser進行查詢

MultiFieldQueryParser和QueryParser相比增加了默認的域是一個數組，也就是說可以有多個默認的域。（此功能比較雞肋，可以直接使用QueryParser對象完成這樣的功能）

	public void testMultiFieldQueryParser() throws Exception {
		//參數1、默認查詢的域(數組)
		//參數2、採用的分析器
		String[] fields = {"fileName","fileContent"};
		MultiFieldQueryParser multiFieldQuery = new MultiFieldQueryParser(fields, new IKAnalyzer());
		Query query = multiFieldQuery.parse("java");
		IndexSearcher searcher = getIndexSearcher();
		printResult(searcher, query);
		//關閉資源
		searcher.getIndexReader().close();
	}

stop the world

發佈了40 篇原創文章 · 獲贊 22 · 訪問量 6萬+

私信關注

《日本蠟燭圖》讀書筆記 & 技術分析回測

《期貨-市場技術分析》讀書筆記

Python多線程編程深度探索：從入門到實戰

mongodb處理json數據很好

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

虛擬機hyper-v安裝win7操作系統網絡連接紅叉失敗

redis的quicklist結構(轉)

java使用Files和Paths獲取文件的屬性

java讀取jar文件中的資源文件

maven使用本地jar包並將jar包打包進項目

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結