2018_02_27 全文檢索技術----Lucene

數據庫中保存的數據，磁盤中保存的數據都可以算做數據，常用的數據分爲兩種，一種是結構化數據，數據格式固定，長度有限，例如數據庫中存儲的數據，另一種是非結構化數據，數據格式不固定，長度不固定，例如磁盤中存儲的文檔數據。

結構化數據的查詢可以使用sql查詢，查詢起來方便。

而非結構化數據的查詢就要使用本章中描述的技術，全文檢索技術--Lucene，它可以將非結構化數據編程結構化數據，當數據量比較小時，比如一個內容較少的文檔，可以適當的使用io流進行讀取，完成匹配，對字符串的常規操作，但是當文檔內容較大時，需要將文檔內容進行合理的拆分，根據得到的詞彙列表創建索引，這種先創建索引，然後查詢索引的過程就是全文檢索。

全文檢索的應用領域：對於數據量大，數據機構不固定的數據可以使用全文檢索方式進行搜索，例如百度，google等搜索引擎。

索引和搜索流程圖：

Lucene的有點就是一次創建索引，多次查詢使用

使用Lucene實現全文檢索的流程步驟：

1，獲取原始文檔

2，創建文檔對象，爲每個原始文檔創建一個對應的Document對象，文檔中包含若干個域filed，每個文檔中包含多個域，不同的document可以包含不同的域，通一個document中可以包含相同的域，每個文檔對應一個唯一的id

3，分析文檔，分析文檔中的內容，分析相關域的內容，

（1）根據控件進行分詞，（2）刪除標點符號，（3）去除停用詞（沒有意義的詞）（4）將單詞統一轉換爲小寫，（5）最終得到一個關鍵詞列表

4，創建索引，

（1）基於分析分檔後獲得到的關鍵詞列表創建索引（2）把單詞和文檔的對應關係保存

索引庫中包含兩部分，一是索引，二是文檔對象，索引庫維護了索引與文檔對象之間的關係，如下：

5，查詢索引

（1）創建查詢對象，根據查詢條件創建一個查詢對象Query，Lucene根據query對象查詢索引庫

（2）執行查詢，查詢獲得一個document文檔對象的id列表，根據document對象的id獲取文檔中的內容

（3）渲染結果，對查詢後的結果進行高亮顯示，分頁顯示等等操作

入門程序

1，創建索引庫步驟：

第一步：創建一個java工程。

第二步：導入jar包。lucene-core-4.10.3.jar、lucene-analyzers-common-4.10.3.jar、commons-io-2.4.jar

第三步：創建一個Directory對象，指定索引庫存放的位置。可以放到內存中也可以放到磁盤上。一般都是保存到磁盤上。

第四步：創建一個IndexWriter對象。需要指定兩個參數，一個Directory對象一個是分析器對象。

第五步：循環讀取磁盤中XXX/XX目錄下的文檔。

第六步：創建一個Document對象。

第七步：向Document對象中添加域。

域有三個屬性：是否分析、是否索引、是否存儲。

是否分析：取決於是否對域的內容進行分詞。

是否索引：取決於是否在域上進行搜索。

是否存儲：取決於是否展示個用戶看。

常用的Field類型：

第八步：把文檔對象寫入索引庫。使用IndexWriter對象。

第九步：關閉IndexWriter對象。

,2，查詢索引庫步驟：

第一步：創建一個Directory對象。指定索引庫存放的目錄

第二步：創建一個IndexReader對象。打開索引庫

第三步：創建一個IndexSearcher對象，構造方法中需要IndexReader對象。

第四步：創建一個Query對象。兩個參數：要搜索的域及要搜索的關鍵詞。

第五步：IndexSearcher執行查詢。

第六步：得到一個文檔id列表。

第七步：根據id取文檔對象。

第八步：從文檔對象中取域的內容，展示給用戶。

第九步：關閉IndexReader。

package lucene_demo;

import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class FirstIndexDemo {
	public static void main(String[] args) {
		try {
			/*testCreateIndex();
			System.out.println("創建索引完畢！！！");*/
			
			testSearchIndex();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	/**
	 * 測試創建索引的方法
	 * @throws Exception 
	 */
	public static void testCreateIndex() throws Exception{
		//創建一個Directory對象，指定索引庫存放的位置。可以放到內存中也可以放到磁盤上。一般都是保存到磁盤上。
		//放到內存中
		//Directory directory = new RAMDirectory();
		//放到磁盤上
		Directory directory  = FSDirectory.open(new File("E:\\lucene"));
		//創建一個IndexWriter對象。需要指定兩個參數，一個Directory對象一個是分析器對象。
		Analyzer analyzer = new StandardAnalyzer();
		//第一個參數Lucene對應的版本號
		//第二個參數分析器對象
		IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
		IndexWriter indexWriter = new IndexWriter(directory,config);
		//循環讀取D:/傳智播客/01.課程/04.lucene/01.參考資料/searchsource目錄下的文檔。
		File dir = new File("E:\\pachong\\text");
		for(File f :dir.listFiles()){
			//取文件名
			String fileName = f.getName();
			//文件路徑
			String filePath = f.getAbsolutePath();
			//文件內容
			String fileContent = FileUtils.readFileToString(f);
			//文件大小
			long fileSize = FileUtils.sizeOf(f);
			//創建一個Document對象。
			Document document = new Document();
			//第一個參數：域的名稱
			//第二個參數：域的內容
			//第三個參數：是否存儲
			Field fileNameField = new TextField("title",fileName,Store.YES);
			Field fileContentField = new TextField("content",fileContent,Store.NO);
			Field filePathField = new StoredField("path",filePath);
			Field fileSizeField = new LongField("size",fileSize,Store.YES);
			//向Document對象中添加域。
			document.add(fileNameField);
			document.add(fileContentField);
			document.add(filePathField);
			document.add(fileSizeField);
			//把文檔寫入索引庫
			indexWriter.addDocument(document);
		}
		//關閉IndexWriter對象
		indexWriter.close();
	}
	
	/**
	 * 測試查詢索引庫的方法
	 * @throws Exception
	 */
	public static void testSearchIndex() throws Exception{
		//創建一個Directory對象。指定索引庫存放的目錄	
		Directory directory = FSDirectory.open(new File("E:\\lucene"));
		//創建一個IndexReader對象。打開索引庫
		IndexReader indexReader = DirectoryReader.open(directory);
		//創建一個IndexSearcher對象，構造方法中需要IndexReader對象。
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);
		//創建一個Query對象。兩個參數：要搜索的域及要搜索的關鍵詞。
		Query query = new TermQuery(new Term("content","個"));
		//IndexSearcher執行查詢。
		//第一個參數查詢對象，第二個參數返回結果的記錄數
		TopDocs topDocs = indexSearcher.search(query, 10);
		//取查詢結果的總記錄數
		System.out.println("查詢記錄的總記錄數："+topDocs.totalHits);
		//得到一個文檔id列表。
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		//根據id取文檔對象。
		for (ScoreDoc scoreDoc : scoreDocs) {
			//文檔對應的id
			int id = scoreDoc.doc;
			//根據id取Document對象
			Document document = indexSearcher.doc(id);
			//從文檔對象中取域的內容，展示給用戶。
			System.out.println(document.get("title"));
			System.out.println(document.get("content"));
			System.out.println(document.get("size"));
			System.out.println(document.get("path"));
			System.out.println("----------------------------");
		}
		//關閉IndexReader。
		indexReader.close();
	}
}

2018_02_27 全文檢索技術----Lucene

C#開源的兩款功能強大的錄屏神器

認知提升的方法

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

2017_11_22 Json數據與Java對象互轉的兩種方式(二)

2017_12_05 echarts動態賦值問題，tab切換時，圖表顯示錯亂

2018_02_27 全文檢索技術----Lucene

2017_11_29 ajax請求的get，post方式

2018_01_17 URI和URL的理解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結