前言

我們團隊目前在研究主題搜素引擎，在研究主題搜索引擎的時候就爲了有個對比，就去簡單了研究了一下lucene這個框架，然後從搭建到實現lucene的功能給大家分享一下

學習之中遇到的問題

首先是找不到對比文本，比如一段文字，怎麼區分是哪個領域的，想過用json和map進行區分，而且當時找了很多教程都不知道他們用的實例文本是什麼樣子的。
查詢的方式，Lucene之模糊、精確、匹配、範圍、多條件查詢等在實際上怎麼運用的，這個文章我是在：https://www.cnblogs.com/fan-yuan/p/9228822.html這上面搞懂的

第一步：導入需要的jar包

lucen-core-7.1 lucene的核心代碼
lucene-analyzers-common-7.1 lucene的分詞工具
lucene-queryparser-7.1 QueryParser能夠根據用戶的輸入來進行解析
lucene-highlighter-7.1 lucene查詢結果進行高亮顯示
還可以添加自己需要的，比如我需要對文本進行一個json的解析我就需要添加一個alibaba的用於json解析

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.61</version>
        </dependency>

第二步：創建索引

創建索引前需要準備你需要進行查詢的句子，比如我這邊準備了10000個文本進行創建索引，1000個與內容相關的文本和9000多個與內容不相關的文本，然後進行查詢，得到查詢到的查準率和查全率，與主題搜索進行一個對比。

具體內容就是一個json的字符串，有id和content，id是爲了判斷是在哪個範圍。content當然就是存放文本。

然後我們先創建一個類，來負責創建索引

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import sun.reflect.misc.FieldUtil;

import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;
import java.io.FileInputStream;


public class Indexer {

    private IndexWriter writer;

    //實例化indexwrite
    public Indexer(String indexDir) throws Exception{
        Directory dir= FSDirectory.open(Paths.get(indexDir));
        Analyzer analyzer=new StandardAnalyzer();
        IndexWriterConfig iwc=new IndexWriterConfig(analyzer);
        writer=new IndexWriter(dir,iwc);
    }

    //關閉索引
    public void close() throws Exception{
        writer.close();
    }

    //索引目錄的所有文件
    public int index(String dataDir) throws Exception{
        File []files=new File(dataDir).listFiles();
        for(File f:files){

            indexFile(f);
        }

        return  writer.numDocs();
    }

    //索引指定文件
    private void indexFile(File f) throws Exception{
        System.out.println("索引文件:"+f.getCanonicalPath());
        Document doc=getDocument(f);
        writer.addDocument(doc);
    }

    //獲取文檔
    private Document getDocument(File f) throws Exception{
        Document doc=new Document();
        //嘗試保存內容
        doc.add(new TextField("contents",FileUtils.readFileToString(f), Field.Store.YES));
        doc.add(new TextField("fileName",f.getName(), Field.Store.YES));
        doc.add(new TextField("fullPath",f.getCanonicalPath(),Field.Store.YES));
        return doc;
    }

    public static void main(String[] args) {
        String indexDir="D:\\jc\\Myself\\app\\demo\\kk2";//你索引後文件保存的目錄
        String dataDir="D:\\jc\\Myself\\app\\demo\\File";//你存放文件的目錄
        Indexer indexer=null;
        int numIndexed=0;
        long start =System.currentTimeMillis();
        try{
            indexer=new Indexer(indexDir);
            numIndexed=indexer.index(dataDir);
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            try {
                indexer.close();
            }catch (Exception e){
                e.printStackTrace();
            }
        }
        long end=System.currentTimeMillis();
        System.out.println("索引："+numIndexed+"個文件 花費了"+(end-start)+"毫秒");
    }
}

這裏沒得講的，看main函數就不難看出，首先就是創建索引，然後讓我們的lucene-analyzers分詞小助手去進行分詞，其實這個過程就像新華字典的工作人員，把每個漢字進行排序，ABCD。。這樣分下去，讓用的很容易查到自己需要的那個字，你只需要知道這個過程就行了，上面代碼只需要你把路徑改一下就可以了，運行後索引文件就保存到KK的文件下面了

第二步：創建查詢類，實現查詢到剛剛創建好的索引文件找到需要的文件

首先代碼附上

package cn.cigit.contextquery.servlet;

import cn.cigit.contextquery.beans.lucenes;
import com.alibaba.fastjson.JSONObject;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.StringReader;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


public class Searcher {


    public  Map<String,Object> search(String indexDir,String q) throws Exception{
        Map<String,Object> NumOF=new HashMap<>();
        List<lucenes> lucenes=new ArrayList<>();
        /// 第一步：創建一個Directory對象，也就是索引庫存放的位置。
        Directory dir= FSDirectory.open(Paths.get(indexDir));
        // 第二步：創建一個indexReader對象，需要指定Directory對象。
        IndexReader reader= DirectoryReader.open(dir);
        // 第三步：創建一個indexsearcher對象，需要指定IndexReader對象
        IndexSearcher is=new IndexSearcher(reader);
        //創建標準的分詞器
        Analyzer analyzer=new StandardAnalyzer();
        //Queryparser	萬能查詢（上面的都可以用這個來查詢到）
        QueryParser parser=new QueryParser("contents",analyzer);
        Query query=parser.parse(q);
        long start =System.currentTimeMillis();
        //第五步：執行查詢。
            TopDocs hits=is.search(query,10);
        long end =System.currentTimeMillis();
        System.out.println("匹配:"+q+",總花費："+(end-start)+"毫秒,"+"查詢到"+hits.totalHits+"個記錄");
        //查到文件的數量
        long docSize=hits.totalHits;
        //與內容相關的文件數量
        int likeDoc = 0;
        //與內容無關的文件數量
        int notLikeDoc=0;
        // 第六步：返回查詢結果。遍歷查詢結果並輸出
        for(ScoreDoc scoreDoc:hits.scoreDocs){
            lucenes kk=new lucenes();
            int docID=scoreDoc.doc;
            Document doc=is.doc(scoreDoc.doc);
            System.out.println("============="+doc.get("fileName")+"=============");
            System.out.println("文件路徑："+doc.get("fullPath"));
            kk.setPath(doc.get("fullPath"));
            JSONObject jsonObject=JSONObject.parseObject(doc.get("contents"));
            kk.setContent((String) jsonObject.get("content"));
            System.out.println("文件內容："+jsonObject.get("content"));
            Explanation explanation=is.explain(query,docID);
            System.out.println("結果評分："+explanation.getValue());
            kk.setValue(String.valueOf(explanation.getValue()));
            String text=doc.get("contents");
            SimpleHTMLFormatter simpleHTMLFormatter=new SimpleHTMLFormatter("<font color='red'>", "</font>");
            Highlighter highlighter=new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
            highlighter.setTextFragmenter(new SimpleFragmenter(text.length()));
            if(text!=null){
                TokenStream tokenStream=analyzer.tokenStream("contents",new StringReader(text));
                String high=highlighter.getBestFragment(tokenStream,text);
                JSONObject jsonObject2=JSONObject.parseObject(high);
                System.out.println("高亮顯示"+jsonObject2.get("content"));
                kk.setHighContent((String) jsonObject2.get("content"));
            }
            String[] ids=jsonObject.getString("id").split("other");
            if(ids.length>1) {
                //與1000疾病文檔不相關的文檔
                notLikeDoc++;
            }else{
                if(jsonObject.getInteger("id")>=1&&jsonObject.getInteger("id")<=100){
                    likeDoc++;
                }
            }
            lucenes.add(kk);
        }
        //創建返回結果的情況
        Map<String,String> num=new HashMap<>();
        num.put("likeDoc",String.valueOf(likeDoc));
        num.put("docSize",String.valueOf(docSize));
        num.put("notLikeDoc",String.valueOf(notLikeDoc));
        num.put("recallRate",String.format("%.6f", likeDoc/10612.0));
        num.put("Accuracy",String.format("%.6f", likeDoc/(docSize*1.0)));
        num.put("text",q);
        System.out.println("========================================================");
        System.out.println("查到與本病相關的個數:"+likeDoc);
        System.out.println("查到總的文件個數："+docSize);
        System.out.println("與1000疾病文件不相關的查到有："+notLikeDoc+"個");
        System.out.println("查全率:"+String.format("%.6f", likeDoc/10612.0));
        System.out.println("查準率:"+String.format("%.6f", likeDoc/(docSize*1.0)));
        System.out.println("========================================================");
        reader.close();
        NumOF.put("lucene",lucenes);
        NumOF.put("num",num);
        return NumOF;
    }
}

上面的代碼就是一個查詢，接收一個字符串然後在索引文件去查詢結果，只需要取前面十條，然後對這10條進行一個內容的提取，通過json去解析，然後就能得到id和content，通過id是在哪個範圍就能知道該查詢的內容是哪個病的，然後進行記錄，然後在進行一個查全率和查準率的計算，用一個map集合去裝就可以得到數據和需要返回的內容。

講到這裏我們目前做的是一個web項目，進行一個用戶的輸入就是可以實現lucene的查詢結果返回展示，有兩個搜索方式一個是純Lucene和lucene加IDA的方式，什麼是LDA，是一個工具，可以把一個句子拆分幾個關鍵字，然後在進行lucene的查詢，如果需要深入瞭解的小夥伴可以加我微信：y958231955,一起討論

當然我們做後端只需要把功能寫出來，其他的交給美工吧，哈哈，今天就寫到這裏了

結語

上訴用到jar包，如果不會下的，可以關注微信公衆號：程序員PG

回覆：lucene

裏面還有10個示例文本，可以去體驗這個過程，相信你一定會早點完成學習和工作

一切關於程序和生活的問題也可以找我一起交流學習

微信：y958231955

今日頭條：超廠長

快速入門Lucene--只需3步--關鍵詞搜索框架之Lucene

前言

學習之中遇到的問題

第一步：導入需要的jar包

第二步：創建索引

第二步：創建查詢類，實現查詢到剛剛創建好的索引文件找到需要的文件

結語

開源高性能結構化日誌模塊NanoLog

杭州的 IT 崩盤了麼？

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

java--怎麼快速找到自己想要的jar包，再也不去官網一個一個的去找，簡單2步

python-opencv：在視頻中顯示fps等opencv快速入門

python-opencv：在視頻中顯示進度條等opencv快速入門

python-opencv：對視頻的基本操作包括獲取高度、寬度、fps以及播放等opencv快速入門

快速入門Lucene--只需3步--關鍵詞搜索框架之Lucene

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結