Java文档搜索引擎

项目详情： https://github.com/BlackerGod/Java_search_api
成品展示： 点击查看
(ps:受限于服务器带宽和处理器，会导致这个有点慢。。。)

一、项目需求

当我们遇到Java内一些不熟悉的函数或者宝，我们都需要去查询官方文档，那么问题就来了，我们可以看到的是官方文档并没有提供一个查询接口，我们每次使用时还要知道在哪个包下，然后再去一个一个查找，费时费力。那么我们今天就做一个项目，来实现对Java官方文档的查询功能；

二、项目分析

当我们需要查找的时候，必须要已经获取到了整个页面的资源，我们要知道每个资源都在哪里，这块我们可以用爬虫实现（比较麻烦），我们先下载好整个文档的资源，然后对这个资源做一些处理，然后从文件中查找，最终返回他的url、描述、题目等信息。

三、项目设计

1.预处理模块：把下载好的html文档进行一次初步的处理（简单分析结构并且干掉其中的html标签）
把api目录中的所有html进行处理 =》得到一个单个文件。使用行文本的方式进行组织（为了制作索引方便）

第一列：文档标题
第二列：文档url
第三列：文档正文（去掉HTML标签）

2.索引模块：预处理得到的结果，构建正排+倒排索引
3.搜索模块：完成一次搜索过程基本流程（从用户输入查询词，得到最终的搜索结果）
4.前端模块：有一个页面，展示结果并且让用户输入数据

四、编码

1.预处理模块

我们先下载好文档：
我们只录入api，可以看到都是一些html页面，这其实是对应着官方的页面的
那么我们先找完整个文件夹，并且把每个html的title、url、content（除了标签的文段）全部保存下来。

package parser;

import java.io.*;
import java.util.ArrayList;

/**
 * 遍历文档目录，读取所有的html文档内容，把结果解析成行文本文件
 * 每一行对应一个文档，每一行都包含文档信息
 * Parser是一个单独可以执行的类（含main）
 */
public class Parser {
    private static final String INPUT_PATH = null; //下载的api路径
    private static final String OUTPUT_PATH = null; //输出处理文件的路径

    /**
     * //完成预处理
     * 1.枚举INPUT_PATH下所有html文件（递归）
     * 2.对html文件路径进行遍历，一次打开每个文件，并读取内容
     * 3.把内容转换成需要结构化的数据（DocInfo对象）,然后写出文件
     * @param args
     */
    public static void main(String[] args) throws IOException {
        FileWriter fileWriter = new FileWriter(new File(OUTPUT_PATH));
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        for (File f : fileList){
            //System.out.println("converting" + f.getAbsolutePath() + "...");
            String line = convertLine(f);
            //System.out.println(line);
            fileWriter.write(line);
        }
        fileWriter.close();
    }

    /**
     *
     * @param f 文件
     * @return  根据文件来获取标题.url.content；
     */
    private static String convertLine(File f) throws IOException {
        String title = convertTitle(f);
        String url = convertUrl(f);
        String content = convertContent(f);
        // \3起到分隔三个部分的效果. \3为ascii码为3的字符
        return title + "\3" + url + "\3" + content + "\n";
    }

    private static String convertContent(File f) throws IOException {
    //把标签和\n去掉
        FileReader fileReader = new FileReader(f);
        boolean isContent = true;
        StringBuilder output = new StringBuilder();
        while (true){
            int ret = fileReader.read();
            if(ret == -1){
                break;
            }
            char c = (char)ret;
            if(isContent){//是正文
                if(c == '<'){
                    isContent = false;
                    continue;
                }
                if(c == '\n' || c == '\r'){  //\n换行，\r表示回车
                    c = ' ';
                }
                output.append(c);
            } else { // 是标签
                if(c == '>'){
                    isContent = true;
                }
            }
        }
        fileReader.close();
        return output.toString();
    }
    private static String convertUrl(File f) {
        //线上文档对应的Url
        String prev = "https://docs.oracle.com/javase/8/docs/api";
        String text = f.getAbsolutePath().substring(INPUT_PATH.length());
        text = text.replaceAll("\\\\","/");//转不转换都可以的
        return prev + text;
    }
    private static String convertTitle(File f) {
        //把文件名当做标题就可以了（去掉.html）
        String name = f.getName();
        return name.substring(0,name.length() - ".html".length());
    }

    /**
     *
     * @param inputPath  当前目录
     * @param fileList     已经保存的文件列表
     */
    private static void enumFile(String inputPath, ArrayList<File> fileList) {
        //递归把inputPath对应的全部目录和文件都遍历一遍
        File root = new File(inputPath);
        File[] files = root.listFiles(); //查看当前路径下的所有文件(包括文件夹)
        for (File f : files){
            if(f.isDirectory()){
                enumFile(f.getAbsolutePath(),fileList);
                //递归向下
            } else if(f.getAbsolutePath().endsWith(".html")){
                //是否是.html,是的话就添加
                fileList.add(f);
            }
        }
    }
}

当处理完成之后，我们就得到一个文件：
我们获取到了，名字，url，正文。这个过程只生成一次就行，以后只用tmp.txt了。

2.索引模块

先清楚两个概念：
【正排索引】：根据文章的ID去找搜索词是否存在
【倒排索引】：根据文章中出现了搜索词找到文章ID
然后我们又遇到一个问题，就是这个关键词可能在很多文章都出现了，那我们如何得知它就是我们最需要的呢？
那我们就需要去计算一个【权重】，根据权重去排序，这里我就简单的写了一下计算方法
【权重】=关键词在题目中出现的次数 x 10 + 关键词在正文中出现的次数x1；

package common;

public class DocInfo {

    private int docId; //文章ID不能重复
    private String title;//文档标题，用文件名命名
    private String url;//线上URL，根据本地构造
    private String content;//html输出标签的内容

    public int getDocId() {
        return docId;
    }

    public void setDocId(int docId) {
        this.docId = docId;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        return "DocInfo{" +
                "docId=" + docId +
                ", title='" + title + '\'' +
                ", url='" + url + '\'' +
                ", content='" + content + '\'' +
                '}';
    }
}

然后开始把文件加到到内存中;

package index;

import common.DocInfo;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * 构建索引，正排索引（ID =》 文档），倒排索引（文档 =》 ID）
 */
public class Index {
    /**
     * 这个静态类是为了计算权重。
     */
    static public class Weight{
        public String word;
        public int docId;
        public int weight;
        //weight = titleCount*10 + textCount;
    }

    // 正排索引
    private ArrayList<DocInfo> forwardIndex = new ArrayList<>();

    // 倒排索引,不仅知道在那个docId下，而且要显示其权重
    // 权重：该词和文档的关联程度
    private HashMap<String,ArrayList<Weight>> invertedIndex = new HashMap<>();


    /**
     * 查询正排
     * @param docId 文章ID
     * @return  文章信息
     */
    public DocInfo getDocInfo(int docId){
        return forwardIndex.get(docId);
    }

    /**
     * 查询倒排
     * @param term 目标词
     * @return 文章列表
     */
    public ArrayList<Weight> getInverted(String term){
        return invertedIndex.get(term);
    }

    /**
     * 把txt文件内容读取出来，加载到内存上面的数据结构
     * \3分隔
     */
    public void build(String path) throws IOException {
        long startTime = System.currentTimeMillis();
        System.out.println("build start");

        // 1.打开文件，按行读取
        BufferedReader bw = new BufferedReader(new FileReader(new File(path)));

        // 2.接收每一行
        String line = "";

        while((line = bw.readLine()) != null){
        // 3.构造正排的过程：按照 \3来切分，切分结果构造成DocInfo对象，加入数据结构
            DocInfo docInfo = buildForward(line);

        // 4.构造倒排的过程
            buidInverted(docInfo);
            System.out.println("Build" + docInfo.getTitle() + "Finished");

        }
        bw.close();
        long finishTime = System.currentTimeMillis();
        System.out.println("build finished Time" + (finishTime - startTime)+"ms");
    }


    /**
     *
     * @param line 正排就是字符串切分
     * @return 返回docInfo
     */
    private DocInfo buildForward(String line) {
        // 把一行按照\3切分
        // 分出来的三个部分就是一个文档的 标题 url 正文；
        String[] tokens = line.split("\3");
        if(tokens.length != 3){
            // 文件格式有问题
            System.out.println("文件格式存在问题：" + line);
            return null;
        }
        DocInfo docInfo = new DocInfo();
        // id 就是正排索引下标
        docInfo.setDocId(forwardIndex.size());
        docInfo.setTitle(tokens[0]);
        docInfo.setUrl(tokens[1]);
        docInfo.setContent(tokens[2]);
        forwardIndex.add(docInfo);
        return docInfo;
    }


    private void buidInverted(DocInfo docInfo) {
    	/**
    	*计算权重的类
    	*/
        class WordCnt{
            public int titleCount;
            public int contengtCount;

            public WordCnt(int titleCount, int contengtCount) {
                this.titleCount = titleCount;
                this.contengtCount = contengtCount;
            }
        }

        HashMap<String,WordCnt> wordCntHashMap = new HashMap<>();
        // 1.对标题分词（分词是靠依赖实现的）
        List<Term> titleTerms = ToAnalysis.parse(docInfo.getTitle()).getTerms();
        // 2.遍历分词结果，统计标题中的每个词出现频率
        for (Term term : titleTerms){
            // 此处word已经转成小写了
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if(wordCnt == null){ // 不存在
                wordCntHashMap.put(word,new WordCnt(1,0));
            } else {
                wordCnt.titleCount++;
            }
        }



        // 3.针对正文分词
        List<Term> contentTerms = ToAnalysis.parse(docInfo.getContent()).getTerms();
        // 4.遍历分词结果，统计正文中词出现的频率
        for (Term term : contentTerms){
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if(wordCnt == null){
                wordCntHashMap.put(word,new WordCnt(0,1));
            } else {
                wordCnt.contengtCount++;
            }
        }

        // 5.遍历HashMap，一次构建weight对象并更新倒排索引的映射关系
        for (Map.Entry<String,WordCnt> entry : wordCntHashMap.entrySet()){
            Weight weight = new Weight();
            weight.word = entry.getKey();
            weight.docId = docInfo.getDocId();
            weight.weight = entry.getValue().titleCount * 10 + entry.getValue().contengtCount;

            //weight加入到倒排索引中
            ArrayList<Weight> invertedList = invertedIndex.get(entry.getKey());
            if(invertedList == null){
                // 不存在
                invertedList = new ArrayList<>();
                invertedIndex.put(entry.getKey(),invertedList);
            }
            invertedList.add(weight);
        }

    }
}

构造倒排索引这块有点不好理解，我还是画个图吧
这是两篇文章，正排索引就是根据docID去返回整篇文章内容，而倒排索引就是根据里面的内容来进行返回文章docID
构建倒排索引的过程：
【首先】针对一个docInfo,我们创建一个来记录一篇文章中，词分别出现的次数，当结束之后，就是像我们下图这样了
构造结束后，我们查询总的结构
就是看看，这个词是否已经存在，如果存在就把它的docID和weigh信息放在顺序表中
最终我们再根据weight中的weight来排序即可

3.搜索模块

搜索模块就很简单了，我们是web应用，写一个Serverlet调用响应的查询方法就行
我是以json格式传输，使用了Gson。

package api;

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import searcher.Result;
import searcher.Searcher;

import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.io.IOException;
import java.util.List;

public class DocSearcherServlet extends HttpServlet {

    private Searcher searcher = new Searcher();
    private Gson gson = new GsonBuilder().create();

    public DocSearcherServlet() throws IOException {
    }

    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        resp.setContentType("application/json;charset=utf-8");
        String query = req.getParameter("query");
        if(query == null){
            resp.setStatus(404);
            resp.getWriter().write("query参数非法");
            return;
        }
        List<Result> results = searcher.search(query);
        String respString = gson.toJson(results);
        resp.getWriter().write(respString);

    }
}

还有就是web.xml的配置，我们可以看到的是，由于构建比较慢，所以第一个提交请求的用户会等很久，之后已经都加载到内存里了，就比较快了。为了避免这个情况，我们就在启动时先构建一次可以避免

<web-app>
  <display-name>Archetype Created Web Application</display-name>

  <servlet>
    <servlet-name>DocSearcherServlet</servlet-name>
    <servlet-class>api.DocSearcherServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
  </servlet>
  <servlet-mapping>
    <servlet-name>DocSearcherServlet</servlet-name>
    <url-pattern>/search</url-pattern>
  </servlet-mapping>

</web-app>

4.前端模块

这块我不是很擅长，查了一些东西最终才搞定

<html>
<head>
    <!-- Bootstrap 文档: https://v3.bootcss.com/css/ -->
    <!-- Vue 文档: https://cn.vuejs.org/v2/guide/ -->
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <!-- Bootstrap CSS -->
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">

    <title>Java API 搜索</title>
    <style>
        #app {
            margin-left:50px;
            margin-right:50px;
        }
        div button {
            width:100%;
        }
        .row {
            padding-top: 10px;
        }
        .col-md-5,.col-md-1 {
            padding-left:2;
            padding-right:2;
        }
        .title {
            font-size: 22px;
        }
        .desc {
            font-size: 18px;
        }
        .url {
            font-size: 18px;
            color: green;
        }
    </style>
</head>
<body>
<div id="app">
    <div class="row">
        <img src="image/1.jpg" width="55px" height="60px" />
    </div>
    <div class="row">
        <div class="col-md-5">
            <input type="text" class="form-control" placeholder="请输入关键字" v-model="query">
        </div>
        <div class="col-md-1">
            <button class="btn btn-success" v-on:click="search()">搜索</button>
        </div>
    </div>
    <div class="row" v-for="result in results">
        <!--用来存放结果-->
        <div class="title"><a v-bind:href="result.clickUrl">{{result.title}}</a></div>
        <div class="desc">{{result.Desc}}</div>
        <div class="url">{{result.ShowUrl}}</div>
    </div>
</div>
</body>
<script src="https://apps.bdimg.com/libs/jquery/2.1.4/jquery.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/vue"></script>
<script>
    var vm = new Vue({
        el: "#app",
        data: {
            query: "",
            results: [ ]
        },
        methods: {
            search() {
                $.ajax({
                    url:"/JavaAPI/search?query=" + this.query,
                    type: "get",
                    context: this,
                    success: function(respData, status) {
                        this.results = respData;
                    }
                })
            },
        }
    })
</script>
</html>

前端页面也没有啥，主要是写css，Javascript的话，主要是把query提交给后端处理，然后接收过来的数据，分别处理。

Java文档搜索引擎

一、项目需求

二、项目分析

三、项目设计

四、编码

1.预处理模块

2.索引模块

3.搜索模块

4.前端模块

至此，大功告成

Java的對象比較

JDK1.8內存佈局

Java鏈表合併有序的兩個鏈表

爬蟲項目（分析awesome-java項目流行趨勢）

C語言實現通訊錄（動態分配內存）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結