Lucene中自动补全Suggest模块的索引追加和更新的解决方案

我用的版本是Lucene-Suggest-4.7.jar
在做类似百度搜索中自动补全模块的时候遇到的问题——索引追加建立,索引更新权重。本问主要解决这两个问题。大家可能在网络上已经搜索到了Lucene的Suggest包的基本用法,这里再简单的说一下:
使用suggest包建立索引时和用lucene的IndexWriter建立索引有很大的不同,这里建立索引时,大概需要三个类:实体类,实体类的迭代器类,具体操作的类。实体类不在多说,代码如下:

public class Suggester implements Serializable {
    private static final long serialVersionUID = 1L;
    String term;
    int times;
    /**
     * @param term  词条
     * @param times  词频
     */
    public Suggester(String term, int times) {
        this.term = term;
        this.times = times;
    }
    public Suggester() {
        super();
    }
    /**
     * @return the term
     */
    public String getTerm() {
        return term;
    }
    /**
     * @param term the term to set
     */
    public void setTerm(String term) {
        this.term = term;
    }
    /**
     * @return the times
     */
    public int getTimes() {
        return times;
    }
    /**
     * @param times the times to set
     */
    public void setTimes(int times) {
        this.times = times;
    }
    /* (non-Javadoc)
     * @see java.lang.Object#toString()
     */
    @Override
    public String toString() {
        return term + " " + times;
    }
    /* (non-Javadoc)
     * @see java.lang.Object#hashCode()
     */
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((term == null) ? 0 : term.hashCode());
        return result;
    }
    /*
     * 只对比term
     * @see java.lang.Object#equals(java.lang.Object)
     */
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Suggester other = (Suggester) obj;
        if (term == null) {
            if (other.term != null)
                return false;
        } else if (!term.equals(other.term))
            return false;
        return true;
    }
}
具体操作的类也是调方法就OK,实体类的迭代器类我们看下源代码就明白为什么需要这个了



这个是AnalyzingInfixSuggester类中建立索引的方法,其参数要求是传入一个InputIterator 对象,即实体类的迭代器类,下面看下实体类的迭代器类:


public class SuggesterIterator implements InputIterator {
    /**集合的迭代器 */
    private final Iterator<Suggester> suggesterIterator;
    /**遍历的当前的Suggerter  */
    private Suggester currentSuggester;
    /**
     * 构造方法
     * @param suggesterIterator
     */
    public SuggesterIterator(Iterator<Suggester> suggesterIterator) {
        this.suggesterIterator = suggesterIterator;
    }
    /*  
     * 迭代下一个
     * @see org.apache.lucene.util.BytesRefIterator#next()
     */
    @Override
    public BytesRef next() throws IOException {
        if (suggesterIterator.hasNext()) {
            currentSuggester = suggesterIterator.next();
            String term = currentSuggester.getTerm();
            try {
                return new BytesRef(term.getBytes("UTF8"));
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }
        //如果出错或者遍历完返回空
        return null;
    }
    /*
     * 是否有payload数据信息
     * @see org.apache.lucene.search.suggest.InputIterator#hasPayloads()
     */
    @Override
    public boolean hasPayloads() {
        return true;
    }
    /*
     *  payload数据,存其他后期需要取出的各种数据,这里存词频
     * @see org.apache.lucene.search.suggest.InputIterator#payload()
     */
    @Override
    public BytesRef payload() {
        /**如hasPayloads retrun false 以下代码无用    */
        try {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            DataOutputStream dos = new DataOutputStream(bos);
            dos.writeInt(currentSuggester.getTimes());
            dos.close();
            return new BytesRef(bos.toByteArray());
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    /*
     * 自定义的排序规则
     * @see org.apache.lucene.search.suggest.InputIterator#weight()
     */
    @Override
    public long weight() {
        //当前权重为词频
        return currentSuggester.getTimes();
    }
    /*  
     * @see org.apache.lucene.util.BytesRefIterator#getComparator()
     */
    @Override
    public Comparator<BytesRef> getComparator() {
        return null;
    }
}



 在准备好之后便可以调用suggest包中的build方法建立索引了:
   /**
     * 创建索引
     * @param list 待建立索引的数据集
     * @return 创建时间
     */
    public double create(List<Suggester> list, String indexPath) {
        //耗时
        long time = 0l;
        //索引创建管理工具
        AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
        try {
            AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, new File(indexPath), analyzer);
            loger.debug("开始创建自动补全索引");
            Long begin = System.currentTimeMillis();
            //build索引
            AnalyzingInfixSuggester.build(new SuggesterIterator(list.iterator()));
            Long end = System.currentTimeMillis(); 
            time = end - begin;
            loger.debug("创建自动补全索引完成!耗时: " + time + "ms");
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭
            AnalyzingInfixSuggester.close();
        }
        return time / 1000.0;
    }


 测试的主要代码:
        List<Suggester> list = new ArrayList<Suggester>();
        list.add(new Suggester("张三", 1)); 
        list.add(new Suggester("李四", 2));
        double time = suggestService.create(list, "file/autoComplete/project/template/index");
        System.out.println(time + " ms");

当执行完上面的代码 在索引中便建立好了 张三和李四,两条Document的索引,我们可以看到建立好的索引结构如下:



看到和IndexWriter的区别了吧,注意,我们上面建立索引是使用的空格分词器。具体索引文件的结构有兴趣就自己再研究吧。


查询部分 不多说,直接看代码自己研究去吧
/**
     * 自动补全查询索引
     * @param region  查询条件
     * @param indexPath 索引位置
     * @return  查询结果集
     */
    public List<Suggester> lookup(String region, String indexPath) {
        //返回的结果集 
        List<Suggester> reList = new ArrayList<Suggester>();
        //索引文件
        File indexFile = new File(indexPath);
        //索引创建管理工具
        AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
        // 查询结果集
        List<LookupResult> results = null;
        try {
            AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, indexFile, analyzer);
            /*
             *   查询结果    
             *     region- 查询的关键词
             *     TOPS- 返回的最多数量
             *     allTermsRequired - should或者must关系
             *     doHighlight - 高亮
             */
            results = AnalyzingInfixSuggester.lookup(region, TOPS, true, true);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            AnalyzingInfixSuggester.close();
        }
        /*
         * 遍历结果
         */
        System.out.println("输入词:" + region);
        for (LookupResult result : results) {
            String str = (String) result.highlightKey;
            Integer time = null;
            try {
                //获取payload部分词频信息 —— 词频
                BytesRef bytesRef = result.payload;
                DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytesRef.bytes));
                time = dis.readInt();
                dis.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
            reList.add(new Suggester(str, time));
        }
        /*
         * 剔除搜索关键词自身
         */
        for (int i = 0; i < reList.size(); i++) {
            Suggester sug = reList.get(i);
            //剔除高亮标签后进行比较
            if (sug.getTerm().replaceAll("<[^>]*>", "").equals(region)) {
                reList.remove(sug);
                break;
            }
        }
        return reList;
    }

在你都建立好了索引,查询也成功之后,那么问题来了:如果我想在索引中追加新的索引怎么办?如果我想修改(update)索引怎么办?然而在你不断的查找过后发现,Suggest并没有提供相关方法……那么接下来着重介绍这两个问题的解决方法。

翻看源码,可以发现在build索引时也是使用IndexWriter,并且有个getIndexWriterConfig方法


在getIndexWriterConfig方法中可以看到,索引文件的打开模式是OpenMode.CREATE固定的,所以索引的建立方法只能是新建,不能是追加。



解决方法就是继承AnalyzingInfixSuggester,重写getIndexWriterConfig方法,制作我们自己的AnalyzingInfixSuggester。
代码如下:

public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {

    /**索引创建方式(新建或追加)*/
    private final OpenMode mode;

......

    /*
     * 重载 构造方法 初始化相关变量
     * @param matchVersion  Lucene版本
     * @param indexPath 索引文件目录
     * @param analyzer 分词器
     * @param mode 索引创建方式(新建或追加)
     * @throws IOException 
     */
    public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
        //调用父类构造方法
        super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
        this.mode = mode;
.....
    }


 /*
     * 重写获得IndexWriterConfig的方法
     * 增加索引创建方式可变(新建或追加)
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
     */
    @Override
    protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
        IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
        iwc.setCodec(new Lucene46Codec());
        if (indexAnalyzer instanceof AnalyzerWrapper) {
            //如果是tmp目录,采用新建方式打开索引文件
            iwc.setOpenMode(OpenMode.CREATE);
        } else {
            iwc.setOpenMode(mode);
        }
        return iwc;
    }
......
}

这样,就可以在new MyAnalyzingInfixSuggester 的时候传入我们指定的索引打开模式,便可实现追加建立索引。但是,如果你只这样写就想追加索引是不可以的,因为在Suggest内部有他自己的排序算法,就是在建立索引时候便根据权重weight进行排序,在查询时候只返回一个文档号,比如在索引中已经有了张三、李四,你再APPEND一个王五进去,在搜索“王”的时候结果会给你显示李四。是不是很郁闷?解决办法就是取消Suggest的建立时就排序的步骤,增加在搜索时排序:
一下是源码中在建立索引时的排序方法,再MyAnalyzingInfixSuggester中重写build方法 删除掉一下代码即可。


重写lookup方法 删除掉下面代码并增加排序方法:(源码中的注释也有解释)



经过上面的处理便万事大吉。可以完美解决APPEND索引的问题。
第二个问题,更新索引就简单了,只需要调用IndexWriter的delete方法删除对应的Document之后再把需要更新的对象包装成list传入create进行build即可!
Directory fsDir = FSDirectory.open(new File(indexPath));
            IndexWriter indexWriter = new IndexWriter(fsDir, new IndexWriterConfig(ManageIndexService.LUCENE_VERSION, analyzer));
            //删除对应的词条
            indexWriter.deleteDocuments(new Term(MyAnalyzingInfixSuggester.TEXT_FIELD_NAME, sug.getTerm()));
            //彻底删除
            indexWriter.forceMergeDeletes();
            //关闭IndexWriter
            indexWriter.commit();
            indexWriter.close();
            loger.debug("删除旧索引成功:" + sug.getTerm());

            List<Suggester> list = new ArrayList<Suggester>();
            list.add(sug);
            //添加建立新的词条索引
            this.create(list, indexPath, OpenMode.APPEND);

下面附上完整的MyAnalyzingInfixSuggester代码。


import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.codecs.lucene46.Lucene46Codec;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.MultiDocValues;
import org.apache.lucene.index.SlowCompositeReaderWrapper;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.Version;


public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {
    /** 日志 **/
    private final Logger logger = Logger.getLogger(MyAnalyzingInfixSuggester.class);

    /** Field name used for the indexed text. */
    public static final String TEXT_FIELD_NAME = "text";

    /** Default minimum number of leading characters before
     *  PrefixQuery is used (4). */
    public static final int DEFAULT_MIN_PREFIX_CHARS = 4;
    private final File indexPath;
    final int minPrefixChars;
    final Version matchVersion;
    private final Directory dir;
    /**索引创建方式(新建或追加)*/
    private final OpenMode mode;

    /*
     * 重载 构造方法 初始化相关变量
     * @param matchVersion  Lucene版本
     * @param indexPath 索引文件目录
     * @param analyzer 分词器
     * @param mode 索引创建方式(新建或追加)
     * @throws IOException 
     */
    public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
        //调用父类构造方法
        super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
        this.mode = mode;
        this.indexPath = indexPath;
        this.minPrefixChars = DEFAULT_MIN_PREFIX_CHARS;
        this.matchVersion = matchVersion;
        dir = getDirectory(indexPath);
    }

    /*
     * 重写获得IndexWriterConfig的方法
     * 增加索引创建方式可变(新建或追加)
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
     */
    @Override
    protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
        IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
        iwc.setCodec(new Lucene46Codec());
        if (indexAnalyzer instanceof AnalyzerWrapper) {
            //如果是tmp目录,采用新建方式打开索引文件
            iwc.setOpenMode(OpenMode.CREATE);
        } else {
            iwc.setOpenMode(mode);
        }
        return iwc;
    }

    /*
     * 重写查询方法,取消在建立索引时候进行排序
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#build(org.apache.lucene.search.suggest.InputIterator)
     */
    @Override
    public void build(InputIterator iter) throws IOException {
        if (searcher != null) {
            searcher.getIndexReader().close();
            searcher = null;
        }
        Directory dirTmp = getDirectory(new File(indexPath.toString() + ".tmp"));
        IndexWriter w = null;
        IndexWriter w2 = null;
        AtomicReader r = null;
        boolean success = false;
        try {
            Analyzer gramAnalyzer = new AnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY) {
                @Override
                protected Analyzer getWrappedAnalyzer(String fieldName) {
                    return indexAnalyzer;
                }

                @Override
                protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
                    if (fieldName.equals("textgrams") && minPrefixChars > 0) {
                        return new TokenStreamComponents(components.getTokenizer(), new EdgeNGramTokenFilter(matchVersion, components.getTokenStream(), 1, minPrefixChars));
                    } else {
                        return components;
                    }
                }
            };
            w = new IndexWriter(dirTmp, getIndexWriterConfig(matchVersion, gramAnalyzer));
            BytesRef text;
            Document doc = new Document();
            FieldType ft = getTextFieldType();
            Field textField = new Field(TEXT_FIELD_NAME, "", ft);
            doc.add(textField);

            Field textGramField = new Field("textgrams", "", ft);
            doc.add(textGramField);

            Field textDVField = new BinaryDocValuesField(TEXT_FIELD_NAME, new BytesRef());
            doc.add(textDVField);

            Field weightField = new NumericDocValuesField("weight", 0);
            doc.add(weightField);

            Field payloadField;
            if (iter.hasPayloads()) {
                payloadField = new BinaryDocValuesField("payloads", new BytesRef());
                doc.add(payloadField);
            } else {
                payloadField = null;
            }
            long t0 = System.nanoTime();
            while ((text = iter.next()) != null) {
                String textString = text.utf8ToString();
                textField.setStringValue(textString);
                textGramField.setStringValue(textString);
                textDVField.setBytesValue(text);
                weightField.setLongValue(iter.weight());
                if (iter.hasPayloads()) {
                    payloadField.setBytesValue(iter.payload());
                }
                w.addDocument(doc);
            }
            logger.debug("initial indexing time: " + ((System.nanoTime() - t0) / 1000000) + " msec");

            r = SlowCompositeReaderWrapper.wrap(DirectoryReader.open(w, false));
            w.rollback();

            w2 = new IndexWriter(dir, getIndexWriterConfig(matchVersion, indexAnalyzer));
            w2.addIndexes(new IndexReader[] { r });
            r.close();

            searcher = new IndexSearcher(DirectoryReader.open(w2, false));
            w2.close();

            payloadsDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), "payloads");
            weightsDV = MultiDocValues.getNumericValues(searcher.getIndexReader(), "weight");
            textDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), TEXT_FIELD_NAME);
            assert textDV != null;
            success = true;
        } finally {
            if (success) {
                IOUtils.close(w, w2, r, dirTmp);
            } else {
                IOUtils.closeWhileHandlingException(w, w2, r, dirTmp);
            }
        }
    }

    /*
     * 重写查询方法,改变结果排序的方法
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#lookup(java.lang.CharSequence, int, boolean, boolean)
     */
    @Override
    public List<LookupResult> lookup(CharSequence key, int num, boolean allTermsRequired, boolean doHighlight) {

        if (searcher == null) {
            throw new IllegalStateException("suggester was not built");
        }

        final BooleanClause.Occur occur;
        if (allTermsRequired) {
            occur = BooleanClause.Occur.MUST;
        } else {
            occur = BooleanClause.Occur.SHOULD;
        }

        TokenStream ts = null;
        try {
            ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()));
            ts.reset();
            final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
            final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
            String lastToken = null;
            BooleanQuery query = new BooleanQuery();
            int maxEndOffset = -1;
            final Set<String> matchedTokens = new HashSet<String>();
            while (ts.incrementToken()) {
                if (lastToken != null) {
                    matchedTokens.add(lastToken);
                    query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
                }
                lastToken = termAtt.toString();
                if (lastToken != null) {
                    maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
                }
            }
            ts.end();

            String prefixToken = null;
            if (lastToken != null) {
                Query lastQuery;
                if (maxEndOffset == offsetAtt.endOffset()) {
                    // Use PrefixQuery (or the ngram equivalent) when
                    // there was no trailing discarded chars in the
                    // string (e.g. whitespace), so that if query does
                    // not end with a space we show prefix matches for
                    // that token:
                    lastQuery = getLastTokenQuery(lastToken);
                    prefixToken = lastToken;
                } else {
                    // Use TermQuery for an exact match if there were
                    // trailing discarded chars (e.g. whitespace), so
                    // that if query ends with a space we only show
                    // exact matches for that term:
                    matchedTokens.add(lastToken);
                    lastQuery = new TermQuery(new Term(TEXT_FIELD_NAME, lastToken));
                }
                if (lastQuery != null) {
                    query.add(lastQuery, occur);
                }
            }
            ts.close();

            Query finalQuery = finishQuery(query, allTermsRequired);

            //新建排序方法
            Sort sort = new Sort(new SortField("weight", SortField.Type.LONG, true));
            TopDocs hits = searcher.search(finalQuery, num, sort);

            List<LookupResult> results = createResults(hits, num, key, doHighlight, matchedTokens, prefixToken);
            return results;
        } catch (IOException ioe) {
            throw new RuntimeException(ioe);
        } finally {
            IOUtils.closeWhileHandlingException(ts);
        }
    }

}

下一篇 有时间可能会写写跨度查询、近同义词什么的,我也写了个完成的Demo,毕竟这个网上一搜一大把 就不着急了。针对以上文章和lucene相关的有什么问题可以+692790242 来一起讨论讨论。欢迎


版权所有, 转载请注明出处!By MRC

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章