Lucene 倒排索引原理

原創

Qunar技术沙龙

2021-08-04 11:23

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b9/b97a05a701956e8e7aa4d3bf48095ac0.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"高沛","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2018年7月加入去哪儿网，目前负责酒店搜索、门票搜索、大搜等搜索相关业务，曾参与基于Lucene的搜索召回服务搭建，个人对搜索引擎、分布式技术比较感兴趣，喜欢探究技术内幕、深入了解底层原理。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 作为 Apache 开源的一款搜索工具，一直以来是实现搜索功能的神兵利器，现今火热的 Solr 和 Elasticsearch 均基于该工具包进行开发，我们搜索召回组这边也是基于 Lucene 实现了一套索引构建机制，用于酒店搜索、门票搜索、大搜等搜索相关业务。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而 Lucene 之所以能在搜索中发挥至关重要的作用正是因为倒排索引。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此，本文将介绍一下倒排索引的概念以及倒排索引在 Lucene 中的实现。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 基本原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1. 什么是倒排索引","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索的核心需求是全文检索，全文检索简单来说就是要在大量文档中找到包含某个单词出现的位置，在传统关系型数据库中，数据检索只能通过 like 来实现，例如需要在酒店数据中查询名称包含公寓的酒店，需要通过如下 sql 实现：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"select * from hotel_table where hotel_name like '%公寓%';","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这种实现方式实际会存在很多问题：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"无法使用数据库索引，需要全表扫描，性能差","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索效果差，只能首尾位模糊匹配，无法实现复杂的搜索需求","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"无法得到文档与搜索条件的相关性","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索的核心目标实际上是保证搜索的效果和性能，为了高效的实现全文检索，我们可以通过倒排索引来解决。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引是区别于正排索引的概念：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正排索引：是以文档对象的唯一 ID 作为索引，以文档内容作为记录的结构。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引：Inverted index，指的是将文档内容中的单词作为索引，将包含该词的文档 ID 作为记录的结构。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4fb5fc732895e42961ed9d48beecc6bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面通过一个例子来说明下倒排索引的生成过程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设目前有以下两个文档内容：","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"苏州街维亚大厦 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"桔子酒店苏州街店","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其处理步骤如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、正排索引给每个文档进行编号，作为其唯一的标识。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/702a73277848f1ac3c1b5f1a5a5ea6c2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、生成倒排索引：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a.首先要对字段的内容进行分词，分词就是将一段连续的文本按照语义拆分为多个单词，这里两个文档包含的关键词有：苏州街、维亚大厦.....","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b.然后按照单词来作为索引，对应的文档 id 建立一个链表，就能构成上述的倒排索引结构。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b0f98f6f5bf831b88616ddbdf3f02a78.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了倒排索引，能快速、灵活地实现各类搜索需求。整个搜索过程中我们不需要做任何文本的模糊匹配。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如，如果需要在上述两个文档中查询 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"苏州街桔子","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":"，可以通过分词后通过 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"苏州街","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":"查到 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"1、2","attrs":{}},{"type":"text","text":"，通过 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"桔子","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":" 查到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"2","attrs":{}},{"type":"text","text":"，然后再进行","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"取交取并","attrs":{}},{"type":"text","text":"等操作得到最终结果。 ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4d6d52784bf1cf27e4c21003b3a348c8.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2. 倒排索引的结构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据倒排索引的概念，我们可以用一个 Map来简单描述这个结构。这个 Map 的 Key 的即是分词后的单词，这里的单词称为 Term，这一系列的 Term 组成了倒排索引的第一个部分 —— Term Dictionary (索引表，可简称为 Dictionary)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引的另一部分为 Postings List（记录表），也对应上述 Map 结构的 Value 部分集合。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"记录表由所有的 Term 对应的数据（Postings）组成，它不仅仅为文档 id 信息，可能包含以下信息：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文档 id（DocId, Document Id），包含单词的所有文档唯一 id，用于去正排索引中查询原始数据。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"词频（TF，Term Frequency），记录 Term 在每篇文档中出现的次数，用于后续相关性算分。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"位置（Position），记录 Term 在每篇文档中的分词位置（多个），用于做词语搜索（Phrase Query）。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"偏移（Offset），记录 Term 在每篇文档的开始和结束位置，用于高亮显示等。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/19aa1afae71f834e38bf676a7924ff36.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. Lucene 倒排索引的实现","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全文搜索引擎在海量数据的情况下是需要存储大量的文本，所以面临以下问题：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dictionary 是比较大的（比如我们搜索中的一个字段可能有上千万个 Term）","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Postings 可能会占据大量的存储空间（一个Term多的有几百万个doc）","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此上面说的基于 Map 的实现方式几乎是不可行的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在海量数据背景下，倒排索引的实现直接关系到存储成本以及搜索性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为此，Lucene 引入了多种巧妙的数据结构和算法。其倒排索引实现拥有以下特性：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以较低的存储成本存储在磁盘（索引大小大约为被索引文本的20-30％）","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快速读写","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面将根据倒排索引的结构，按 Posting List 和 Terms Dictionary 两部分来分析 Lucene 中的实现。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1. Posting List 实现","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PostingList 包含文档 id、词频、位置等多个信息，这些数据之间本身是相对独立的，因此 Lucene 将 Postings List 被拆成三个文件存储：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".doc后缀文件：记录 Postings 的 docId 信息和 Term 的词频","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".pay后缀文件：记录 Payload 信息和偏移量信息","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".pos后缀文件：记录位置信息","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基本所有的查询都会用 .doc 文件获取文档 id，且一般的查询仅需要用到 .doc 文件就足够了，只有对于近似查询等位置相关的查询则需要用位置相关数据。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三个文件整体实现差不太多，这里以.doc 文件为例分析其实现。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".doc 文件存储的是每个 Term 对应的文档 Id 和词频。每个 Term 都包含一对 TermFreqs 和 SkipData 结构。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 TermFreqs 存放 docId 和词频信息，SkipData 为跳表信息，用于实现 TermFreqs 内部的快速跳转。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e7d8a710fa5d70eb08a311919a6cca72.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.1. TermFreqs","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TermFreqs 存储文档号和对应的词频，它们两是一一对应的两个 int 值。Lucene 为了尽可能的压缩数据，采用的是混合存储，由 PackedBlock 和 VIntBlocks 两种结构组成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"PackedBlock","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其采用 PackedInts 结构将一个 int[] 压缩打包成一个紧凑的 Block。它的压缩方式是取数组中最大值所占用的 bit 长度作为一个预算的长度，然后将数组每个元素按这个长度进行截取，以达到压缩的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如：一个包含128个元素的 int 数组中最大值的是 2，那么预算长度为2个 bit, PackedInts 的长度仅是 2 * 128 / 8 = 32个字节，然后就可以通过4个 long 值存储。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7e/7ee9913980300e9d5173f854f2325ed7.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"VIntBlock","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"VIntBlock 是采用 VInt 来压缩 int 值，对于绝大多数语言，int 型都占4个字节，不论这个数据是1、100、1000、还是1000,000。VInt 采用可变长的字节来表示一个整数。数值较大的数，使用较多的字节来表示，数值较少的数，使用较少的字节来表示。每个字节仅使用第1至第7位(共7 bits)存储数据，第8位作为标识，表示是否需要继续读取下一个字节。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"举个例子：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整数130为 int 类型时需要4个字节，转换成 VInt 后仅用2个字节，其中第一个字节的第8位为1，标识需要继续读取第二个字节。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a9ea395994ac7f211c4b554539f7345d.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据上述两种 Block 的特点，Lucene 会每处理包含 Term 的128篇文档，将其对应的 DocId 数组和 TermFreq 数组分别处理为 PackedDocDeltaBlock 和 PackedFreqBlock 的 PackedInt 结构，两者组成一个 PackedBlock，最后不足128的文档则采用 VIntBlock 的方式来存储。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/01/018e897474c5790f13caed6c2ee9b88a.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.2. SkipData","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在搜索中存在将每个 Term 对应的 DocId 集合进行取交集的操作，即判断某个 Term 的 DocId 在另一个 Term 的 TermFreqs 中是否存在。TermFreqs 中每个 Block 中的 DocId 是有序的，可以采用顺序扫描的方式来查询，但是如果 Term 对应的 doc 特别多时搜索效率就会很低，同时由于 Block 的大小是不固定的，我们无法使用二分的方式来进行查询。因此 Lucene 为了减少扫描和比较的次数，采用了 SkipData 这个跳表结构来实现快速跳转。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"跳表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跳表是在原有的有序链表上面增加了多级索引，通过索引来实现快速查找。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实质就是一种可以进行二分查找的有序链表。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/aeb8f5b5e562b534b2e2f1c42720e4c6.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SkipData结构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 TermFreqs 中每生成一个 Block 就会在 SkipData 的第0层生成一个节点，然后第0层以上每隔 N 个节点生成一个上层节点。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每个节点通过 Child 属性关联下层节点，节点内 DocSkip 属性保存 Block 的最大的 DocId 值，DocBlockFP、PosBlockFP、PayBlockFP 则表示 Block 数据对应在 .pay、.pos、.doc 文件的位置。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/66/6635cd475f5ee49b24825be3b000e760.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.3. Posting 最终数据","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Posting List 采用多个文件进行存储，最终我们可以得到每个 Term 的如下信息：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SkipOffset：用来描述当前 term 信息在 .doc 文件中跳表信息的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DocStartFP：是当前 term 信息在 .doc 文件中的文档 ID 与词频信息的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PosStartFP：是当前 term 信息在 .pos 文件中的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PayStartFP：是当前 term 信息在 .pay 文件中的起始位置。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2. Term Dictionary 实现","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary（索引表）存储所有的 Term 数据，同时它也是 Term 与 Postings 的关系纽带，存储了每个 Term 和其对应的 Postings 文件位置指针。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c0626a99bb0d64f28d61058f0e8203d7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.1. 数据存储","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary 通过 .tim 后缀文件存储，其内部采用 NodeBlock 对 Term 进行压缩前缀存储，处理过程会将相同前缀的的 Term 压缩为一个 NodeBlock，NodeBlock 会存储公共前缀，然后将每个 Term 的后缀以及对应 Term 的 Posting 关联信息处理为一个 Entry 保存到 Block。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7bedadb0a3c3d1e03a6fff38974b7fbe.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上图中可以看到 Block 中还包含了 Block，这里是为了处理包含相同前缀的 Term 集合内部部分 Term 又包含了相同前缀。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"举个例子，在下图中为公共前缀为 a 的 Term 集合，内部部分 Term 的又包含了相同前缀 ab，这时这部分 Term 就会处理为一个嵌套的 Block。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/63/63ab5bc1d6dfd34c4d926aee1432bb26.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.2. 数据查找","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary 是按 NodeBlock 存储在.tim 文件上。当文档数量越来越多的时，Dictionary 中的 Term 也会越来越多，那查询效率必然也会逐渐变低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此需要一个很好的数据结构为 Dictionary 建构一个索引，这就是 Terms Index(.tip文件存储)，Lucene 采用了 FST 这个数据结构来实现这个索引。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"FST","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FST, 全称 Finite State Transducer（有限状态转换器）。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它具备以下特点：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"给定一个 Input 可以得到一个 output，相当于 HashMap","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"共享前缀、后缀节省空间，FST 的内存消耗要比 HashMap 少很多","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"词查找复杂度为 O(len(str))　　","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"构建后不可变更","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图为 mon/1，thrus/4，tues/2 生成的 FST，可以看到 thrus 和 tues 共享了前缀 t 以及后缀 s。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/ab0f6969a5064cfe52201101e49a2d58.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据 FST 就可以将需要搜索 Term 作为 Input，对其途径的边上的值进行累加就可以得到 output，下述为以 input 为 thrus 的读取逻辑：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始状态0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入t， FST 从0 -> 3， output=2","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入h，FST 从3 -> 4， output=2+2=4","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入r， FST 从4 -> 5， output=4+0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入u，FST 从5 -> 7， output=4+0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入s， FST 到达终止节点，output=4+0=4","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那么 Term Dictionary 生成的 FST 对应 input 和 output 是什么呢？可能会误认为 FST 的 input 是 Dictionary 中所有的 Term，这样通过 FST 就可以找到具体一个 Term 对应的 Posting 数据。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实际上 FST 是通过 Dictionary 的每个 NodeBlock 的前缀构成，所以通过 FST 只可以直接找到这个 NodeBlock 在 .tim 文件上具体的 File Pointer, 然后还需要在 NodeBlock 中遍历 Entry 匹配后缀进行查找。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此它在 Lucene 中充当以下功能：","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"快速试错，即是在 FST 上找不到可以直接跳出不需要遍历整个 Dictionary，类似于 BloomFilter。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"快速定位 Block 的位置，通过 FST 是可以直接计算出 Block 的在文件中位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"FST 也是一个 Automation(自动状态机)。这是正则表达式的一种实现方式，所以 FST 能提供正则表达式的能力。通过 FST 能够极大的提高近似查询的性能，包括通配符查询、SpanQuery、PrefixQuery 等。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3. 倒排查询逻辑","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介绍了索引表和记录表的结构后，就可以得到 Lucene 倒排索引的查询步骤：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"通过 Term Index 数据（.tip文件）中的 StartFP 获取指定字段的 FST","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"通过 FST 找到指定 Term 在 Term Dictionary（.tim 文件）可能存在的 Block","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"将对应 Block 加载内存，遍历 Block 中的 Entry，通过后缀（Suffix）判断是否存在指定 Term","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"存在则通过 Entry 的 TermStat 数据中各个文件的 FP 获取 Posting 数据","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要获取 Term 对应的所有 DocId 则直接遍历 TermFreqs，如果获取指定 DocId 数据则通过 SkipData 快速跳转","attrs":{}}]}]}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8e0cf56336056ff40f32a1fc14e57d1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. Lucene 数值类型处理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述 Terms Dictionary 与 Posting List 的实现都是处理字符串类型的 Term，而对于数值类型，如果采用上述方式实现会存在以下问题：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数值潜在的 Term 可能会非常多，比如是浮点数，导致查询效率低","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"无法处理多维数据，比如经纬度","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以 Lucene 为了支持高效的数值类或者多维度查询，引入了 BKDTree。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1. KDTree","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BKDTree 是基于 KDTree，KDTree 实现起来很像是一个二叉查找树。主要的区别是，KDTree 在不同的层使用的是不同的维度值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是一个2维树的样例，其第一层以 x 为切分维度，将 x>30的节点传递给右子树，x<30的传递给左子树，第二层再按 y 维度切分，不断迭代到所有数据都被建立到 KD Tree 的节点上为止。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/622038aafb06bf4c4fdcdb7c6e4c5909.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2. BKDTree","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BKD 树是 KD 树和 B+ 树的组合，拥有以下特性：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"内部 node 必须是一个完全二叉树","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"叶子节点存储点数据，降低层高度，减少磁盘 IO","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/802d0f38233f1edda56855c5be205f99.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5. 总结","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文先介绍了倒排索引的概念和结构，然后对 Lucene 倒排索引的 Terms Dictionary 和 Posting List 的整体结构以及倒排索引的查询逻辑，最后介绍了 Lucene 对数值类型所做的处理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引有效的解决了搜索中的很多问题，而 Lucene 对倒排索引的实现包含了很多巧妙的结构和设计，对数据存储压缩以及查询很有借鉴意义，值得深入学习。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"参考资料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 源码分析：https://www.amazingkoala.com.cn/Lucene/ ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene BKD 树：https://www.shenyanchao.cn/blog/2018/12/04/lucene-bkd/ ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 查询原理及解析：https://www.infoq.cn/article/ejEG02VRoeGVaLw4j_LL","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 倒排索引探秘：https://www.6aiq.com/article/1564413040138 Lucene","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"词典 FST 深入剖析：https://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/","attrs":{}}]}],"attrs":{}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

欧洲英国德国法国TikTok与YouTube海外网红达人的完美合作策略

【本篇由言同數字科技有限公司原創】在當今數字營銷時代，TikTok已成爲一種受歡迎的社交媒體平臺，尤其在年輕人中頗具影響力。而其中的直播帶貨更是吸引了衆多品牌的注意，成爲推廣產品和增加銷售的重要途徑。下面言同數字將針對海外TikTok網紅直

2024-05-03 22:36:01

ollama使用

ollama 僅支持。gguf的格式其他格式需要llama.cpp 轉換 curl https://ollama.ai/install.sh | sh ollama --version ollama pull llama2-chin

2024-05-01 00:42:55

「Qt Widget中文示例指南」如何实现一个快捷编辑器（一）

Qt 是目前最先進、最完整的跨平臺C++開發工具。它不僅完全實現了一次編寫，所有平臺無差別運行，更提供了幾乎所有開發過程中需要用到的工具。如今，Qt已被運用於超過70個行業、數千家企業，支持數百萬設備及應用。快捷編輯器示例展示瞭如何創建一

2024-04-30 23:36:29

解锁HDC 2024之旅：从购票到报名，全程攻略

本文分享自華爲雲社區《解鎖HDC 2024之旅：從購票到報名，全程攻略》，作者：華爲雲社區精選。 Hi，代碼界的小夥伴們，集結號已經吹響了！華爲開發者大會（HDC 2024）——這場匯聚了HarmonyOS NEXT鴻蒙星河版、盤古大模型5

2024-04-30 22:34:35

银行核心背后的落地工程体系丨Oracle - TiDB 数据迁移详解

本文作者：張顯華，孟凡輝，莊培培系列導讀：徐戟（白鱔）數據庫技術專家，Oracle ACE，PostgreSQL ACE Director 當前，國內大量的關鍵行業的核心繫統正在實現國產化替代，而與此同時，這些行業的數字化轉型也正在進入

2024-04-30 22:24:59

30 秒出服装设计稿，森马用函数计算+AIGC 整“新活”!

創新項目如何去賦能我們的業務，這件事情在森馬很重要。阿里雲函數計算幫我們屏蔽掉了想把AI落地到實際業務場景中 GPU 算力資源儲備、採購成本、技術門檻等很多難題，從而迅速做出決策，快人一步站在正確的起點，體驗新技術對整個服裝爆款設計、營銷

2024-04-30 21:12:14

消金公司2023财报解析：息差维持高位，信用成本攀升

來源 | 鐳射財經（leishecaijing） 2023年，是持牌消金行業承上啓下的關鍵一年，也是鍛造韌性、比拼內功最緊張的一年。一方面，住戶短期消費貸款餘額在2022年觸底後，伴隨經濟復甦、消費提振，於2023年重新回到上行軌道。短

2024-04-30 13:11:32

Linux下制作Nginx绿色免安装包

前言 linux下安裝nginx比較繁瑣，遇到內網部署環境更是麻煩，所以研究了下nginx綠色免安裝版的部署包製作，開箱即用，特此記錄分享，一下操作在centos8環境下安裝，如果需要其他內核系統的安裝（Debian/Ubuntu等），請在

2024-04-29 21:38:23

数字化转型新篇章：企业通往智能化的新范式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

MindSpore强化学习：使用PPO配合环境HalfCheetah-v2进行训练

本文分享自華爲雲社區《MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練》，作者： irrational。半獵豹（Half Cheetah）是一個基於MuJoCo的強化學習環境，由P. Wawrzyński

2024-04-29 10:33:13

图片旋转后保存到数据库

1、圖片通過canvas繪製 2、canvas旋轉 3、canvas 轉成blob 在實例化成文件 4、創建formData裏面append放入文件和其他的參數，再調上傳接口 <div style=" heig

2024-04-29 10:16:22

记一次北京某大学逻辑漏洞挖掘

0x01 信息收集個人覺得教育src的漏洞挖掘就不需要找真實IP了，我們直接進入正題，收集某大學的子域名，可以用oneforall，這裏給大家推薦一個在線查詢子域名的網站：https://www.virustotal.com/ 收集到的子

2024-04-28 22:47:25

1 名工程师轻松管理 20 个工作流，创业企业用 Serverless 让数据处理流程提效

作者：嶽洋、陳德全、劉靜娜北京語勢科技有限公司成立於 2023 年 6 月，語勢科技定位爲“智能投資時代的主題入口”，在資管行業從以機構爲核心轉向以用戶爲核心的變革時代，通過打造主題投資引擎，賦能普惠投資一體化，打造以投資者和資管機構爲主

2024-04-28 21:12:22

实用分享！用Axure RP构建交互的5个小技巧

Axure RP是一套專門爲網站或應用程序所設計的快速原型設計工具，可以讓應用網站策劃人員或網站功能界面設計師更加快速方便的建立Web AP和Website的線框圖、流程圖、原型和規格。在Axure RP中，交互是創建豐富而逼真的原型的

2024-04-28 11:35:53

LoRA微调语言大模型的实用技巧

一、引言隨着深度學習技術的快速發展，語言大模型在自然語言處理領域取得了顯著的進展。然而，傳統的微調方法通常需要大量的計算資源和時間，對於實際應用來說並不友好。爲了解決這個問題，LoRA微調技術應運而生。LoRA（Low-Rank Adap

2024-04-28 11:30:13

24小時熱門文章

DAPPER 事务 TRANSACTION

最新文章

最新評論文章