Lucene 倒排索引原理

原創

Qunar技术沙龙

2021-08-04 11:23

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b9/b97a05a701956e8e7aa4d3bf48095ac0.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"高沛","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2018年7月加入去哪兒網，目前負責酒店搜索、門票搜索、大搜等搜索相關業務，曾參與基於Lucene的搜索召回服務搭建，個人對搜索引擎、分佈式技術比較感興趣，喜歡探究技術內幕、深入瞭解底層原理。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 作爲 Apache 開源的一款搜索工具，一直以來是實現搜索功能的神兵利器，現今火熱的 Solr 和 Elasticsearch 均基於該工具包進行開發，我們搜索召回組這邊也是基於 Lucene 實現了一套索引構建機制，用於酒店搜索、門票搜索、大搜等搜索相關業務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而 Lucene 之所以能在搜索中發揮至關重要的作用正是因爲倒排索引。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此，本文將介紹一下倒排索引的概念以及倒排索引在 Lucene 中的實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 基本原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1. 什麼是倒排索引","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索的核心需求是全文檢索，全文檢索簡單來說就是要在大量文檔中找到包含某個單詞出現的位置，在傳統關係型數據庫中，數據檢索只能通過 like 來實現，例如需要在酒店數據中查詢名稱包含公寓的酒店，需要通過如下 sql 實現：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"select * from hotel_table where hotel_name like '%公寓%';","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種實現方式實際會存在很多問題：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無法使用數據庫索引，需要全表掃描，性能差","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索效果差，只能首尾位模糊匹配，無法實現複雜的搜索需求","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無法得到文檔與搜索條件的相關性","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索的核心目標實際上是保證搜索的效果和性能，爲了高效的實現全文檢索，我們可以通過倒排索引來解決。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引是區別於正排索引的概念：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正排索引：是以文檔對象的唯一 ID 作爲索引，以文檔內容作爲記錄的結構。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引：Inverted index，指的是將文檔內容中的單詞作爲索引，將包含該詞的文檔 ID 作爲記錄的結構。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4fb5fc732895e42961ed9d48beecc6bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面通過一個例子來說明下倒排索引的生成過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設目前有以下兩個文檔內容：","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"蘇州街維亞大廈 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"桔子酒店蘇州街店","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其處理步驟如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、正排索引給每個文檔進行編號，作爲其唯一的標識。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/702a73277848f1ac3c1b5f1a5a5ea6c2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、生成倒排索引：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a.首先要對字段的內容進行分詞，分詞就是將一段連續的文本按照語義拆分爲多個單詞，這裏兩個文檔包含的關鍵詞有：蘇州街、維亞大廈.....","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b.然後按照單詞來作爲索引，對應的文檔 id 建立一個鏈表，就能構成上述的倒排索引結構。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b0f98f6f5bf831b88616ddbdf3f02a78.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了倒排索引，能快速、靈活地實現各類搜索需求。整個搜索過程中我們不需要做任何文本的模糊匹配。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如，如果需要在上述兩個文檔中查詢 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"蘇州街桔子","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":"，可以通過分詞後通過 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"蘇州街","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":"查到 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"1、2","attrs":{}},{"type":"text","text":"，通過 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"桔子","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":" 查到","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":" ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"2","attrs":{}},{"type":"text","text":"，然後再進行","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"取交取並","attrs":{}},{"type":"text","text":"等操作得到最終結果。 ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4d6d52784bf1cf27e4c21003b3a348c8.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2. 倒排索引的結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據倒排索引的概念，我們可以用一個 Map來簡單描述這個結構。這個 Map 的 Key 的即是分詞後的單詞，這裏的單詞稱爲 Term，這一系列的 Term 組成了倒排索引的第一個部分 —— Term Dictionary (索引表，可簡稱爲 Dictionary)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引的另一部分爲 Postings List（記錄表），也對應上述 Map 結構的 Value 部分集合。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"記錄表由所有的 Term 對應的數據（Postings）組成，它不僅僅爲文檔 id 信息，可能包含以下信息：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文檔 id（DocId, Document Id），包含單詞的所有文檔唯一 id，用於去正排索引中查詢原始數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞頻（TF，Term Frequency），記錄 Term 在每篇文檔中出現的次數，用於後續相關性算分。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"位置（Position），記錄 Term 在每篇文檔中的分詞位置（多個），用於做詞語搜索（Phrase Query）。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"偏移（Offset），記錄 Term 在每篇文檔的開始和結束位置，用於高亮顯示等。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/19aa1afae71f834e38bf676a7924ff36.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. Lucene 倒排索引的實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全文搜索引擎在海量數據的情況下是需要存儲大量的文本，所以面臨以下問題：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dictionary 是比較大的（比如我們搜索中的一個字段可能有上千萬個 Term）","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Postings 可能會佔據大量的存儲空間（一個Term多的有幾百萬個doc）","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此上面說的基於 Map 的實現方式幾乎是不可行的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在海量數據背景下，倒排索引的實現直接關係到存儲成本以及搜索性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此，Lucene 引入了多種巧妙的數據結構和算法。其倒排索引實現擁有以下特性：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以較低的存儲成本存儲在磁盤（索引大小大約爲被索引文本的20-30％）","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快速讀寫","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面將根據倒排索引的結構，按 Posting List 和 Terms Dictionary 兩部分來分析 Lucene 中的實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1. Posting List 實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PostingList 包含文檔 id、詞頻、位置等多個信息，這些數據之間本身是相對獨立的，因此 Lucene 將 Postings List 被拆成三個文件存儲：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".doc後綴文件：記錄 Postings 的 docId 信息和 Term 的詞頻","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".pay後綴文件：記錄 Payload 信息和偏移量信息","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".pos後綴文件：記錄位置信息","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基本所有的查詢都會用 .doc 文件獲取文檔 id，且一般的查詢僅需要用到 .doc 文件就足夠了，只有對於近似查詢等位置相關的查詢則需要用位置相關數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三個文件整體實現差不太多，這裏以.doc 文件爲例分析其實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":".doc 文件存儲的是每個 Term 對應的文檔 Id 和詞頻。每個 Term 都包含一對 TermFreqs 和 SkipData 結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 TermFreqs 存放 docId 和詞頻信息，SkipData 爲跳錶信息，用於實現 TermFreqs 內部的快速跳轉。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e7d8a710fa5d70eb08a311919a6cca72.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.1. TermFreqs","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TermFreqs 存儲文檔號和對應的詞頻，它們兩是一一對應的兩個 int 值。Lucene 爲了儘可能的壓縮數據，採用的是混合存儲，由 PackedBlock 和 VIntBlocks 兩種結構組成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"PackedBlock","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其採用 PackedInts 結構將一個 int[] 壓縮打包成一個緊湊的 Block。它的壓縮方式是取數組中最大值所佔用的 bit 長度作爲一個預算的長度，然後將數組每個元素按這個長度進行截取，以達到壓縮的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如：一個包含128個元素的 int 數組中最大值的是 2，那麼預算長度爲2個 bit, PackedInts 的長度僅是 2 * 128 / 8 = 32個字節，然後就可以通過4個 long 值存儲。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7e/7ee9913980300e9d5173f854f2325ed7.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"VIntBlock","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"VIntBlock 是採用 VInt 來壓縮 int 值，對於絕大多數語言，int 型都佔4個字節，不論這個數據是1、100、1000、還是1000,000。VInt 採用可變長的字節來表示一個整數。數值較大的數，使用較多的字節來表示，數值較少的數，使用較少的字節來表示。每個字節僅使用第1至第7位(共7 bits)存儲數據，第8位作爲標識，表示是否需要繼續讀取下一個字節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整數130爲 int 類型時需要4個字節，轉換成 VInt 後僅用2個字節，其中第一個字節的第8位爲1，標識需要繼續讀取第二個字節。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a9ea395994ac7f211c4b554539f7345d.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據上述兩種 Block 的特點，Lucene 會每處理包含 Term 的128篇文檔，將其對應的 DocId 數組和 TermFreq 數組分別處理爲 PackedDocDeltaBlock 和 PackedFreqBlock 的 PackedInt 結構，兩者組成一個 PackedBlock，最後不足128的文檔則採用 VIntBlock 的方式來存儲。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/01/018e897474c5790f13caed6c2ee9b88a.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.2. SkipData","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在搜索中存在將每個 Term 對應的 DocId 集合進行取交集的操作，即判斷某個 Term 的 DocId 在另一個 Term 的 TermFreqs 中是否存在。TermFreqs 中每個 Block 中的 DocId 是有序的，可以採用順序掃描的方式來查詢，但是如果 Term 對應的 doc 特別多時搜索效率就會很低，同時由於 Block 的大小是不固定的，我們無法使用二分的方式來進行查詢。因此 Lucene 爲了減少掃描和比較的次數，採用了 SkipData 這個跳錶結構來實現快速跳轉。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"跳錶","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跳錶是在原有的有序鏈表上面增加了多級索引，通過索引來實現快速查找。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實質就是一種可以進行二分查找的有序鏈表。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/aeb8f5b5e562b534b2e2f1c42720e4c6.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SkipData結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 TermFreqs 中每生成一個 Block 就會在 SkipData 的第0層生成一個節點，然後第0層以上每隔 N 個節點生成一個上層節點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個節點通過 Child 屬性關聯下層節點，節點內 DocSkip 屬性保存 Block 的最大的 DocId 值，DocBlockFP、PosBlockFP、PayBlockFP 則表示 Block 數據對應在 .pay、.pos、.doc 文件的位置。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/66/6635cd475f5ee49b24825be3b000e760.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1.3. Posting 最終數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Posting List 採用多個文件進行存儲，最終我們可以得到每個 Term 的如下信息：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SkipOffset：用來描述當前 term 信息在 .doc 文件中跳錶信息的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DocStartFP：是當前 term 信息在 .doc 文件中的文檔 ID 與詞頻信息的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PosStartFP：是當前 term 信息在 .pos 文件中的起始位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PayStartFP：是當前 term 信息在 .pay 文件中的起始位置。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2. Term Dictionary 實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary（索引表）存儲所有的 Term 數據，同時它也是 Term 與 Postings 的關係紐帶，存儲了每個 Term 和其對應的 Postings 文件位置指針。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c0626a99bb0d64f28d61058f0e8203d7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.1. 數據存儲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary 通過 .tim 後綴文件存儲，其內部採用 NodeBlock 對 Term 進行壓縮前綴存儲，處理過程會將相同前綴的的 Term 壓縮爲一個 NodeBlock，NodeBlock 會存儲公共前綴，然後將每個 Term 的後綴以及對應 Term 的 Posting 關聯信息處理爲一個 Entry 保存到 Block。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7bedadb0a3c3d1e03a6fff38974b7fbe.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上圖中可以看到 Block 中還包含了 Block，這裏是爲了處理包含相同前綴的 Term 集合內部部分 Term 又包含了相同前綴。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子，在下圖中爲公共前綴爲 a 的 Term 集合，內部部分 Term 的又包含了相同前綴 ab，這時這部分 Term 就會處理爲一個嵌套的 Block。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/63/63ab5bc1d6dfd34c4d926aee1432bb26.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2.2. 數據查找","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Terms Dictionary 是按 NodeBlock 存儲在.tim 文件上。當文檔數量越來越多的時，Dictionary 中的 Term 也會越來越多，那查詢效率必然也會逐漸變低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此需要一個很好的數據結構爲 Dictionary 建構一個索引，這就是 Terms Index(.tip文件存儲)，Lucene 採用了 FST 這個數據結構來實現這個索引。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"FST","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FST, 全稱 Finite State Transducer（有限狀態轉換器）。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它具備以下特點：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"給定一個 Input 可以得到一個 output，相當於 HashMap","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"共享前綴、後綴節省空間，FST 的內存消耗要比 HashMap 少很多","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞查找複雜度爲 O(len(str))　　","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構建後不可變更","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖爲 mon/1，thrus/4，tues/2 生成的 FST，可以看到 thrus 和 tues 共享了前綴 t 以及後綴 s。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/ab0f6969a5064cfe52201101e49a2d58.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據 FST 就可以將需要搜索 Term 作爲 Input，對其途徑的邊上的值進行累加就可以得到 output，下述爲以 input 爲 thrus 的讀取邏輯：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始狀態0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入t， FST 從0 -> 3， output=2","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入h，FST 從3 -> 4， output=2+2=4","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入r， FST 從4 -> 5， output=4+0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入u，FST 從5 -> 7， output=4+0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入s， FST 到達終止節點，output=4+0=4","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼 Term Dictionary 生成的 FST 對應 input 和 output 是什麼呢？可能會誤認爲 FST 的 input 是 Dictionary 中所有的 Term，這樣通過 FST 就可以找到具體一個 Term 對應的 Posting 數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上 FST 是通過 Dictionary 的每個 NodeBlock 的前綴構成，所以通過 FST 只可以直接找到這個 NodeBlock 在 .tim 文件上具體的 File Pointer, 然後還需要在 NodeBlock 中遍歷 Entry 匹配後綴進行查找。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此它在 Lucene 中充當以下功能：","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"快速試錯，即是在 FST 上找不到可以直接跳出不需要遍歷整個 Dictionary，類似於 BloomFilter。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"快速定位 Block 的位置，通過 FST 是可以直接計算出 Block 的在文件中位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"FST 也是一個 Automation(自動狀態機)。這是正則表達式的一種實現方式，所以 FST 能提供正則表達式的能力。通過 FST 能夠極大的提高近似查詢的性能，包括通配符查詢、SpanQuery、PrefixQuery 等。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3. 倒排查詢邏輯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介紹了索引表和記錄表的結構後，就可以得到 Lucene 倒排索引的查詢步驟：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"通過 Term Index 數據（.tip文件）中的 StartFP 獲取指定字段的 FST","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"通過 FST 找到指定 Term 在 Term Dictionary（.tim 文件）可能存在的 Block","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"將對應 Block 加載內存，遍歷 Block 中的 Entry，通過後綴（Suffix）判斷是否存在指定 Term","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"存在則通過 Entry 的 TermStat 數據中各個文件的 FP 獲取 Posting 數據","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要獲取 Term 對應的所有 DocId 則直接遍歷 TermFreqs，如果獲取指定 DocId 數據則通過 SkipData 快速跳轉","attrs":{}}]}]}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8e0cf56336056ff40f32a1fc14e57d1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. Lucene 數值類型處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述 Terms Dictionary 與 Posting List 的實現都是處理字符串類型的 Term，而對於數值類型，如果採用上述方式實現會存在以下問題：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數值潛在的 Term 可能會非常多，比如是浮點數，導致查詢效率低","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無法處理多維數據，比如經緯度","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以 Lucene 爲了支持高效的數值類或者多維度查詢，引入了 BKDTree。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1. KDTree","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BKDTree 是基於 KDTree，KDTree 實現起來很像是一個二叉查找樹。主要的區別是，KDTree 在不同的層使用的是不同的維度值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是一個2維樹的樣例，其第一層以 x 爲切分維度，將 x>30的節點傳遞給右子樹，x<30的傳遞給左子樹，第二層再按 y 維度切分，不斷迭代到所有數據都被建立到 KD Tree 的節點上爲止。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/622038aafb06bf4c4fdcdb7c6e4c5909.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2. BKDTree","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BKD 樹是 KD 樹和 B+ 樹的組合，擁有以下特性：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內部 node 必須是一個完全二叉樹","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"葉子節點存儲點數據，降低層高度，減少磁盤 IO","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/802d0f38233f1edda56855c5be205f99.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5. 總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文先介紹了倒排索引的概念和結構，然後對 Lucene 倒排索引的 Terms Dictionary 和 Posting List 的整體結構以及倒排索引的查詢邏輯，最後介紹了 Lucene 對數值類型所做的處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"倒排索引有效的解決了搜索中的很多問題，而 Lucene 對倒排索引的實現包含了很多巧妙的結構和設計，對數據存儲壓縮以及查詢很有借鑑意義，值得深入學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 源碼分析：https://www.amazingkoala.com.cn/Lucene/ ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene BKD 樹：https://www.shenyanchao.cn/blog/2018/12/04/lucene-bkd/ ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 查詢原理及解析：https://www.infoq.cn/article/ejEG02VRoeGVaLw4j_LL","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 倒排索引探祕：https://www.6aiq.com/article/1564413040138 Lucene","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞典 FST 深入剖析：https://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/","attrs":{}}]}],"attrs":{}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

LightningChartJS是Web上性能特高的圖表庫，具有出色的執行性能 - 使用高數據速率同時監控數十個數據源。 GPU加速和WebGL渲染確保您的設備的圖形處理器得到有效利用，從而實現高刷新率和流暢的動畫，常用於貿易，工程，航空航

2024-04-25 11:36:06

詳解數倉的向量化執行引擎

本文分享自華爲雲社區《GaussDB(DWS)向量化執行引擎詳解》，作者： yd_212508532。前言適用版本：【基線功能】傳統的行執行引擎大多采用一次一元組的執行模式，這樣在執行過程中CPU大部分時間並沒有用來處理數據，更

2024-04-25 10:33:17

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

Blackhat Asia 2024於4月中旬在新加坡隆重舉行。此次大會聚集了業界最傑出的信息安全專業人士和研究者，爲參會人員提供了安全領域最新的研究成果和發展趨勢。在本次大會上，百度安全共有三篇技術議題被大會收錄，主要圍繞自動駕駛控制器安

2024-04-25 09:33:19

前端面試題 - 說一下原型和原型鏈？

前端面試題 - 說一下原型和原型鏈？ JavaScript 中，萬物皆對象，對象分爲普通對象和函數對象。所有的函數都是函數對象（typeof f === 'function'），其他都是普通對象（typeof o === 'object'

2024-04-24 23:51:10

前端面試題 - JS的垃圾回收機制？

前端面試題 - JS的垃圾回收機制？有兩種垃圾回收策略：標記清除：標記階段即爲所有活動對象做上標記，清除階段則把沒有標記（也就是非活動對象）銷燬。引用計數：它把對象是否不再需要簡化定義爲對象有沒有其他對象引用到它。如果沒有引用指向該

2024-04-24 23:51:03

數據結構筆記淺記（十三）哈希表

「哈希表 hash table」，又稱「散列表」，它通過建立鍵 key 與值 value 之間的映射，實現高效的元素查詢。具體而言，我們向哈希表中輸入一個鍵 key ，則可以在 𝑂(1) 時間內獲取對應的值 value 。從本質上看，哈

2024-04-24 23:39:16

一則 TCP 緩存超負荷導致的 MySQL 連接中斷的案例分析

除了 MySQL 本身之外，如何分析定位其他因素的可能性？作者：龔唐傑，愛可生 DBA 團隊成員，主要負責 MySQL 技術支持，擅長 MySQL、PG、國產數據庫。愛可生開源社區出品，原創內容未經授權不得隨意使用，轉載請聯繫小編並註

2024-04-24 23:20:53

離開工位老是忘記鎖屏？試着讓電腦自動完成這事吧！

1.場景說明公司要求離開工位要立刻鎖定電腦屏幕防止信息泄露，但無論是使用鎖屏快捷鍵還是設置觸發角，總感覺不得勁。想想汽車現在基本都是自動鎖車了，電腦它就不能自己鎖屏嗎？於是抽空蒐羅了一些自動化的解決方案，並按照Win和Mac進行分類。

2024-04-24 23:17:17

京東廣告研發 —— 京東推薦廣告排序機制演化

1、序言：廣告排序機制的前世今生 1.1、簡介：廣告排序機制在線廣告是國內外各大互聯網公司的重要收入來源之一，而在線廣告與傳統廣告最大的區別就在於其超大規模的實時競價環境：數以萬計的廣告主在一天內可以參與億級別的流量競拍。在這複雜的實

2024-04-24 23:17:14

高可用 - 隔離原則

前言當討論高可用時，那麼必然有與之對應的低可用甚至不可用，但無論是哪種可用描述，其中都暗含了一個大衆共識，即不存在永久穩定運行的系統程序。事實上，幾十年前圖靈也論證過類似的問題，稱爲“停機問題”，具體的描述是：能否爲A計算機編程，使得

2024-04-24 23:17:13

DataGear 5.0.0 發佈，數據可視化分析平臺

DataGear 企業版 1.1.0 已發佈，歡迎瞭解試用！ http://datagear.tech/pro/ DataGear 5.0.0 發佈，核心功能重構，新增圖表追加更新模式，具體更新內容如下：重構：【圖表數據集】概念和設計

2024-04-24 21:42:05

界面控件DevExpress VCL v24.1預覽 - 支持RAD Studio 12.1、圖表新功能

DevExpress VCL Controls是Devexpress公司旗下最老牌的用戶界面套包，所包含的控件有：數據錄入、圖表、數據分析、導航、佈局等。該控件能幫助您創建優異的用戶體驗，提供高影響力的業務解決方案，並利用您現有的VCL技能

2024-04-24 11:35:34

「Java開發指南」如何利用MyEclipse啓用Spring DSL？（二）

本教程將引導您通過啓用Spring DSL和使用Service Spring DSL抽象來引導Spring和Spring代碼生成項目，本教程中學習的技能也可以很容易地應用於其他抽象。在本教程中，您將學習如何：爲Spring DSL初始化

2024-04-24 11:35:31

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

大家好，我是Python進階者。一、前言前幾天在Python白銀交流羣【Jethro Shen】問了一個Python谷歌驅動下載的問題。二、實現過程這裏【Kim】和【Crazy】給了一個指導，如上圖所示。說來奇怪，在鏈接中看了沒有

2024-04-24 09:48:52

如何從根本上避免釣魚--安全意識的重要性

一、什麼是網絡釣魚（Phishing） “網絡釣魚（Phishing）攻擊者利用欺騙性的電子郵件和僞造的 Web 站點來進行網絡詐騙活動，受騙者往往會泄露自己的私人資料，如信用卡號、銀行卡賬戶、身份證號等內容。詐騙者通常會將自己僞裝成網

2024-04-23 23:16:04

24小時熱門文章

最新文章

最新評論文章