Lucene 中的 Stored Fields 存儲優化

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1C8D58","name":"user"}}],"text":"1 背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Qunar 酒店的搜索和 suggest 是基於 Lucene 構建的,在我們的使用場景中,由於召回和排序是作爲兩個單獨的應用,當召回的文檔數量比較多的時候,響應速度較慢,Young GC 也比較嚴重,導致併發量很難上去。經過分析我們發現,主要的問題是因爲需要獲取大量文檔的存儲字段,造成反序列化比較多,所以影響速度,GC 也比較多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 正常的使用場景是不期望返回這麼多文檔的,一般是排序完成後只返回其中一頁的結果,所以問題不明顯,儘管也可以通過一些方法(比如粗排序)減少返回文檔的數量,但問題還是存在的。所以針對這個問題,我們希望能夠找到一個比較徹底的解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼獲取存儲字段會有速度和 GC 的問題呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們知道,Lucene 的 Stored Fields 在存儲的時候,會把文檔的字段按照某種形式編碼後存儲,並且會按塊進行壓縮。所以獲取存儲字段的時候,先會對字段所在的塊解壓縮,然後將對應的字段值反序列化爲 Java 對象,放到StoredField 對象中,文檔的所有字段組裝成一個 Document 對象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏頭對時間影響比較大的是解壓縮和反序列化,對 GC 影響大的是兩部分,一部分是反序列化會產生很多小的 Java 對象,另外是每個字段都會創建一個 StoredField Java 對象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"壓縮的問題,可以通過選項禁用壓縮解決,其他的在現有的實現上就不好避免了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼有沒有其他的選項呢?Doc Values 提供了另外一種存儲字段的方法,它採用列式存儲,但其目的並不是爲了替代 Stored Fields,Doc Values 適用於獲取大量文檔的少數字段的情況,而 Stored Fields 適用於獲取少數文檔的大量字段的情況,Doc Values 通常用於排序、算分或者 Facet 聚合計算等場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管用 Doc Values 來存儲是比較接近我們的優化目標,但當字段比較多的時候不太合適,而且 String 類型的數值需要以 binary 的形式存儲,編解碼次數多了也比較耗時,所以我們想,能不能自己實現字段的存儲,把字段cache 到內存裏頭,每次訪問的時候,直接根據文檔 ID 去獲取相應的字段,這樣就基本上沒有序列化的開銷,也少創建很多對象,對於我們這種數據量不是特別大的情況來說,效果應該更好。基於這個想法,我們調研了一下 Lucene 提供的相關機制,證明這麼做是可行的,下面我們說一下 Lucene 提供的機制,以及我們怎麼利用這種機制去實現我們想要的功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1C8D58","name":"user"}}],"text":"2 Lucene自定義Codec機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 內部通過 codec API 來讀寫索引文件,codec 是 Lucene 的一個非常重要的抽象:它把索引數據結構的存儲和上層的建索引和搜索的複雜邏輯隔離了開來,訪問索引的時候都是通過 codec API 來操作,這樣就允許我們實驗各種不同的索引存儲格式,而不會影響上層的搜索和建索引的邏輯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Codec 針對不同類型的索引數據定義了10種 Format,每種類型的 Format 又定義了讀寫的 API,其中讀的 API 在搜索時使用,寫的 API 在建索引的時候使用,每個 Segment 可以設置自己單獨的 Codec。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 中的抽象類 org.apache.lucene.codec.Codec 定義了 Codec 的接口:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/58\/fc\/581b2e0d49b20dda6c18cec522537ffc.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個 codec 必須有一個唯一的名字,比如\"Lucene80\",codec 通過 Java 的 SPI(Service Provider Interface)機制進行註冊,所以只要知道了名字,就可以找到對應的 codec 實例,同時在建索引的過程中 codec 的名字也會寫入到每個 Segment 對應的索引元數據 SegmentInfos 中,所以 Lucene 能夠根據索引中的信息找到對應的 codec。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 8中有10種 Format,具體每種 Format 處理什麼類型的索引,我們這裏就不一一詳細列舉了,簡單說下其中幾個:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PostingsFormat 支持倒排索引的讀寫,倒排索引我們知道,是從 Term→{docId List}的一個索引,其中 docId List 就叫做 posting list。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StoredFieldsFormat 支持存儲字段的讀寫,Stored Fields Index 可以算作是一種正排索引(forward index)的存儲方式,通過 docId 可以直接獲取,Stored Fields 採用行式存儲,爲了節省存儲,做了壓縮編碼。在建索引時,針對某個字段如果指定 stored=true,會存儲到 StoredFields 索引文件中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DocValuesFormat 支持 Doc Values 的讀寫,Doc Values 也是一種正排的存儲方式,是爲了解決排序、算分、Facet 聚合等場景引入的一種列式存儲方式,當需要訪問大量文檔的同一字段時的性能提升比較明顯。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們要優化的就是 StoredFields 的訪問,其他部分不做修改,所以並不需要自定義所有的 Format,Lucene 提供了 FilterCodec 類,允許我們選擇性地改寫某個 Format 的實現,其他則 delegate 給默認的實現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/96\/ae\/96c6ba2715b5dbd846dca9486f2b1cae.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我們只需要選擇性地覆蓋 StoredFieldsFormat 的實現,其他的使用 Lucene80 Codec 默認的實現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/96\/ae\/96c6ba2715b5dbd846dca9486f2b1cae.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene 提供了完善的單元測試,可以用來驗證縮寫的 Codec 功能是否正常,具體可以參考:build-your-own-lucene-codec"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/dzone.com\/articles\/build-your-own-lucene-codec","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/dzone.com\/articles\/build-your-own-lucene-codec"}],"marks":[{"type":"italic"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1C8D58","name":"user"}}],"text":"3 自定義 StoredFieldsFormat 實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們希望將 Stored Fields 數據全加載到內存,儘量減少序列化和創建對象的開銷。要達成這個目標,實際上我們並不需要完全從頭開始定義自己的 Stored Fields 存儲格式,我們可以利用原來的索引存儲格式,只需要改寫讀索引的 StoredFieldsReader,將數據緩存到內存中,建索引時使用的 StoredFieldsWriter 和磁盤存儲格式都可以保持不變,這樣是最簡單的。因爲我們的整個架構是基於 Lucene NRT replication 構建的一個主從式的架構,所以我們在Primary(master)建索引的時候,可以按照正常的方式建,在 Replica(slave)使用索引的時候,可以通過開關打開 cache,整個的過程大概是這樣的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bb\/9a\/bbbf59475b060f554475b65c5e0a149a.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Primary 節點在建索引的時候配 IndexWriterConfig,通過 IndexWriterConfig.setCodec 設置我們自定義的 codec,codec 的信息會寫入索引的元數據中。Primary 端按正常方式建索引。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Replica 節點加載 segment 數據的時候,會調用自定義的 codec,進而調用我們自定義的 StoredFieldsReader,自定義的 StoredFieldsReader 通過原有的 Lucene80Codec的Reader 讀入數據,緩存到內存中(多個列式存儲的向量),後續所有訪問操作直接讀取內存中的數據。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自定義的 Codec,StoredFieldsFormat 和 StoredFieldsReader 之間的關係如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9f\/26\/9f1daee26cc8667892cf76f1244e5226.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 StoredFieldsFormat 的接口定義如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bd\/2d\/bde0354640d6f6cfcfa82fb8844e022d.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們只需要在覆蓋 fieldsReader 方法,在其中初始化自定義的 MemoryStoredFieldsReader,傳入的參數有 Segment 和字段相關的信息,所以可以通過 delegate 的原始 StoredFieldsReader 讀取存儲字段的數據(通過visitDocument 方法訪問),並存儲到內存數據結構(內存數據結構我們下一節說明)中,因爲 Lucene 中的 Segment 數據是不變的,所以一次性讀入就可以。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據放到內存數據結構中之後,可以通過 StoredFieldsReader 的 visitDocument 接口訪問:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e0\/d8\/e008bf950f6441f0589bfa3e5ce510d8.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準的 StoredFieldsVisitor 實現(比如 DocumentStoredFieldVisitor)有個問題,創建了太多的中間對象,比如每個字段會建一個 StoredField 對象,String 類型的字段需要先轉成 byte[],然後再轉成 String 等等,產生了很多不必要的中間對象,爲了充分利用緩存和減少中間轉化的代價,除支持標準接口外,我們自定義了 StoredFieldsVisitor,直接在內存數據結構的基礎上包裝了一個文檔訪問的接口,並通過 StoredFieldsVistor 對外提供。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"僞代碼示例如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e9\/32\/e9ac13246f4d1663c3e3d7f45dbb2f32.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"visitDocument 接口最終是被 IndexSearcher.doc(int docId, StoredFieldVisitor storedFieldsVistor)接口使用的,搜索的時候返回 docId,獲取存儲字段通過 Searcher 的 doc 接口。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1C8D58","name":"user"}}],"text":"4 內存存儲結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將數據 cache 到內存裏頭,一是爲了解決序列化的速度問題,二是爲了減少過多的中間對象,但是我們又不希望存儲過度膨脹,那樣我們就沒法在單個機器存儲所有的數據,因此,選擇合適的存儲結構非常重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般來說,有兩種存儲的方法,一種是行式存儲,一種是列式存儲,Lucene 裏頭默認的 StoredFields 存儲是行式存儲,DocValues 是列式存儲。假設我們用行式存儲的方式,如果將文檔序列化之後再存儲,從空間、時間和產生的中間對象上來看相較原始的存儲方式並沒有什麼優勢,如果以 Java Bean 的方式來存儲,速度上是最快的,產生的中間對象也比較少,但是存儲空間消耗非常大,主要是因爲 Java 在存儲方面並不是很經濟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲很多字段是允許多值的,所以我們需要採用數組來存儲,數組在 Java 裏頭,64位系統下光對象頭就要佔用24個字節(啓用指針壓縮的情況下也得佔用16個字節,如果超過堆內存大小超過32G,雖然也能對指針進行壓縮,但是會有額外的對齊的開銷);"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"空值字段也會消耗空間,比如一個 null 引用也會佔用64位,空的原生類型字段也會佔用空間;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對象對齊的開銷;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"複合對象引用的開銷。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以採用 Java Bean 的方式在存儲上代價有點高,不太能滿足要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而列式存儲的方式,將同一個字段的放到連續的存儲中,可以減少數組對象頭的開銷,訪問的時候,也只是增加了一些偏移量計算的開銷,在空間和時間上相對來說更適合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過一個例子來說明列式存儲怎麼實現,假設有四個文檔,有一個別名字段 hotelAliases:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f5\/e9\/f5e7e21afe38ffcc626b73bfc872bce9.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 ID 爲5的文檔有兩個別名,ID 爲2和6的文檔沒有別名,採用列式存儲的方式可以用兩個數組來表示,一個 value 數組用來存儲別名,一個 offset 數組用來指示文檔值的起始和結束位置:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9e\/c0\/9e4eaa15e0365596d07bde8247077ec0.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 offset 的下標爲文檔 ID,offset[docId+1] - offset[docId]表示值的個數,如果不爲0,表示有值,值在 value 數組中的起始位置爲 offset[docId]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"value 數組如果是 String 類型的對象,我們可以通過對 String 做 intern 操作來去除重複,考慮到 intern 操作本身會使用一個 Map 類型的索引來做去重,如果維護一個全局的索引的話,需要一直留着不能釋放,佔用內存較多,所以我們只在一個 Segment 內做 intern,因爲 Segment 的數據是不變的,做完了之後,我們可以將 intern 使用的 Map 釋放掉,經過測試,這樣做可以節省空間,原因猜測是因爲我們的數據重複的值比較集中,大都是一些低 cardinality(基數)的數據,而高 cardinality 的值則很少重複,保存去重的索引反而佔用空間較多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過列式存儲的方式,可以將存儲消耗降低爲 Java Bean 方式的65%,訪問速度上,損失大概百分之十幾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面的編碼方式,空值是不佔 value 數組存儲空間的,但是會佔用 offset 數組的存儲空間,雖然看起來單個文檔只佔用一個 int,但當存在很多不同類型的文檔時,有些類型可能根本就不存在某個字段,這樣就會存在大量空值,加起來浪費也比較嚴重,所以我們後來又在這個基礎的列式存儲上進一步做了優化,通過採用 succinct data structure 中的 rank\/select 操作,用兩個 bit 數組代替 int 數組,這個優化能夠將存儲空間消耗進一步減少將近20%(12G->10G)。關於這一塊,我們在將來的文章中再做介紹。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同類型的數據,內存佔用會有區別,除了提供通用的 Object類型的實現,我們也針對 Primitive Type 提供單獨的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1C8D58","name":"user"}}],"text":"5 寫在最後"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文所述的 Lucene Stored Fields 存儲優化,主要是對我們的特殊應用場景:數據量不是特別大,每次查詢返回文檔數較多,做了針對性的優化,降低了生成的中間對象的數量,從我們的線上監控看,Young GC 頻次從原來的每秒2-3次,變成9-10秒發生一次,響應時間也降低了80%多,存儲空間上面,通過採用緊湊的內存存儲格式,也較好地解決了空間消耗的問題,使得我們能夠將全量的存儲字段數據加載到內存裏頭。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未來,我們還計劃在這個基礎上進一步做一些優化,比如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"嘗試堆外存儲,減少堆空間佔用,更好地利用指針壓縮(不過這樣會有字符串編碼開銷,需要測試下);"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現 Per-field 的存儲字段 cache,只對必要的字段做內存緩存,減少總的內存佔用;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:王名悠"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/zsAjhhoOy__UlSf6i-ZMSA","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/mp.weixin.qq.com\/s\/zsAjhhoOy__UlSf6i-ZMSA"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:Lucene 中的 Stored Fields 存儲優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:Qunar技術沙龍 - 微信公衆號 [ID:QunarTL]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章