如何基於磁盤 KV 實現 Bitmap

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分開發對 Bitmap 應該都不陌生,除了作爲 Bloom Filter 實現的存儲之外,許多數據庫也有提供 Bitmap 類型的索引。對於內存型的存儲來說,Bitmap 只是一個特殊類型(bit)的稀疏數組,操作內存不會帶來讀寫放大問題(指的是物理讀寫的數據量遠大於邏輯的數據量), Redis 就是在字符串類型上支持 bit 的相關操作,而對於 Kvrocks 這種基於磁盤 KV 實現的存儲則會是比較大挑戰,本篇文章主要討論的其實是「","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基於磁盤 KV 實現 Bitmap","attrs":{}},{"type":"text","text":" 要","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何減少磁盤讀寫放大」","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼會產生讀寫放大","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讀寫放大的主要是來源於兩方面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"硬件層面的最小讀寫單元","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"軟件層面存儲組織方式","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"硬件層面一般是由於最小讀寫單元帶來的讀寫放大,以 SSD 爲例,讀寫的最小單位是頁(一般是 4KiB/8KiB/16KiB)。即使應用層只寫入一個字節,在磁盤上實際會寫入一個頁,這也就是我們所說的寫放大,反之讀也是一樣。另外,SSD 修改數據不是在頁內位置原地修改而是 Read-Modify-Write 的方式,修改時需要將原來的數據讀出來,修改之後再寫到新頁,老的磁盤頁由 GC 進行回收。所以,即使對同一頁的一小塊數據反覆修改也會由於硬件本身機制而造成寫放大。類似於如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c72299b261082c789955348cb3e0319d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由此可見,大量隨機小 io 讀寫對於 SSD 磁盤來說是很不友好的,除了在性能方面有比較大的影響之外,頻繁擦寫也會嚴重導致 SSD 的壽命(隨機讀寫對 HDD 同樣不友好,需要不斷尋道和尋址)。LSM-Tree 就是通過將隨機寫入變成順序批量寫入來緩解這類問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"軟件層面的讀寫放大主要來自於數據組織方式,不同的存儲組織方式所帶來的讀寫放大程度也會有很大的差異。這裏以 RocksDB 爲例,RocksDB 是 Facebook 基於 Google LevelDB 之上實現了多線程,Backup 以及 Compaction 等諸多很實用的功能。RocksDB 的數據組織方式是 LSM-Tree,在解決磁盤寫入方法問題,本身的數據存儲也帶來了一些空間放大問題。下面可以簡單看一下 LSM-Tree 是如何組織數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9c/9c671d36eac803e2a10a51e74a7d9641.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LSM-Tree 每次寫入都會產生一條記錄,比如上圖 x 先後寫了 4 次,分別是 0,1,2,3。如果單看 x 這個變量,這裏相當於有 4 倍的空間放大,這些重複的記錄會在 compaction 的時候進行回收。同樣,刪除也是通過插入一條 value 爲空的記錄來實現。 LSM-Tree 每一層空間大小是逐層遞增,當容量大小當層最大時會觸發 compaction 合併到下一層,以此類推。假設 Level 0 最大存儲大小是 M Bytes,逐層按照 10 倍增長且最大 7 層,理論上空間放大的大約是 1.111111 倍。計算公式如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"go"},"content":[{"type":"text","text":"空間放大率 = (1 + 10 + 100 +1000 + 10000 + 100000 + 1000000) * M / (1000000 * M) \n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但在實際場景中,由於最後一層一般無法達到最大值,所以放大空間率比這個理論值大不少,具體在 RocksDB 的文檔裏面也有提過,具體見: ","attrs":{}},{"type":"link","attrs":{"href":"https://rocksdb.org/blog/2015/07/23/dynamic-level.html","title":"","type":null},"content":[{"type":"text","text":"https://rocksdb.org/blog/2015/07/23/dynamic-level.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,由於 RocksDB 讀寫都是以 KV 爲單位,Value 越大帶來的讀寫放大就可能越大。舉個例子,假設有一個 Value 爲 10 MiB 的 JSON,如果要修改這個 key 中的一個字段,那麼需要把整個 JSON 讀出來,修改後再重新寫回去,就會導致巨大的讀寫放大。有一篇 paper「WiscKey: Separating Keys from Values in SSD-conscious Storage」就是通過 Key/Value 分離的方式來優化 LSM-Tree 大 KV 的來減少 Compaction 時帶來寫放大的問題。TiKV 裏面的 titan 就是基於 Wiskey 論文優化 RocksDB 在大 KV 場景的寫放大問題,RocksDB 也在社區版本里面實現這個功能,不過還是實驗性的階段。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基於磁盤 KV 實現 Bitmap","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kvrocks 是基於 RocksDB 之上實現的兼容 Redis 協議的磁盤存儲, 需要支持 Bitmap 功能,所以就需要在磁盤 KV 之上實現 Bitmap 的功能。而大部分使用 Bitmap 的場景都是作爲稀疏數組來用,意味着第一次寫入的 offset 爲 1,下次的 offset 可能就是 1000000000 甚至更大,所以在實現 Bitmap 就會面臨上述讀寫和空間放大問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種最簡單的實現方式是仍然把整個 Bitmap 作爲一個 Value,讀寫時將 Value 讀取到內存中再回寫。這種實現雖然很簡單,但一不小心可能導致 value 巨大,單個 Value 大小上 GiB 都是可能的。除了存在有效空間利用率問題之外,可能會直接導致整個服務不可用(需要讀寫整個 Value)。Pika 裏面的 Bitmap 就是這種實現,但限制最大的 Value 爲 128 KiB,限制 Value 大小雖然避免上述的極端情況,但會大大限制 Bitmap 的使用場景,甚至是無法使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然知道核心問題是由於單個 KV 過大導致, 那麼最直接的方式就是將 Bitmap 拆分成多個 KV,然後控制單個 KV 大小在合理範圍之內, 那麼讀寫帶來的放大也是相對可控。在當前 Kvrocks 的實現裏面是按照每個 KV 爲 1 KiB 來劃分,相當於每個 value 可以存放 8192 bits。算法示意圖如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c3/c3ef79a011e6b8dd9de364385f0c069f.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"setbit foo 8192002 1","attrs":{}}],"attrs":{}},{"type":"text","text":" 爲例,實現的步驟如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"計算 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"8192002","attrs":{}}],"attrs":{}},{"type":"text","text":" 這個 offset 對應所在的 key, 因爲 Kvrocks 是按照 1 KiB 一個 value,那麼所在 key 的編號就是 8192002/(1024*8) = 1000,所以就可以知道這個 offset 應該寫到 \"foo\" + 1000 這個 key 對應的 value 裏面","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"接着從 RocksDB 裏面去獲取這個 key 對應的 value","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"計算這個 offset 在分段裏面的偏移,8192002%8291 等於 2,然後把 value 中偏移爲 2 的 bit 位設置爲 1","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"最後將 value 回寫到 RocksDB","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種實現比較關鍵的一個特點是 Bitmap 對應的 KV 只在有寫入的時候纔會真正寫到 RocksDB 裏面。假設我們只執行過兩次 setbit ,分別是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"setbit foo 1 1","attrs":{}}],"attrs":{}},{"type":"text","text":" 和 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"setbit foo 8192002 1","attrs":{}}],"attrs":{}},{"type":"text","text":" ,那麼 RocksDB 裏面只會有 foo:0 和 foo:1000 這兩個 key,實際的寫入 KV 總共也只有 2 KiB。剛好也可以完美適應 Bitmap 這種稀疏數組的場景,不會因爲稀疏寫入而帶來空間放大的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這個想法也和 Linux 的虛擬內存/物理內存映射策略類似,比如我們 malloc 申請了 1GiB 的內存,操作系統也只是分配一片虛擬內存地址空間,只有在真正寫入的時候纔會觸發缺頁中斷去分配物理內存(目前正常內存頁大小是 4KiB)。也就是如果內存頁沒有被寫過,只讀也不會產生物理內存分配。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GetBit 也是類似,先計算 offset 所在的 key,然後從 RocksDB 讀取這個 key, 如果不存在則說明這段沒有被寫過,直接返回 0。如果存在則讀取 Value,返回對應 bit 的值。另外,在實現上也單個 KV 實際存儲大小也是由目前寫入最大的 offset 決定,並不是有寫入就會分配 1024 KiB,這樣也可以一定程度優化單個 KV 內的讀寫放大問題。實現可參考: ","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/KvrocksLabs/kvrocks/blob/unstable/src/redis_bitmap.cc","title":"","type":null},"content":[{"type":"text","text":"https://github.com/KvrocksLabs/kvrocks/blob/unstable/src/redis_bitmap.cc","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到基於內存和磁盤之上去實現同一個功能,除了不同類型存儲介質本身的速度差異之外,問題和挑戰是完全不一樣的。對於磁盤類型的服務,需要不斷去優化隨機讀寫和空間放大問題,除了對於軟件本身熟悉之外,同樣需要了解硬件設備。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,Kvrocks 作爲基於磁盤 KV 之上兼容 Redis 協議存儲服務,最經常被問到是跟其他功能類似的服務有什麼區別?簡單來說,最大的差異在於不同項目維護者在功能設計上的差異,不同設計會讓功能看似一樣的服務在表現上完全不一樣。所以,最好的方式就是通過代碼去了解項目的設計和問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"References","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] ","attrs":{}},{"type":"link","attrs":{"href":"https://rocksdb.org/blog/2015/07/23/dynamic-level.html","title":"","type":null},"content":[{"type":"text","text":"https://rocksdb.org/blog/2015/07/23/dynamic-level.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf","title":"","type":null},"content":[{"type":"text","text":"https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] ","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/KvrocksLabs/kvrocks","title":"","type":null},"content":[{"type":"text","text":"https://github.com/KvrocksLabs/kvrocks","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] ","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/facebook/rocksdb","title":"","type":null},"content":[{"type":"text","text":"https://github.com/facebook/rocksdb","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] ","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/tikv/titan","title":"","type":null},"content":[{"type":"text","text":"https://github.com/tikv/titan","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章