乾貨 | 攜程持久化KV存儲實踐

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去幾年,攜程技術保障部門在Redis治理方面做了很多工作,解決了運營上的問題,在私有云上也積累了豐富的經驗。後又通過引入"},{"type":"link","attrs":{"href":"https:\/\/github.com\/KvrocksLabs\/kvrocks","title":null,"type":null},"content":[{"type":"text","text":"Kvrocks"}]},{"type":"text","text":",在公有云上實現降本增效的目的,從而支撐了公司的國際化戰略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與此同時,國內業務部門存在降低基礎建設成本的客觀需要,有些業務方期望提供一種非傳統關係數據庫來解決某些高性能海量空間的業務需求,並在此基礎上支持特殊定製化以面對後疫情時代的挑戰。這些變化使我們開始思考,是不是可以參考公有云上的思路,在私有云上構建一種持久化數據庫,來滿足業務方對高性能、低成本、海量、持久化的需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、面對的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回顧之前在公有云上的方案,目的明確。因爲公有云的內存較貴,我們將Redis的數據存在SSD上來降低成本,選型了Kvrocks,並自研實現支持Redis的複製協議,將公有云上的成本降低了60%(圖1)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ec\/ec0ce7752dfc5a209d7d584e5b875712.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務發展和Redis集羣的日益增長,需求更加多樣化,需要在私有云上同樣能有一種持久化的KV存儲系統來提供服務,包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)KV存儲和讀寫的場景,Redis能提供的存儲上限過低,需要有大容量的KV存儲系統;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)數據持久化,而不是像Redis那樣重啓數據即丟失;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)節約Redis的使用成本,畢竟私有云上的Redis集羣非常龐大;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)提供類似selectforudpate的語義來實現庫存之類字段的扣減,而不是依賴外部的一些組件,比如分佈式鎖;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5)數據能提供相比Redis更高的一致性,比如支持同步複製。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們仔細分析業務需求和業界可選的方案,以期望找到一種持久化的KV數據庫,能兼容Redis滿足大容量和成本降低的需求,而又不侷限於Redis,能提供更多樣化的能力來支撐業務的訴求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、調研和選擇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們調研了業界大部分的NoSQL\/NewSQL數據庫,主要考慮以下幾個方面。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"是否爲業界主流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主流有兩層含義:第一,是否流行,比如github上的star數,是否是頂級開源基金的項目,或是否有大廠背書;第二,其理念是否主流,如現在使用最廣的關係型數據庫mysql,以及newsql TIDB,其相關概念如半同步複製,GTID,raft,計算存儲分離等概念都比較深入人心。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"是否有成熟的中間件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間件成熟是非常重要的一種能力,一旦選擇了一種不合適的數據庫,中間件相關的路由,打點,監控,降級,熔斷,DR切換等每一項都需要投入大量的人力物力來做,此外穩定的中間件也是需要長時間打磨才能被業務方信賴,如果能複用現有中間件的大部分能力,能節約大量人力物力。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"集羣運維治理配套是否完善"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇一種KV數據庫,除了中間件外,治理相關的如集羣擴容,縮容,實例的遷移,資源利用率等一樣要考慮進來。無論哪種數據庫,部署後的運維治理相關,能複用現有的能力最好,如果不能複用,需要考慮:擴容到10倍需要多久時間,是否可以縮容?是否好遷移,對業務透明?大規模部署後,資源利用率是否可以提升?"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"性能是否滿足要求,是否支持10X的擴展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面說的這幾點,如果都滿足,但性能不滿足或者不支持10X擴展,那也將一票否決。性能也是重要考量的一塊,希望找到一種性能優異的KV數據庫。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"是否可以二次開發,獨立演進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於攜程這樣體量或相似體量的公司來說,持久化KV的數據庫大多有自研的或基於開源二次開發的數據庫,比如美團的Cellar,餓了麼的Tidis,360的pika等,我們同樣需要選擇一種易於二次開發或方便擴展的數據庫,來開發自定義的特性支撐業務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調研的過程受制於篇幅限制,不再一一展開,最終我們繼續選擇了Kvrocks來作爲治理演進的對象,其他的NoSql\/NewSql有各種不足,而Kvrocks受益於Redis運維治理的成熟,可以複用現有的大部分Redis中間件和運維治理的能力,在攜程與Redis幾乎無差異的部署\/使用方式,當下無疑是最適合的一種持久化KV數據庫。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、從Kvrocks到TRocks"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過不斷的開發迭代和使用,最終我們將新系統命名爲TRocks(Trip+Kvrocks),作爲攜程自己的持久化KV數據庫。相比於原來的Kvrocks,除了與Redis可以互通協議互爲主從外,主要是基於以下幾個方面的改進。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1 功能增強"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"獨佔鎖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些業務方存在着流程協調,執行順序的限制,往往會需要使用分佈式鎖,比如扣減庫存的邏輯。常見的方式是引入一個第三方的分佈式系統,將鎖標識存儲在那裏用於共享訪問,以達到鎖的目的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣做雖然常見,但也有一些問題,首先需要引入額外的系統,並單獨考慮各種異常情況的處理,增加了整個應用的複雜度。其次標識位往往有一定含義或者能與當前業務數據做關聯,這就相當於額外存儲了一份業務數據,存在一定的安全隱患。同時多個應用可能共用一套外部分佈式系統來處理鎖,這就無形中增加了系統的訪問壓力,一旦出現問題將影響多個依賴方,缺乏隔離性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決此類問題,TRocks在內部實現了基於Key力度的鎖功能,將其分佈式部署並作爲應用的業務數據庫時,其本身就擁有了分佈式鎖的能力(圖2)。對鎖的處理和業務數據在一起,無需引入多餘的系統,降低複雜度,幫助業務方專注於業務代碼的開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/65\/6575618416e5b147c5f21f7312cfed80.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了保證請求的唯一性和類似raft那樣支持冪等重試的功能,每個請求需要帶上標識唯一性的clientid和自增seq,這些metadata和本身的data會被當成一個writebatch寫入到rocksdb中,後續還會同步到slave上,從而保證整條鏈路上請求的原子性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"複合命令"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Redis命令本身的限制,有些業務方反饋實現一個功能,比如對hash key進行超時處理需要進行2次操作,一次設置值,一次設置超時。雖然中間件將這層邏輯封裝之後對外只提供一個api,但內部執行仍然是2個命令,可能存在原子性問題。TRocks針對這種情況增加了一些複合功能的命令,調用這些命令可以實現相同的效果並保證原子性,同時這些功能對用戶是透明的,直接調用客戶端相應api即可使用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 可用性增強"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"可調一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kvrocks本身的主從複製邏輯與Redis相似,都是通過異步方式進行的。在這種方式下,如果出現網絡斷開或者master宕機,數據還未來得及同步,就會出現數據丟失的情況。爲了避免此類問題,TRocks加入了類似Mysql的半同步複製來提高數據的一致性。我們可以通過打開半同步方式並指定至少需要參與的半同步slave的數量來啓用該功能,提高災備能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如一個1主4從的集羣,設定需要等待任意2臺Slave響應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖3所示,當滿足響應的slave爲2的時候,半同步即可認爲完成,即使此時另外兩臺slave可能還未完成同步工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8c\/8c073dc9624ad0901ed99e092bc076e3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但這種方式在多機房部署的情況仍然可能存在問題。因爲距離的關係,相同機房的數據傳輸速率會更高,所以master複製到和其在同一個機房的slave通常情況會更快(圖4)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/df\/df4712be37052fd36ce6ef3cbf3253b8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣就很容易發生同機房的slave數據複製的進度要快於異地機房的slave。如果發生機房級的故障,導致master所在集羣的服務全都無法正常工作,這個時候就可能發生數據丟失。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此我們在半同步複製的基礎上增加了IDC模式,使得即使初始條件已經滿足,也需要至少存在相關IDC的slave反饋才能完成整個複製流程。IDC模式有兩種,本地複製和異地複製。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以異地模式爲例,如果返回slave的數量滿足條件,且包含至少一臺來自於master所在機房不同的slave,則半同步複製完成。如果當前響應中未包含非master集羣的slave,則繼續等待,直到master接收到一臺來自異地的slave的反饋,半同步才能完成(圖5)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7e\/7ed20d9cb10d5e87db5f0772b95378c7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管異地模式數據的安全性更高,但也會影響整個系統的性能,這個性能差正常情況下取決於不同機房之間的網絡延遲。基於對性能和數據可用性的不同要求,使用方可以酌情選擇全異步複製(即關閉半同步),半同步 & 半同步(本地)複製或者半同步(異地)複製。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"全量同步複製抑制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面說到異步複製在異常情況下可能存在數據缺失的情況,如果再加上運維繫統對主從關係的調整,就會發生數據衝突。而我們目前TRocks的版本還在快速迭代中,希望每次升級版本能夠對用戶透明,然而事實並非如此。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設存在master A和slave B,正常情況下A和B的數據是保持一致的(綠色部分),但當A發生宕機的時候,B可能還未同步到A的最新數據,這時B的數據不再增加。但隨後哨兵發現master無法訪問,就把B提升爲master並開始處理寫入數據(藍色部分)。當一段時間後,A系統恢復,重新加入進集羣,此時A會變爲masterB的slave,並嘗試從B中同步數據,這裏就可能存在衝突區(圖6)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ac\/ac3e86a0a8b041793e1c44fc5521dac5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照Kvrocks初始的複製邏輯,A會認爲自身數據存在問題,並放棄全部數據然後從頭開始進行全量同步B的數據。這個行爲本身沒有問題。然而實際生產環境下,如果數據量很大的話,全量同步的耗時會比較長,而硬盤相比內存的帶寬至少小兩個數量級,因爲我們的實例都是容器化部署,這有可能導致災難性的後果,A在同步數據的時候會產生大量的IO,從而可能會影響A\/B所在的宿主機上的所有的實例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據一致性要求沒有那麼高的場景中,僅僅因爲可能的幾條數據不一致就重新同全量同步,代價非常昂貴。所以我們希望在非強一致性條件下,系統可以容忍極少量的數據差異,儘可能避免全量同步以便充分利用資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的方案是當檢測到數據不一致的時候,主從之間會進行交互協調,計算出衝突區的範圍,並從衝突區之後第一條數據開始進行同步。爲什麼不是直接從衝突區後面開始同步?這裏需要有個概念,TRocks\/Kvrocks的數據都是追加形式的,增刪改都會在log文件中追加一條記錄,並提供起始位置(Sequence),對應不同的Redis類型的記錄會有不同的長度(Count),比如一條SET指令對應的Sequence會累加1,而HSET指令會累加2。從Sequence到Sequence+Count就是一條記錄的數據範圍。當重新同步的時候,衝突區的結束位置如果處於正常數據的中間,這樣是沒有辦法取得完整數據的,所以需要從衝突區後第一條數據開始。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/af\/afc44dc887b17fa562bc7df3f7f8159f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而衝突區與同步開始之間的區域是補足區(圖7),我們通過插入空白數據來進行填補,所以對於A和B來說,他們之間不一致區域是衝突區和補足區的總和。而對於衝突的部分,我們會記錄下兩邊的差異,真有差異發生時,參考git解決衝突的思路,將數據的選擇權交給用戶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上線該feature後,版本的升級就變得比較輕鬆,大部分情況下版本升級只是一次實例的拉出重啓拉入,實例也是秒級up,升級過程也基本上對業務做到了透明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在解決此問題的同時,我們也注意到master\/slave數據是對齊的某些情況下也會發生全量同步,檢查下來發現是pub\/sub命令的問題。這個命令是哨兵用於訂閱服務消息的,但Kvrocks的pub\/sub是一個寫操作,這樣就會造成持續性的數據寫入從而累加rocksdb的Sequence,這樣如果一個slave宕機後恢復,還沒來得及與master同步卻被哨兵寫入了一條無關緊要的pub消息,累加了Seq從而觸發了不必要的全量同步,但實際上該功能並非必須,所以我們修改Kvrocks處理哨兵pubsub消息的規則,不去寫之後這個命令只工作在內存中,自然不會累加rocksdb的Sequence,杜絕這種情況全量同步的可能性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3 運維治理能力增強"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"水平擴縮容"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4c\/4cc3d553bd9a54f81a2ce384e09422dd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在之前的"},{"type":"link","attrs":{"href":"http:\/\/mp.weixin.qq.com\/s?__biz=MjM5MDI3MjA5MQ==&mid=2697271083&idx=1&sn=43eb37c7fc068aeaf5832e3084bed1c8&chksm=8376e81fb40161096b13060a937e2288bc5016c5633ad86e2504302b0b111014371a97a3bd67&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"Redis治理演進之路"}]},{"type":"text","text":"文章中,我們介紹了一種新的擴縮容方案來解決Redis集羣版本升級和擴縮容的問題(圖8),參考同樣的思路,我們繼續改造BinlogServer來實現TRocks的集羣的水平擴縮容,這套方案實際上不僅解決了擴縮容的問題,同時也解決了Redis到Redis的數據遷移,TRocks到TRocks的數據遷移,Redis與TRocks之間的互相遷移,也可以幫助用戶平滑的從Redis的訪問過渡到TRocks的訪問。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而相比Redis擴縮容基本不需要考慮內存帶寬,硬盤帶寬太窄,而數據遷移的時候流量太大。由於所有數據最終都需要在新集羣上刷盤,導致遷移過程中目標集羣的磁盤讀寫會非常大,又由於我們都是容器化部署,大量的磁盤讀寫也可能會影響到統一宿主機上的其他無關的應用,所以我們調整了TRocks的寫入限流設置,以避免大量寫入影響磁盤性能,同時修改了BinlogServer加入了限流功能,平緩數據傳輸的速率。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"哨兵多機房部署"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了保證TRocks集羣可以跨機房容災,哨兵需要部署在多個機房中,目前我們是三機房部署。如下圖(圖9):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2b\/2bc8db4db623f94d7eb6c14ca659ee3a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在部署的時候,遇到了一個問題,我們發現哨兵之間經常無法選出leader,需要等下一個選舉週期(6分鐘)才能重新選出,導致長時間無法確定TRocks master。這個問題本身跟TRocks沒有太大關係,只是實際使用中對我們故障處理帶來了不小麻煩。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出現無法選出leader的原因是多個哨兵同時發起選舉希望成爲leader,導致最終每個哨兵都選擇了自己,無法達成共識。查看源碼發現官方已經爲發起選舉前設置了隨機的間隔時間(50~100ms),但實際操作中發現這個隨機間隔反而增加了發生選角失敗的可能,考慮應該是隨機時間太短導致,所以我們將隨件間隔修改爲100~200ms,同時在哨兵發現master宕機之後就立即發起選舉來儘可能規避無法選主的問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、一些數據"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.1 性能數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TRocks在內網上線後,在各個業務線都得到了廣泛的使用,排除公有云的部分,私有云上已經有將近2K的實例,10T+的數據量,下圖(圖10,圖11)可以看到同樣的數據寫TRocks和Redis的性能對比。平均響應時間,99.9%在同一個水平,並且我們還可以看到,得益於自定義的命令,同樣的功能相比Redis更加簡潔。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c3\/c397cf190eee9d4d653f4b514c615837.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖10"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c97a3ec9021b96946fe204f1c36b0b2e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖11"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我們跟業務方壓測,一臺40C和2塊RAID0的SATA SSD在保證良好響應的前提下(99.9%<10ms)約能提供讀寫的QPS爲8-10W,其中value<1k。而如果換成NVME SSD這個QPS可以提升3-5倍。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.2 成本數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設TRocks都是容器化部署,並且一臺40C的宿主機上可以部署20個實例,每個實例大小爲40G,因爲TRocks相比Redis有不小的壓縮功能(約3-7倍的壓縮率),如果將Redis的數據導過來可以平穩運行,那麼TRocks相比Redis約可以節約90%的成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然能省這麼多成本,是否所有的Redis都可以用TRocks來代替,我們是否需要將私有云上所有的Redis都替換成爲TRocks?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"答案都是否定的,也不是我們推廣TRocks的初衷,原因有以下兩點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)如上文所提到的,我們希望TRocks能擁有Redis的大部分能力,而又不僅僅侷限於Redis,希望它更是一個通用的KV數據庫,能提供更多樣化的能力來支撐業務的訴求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)硬盤的帶寬與內存有2個數量級的差距,而這些先天不足也無法滿足某些Redis場景的需求。比如大Key(>100K)響應和Redis還是有一定的差距,此外某些數據量小並且單個實例訪問QPS較高的實例,用TRocks來替換也並不合適,因爲規模化運維治理,我們需要考慮整個宿主機和每個實例是否能平穩運行,一般來說單個實例>10G,QPS<5K 是比較適合的。當然NVME SSD可以極大縮短大Key的響應時間和提升單個實例QPS的上限。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、未來規劃"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.1 複合命令增強"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們調研發現,業務經常爲了獲取一條數據,需要多次查詢TRocks,類似二度人脈的取數據邏輯,多次的網絡IO會導致耗時增加,而設計通用的命令來支持業務需求,減少網絡IO變得非常重要,此外還有些用戶詢問TRocks的hash類型中的subkey是否也可以實現過期。由於hash功能目前仍然是遵照Redis的規則,所以現在是按照整個hash key一起過期而不能實現內部數據項的過期。這個需求是有一定價值的,未來我們會通過提供一個特殊的hash結構來實現此類功能。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.2 引入checkpoint"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kvrocks1.X在進行全量複製時,master會生成硬的backup,會拷貝文件產生大量的IO,而官方2.0版本已經用Rocksdbcheckpoint解決了這個問題,我們也已經將2.0版本merge過來測試,準備適時升級上線。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.3 使用NVME SSD"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前攜程的大量TRocks還是跑在SATA接口的SSD上,而據我們的測試下來兩塊SATA raid0的帶寬大約爲800MB\/S,導致硬盤非常容易跑滿,相比之下,NVME SSD的帶寬基本都是幾G起步,並且我們測試下來NVME SSD在小的壓力下,對於SATA SSD性能有3-5倍的提升,而對於大Key的情況(超過100K)和大的壓力下,NVME SSD的性能提升可以高達10-100倍。因此我們已經計劃將SATA SSD全換成NVME SSD,進一步提升TRocks的性能。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.4 回饋社區"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在TRocks開發過程中,我們一直受益於Kvrocks社區開發者的幫助,並跟社區保持着緊密溝通,也提交過比較多的PR\/issues給社區。希望後續能更好回饋社區,將一些獨立的比較大的feature分享出來,目前半同步複製的feature已經提交給社區review,希望可以早日merge進主分支。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:攜程技術中心(ID:ctriptech)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/24sQPnZX9FnpxK7mcJtvHw","title":"xxx","type":null},"content":[{"type":"text","text":"乾貨 | 攜程持久化KV存儲實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章