面向大規模商業系統的數據庫設計和實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"目前關係型數據庫從上世紀70年代誕生以來得到了廣泛應用,各種數字化的信息系統都能見到關係型數據庫的身影。在真實的場景裏面,業務系統對關係型數據庫這種基礎軟件的要求非常簡單,那就是高可靠和高性能,同時希望儘可能藉助複雜的SQL語義來簡化業務層功能的實現。傳統數據庫產品例如Oracle、SQLServer、MySQL、PostgreSQL等都發展趨於成熟,新一代的雲原生數據庫產品例如Aurora、PolarDB、TiDB、OceanBase等又開始引發更廣泛的關注,那麼什麼樣的數據庫產品才能更好地適應業務發展?數據庫這種比較古老的軟件產品的未來又是什麼?本文主要從商業產品系統的需求出發探討數據庫技術的實踐和思考。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文11241字,預計閱讀時間 18分鐘。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、商業產品系統對數據存儲設施需求的特點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45ee053341ffd87cb1d55bdfa6fe6a8e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度商業產品矩陣主要包括效果廣告(搜索廣告、信息流廣告)和展示廣告(品牌廣告、開屏聚屏廣告)兩大類廣告產品,以及基木魚和觀星盤等營銷工具,商業產品系統是連接百度客戶和廣告檢索系統的橋樑,幫助客戶表達營銷訴求,達成營銷目標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商業產品系統本質就是一個複雜、龐大的廣告信息管理系統,有toB、toC的多種場景,需求多樣豐富且迭代頻繁。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這些業務需求聚焦到數據存儲層面,主要有:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"投放,交易場景的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"事務型需求","attrs":{}},{"type":"text","text":"(OLTP,On-Line Transaction Processing);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"廣告效果分析場景的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"分析型需求","attrs":{}},{"type":"text","text":"(OLAP,Online analytical processing)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特定場景的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"高QPS查詢","attrs":{}},{"type":"text","text":",例如賬戶結構,權限關係等;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字面場景的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"正反KV查詢","attrs":{}},{"type":"text","text":",例如關鍵詞字面和id互查等;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物料列表場景的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模糊查詢;","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了應對商業場景下如此多樣且迥異的數據存儲需求,如果使用傳統的存儲技術,至少需要使用關係型數據庫(例如MySQL)、KV存儲(例如Redis)、OLAP 數倉(例如Palo)、全文檢索(例如ElasticSearch)以及自定義內存結構的存儲等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"那麼業務系統對數據存儲設施的要求是什麼呢?","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"穩定可靠","attrs":{}},{"type":"text","text":",不可用就意味着客戶體驗受損乃至直接的經濟損失;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據儘可能一致","attrs":{}},{"type":"text","text":",如果客戶在不同環節看的數據有差異則會產生誤解甚至引發錯誤的廣告投放操作;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再次儘可能","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"低成本","attrs":{}},{"type":"text","text":"的應對數據規模持續增長,不需要預先購置大量硬件,後期擴展時也儘可能簡單;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後綜合","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"讀寫性能好","attrs":{}},{"type":"text","text":",儘量毫秒級響應,不影響客戶的操作體驗。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對於業務研發的同學來說,他們希望用到的數據存儲產品是什麼樣呢?","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接口的使用方式單一,學習和遷移成本低,不同的數據存儲也儘量採用相同的接口形式;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據變更行爲可理解,不出現數據丟失或者覆蓋,不因併發引入異常數據;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"擴展性高,能夠適應數據規模和流量從1到N的變化,業務最好無感知;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高可用,內建高度容錯能力,業務對數據庫異常最好無感知;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Schema變更成本低廉;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對各種讀寫模式都能提供很好的性能。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"總結起來最好什麼都可以幹,什麼負載都可以扛,什麼運維都不用管!","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、BaikalDB 的發展歷程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商業系統最核心的存儲需求就是廣告庫,廣告庫存儲了所有的廣告物料信息,用於完成整個廣告生命週期的管理,幫助客戶完成全部廣告投放功能,獲取轉化。伴隨百度鳳巢系統的發展,廣告庫的存儲設施經歷了兩個重要的階段:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 單庫到分庫分表的MySQL集羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. MySQL主存儲集羣+鏡像輔助存儲構成的異構複合存儲集羣","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 分庫分表的 MySQL 集羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最早的鳳巢廣告庫採用單機MySQL,部署在獨立的盤櫃(高性能磁盤陣列)上,這種架構受限於當時的硬件條件現在看來比較古老,但這個跟現在流行的存儲計算分離的雲原生架構從思想上是完全一致的,AWS的Aurora或者阿里雲的PolarDB就是把MySQL、PostgreSQL等單機數據庫部署到一個由EBS磁盤或者RDMA高速網絡連接的分佈式文件系統上,實現100%的SQL兼容。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務發展,單機部署的MySQL無法支撐數據量和讀寫量的膨脹,分庫分表就成了當時乃至現在最優的選擇,通過分庫分表,MySQL可以實現容量和性能的高擴展性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/01/01f5db594b12513df8d573cd52fa1725.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從2010年開始,鳳巢廣告庫就依次經歷了1拆4、4拆8、8拆16、16拆32的分庫過程,從一套單機集羣發展成了有33分庫(多拆出來的一個分庫是爲了解決個別大客戶購買巨量關鍵詞的場景),每分庫1主11從的多分庫集羣,存儲了數十TB的廣告物料信息,讀寫PV達到每日數十億。拆庫的服務停頓時間從一天到6個小時,再到分鐘級別。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 異構複合存儲集羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鳳巢廣告庫的業務場景是讀多寫少,查詢場景多樣,多分庫MySQL集羣在滿足一些查詢場景較爲喫力,比如在賬戶-計劃-單元-關鍵詞層級結構裏,獲取賬戶下關鍵詞數,計劃下的關鍵詞數等涉及全表掃的count,關鍵詞字面高qps查詢,創意模糊搜索,物料列表分篩排等,這些需求使用MySQL都難以滿足。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲解決這個問題,我們通過數據流,把MySQL的數據實時同步到一個鏡像的內存存儲,這些鏡像存儲採用針對特定查詢場景的內存結構,來滿足業務性能。同時爲了業務應用的開發便利,還專門開發了SQL代理層,按照一定規則在SQL不改變的情況路由到鏡像索引,並轉化爲鏡像存儲所需要的請求參數,這樣雖然我們使用了不同的數據源,但是業務應用仍然認爲是一個 MySQL協議的數據庫在提供服務,且無需要關注應該查詢哪種數據源,由此形成一個異構的複合存儲形態。架構如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/57997b4b19c86714b278519f7ddfa329.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一種常見的架構設計,在另一些業務場景中會把OLTP數據庫的數據同步到OLAP數據倉庫,隔離離線分析場景,它的優勢在於多套同種數據不同存儲引擎的系統通過分而治之來解決複雜的查詢場景,並具有一定業務隔離性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依靠SQL代理層能夠有效提升業務應用的使用體驗,並且可以把應用層分庫分表邏輯也下沉到這個代理層,拆庫時業務應用也無需感知。對於業務應用來說,看到的是一個單機的MySQL系統,不再需要考慮任何性能和容量的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"但是這種架構也有明顯的缺點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"運維更爲複雜:","attrs":{}},{"type":"text","text":"除了關注 MySQL 本身,還需要運維數據實時同步流,SQL代理層,鏡像索引這些系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據實時同步容易出現故障或者延遲:","attrs":{}},{"type":"text","text":"客戶可能感知到明顯的不一致,從鏡像索引查詢到的數據跟從MySQL查詢有差異。爲了降低這種差異的影響,SQL代理層還需要設計一定的降級能力(發現延遲時儘可能切換到MySQL查詢)。還需要有快速修正鏡像索引數據的設施。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"資源冗餘浪費:","attrs":{}},{"type":"text","text":"鏡像索引實際是數據的複製, MySQL爲扛住讀性能和同步需求需要大量的從庫。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 2017年的選擇","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間來到2017年,鳳巢廣告庫已經有33分庫,磁盤也用上了NVME SSD,對於限定場景的讀寫性能可以滿足業務需求,但是如果再進行一次拆庫,無論是資源消耗還是運維成本都更爲巨大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到這個階段,我們開始思考是否存在一種成本更低的解決方案。新的信息流廣告業務也在快速發展,如果再形成一套鳳巢廣告類似的存儲架構,實際成本會非常可觀。雖然4年後的今天,鳳巢廣告庫依靠硬件升級,包括CPU和內存升級、NVME SSD升級到單盤3T,依然維持在33分庫的部署架構,但性能瓶頸已經開始突顯,如果廣告物料繼續高速增長,預計2022年底就需要進行新的拆庫。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當時廣告系統的業界標杆Google AdWords的核心存儲是F1/Spanner,採用全球部署可以跨遠距離的數據中心多活,配備原子鐘用於實現分佈式強一致事務,具備極高的可用性和自動增容的擴展性。參考Google存儲系統的設計理念,廣告存儲系統設計也有可見的兩種路線:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.3.1 基於MySQL深度定製","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MySQL是一種單機的架構,代碼規模達到百萬行級別,掌控和修改的難度都特別高。如果要把MySQL從內部改造成一種類似F1/Spanner能力的系統基本不大可能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這時一般有兩種解決思路,都是從外部來尋求突破:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"類似Aurora和PolarDB,在文件系統上進行突破,使用EBS或者構建一種RDMA高速連接的分佈式文件系統,這並不是研發新的數據庫系統。但是爲獲取更好的性能,依然需要深入 MySQL的存儲引擎和主從同步機制進行一些定製和深度優化。即便如此,總容量和性能也不能無限擴展,例如Aurora最高可達128TB,性能是MySQL的5倍,PolarDB最高可達100TB,性能是MySQL的6倍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"類似鳳巢廣告存儲的設計思路,通過數據同步並藉助擴展的鏡像索引提升查詢性能,但冗餘成本高,數據一致性差。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.3.2 使用滿足分佈式+雲原生+多樣化索引架構+強一致等條件的新數據庫系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2017年的時候,無論是Google的F1/Spanner還是OceanBase都是閉源系統,跟內部設施耦合很大。開源系統主要有兩個流派,一類是支持SQL的OLAP系統,例如百度的Palo(現開源名爲Doris)、Impala(無存儲引擎)、ClickHouse等,一類是參考F1/Spanner思想的CockroachDB和TiDB。OLAP系統肯定不太滿足我們TP(在線事務)場景的主需求,當時CockroachDB和 TiDB也處於起步階段,生產場景的使用基本沒有。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時放眼望去,實際並沒有特別成熟的解決方案,基於MySQL的方案也走到了一個瓶頸,那麼我們能否自研一個新的分佈式數據庫系統?當時的決策依據是看團隊是否具備能力從零研發出一個高可用、高性能、低成本的OLTP爲主兼顧OLAP的數據庫(也就是HTAP,Hybrid Transaction and Analytical Process,混合事務和分析處理)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"團隊的條件:","attrs":{}},{"type":"text","text":"已有的存儲方向團隊(4人)是C++技術棧,研發過SQL代理層和定製化存儲,熟悉MySQL協議,有實戰的工程經驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"技術的條件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、分佈式系統需要有效的通信框架,百度的brpc框架當時已經非常成熟,是工業級的RPC實現,有超大規模的應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、保障數據一致性當時主流的方案就是Paxos和Raft,百度braft框架是基於brpc的Raft協議實現,發展也很迅速,有內部支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、單機存儲節點需要一個可靠的KV存儲,Facebook&Google聯合出品的RocksDB是基於LSM-Tree的高性能KV引擎,CockroachDB和TiDB都選擇了RocksDB。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後來經過8個月的設計研發,我們的1.0版本數據庫就完成上線,結果也證明了我們的決策是可行的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4 面向商業產品系統的新一代存儲系統BaikalDB","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB是面向商業產品系統的需求而設計的分佈式數據庫系統,核心的目標有三個:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、靈活的雲上部署模式:","attrs":{}},{"type":"text","text":"面向容器化環境設計,能夠與業務應用混部,靈活遷移,容量和性能支持線性擴展,成本低廉,不需要特殊硬件。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、一站式存儲計算能力:","attrs":{}},{"type":"text","text":"面向業務複雜需求具備綜合的適應性,主要滿足OLTP需求,兼顧OLAP需求、全文索引需求、高性能KV需求等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3、兼容MySQL協議:","attrs":{}},{"type":"text","text":"易於業務使用,學習成本低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB命名來自於Lake Baikal(世界上容量最大的淡水湖-貝加爾湖),貝加爾湖是世界上容量最大的淡水湖,相當於北美洲五大湖水量的總和,超過整個波羅的海水量,淡水儲量佔全球20%以上。西伯利亞總共有336條河流注入貝加爾湖。冬天的貝加爾湖畔,淡藍色的冰柱猶如分佈式數據庫一列列的數據密密麻麻但是有井然有序,令人驚豔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8e/8edb93e2f666bb1c04a908c3a4696fcf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB是一個兼容MySQL協議的分佈式可擴展存儲系統,支持PB級結構化數據的隨機實時讀寫,整體系統架構如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3d/3d7946230fe023fb036dc8f92d32718a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB基於RocksDB實現單機存儲,基於Multi Raft協議(braft庫)保障副本數據一致性,基於brpc實現節點通訊交互,其中","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"BaikalStore","attrs":{}},{"type":"text","text":" 負責數據存儲,用Region組織,三個Store的三個Region形成一個Raft group實現三副本,多實例部署。Store實例宕機可以自動遷移Region數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"BaikalMeta ","attrs":{}},{"type":"text","text":"負責元信息管理,包括分區、容量、權限、均衡等, Raft保障的3副本部署,Meta宕機隻影響數據無法擴容遷移,不影響數據讀寫。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Baikaldb ","attrs":{}},{"type":"text","text":"負責前端SQL解析,查詢計劃生成執行,無狀態全同構多實例部署,宕機實例數不超過QPS承載極限即可。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"BaikalDB 的核心特性有:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"全自主化的容量管理:","attrs":{}},{"type":"text","text":"可以自動擴容和自動數據均衡,應用無感知,很容易實現雲化,目前運行在Opera PaaS平臺之上","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"高可用,無單點:","attrs":{}},{"type":"text","text":"支持自動故障恢復和遷移","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"面向查詢優化:","attrs":{}},{"type":"text","text":"支持各種二級索引,包括全文索引,支持多表join,支持常見的OLAP需求","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"兼容MySQL協議,支持分佈式事務:","attrs":{}},{"type":"text","text":"對應用方提供SQL界面,支持高性能的Schema和索引變更","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"支持多租戶:","attrs":{}},{"type":"text","text":"meta信息共享,數據存儲完全隔離","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在系統研發過程中,BaikalDB以業務需求爲導向規劃快速迭代,在業務使用中深度打磨優化,隨業務成長而成長,關鍵功能的時間節點如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/49ab357db1c78bc1879ff274ea367cab.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從2018年上線以來,BaikalDB已部署1.5K+數據表,數據規模達到600+TB,存儲節點達到1.7K+。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、BaikalDB 關鍵設計的思考和實踐","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式數據存儲系統一般有三種架構模式,分別是Shared Everthing、Shared Disk和Shared Nothing。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、Shared Everthing:","attrs":{}},{"type":"text","text":"一般是針對單個主機,完全透明共享CPU、內存和磁盤,傳統RDMS產品都是這種架構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、Shared Disk:","attrs":{}},{"type":"text","text":"各個處理單元使用自己的私有CPU和內存,但共享磁盤系統,這可以實現存儲和計算分離,典型的代表是Oracle Rac(使用SAN共享數據)、Aurora(使用EBS)、PolarDB(使用RDMA)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3、Shared Nothing:","attrs":{}},{"type":"text","text":"各個處理單元都有自己私有的CPU、內存和磁盤等,不存在資源共享,類似於MPP(大規模並行處理)模式,各處理單元之間可以互相通信,並行處理和擴展能力更好。典型代表是hadoop,各node相互獨立,分別處理自己的數據,處理後可能向上層彙總或在節點間流轉。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Shared Disk是很多雲廠商倡導的架構,雲廠商希望在雲上提供一個完全兼容傳統RDMS系統的雲產品,希望廣大的數據庫使用者基本沒有遷移成本,但是各家雲廠商的實現有比較多差異,主要比拼性能、容量和可靠性,這些也是各家雲廠商吸引客戶的賣點。但是該架構的Scale Out(橫向擴容)能力比較有限,所以雲廠商的存儲廣告宣傳語一般是百TB級別的數據容量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Shared Nothing是一種分佈式的依靠多節點來工作的架構,大部分NoSQL都是這樣的架構,Sharding MySQL集羣也是一種Shared Nothing的架構,每個分片獨立工作。這類架構最大的侷限是難以同時保障一致性和可用性,也就是受限於著名的CAP理論。對於NoSQL系統大部分不支持事務,所以優先保障可用性。但對於OLTP場景,數據一致性非常重要,事務是不可或缺的環節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲BaikalDB的目標是一個面向業務需求的具備融合型能力的分佈式數據存儲,並且大規模數據場景更看重Scale Out能力(僅僅100TB 容量遠遠不夠),所以採用的是Shared Nothing的架構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於分佈式數據系統而言,設計最需要關注存儲、計算、調度三個方面的內容。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 存儲層的設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"存儲層的設計主要是關注用什麼樣的數據結構來描述數據存放。對於分佈式數據系統,還需要額外關注怎麼利用多節點來協同存儲同一份數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於大規模數據場景的存儲優先需要考慮使用磁盤,而不是成本更高數據易失的內存。面向磁盤的存儲引擎,RocksDB是比較突出的代表,其核心模型是Key-Value結構。如果使用RocksDB就需要考慮如何數據表的結構映射到Key-Value結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了把數據分散到多臺機器,BaikalDB還引入了Region的概念,用於描述最小的數據管理單元,每個數據表是有若干個Region構成,分佈到多臺機器上。這樣的架構就需要考慮如何對數據進行拆分,一般有Hash(根據Key的Hash值選擇對應的機器)和Range(某一段連續的Key都保存在一個機器上)兩種。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hash的問題在於如果Region增大到需要分裂時如何動態修改Hash規則,而Hash規則的改動會涉及大量數據的重新分佈,每個Region的大小都很難均衡,即使引入一致性Hash也只能有限改善該問題。Range雖然容易實現數據的分裂拆分,但是容易有熱點,不過相對來說好克服。所以BaikalDB採用了Range拆分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Key-Value不等同於數據庫的Table,需要把數據表的主鍵索引(也叫聚簇索引,存儲主鍵值和全部數據),和麪向查詢優化的二級索引(也叫非主鍵索引、非聚簇索引,存儲索引值和主鍵值)以及全文索引都要映射到Key-Value模型裏面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"主鍵索引,","attrs":{}},{"type":"text","text":"爲區分Region和索引類型,除了包含主鍵值還需要包含 region_id和index_id,由於region_id可以在同一集羣全局唯一分配,也不需要table_id,同時考慮到多字段構成的聯合主鍵需要按聯合主鍵的大小順序存放,還需要引入對Key的Memcomparable編碼來提升Scan性能;整行的數據可以用protobuf進行編碼後再存儲到Value裏,這樣可有一定壓縮也能更方便的實現加列操作。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"二級索引,","attrs":{}},{"type":"text","text":"主要需要考慮採用本地(局部)二級索引(Local Secondary Index)還是全局二級索引(Global Secondary Index)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本地二級索引:","attrs":{}},{"type":"text","text":"只索引本機Region的數據,優點是索引和數據都在同一個節點,回錶速度快,不需要實現分佈式事務。但是查詢總需要附帶上主鍵條件,否則就只能廣播給全部分區。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"全局二級索引:","attrs":{}},{"type":"text","text":"能夠索引整個表的全部分區數據,優點是沒有主鍵條件時也不需要廣播,但由於全局二級索引是一張獨立的表,跟所有主鍵數據沒法在一個存儲節點上,需要引入分佈式事務才能工作。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二級索引在Key-Value模型裏面,不管本地的還是全局的,Key都是由region_id、index_id、索引鍵值、主鍵值(如果是二級唯一索引就無需包含),Value是主鍵值,可以看出使用二級索引拿到整行數據還需要從主鍵索引再獲取一次(也就是回表操作),如果相關數據都在索引鍵值裏就不需要回表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB在早期沒有引入分佈式事務(實在太複雜),所以先實現了本地二級索引,在實現分佈式事務後再實現了全局二級索引。對於業務應用而言,可以按使用場景優先選擇本地二級索引。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"全文索引,主要涉及索引構建及檢索:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"構建:","attrs":{}},{"type":"text","text":"將正排字段切詞爲一或多個term。構建term的有序倒排拉鍊,並按照格式進行存儲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"檢索:","attrs":{}},{"type":"text","text":"根據檢索詞,對多個倒排拉鍊進行布爾查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Key-Value模型裏面,Key爲region_id、index_id、分詞後的 term,Value爲排序好的主鍵鍵值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以在存儲層面,包括以上主要核心邏輯結構,以及列存、HLL、TDdigest等都是KV的物理結構。關於更多的索引細節設計請參考BaikalDB 的索引實現(https://my.oschina.net/BaikalDB/blog/4514979)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據基於Range方式按照主鍵切分成多個分片(Region)後,同時爲提升在分佈式場景下的整體可用性,需要多個副本(Replica)來存儲同樣的Region,這時就需要考慮多個副本和多個分片的數據一致性:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多副本(Replica)的數據一致性:","attrs":{}},{"type":"text","text":"多個副本需要數據可靠地複製,在出現故障時,能產生新的副本並且不會發生數據錯亂,這主要依靠Raft一致性協議來實現。Raft提供幾個重要的功能:Leader選舉、成員變更、日誌複製,每個數據變更都會落地爲一條Raft日誌,通過Raft的日誌複製功能,將數據安全可靠地同步到副本Group的多數節點中。在強一致性要求的情況下,讀寫都發生在Leader節點。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多分片(Region)的數據一致性:","attrs":{}},{"type":"text","text":"涉及多個Region的操作需要保障同時成功或者失敗,避免一個修改中部分Region失敗導致的不一致問題,這主要依靠結合RocksDB單機事務的兩階段提交(Two-phase Commit,即2PC)實現悲觀事務,同時結合Percolator的思想,採用Primary Region來充當事務協調者的角色,避免協調者的單點。分佈式事務的細節比較複雜,在很多數據庫研發公司中都是一個專門的團隊在投入,更多細節可以參考","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" ","attrs":{}},{"type":"text","text":"BaikalDB 的分佈式事務實現(https://my.oschina.net/BaikalDB/blog/4319429)。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 計算層的設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算層需要關注如何把SQL解析成具體的查詢計劃,或者叫分佈式的計算過程,還需要考慮如何基於代價進行優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalDB的SQL層面是一種分佈式的分層設計,整體架構如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e84e286f532680474762cc52e1566283.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前BaikalDB還不是完全的MPP架構,跟OLAP系統的設計有較大差異,數據最後的計算彙總只發生在一個Baikaldb節點上,同時各種Filter條件會盡量下推到BaikalStore模塊,減少最後需要彙總的數據,考慮到OLTP場景爲主的情況下返回數據規模有限,所以也足夠使用。因此BaikalDB存儲節點具備一定的計算能力,可以分佈式執行,降低傳輸壓力,所以也不是嚴格的存儲和計算分離。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在SQL引擎的實現層面,我們採用了經典的火山模型,一切皆算子,包括Baikaldb和BaikalStore的交互也是算子,每個算子提供open、next、close操作,算子之間可以靈活拼接,具備很好的擴展性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在查詢條件執行過程中,如果數據表有多種索引,爲了讓查詢更優,還需要具備","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"自動選擇最合適索引的能力,這種查詢優化器的主流設計有兩種:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基於規則的優化器(RBO,Rule-Based Optimization):","attrs":{}},{"type":"text","text":"該方式按照硬編碼在數據庫中的一系列規則來決定SQL的執行計劃。實際過程中,數據的量級差異會影響相同SQL的性能,這也是RBO的缺陷所在,規則是不變的,數據是變化的,最後規則決定的不一定最優。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基於代價的優化器(CBO,Cost-Based Optimization):","attrs":{}},{"type":"text","text":"該方向通過根據優化規則會生成多個執行計劃,然後CBO會通過根據統計信息(Statistics)和代價模型(Cost Model)計算各種執行計劃的代價,即COST,從中選用COST最低的執行計劃作爲實際執行計劃。統計信息的準確與否會影響CBO做出最優的選擇。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢優化器也是非常複雜的話題,現在還有基於AI技術的查詢優化器,也是學術研究的熱門,在很多數據庫研發公司,這一般也是一個專門的大方向。BaikalDB採用了RBO和CBO結合的方式做查詢優化,關於CBO的細節可以參考BaikalDB 的代價模型實現(https://my.oschina.net/BaikalDB/blog/4715063)。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 調度層的設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式數據系統涉及到多個工作節點,每個節點可能有不用的硬件環境和軟件負載,爲儘可能發揮集羣的性能,肯定希望每個工作節點存儲的數據大小基本一致,數據處理的負載基本一致,但同時還需要考慮故障節點的避讓和恢復,保持集羣的性能平穩。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度系統基本都會有一個Master角色來綜合評估集羣所有節點的信息做出調度決策,BaikalDB的Master角色是BaikalMeta模塊,BaikalStore定時通過心跳包收集信息上報給BaikalMeta,BaikalMeta獲得整個集羣的詳細數據後根據這些信息以及調度策略來生成決策,這些決策會通過心跳包的回覆發送給BaikalStore,BaikalStore會根據實際情況來靈活執行,這裏並不需要保證操作的執行成功,後續還會通過心跳告知BaikalMeta執行情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BaikalMeta每次決策只需要根據本輪收集的所有節點的心跳包的結果來處理,不需要依賴之前的心跳包信息,這樣決策邏輯就比較容易實現。如果BaikalMeta故障,BaikalStore的心跳包沒有迴應,就會停止全部的調度操作,整個集羣處於不調整的狀態,同時Baikaldb模塊會緩存BaikalMeta返回的集羣信息,能準確知道每個數據表的全部Region信息,並能對失敗的節點做剔除或者重試,這樣即使BaikalMeta故障,也不會影響讀寫。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面BaikalMeta的決策並不需要很高的時效性,所有BaikalStore可以間隔較長時間發送心跳,有效控制對BaikalMeta請求壓力,這樣一組BaikalMeta就可以管理成千上萬的BaikalStore節點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在存儲節點的調度中主要需要關注Leader的均衡和Peer的均衡:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Leader 均衡 :","attrs":{}},{"type":"text","text":"每個BaikalStore節點的Region Leader數量應儘量一致,Leader是主要的讀寫壓力承擔者,均衡可以讓每個BaikalStore節點的 CPU 內存負載接近。在BaikalStore負載較高時(通常容器化環境下會顯著變高),如果同機器的其他容器消耗大量的CPU和內存,同一個BaikalStore的其他Leader也可能消耗大量資源,就需要把它上面的Leader切換到其他節點,避免熱點導致的處理超時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Peer 均衡:","attrs":{}},{"type":"text","text":"指每個Raft Group的副本儘量分散到每個BaikalStore節點,使得每個BaikalStore節點的副本數量儘可能一致,每個數據表的所有Region大小基本是一致的,因此使得每個BaikalStore的存儲容量也比較接近,避免數據傾斜,這樣能夠重複利用集羣的磁盤資源。此外還希望每個副本在不同的機器,甚至不同的網段上,避免機器故障和網絡故障導致一個Region的大部分副本不可用進而導致 Leader 無法選出不能讀寫。在Peer有節點故障或者主動遷移時,還需要創建新的Peer同步數據達成可用,並刪除不可用Peer,這樣來保障Peer數量穩定。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Region作爲一個調度單元,它的可分裂性也是調度機制的一個基礎,BaikalDB 會在 Region大小超過設定閾值時,採用基準+增量的方式來拆分Range產生新的Region及其副本,通過彙報信息進行調度均衡,這樣使得在數據增長的時候可以自動化拆分。調度也是一個比較複雜的話題,通過引入很多調度策略能夠提升資源的利用率、容災、避免熱點,保障性能,這塊工作也是BaikalDB迭代的重點方向。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文從大規模商業系統的需求出發,總結了商業場景對數據存儲設施的期望。通過回顧整個鳳巢廣告庫依賴的數據庫系統的發展過程,來展示了商業平臺研發部自研更低成本、更爲可靠、更爲強大的數據存儲系統-BaikalDB的迭代歷程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過4年的工作,BaikalDB已經整合了商業產品系統歷史存在的全部存儲系統,實現了大一統。在結合業務需求研發過程中,BaikalDB也儘可能依靠很少的人力投入,快速構建核心功能集,根據業務需求的緊迫程度逐漸迭代,不僅僅滿足了廣告場景的需求,還滿足新的包括落地頁和電商的新商業場景的需求,而且仍在不停的豐富功能、優化性能和降低成本,打磨整個系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後針對如何研發一個數據庫,從存儲、計算、調度三個角度總結了BaikalDB的一些關鍵設計思路。數據庫和操作系統、編譯器並稱三大系統軟件,可以說是整個計算機軟件的基礎設施,數據庫技術同樣是博大精深,本文只是以業務視角管中窺豹,難免有疏漏,希望大家探討指正。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後希望大家多多關注BaikalDB的開源項目github.com/baidu/BaikalDB 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章