OceanBase原理與實現分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近看了《大規模分佈式存儲系統:原理解析與架構實踐》,主要介紹了OceanBase數據庫的原理和實現,是一本難得的好書。本文介紹書中的關鍵內容以及自己的想法。如果你有充足的時間,建議直接讀原書,沒必要花時間閱讀本文,如果沒有的話,可以閱讀本文了解OceanBase的整體架構和關鍵內容。另外書中的內容是基於OceanBase 0.4版本的,0.4發佈於2012年,距今已經8年,一些內容上可能有點陳舊,但是不影響整本書的價值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase開發背景是爲了解決海量數據的讀寫問題。舉個例子,淘寶有個收藏夾功能,每個用戶可以收藏自己感興趣的商品方便自己訪問。收藏夾可以有多個商品,用戶可以添加或者刪除收藏夾的商品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據庫設計了商品表item和收藏信息表info"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商品表item有數十億條記錄,收藏信息表info有數百億條記錄"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個用戶可以收藏幾千個商品,熱門商品可能被數十萬用戶收藏"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商品的價格等信息隨時變化"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這個問題,傳統的方案有兩種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種方案是info表只保存item表的索引,每次按用戶ID找到info列表後,再根據索引找到每個item詳情返回給用戶。這種方案的問題在於,如果info表中的記錄比較多,那麼關聯查詢item的時間會很長,尤其是分庫分表後,查詢item可能需要到很多臺機器上查詢。這種方案對讀不夠友好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種方案是info表冗餘存儲item詳情,查詢的時候只需要查info表,不需要關聯查詢item表。但這種方案的問題在於,如果商品item的信息變化了,比如價格修改了,需要修改所有幾十萬收藏了這個商品的info信息,分庫分表後,可能需要到很多臺機器上更新。這種方案對寫不夠友好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個問題是OceanBase面臨的主要挑戰之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. OceanBase架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分互聯網業務有個特點就是寫多讀少,讀的次數可能是寫的幾倍,幾十倍,甚至更多。比如遊戲業務,讀寫比可能是5:1,熱門視頻讀寫比可能是一百萬比一。針對這個特點,OceanBase設計時採用單臺更新服務器保存最近一段時間的增量數據。歷史數據也叫基線數據保存在基線服務器中。每次查詢的時候,從基線服務器取到基線數據,從更新服務器取到增量數據,合併後返回給客戶端。更新服務器上的數據會定期同步給基線服務器,這樣可以保證更新服務器上的數據不會太大。整體架構如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ff/ffdc88deccee9773c2894fbefc22b414.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端:兼容MySQL客戶端,支持JDBC,C客戶端訪問。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RootServer:管理服務器,管理集羣中所有服務器,子表數據分佈和副本管理。一主一備強同步。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateServer:更新服務器,一主一備,可以配置不同的同步模式。一般是強同步模式,異步模式主要用來做異地容災。異步模式下,有可能會造成一段時間的數據丟失,但因爲跨城的延時比較大,一般是30ms ~ 100ms,如果採用強同步方式,性能可能會慢得讓你懷疑人生。 另外UpdateServer和RootServer一般部署在同一臺物理服務器上。因爲兩臺服務器都是單點,這樣可以提升進程間通信的開銷。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ChunkServer:基線服務器,基線數據一般存儲三副本,相當於是存儲系統的數據存儲節點。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MergeServer:接收並解析客戶端SQL請求,將請求轉化成物理查詢計劃後發給ChunkServer或者UpdateServer,合併多個服務器返回的結果,將最終結果發給客戶端。MergeServer無狀態,相當於是存儲系統的計算節點。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 客戶端"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端訪問OceanBase流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)請求RootServer獲取MergeServer地址列表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)選擇某臺MergeServer發送SQL請求。選擇的策略有隨機和一致性哈希兩種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)如果請求MergeServer失敗,從MergeServer列表中重新選擇一臺重試,如果某臺MergeServer失敗超過一定次數,將這臺MergeServer加入黑名單並從列表中刪除。客戶端還會定期從RootServer更新MergeServer列表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 RootServer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RootServer主要功能包括集羣管理,數據分佈和副本管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理:RootServer通過租約機制選擇主UpdateServer。另外通過和MergeServer以及ChunkServer建立心跳來感知服務器的健康狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分佈:存儲數據時,OceanBase使用主鍵對數據排序後存儲。基線數據按照主鍵劃分子表,每個子表默認256MB,默認三副本,分佈在多臺ChunkServer中。每個子表負責的主鍵範圍保存在RootServer中。一個子表劃分例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/41/41c1d2766a5d83e8a98827656033dd05.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"副本管理:當某臺ChunkServer故障時,RootServer檢測後觸發對這臺ChunkServer上的子表增加副本的操作。如果某臺ChunkServer的負載比較高,RootServer會定期執行負載均衡操作,將某些子表遷移到負載較低的機器上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 MergeServer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MergeServer主要功能包括協議解析,SQL解析,請求轉發,結果合併,多表操作等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MergeServer緩存了子表分佈信息,根據請求設計的子表將請求轉發給子表所在的ChunkServer。如果是寫操作,還會轉發給UpdateServer。如果請求跨多個子表,MergeServer將請求拆分後併發發給多臺ChunkServer,合併這些ChunkServer返回的結果。如果某個ChunkServer故障,MergeServer將請求轉發給該子表其他副本所在的ChunkServer。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4 ChunkServer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ChunkServer主要功能包括存儲多個子表,提供讀取服務,執行定期合併和數據分發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase將大表劃分成256MB的子表,每個子表一般由一個SSTable組成,每個SSTable由多個Block組成,每個Block 4KB ~ 64KB之間。子表不能太大也不能太小,太大的話導致子表遷移以及負載均衡成本較高,也不能太小,太小導致元數據會比較大,增加RootServer負擔。查找數據時,先根據子表數據分佈定位到子表,然後在SSTable中執行二分查找,找到基線數據後,ChunkServer請求UpdateServer獲取增量數據,合併基線數據和增量數據的結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般在凌晨的時候,OceanBase會觸發UpdateServer的數據分發和合並,這時ChunkServer向UpdateServer請求某段時間的更新操作,和本地數據合併。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.5 UpdateServer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateServer更新數據流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)同步操作日誌到備機"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)寫主機操作日誌"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)更新內存"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)當內存數量超過一定值,生成快照文件,也即SSTable,存儲到SSD中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果UpdateServer宕機,RootServer上的租約將失效,然後將備UpdateServer切換爲主UpdateServer。宕機的UpdateServer重啓後需要先加載快照文件,然後回放快照點之後的操作日誌。注意第一步和第二部的順序不能調換,書中8.3.6節這裏的順序寫反了,書中先寫本機再同步備機,如果主機在這之間宕機,數據還沒有同步到備機,這時備機變成主機,之前宕機的主機重啓後,回放快照,包含之前的最後一次更新,但是這時的新主機上沒有這次更新,那就衝突了,書中8.4.1中的描述是對的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意第一步日誌同步到備機,備機只要接收到日誌寫到內存就可以回覆主機,不需要寫盤後纔回復,這樣提高同步性能。而主機是需要寫盤後才能回覆客戶端的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateServer的租約一般3 ~ 5秒,正常情況下RootServer定期給UpdateServer發送延長租期命令。如果RootServer需要升級,升級過程中,UpdateServer租約過期後系統就停止服務了。書中給出的方案是RootServer在退出前,會給UpdateServer發送一個超長時間的租約,承諾這段時間不進行主UpdateServer選舉。但是這裏有個隱患,如果這時主UpdateServer宕機了,就無法選擇新的UpdateServer了,系統還是無法服務。而且因爲RootServer主備強同步,即使RootServer有備機,也不能先升級備RootServer,再升級主RootServer。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲集羣中只有一臺主UpdateServer,所以很容易實現事務,相當於是單機事務,而不是分佈式事務,都不需要用到兩階段協議。然後通過每日數據合併和分發可以使UpdateServer的內存消耗不至於太大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateServer的數據結構是一個內存B+樹,結構如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/362e94980044c20ddeedb034edea1904.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"樹中每個葉子節點對應一行數據,key爲主鍵,value爲每行按照時間的順序操作鏈表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如:對主鍵爲1的商品有3個修改操作,分別是:將商品購買人數修改爲100,刪除該商品,將商品名稱修改爲“女鞋”,那麼,該商品的行操作鏈中將保存三個Cell,分別爲:<update,購買人數,100>、<delete,*>以及<update,商品名,“女鞋”>。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意這裏保存的是操作列表,而不是最終值,這樣做有什麼好處呢?好處就是在進行事務回滾操作時,只需要把這個操作記錄刪除就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase相當於GFS + MemSQL,ChunkServer相當於GFS,UpdateServer相當於MemSQL。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3 關鍵設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面介紹OceanBase關鍵點的設計方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 定期合併和數據分發"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"定期合併和數據分發都是將UpdateServer中的增量數據分發到ChunkServer的手段,兩者有相同的地方,也有不同的地方。相同的地方在於:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)UpdateServer先凍結當前活躍的內存表,變成凍結的內存表,並開啓新的活躍內存表用於保存後續的更新操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)UpdateServer通知RootServer數據版本發生了變化,RootServer通過心跳通知ChunkServer。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)ChunkServer從UpdateServer獲取凍結的更新數據保存到本地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同的地方在於:數據分發只是將數據保存到內存,而定期合併會將本地SSTable的基線數據與增量更新數據執行多路歸併,生成新的基線數據保存在新的SSTable中。定期合併是一個比較耗性能的操作,爲了避免影響正常服務,定期合併會進行限速,而且RootServer通知所有ChunServer數據版本變化時,每個ChunkServer會等待隨機一段時間後再開始定期合併,防止所有ChunkServer同時請求UpdateServer。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 順序寫磁盤"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不管是機械硬盤還是SSD,隨機寫磁盤都是一個耗時間的操作。對於機械硬盤,尋道時間比較長。對於SSD存在寫入放大效應。所以OceanBase摒棄隨機寫,採用批量順序寫,提高寫盤時的性能。對於隨機讀,SSD可以輕鬆應對。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 子表複製與負載均衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cf/cf52f761ab22e42ccfc9ec75f6e2526b.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RootServer有一個線程專門執行子表複製和負載均衡任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"子表複製是指當某個ChunkServer宕機後,爲了防止數據丟失,需要增加副本數小於閾值的子表數量。增加副本時,RootServer選擇這個子表副本所在的ChunkServer作爲源,選擇另一臺負載較低的ChunkServer作爲目的服務器,生成一個子表複製任務,加到任務隊列中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負載均衡是指當某臺ChunkServer上的子表個數超過了閾值,把這臺ChunkServer上的子表遷移到負載較低的ChunkServer上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"子表複製和負載均衡操作時,RootServer還會限制每個ChunkServer同時執行的任務數量,防止任務太多壓垮ChunkServer。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4 子表自動分裂與合併"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着數據的增加和減少,每個子表的大小可能會差距非常大,有的幾GB,有的幾MB。爲了維護每個子表的大小差距不至於太大,OceanBase會自動對子表的進行分裂與合併操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分裂操作由ChunkServer在定期合併時執行。如果當前子表大於256MB,找到當前子表的中間點作爲分裂點,將子表分裂成大小差不多的兩個子表。雖然每個子表的副本有多個,但只要所有子表都採用同樣的分裂規則,就可以保證分裂後的子表是相同的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合併操作流程如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)合併準備:RootServer選擇若干個主鍵範圍連續的小子表"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)子表遷移:將待合併的若干個小子表遷移到相同的ChunkServer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)子表合併:往ChunkServer發送子表合併命令,生成合並後的子表範圍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲每個子表有多個副本,只要某一個副本合併成功就行,失敗的子表通過垃圾回收機制刪除。副本數小於閾值的子表將會在子表複製過程中恢復副本數量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.5 SSTable文件結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SSTable的文件結構如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9f/9f40cc10b2ae086dc59b71ff33a51d01.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)第一列下部框圖是整個SSTable的結構。每行數據按照主鍵排序後存放在多個Block中,Block之間也是按照順序排列。接着存放的是Block Index,把每個Block最後一行的主鍵作爲Block Index。接着是布隆過濾器和表Schema。最後是Trailer和Trailer Offset。Trailer是保存的是各個區域的偏移信息,比如每個Block在哪個位置,Block Index在哪個位置,有了這些位置信息可以快速讀取到某個Block的數據,而不需要把整個文件加載到內存。Trailer Offset保存到是Trailer本身的偏移位置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)第二列下部框圖是Block Index的結構。上部框圖是Block結構,先是兩個Header信息,接着是每行數據,最後是Row Index,也就是每行的偏移信息,可以快速定位到行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從SSTable查找數據時,首先讀取Trailer Offset信息,根據這個讀取Trailer信息,從Trailer中獲取到Block Index位置,把Block Index讀到內存,通過二分查找找到這個主鍵屬於哪個Block。接着讀取Block到內存,再用二分查找找到對應行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上面流程看到,整個ChunkServer是一個三級索引結構:子表索引,塊索引以及行索引。通過這些索引可以快速讀取到行數據。爲了進一步加速讀取,ChunkServer使用了緩存功能,包含三種:塊緩存,行緩存以及塊索引緩存。塊緩存和行緩存存儲訪問頻繁的數據塊和數據行。塊索引緩存一般常駐內存,因爲塊索引一般不會太大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.6 UpdateServer瓶頸"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateServer的一些性能數據:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"千兆網卡每秒收發包超過50萬個,萬兆網卡每秒收發包超過100萬個;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"16核機器上B+樹每秒單行修改操作查過150萬次。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過網絡,內存結構,IO讀取,多線程操作等多種優化,UpdateServer的性能已經很卓越了,但是隨着應用的場景越來越多,還是有可能成爲瓶頸。一種可能的擴展方式是採用數據分片的方式,把所有數據按Hash分片,這樣同一個用戶的數據會分到同一個UpdateServer。但是這樣帶來的問題是如果要修改多個用戶的數據,就會帶來分佈式事務的挑戰,可以通過兩階段提交或者最終一致性方式實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.7 事務和MVCC"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase支持多線程併發修改,寫操作拆分爲兩個階段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)預提交(多線程執行):事務執行線程首先鎖住待更新數據行,接着,將事務中針對數據行的操作追加到該行的未提交行操作鏈表中,最後,往提交任務隊列中加入一個提交任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)提交(單線程執行):提交線程不斷地掃描並取出提交任務隊列中的提交任務,將這些任務的操作日誌追加到日誌緩衝區中。如果日誌緩衝區到達一定大小,將日誌緩衝區中的數據同步到備機,同時寫入主機的磁盤日誌文件。操作日誌寫成功後,將未提交行操作鏈表中的cell操作追加到已提交行操作鏈表的末尾,釋放鎖並回復客戶端寫操作成功。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e5/e5bb15e81819ed9a4cc64303ab5394ac.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼要拆分成兩階段?多個線程同時提交會有什麼問題?我能想到的一種解釋是提交階段的寫日誌緩衝,同步備機和寫磁盤這三個操作是沒辦法多線程執行的,所以把提交階段設置爲單線程執行,而預提交多線程執行,既保證了正確性又兼顧了性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外每個寫事務會根據提交時的系統時間生成一個事務版本,讀事務只會讀取在它之前提交的寫事務的修改操作。其中寫事務需要加鎖,版本號是在提交時生成,而讀事務不需要加鎖,版本號在預提交時生成。對於讀寫事務,需要加鎖,版本號在提交時生成。舉個例子:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f6a8ed554aa244b3047eee1e9a89c840.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對主鍵爲1的商品有2個寫事務:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事務T1(提交版本號爲2)將商品購買人數修改爲100,事務T2(提交版本號爲4)將商品購買人數修改爲50。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事務T2預提交時,T1已經提交,該商品的已提交行操作鏈包含一個cell:<update,購買人數,100>,未提交操作鏈包含一個cell:<update,購買人數,50>。事務T2成功提交後,該商品的已提交行操作鏈將包含兩個cell:<update,購買人數,100>以及<update,購買人數,50>,未提交行操作鏈爲空。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於讀事務:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"T3:事務版本號爲1,T1和T2均未提交,該行數據爲空。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"T4:事務版本號爲3,T1已提交,T2未提交,讀取到<update,購買人數,100>。儘管T2在T4執行過程中將購買人數修改爲50,T4第二次讀取時會過濾掉T2的修改操作,因而兩次讀取將得到相同的結果。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"T5:事務版本號爲5,T1和T2均已提交,讀取到<update,購買人數,100>以及<update,購買人數,50>,購買人數最終值爲50。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase鎖定粒度爲行鎖,默認隔離級別是讀取已提交(read committed)。read commited會遇到不可重複讀和幻讀問題,使用MVCC功能,可以解決不可重複讀問題。讀操作讀取某個版本的數據快照,不需要加鎖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只寫事務:事務預提交時加鎖,事務提交時釋放鎖。如果修改多行數據,使用兩階段鎖2PL。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讀寫事務:讀操作讀取某個版本快照,寫操作加鎖方式和只寫事務相同。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OceanBase處理死鎖的方式爲事務執行時超過一段時間無法獲取鎖,回滾事務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.8 淘寶收藏夾實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到文章開頭的問題,如何實現收藏夾?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據前面的OceanBase的特性,可以採取這樣的方案:info表中冗餘存儲item的信息,用戶查詢收藏夾時,MergeServer先發請求給ChunkServer請求基線數據,然後ChunkServer再發請求給UpdateServer,獲取用戶info記錄和各個item的增量信息,因爲UpdateServer的數據都在內存,即使這個用戶收藏了幾千個item,獲取增量信息也會很快。合併基線和增量信息後,返回給MergeServer,再返回給客戶端。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在每日定期合併時,info和item表的增量信息需要合併進ChunkServer的info表中,生成新的基線數據,這樣可以減少查詢時在UpdateServer上處理的數據量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4 後續迭代"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"《大規模分佈式存儲系統:原理解析與架構實踐》書中介紹的OceanBase是0.4版本,發佈於2012年,後面幾年OceanBase又經過了一系列的迭代改進。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.0版本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2016年發佈1.0版本,主要的變化有兩點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引入Paxos實現真正的高可用和強一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改進架構,將UpdateServer、ChunkServer、MergeServer和RootServer合併爲一個ObServer。這樣可以減少運維和部署成本。 而ObProxy主要是反向代理的作用,對用戶的SQL進行解析,將請求發給對應的ObServer。新的架構如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/96/961651de9dbdba394129a5918d9edb8f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現多租戶功能SQL-VM,隔離不同用戶之間資源,包括CPU,內存和IO。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.0版本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2018年發佈2.0版本,和1.0相比架構沒有太大變化,主要對擴展性,高可用,一致性,低成本和易用性做了提升,提供了全局一致性快照功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看完整本書,個人最大的感受就是如果可能的話,一切都應該自動化,減少人工參與。包括但不限於自動合併分裂子表,自動負載均衡,自動切換主備等。而且系統設計階段就應該考慮自動化的方式,否則等系統上線運營後再引入自動化可能成本就會比較大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s/CXDrXxYV07O2JF8fIulqow?","title":""},"content":[{"type":"text","text":"OceanBase 1.0 分佈式技術架構"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://book.douban.com/subject/25723658/","title":""},"content":[{"type":"text","text":"大規模分佈式存儲系統:原理解析與架構實踐"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章