雲原生數據庫設計新思路

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"本文作者爲 PingCAP 聯合創始人兼 CTO 黃東旭,將分享分佈式數據庫的發展趨勢以及雲原生數據庫設計的新思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在講新的思路之前,先爲過去沒有關注過數據庫技術的朋友們做一個簡單的歷史回顧,接下來會談談未來的數據庫領域,在雲原生數據庫設計方面的新趨勢和前沿思考。首先來看看一些主流數據庫的設計模式。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"常見的分佈式數據庫流派"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"分佈式數據庫的發展歷程,我按照年代進行了分類,到目前爲止分成了四代。第一代是基於簡單的分庫分表或者中間件來做 Data Sharding 和 水平擴展。第二代系統是以 Cassandra、HBase 或者 MongoDB 爲代表的 NoSQL 數據庫,一般多爲互聯網公司在使用,擁有很好的水平擴展能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三代系統我個人認爲是以 Google Spanner 和 AWS Aurora 爲代表的新一代雲數據庫,他們的特點是融合了 SQL 和 NoSQL 的擴展能力,對業務層暴露了 SQL 的接口,在使用上可以做到水平的擴展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第四代系統是以現在 TiDB 的設計爲例,開始進入到混合業務負載的時代,一套系統擁有既能做交易也能處理高併發事務的特性,同時又能結合一些數據倉庫或者分析型數據庫的能力,所以叫 HTAP,就是融合型的數據庫產品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"未來是什麼樣子,後面的分享我會介紹關於未來的一些展望。從整個時間線看,從 1970 年代發展到現在,database 也算是個古老的行業了,具體每個階段的發展情況,我就不過多展開。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4c\/4ceab839911b26a172b2138926731164.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"數據庫中間件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於數據庫中間件來說,第一代系統是中間件的系統,基本上整個主流模式有兩種,一種是在業務層做手動的分庫分表,比如數據庫的使用者在業務層裏告訴你;北京的數據放在一個數據庫裏,而上海的數據放在另一個數據庫或者寫到不同的表上,這種就是業務層手動的最簡單的分庫分表,相信大家操作過數據庫的朋友都很熟悉。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二種通過一個數據庫中間件指定 Sharding 的規則。比如像用戶的城市、用戶的 ID、時間來做爲分片的規則,通過中間件來自動的分配,就不用業務層去做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這種方式的優點就是簡單。如果業務在特別簡單的情況下,比如說寫入或者讀取基本能退化成在一個分片上完成,在應用層做充分適配以後,延遲還是比較低的,而整體上,如果 workload 是隨機的,業務的 TPS 也能做到線性擴展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是缺點也比較明顯。對於一些比較複雜的業務,特別是一些跨分片的操作,比如說查詢或者寫入要保持跨分片之間的數據強一致性的時候就比較麻煩。另外一個比較明顯的缺點是它對於大型集羣的運維是比較困難的,特別是去做一些類似的表結構變更之類的操作。想象一下如果有一百個分片,要去加一列或者刪一列,相當於要在一百臺機器上都執行操作,其實很麻煩。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"NoSQL - Not Only SQL"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在 2010 年前後,好多互聯網公司都發現了這個大的痛點,仔細思考了業務後,他們發現業務很簡單,也不需要 SQL 特別複雜的功能,於是就發展出了一個流派就是 NoSQL 數據庫。NoSQL 的特點就是放棄到了高級的 SQL 能力,但是有得必有失,或者說放棄掉了東西總能換來一些東西,NoSQL 換來的是一個對業務透明的、強的水平擴展能力,但反過來就意味着你的業務原來是基於 SQL 去寫的話,可能會帶來比較大的改造成本,代表的系統有剛纔我說到的 MongoDB、Cassandra、HBase 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最有名的系統就是 MongoDB,MongoDB 雖然也是分佈式,但仍然還是像分庫分表的方案一樣,要選擇分片的 key,他的優點大家都比較熟悉,就是沒有表結構信息,想寫什麼就寫什麼,對於文檔型的數據比較友好,但缺點也比較明顯,既然選擇了 Sharding Key,可能是按照一個固定的規則在做分片,所以當有一些跨分片的聚合需求的時候會比較麻煩,第二是在跨分片的 ACID 事務上沒有很好的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9b\/9b1f514414bb1849034da93c848d5052.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"HBase 是 Hadoop 生態下的比較有名的分佈式 NoSQL 數據庫,它是構建在 HDFS 之上的一個 NoSQL 數據庫。Cassandra 是一個分佈式的 KV 數據庫,其特點是在 KV 操作上提供多種一致性模型,缺點與很多 NoSQL 的問題一樣,包括運維的複雜性, KV 的接口對於原有業務改造的要求等。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"第三代分佈式數據庫 NewSQL"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"剛纔說過 Sharding 或者分庫分表,NoSQL 也好,都面臨着一個業務的侵入性問題,如果你的業務是重度依賴 SQL,那麼用這兩種方案都是很不舒適的。於是一些技術比較前沿的公司就在思考,能不能結合傳統數據庫的優點,比如 SQL 表達力,事務一致性等特性,同時又跟 NoSQL 時代好的擴展性等功能結合,最終發展出一種新的、可擴展並且用起來又像單機數據庫一樣方便的系統。最終,在這個思路下就誕生出了兩個流派,一個是 Spanner,一個是 Aurora,兩個都是頂級的互聯網公司在面臨到這種問題時做出的一個選擇。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"Shared Nothing 流派"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Shard Nothing 這個流派是以 Google Spanner 爲代表,好處是在於可以做到幾乎無限的水平擴展,整個系統沒有端點,不管是 1 個 T、10 個 T 或者 100 個 T,業務層基本上不用擔心擴展能力。第二個好處是他的設計目標是提供強 SQL 的支持,不需要指定分片規則、分片策略,系統會自動的幫你做擴展。第三是支持像單機數據庫一樣的強一致的事務,可以用來支持金融級別的業務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2d5a4f24d8c0f3d287782bf8d2a9f599.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"代表產品就是 Spanner 與 TiDB,這類系統也有一些缺點,從本質上來說一個純分佈式數據庫,很多行爲沒有辦法跟單機行爲一模一樣。舉個例子,比如說延遲,單機數據庫在做交易事務的時候,可能在單機上就完成了,但是在分佈式數據庫上,如果要去實現同樣的一個語義,這個事務需要操作的行可能分佈在不同的機器上,需要涉及到多次網絡的通信和交互,響應速度和性能肯定不如在單機上一次操作完成,所以在一些兼容性和行爲上與單機數據庫還是有一些區別的。即使是這樣,對於很多業務來說,與分庫分表相比,分佈式數據庫還是具備很多優勢,比如在易用性方面還是比分庫分表的侵入性小很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"Shared Everything 流派"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二種流派就是 Shared Everything 流派,代表有 AWS Aurora、阿里雲的 PolarDB,很多數據庫都定義自己是 Cloud-Native Database,但我覺得這裏的 Cloud-Native 更多是在於通常這些方案都是由公有云服務商來提供的,至於本身的技術是不是雲原生,並沒有一個統一的標準。從純技術的角度來去說一個核心的要點,這類系統的計算與存儲是徹底分離的,計算節點與存儲節點跑在不同機器上,存儲相當於把一個 MySQL 跑在雲盤上的感覺,我個人認爲類似 Aurora 或者 PolarDB 的這種架構並不是一個純粹的分佈式架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/105bededcf51bdad9d938c7654e404d0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"原來 MySQL 的主從複製都走 Binlog,Aurora 作爲一種在雲上 Share Everything Database 的代表,Aurora 的設計思路是把整個 IO 的 flow 只通過 redo log 的形式來做複製,而不是通過整個 IO 鏈路打到最後 Binlog,再發到另外一臺機器上,然後再 apply 這個 Binlog,所以 Aurora 的 IO 鏈路減少很多,這是一個很大的創新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"日誌複製的單位變小,意味着我發過去的只有 Physical log,不是 Binlog,也不是直接發語句過去,直接發物理的日誌能代表着更小的 IO 的路徑以及更小的網絡包,所以整個數據庫系統的吞吐效率會比傳統的 MySQL 的部署方案好很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c3\/c3d288b72352d1a0dc0d678f2e730062.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Aurora 的優勢是 100% 兼容 MySQL,業務兼容性好,業務基本上不用改就可以用,而且對於一些互聯網的場景,對一致性要求不高的話,數據庫的讀也可以做到水平擴展,不管是 Aurora 也好,PolarDB 也好,讀性能是有上限的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Aurora 的短板大家也能看得出來,本質上這還是一個單機數據庫,因爲所有數據量都是存儲在一起的,Aurora 的計算層其實就是一個 MySQL 實例,不關心底下這些數據的分佈,如果有大的寫入量或者有大的跨分片查詢的需求,如果要支持大數據量,還是需要分庫分表,所以 Aurora 是一款更好的雲上單機數據庫。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"第四代系統:分佈式 HTAP 數據庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第四代系統就是新形態的 HTAP 數據庫,英文名稱是 Hybrid Transactional and Analytical Processing,通過名字也很好理解,既可以做事務,又可以在同一套系統裏面做實時分析。HTAP 數據庫的優勢是可以像 NoSQL 一樣具備無限水平擴展能力,像 NewSQL 一樣能夠去做 SQL 的查詢與事務的支持,更重要的是在混合業務等複雜的場景下,OLAP 不會影響到 OLTP 業務,同時省去了在同一個系統裏面把數據搬來搬去的煩惱。目前,我看到在工業界基本只有 TiDB 4.0 加上 TiFlash 這個架構能夠符合上述要求。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"分佈式 HTAP 數據庫:TiDB (with TiFlash)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲什麼 TiDB 能夠實現 OLAP 和 OLTP 的徹底隔離,互不影響?因爲 TiDB 是計算和存儲分離的架構,底層的存儲是多副本機制,可以把其中一些副本轉換成列式存儲的副本。OLAP 的請求可以直接打到列式的副本上,也就是 TiFlash 的副本來提供高性能列式的分析服務,做到了同一份數據既可以做實時的交易又做實時的分析,這是 TiDB 在架構層面的巨大創新和突破。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2d6f51db39ff3506a9b274f59b9a4d0e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"下圖是 TiDB 的測試結果,與 MemSQL 進行了對比,根據用戶場景構造了一種 workload,橫軸是併發數,縱軸是 OLTP 的性能,藍色、黃色、綠色這些是 OLAP 的併發數。這個實驗的目的就是在一套系統上既跑 OLTP 又跑 OLAP,同時不斷提升 OLTP 和 OLAP 的併發壓力,從而查看這兩種 workload 是否會互相影響。可以看到在 TiDB 這邊,同時加大 OLTP 和 OLAP 的併發壓力,這兩種 workload 的性能表現沒有什麼明顯變化,幾乎是差不多的。但是,同樣的實驗發生在 MemSQL 上,大家可以看到 MemSQL 的性能大幅衰減,隨着 OLAP 的併發數變大,OLTP 的性能下降比較明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ec\/ec5df943243534560ccd7d5fee09c328.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"接下來是 TiDB 在一個用戶實際業務場景的例子,在進行 OLAP 業務的查詢的時候,OLTP 業務仍然可以實現平滑的寫入操作,延遲一直維持在較低的水平。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"未來在哪裏"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"Snowflake"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Snowflake 是一個 100% 構建在雲上的數據倉庫系統,底層的存儲依賴 S3,基本上每個公有云都會提供類似 S3 這樣的對象存儲服務,Snowflake 也是一個純粹的計算與存儲分離的架構,在系統裏面定義的計算節點叫 Virtual Warehouse,可以認爲就是一個個 EC2 單元,本地的緩存有日誌盤,Snowflake 的主要數據存在 S3 上,本地的計算節點是在公有云的虛機上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bd\/bd7a754946f0a4a92ac761e93ff78d3d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這是 Snowflake 在 S3 裏面存儲的數據格式的特點,每一個 S3 的對象是 10 兆一個文件,只追加,每一個文件裏面包含源信息,通過列式的存儲落到磁盤上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c0e1626fcc2d1f8c80e83bcd0136097.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Snowflake 這個系統最重要的一個閃光點就是對於同一份數據可以分配不同的計算資源進行計算,比如某個 query 可能只需要兩臺機器,另外一個 query 需要更多的計算資源,但是沒關係,實際上這些數據都在 S3 上面,簡單來說兩臺機器可以掛載同一塊磁盤分別去處理不同的工作負載,這就是一個計算與存儲解耦的重要例子。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"Google BigQuery"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二個系統是 BigQuery,BigQuery 是 Google Cloud 上提供的大數據分析服務,架構設計上跟 Snowflake 有點類似。BigQuery 的數據存儲在谷歌內部的分佈式文件系統 Colossus 上面,Jupiter 是內部的一個高性能網絡,上面這個是谷歌的計算節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ab\/ab3ff370c9f7e4a93e6da61dad973e0a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"BigQuery 的處理性能比較出色,每秒在數據中心內的一個雙向的帶寬可以達到 1 PB,如果使用 2000 個專屬的計算節點單元,大概一個月的費用是四萬美金。BigQuery 是一個按需付費的模式,一個 query 可能就用兩個 slot,就收取這兩個 slot 的費用,BigQuery 的存儲成本相對較低,1 TB 的存儲大概 20 美金一個月。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#089cbb","name":"user"}}],"text":"RockSet"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三個系統是 RockSet,大家知道 RocksDB 是一個比較有名的單機 KV 數據庫,其存儲引擎的數據結構叫 LSM-Tree,LSM-Tree 的核心思想進行分層設計,更冷的數據會在越下層。RockSet 把後面的層放在了 S3 的存儲上面,上面的層其實是用 local disk 或者本地的內存來做引擎,天然是一個分層的結構,你的應用感知不到下面是一個雲盤還是本地磁盤,通過很好的本地緩存讓你感知不到下面雲存儲的存在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"所以剛纔看了這三個系統,我覺得有幾個特點,一個是首先都是天然分佈式的,第二個是構建在雲的標準服務上面的,尤其是 S3 和 EBS,第三是 pay as you go,在架構裏面充分利用了雲的彈性能力。我覺得這三點最重要的一點是存儲,存儲系統決定了雲上數據庫的設計方向。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"爲什麼 S3 是關鍵?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在存儲裏邊我覺得更關鍵的可能是 S3。EBS 其實我們也有研究過,TiDB 第一階段其實已經正在跟 EBS 塊存儲做融合,但從更長遠的角度來看,我覺得更有意思的方向是在 S3 這邊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先第一點 S3 非常划算,價格遠低於 EBS,第二 S3 提供了 9 個 9 很高的可靠性,第三是具備線性擴展的吞吐能力,第四是天然跨雲,每一個雲上都有 S3 API 的對象存儲服務。但是 S3 的問題就是隨機寫入的延遲非常高,但是吞吐性能不錯,所以我們要去利用這個吞吐性能不錯的這個特點,規避延遲高的風險。這是 S3 benchmark 的一個測試,可以看到隨着機型的提升,吞吐能力也是持續的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/72\/7242911920b5301dc43929d982031ef1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"如何解決 Latency 的問題?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果要解決 S3 的 Latency 問題,這裏提供一些思路,比如像 RockSet 那樣用 SSD 或者本地磁盤來做 cache,或者通過 kinesis 寫入日誌,來降低整個寫入的延遲。還有數據的複製或者你要去做一些併發處理等,其實可以去做 Zero-copy data cloning,也是降低延遲的一些方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上述例子有一些共同點都是數據倉庫,不知道大家有沒有發現,爲什麼都是數據倉庫?數據倉庫對於吞吐的要求其實是更高的,對於延遲並不是那麼在意,一個 query 可能跑五秒出結果就行了,不用要求五毫秒之內給出結果,特別是對於一些 Point Lookup 這種場景來說,Shared Nothing 的 database 可能只需要從客戶端的一次 rpc,但是對於計算與存儲分離的架構,中間無論如何要走兩次網絡,這是一個核心的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4b\/4b96f528450bbd1a0600be8ae89401a6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"你可能會說沒有關係,反正計算和存儲已經分離了,大力出奇跡,可以加計算節點。但是我覺得新思路沒必要這麼極端,Aurora 是一個計算存儲分離架構,但它是一個單機數據庫,Spanner 是一個純分佈式的數據庫,純 Shared Nothing 的架構並沒有利用到雲基礎設施提供的一些優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"比如說未來我們的數據庫可以做這樣的設計,在計算層其實帶着一點點狀態,因爲每臺 EC2 都會帶一個本地磁盤,現在主流的 EC2 都是 SSD,比較熱的數據可以在這一層做 Shared Nothing,在這一層去做高可用,在這一層去做隨機的讀取與寫入。熱數據一旦 cache miss,纔會落到 S3 上面,可以在 S3 只做後面幾層的數據存儲,這種做法可能會帶來問題,一旦穿透了本地 cache,Latency 會有一些抖動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這種架構設計的好處:首先,擁有對實時業務的數據計算親和力,在 local disk 上會有很多數據,在這點上很多傳統數據庫的一些性能優化技巧可以用起來;第二,數據遷移其實會變得很簡單,實際上底下的存儲是共享的,都在 S3 上面,比如說 A 機器到 B 機器的數據遷移其實不用真的做遷移,只要在 B 機器上讀取數據就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個架構的缺點是:第一,緩存穿透了以後,Latency 會變高;第二,計算節點現在有了狀態,如果計算節點掛掉了以後,Failover 要去處理日誌回放的問題,這可能會增加一點實現的複雜度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/5748dde74d92df6a5ac2886d1a0bca0d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00b38a","name":"user"}}],"text":"還有很多值得研究的課題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上面的架構只是一個設想,TiDB 其實還不是這樣的架構,但未來可能會在這方向去做一些嘗試或者研究,在這個領域裏面其實還有很多 open question 我們還沒有答案,包括雲廠商、包括我們,包括學術界都沒有答案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在有一些研究的課題,第一,如果我們要利用本地磁盤,應該緩存多少數據,LRU 的策略是什麼樣子,跟 performance 到底有什麼關係,跟 workload 有什麼關係。第二,對於網絡,剛纔我們看到 S3 的網絡吞吐做的很好,什麼樣的性能要配上什麼樣的吞吐,要配多少個計算節點,特別是對於一些比較複雜查詢的 Reshuffle;第三,計算複雜度和計算節點、機型的關係是什麼?這些問題其實都是比較複雜的問題,特別是怎麼用數學來表達,因爲需要自動化地去做這些事情。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"即使這些問題都解決了,我覺得也只是雲上數據庫時代的一個開始。未來在 Serverless,包括 AI-Driven 幾大方向上,怎麼設計出更好的 database,這是我們努力的方向。最後引用屈原的一句話,就是路漫漫其修遠兮,我們還有很多事情需要去做,謝謝大家。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章