將數據庫性能提升100倍?大數據時代中,一位數據庫老兵的創新之路

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前構建大數據應用的難點是什麼?對於這個問題,相信很多資深從業者都會回答:海量數據的高效處理與運維。在大數據時代中,它是數據庫行業面臨的基礎性問題,如何將它解決?既是挑戰,也是機遇。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在無數數據庫行業的老將新兵中,我們注意到一批力圖解決大數據語境下,數據庫使用和運維難題的“引路人”。今天,InfoQ 的專訪對象 --- 姚延棟,正是這批大數據“引路人”中的一個。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲 "},{"type":"link","attrs":{"href":"https:\/\/greenplum.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"Greenplum"}]},{"type":"text","text":" 的第三號員工,在過去十年間,姚延棟曾帶領團隊將 Greenplum 打造成爲世界排名第三的分析型數據庫,創下由華人主導的數據庫產品最好排名。而在數據庫領域從業數十年後,他又選擇開啓自己的創業之旅,與兩位合夥人共同創辦了一家名爲四維縱橫的數據庫公司。那麼,他爲什麼在這個時間點選擇創業?當傳統行業的數字化轉型成爲大勢所趨,數據庫領域又發生了什麼新故事?我們帶着這些問題,同四維縱橫創始人姚延棟一起聊聊數據庫的過去與未來,挑戰與機遇。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"行業中存在一種思維慣性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“創業是爲了打破行業的慣性。”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“如果把數據庫領域比作一個大森林,那麼我們就是對地形非常熟悉的原住民。當有人想穿過大森林,卻不知該走哪條路的時候,我們就充當“引路人”來幫助他們穿過森林。倘若沒有我們,那他們可能會按照自己的慣性去走。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"姚延棟在 Greenplum 效力的十年期間,最初主要從外圍模塊入手打造產品,後來逐步向核心邁進,打磨內核模塊,直到團隊駕馭整個數據庫內核,他坦言這與農村包圍城市的過程十分類似。在此期間,他發現行業中存在一種慣性思維,而創業的目的就是爲了打破這種慣性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,這個慣性到底指的是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以時序場景爲例,現在業內普遍流行使用專用的時序數據庫,典型的代表產品有 "},{"type":"link","attrs":{"href":"https:\/\/www.influxdata.com\/","title":"xxx","type":null},"content":[{"type":"text","text":"InfluxDB"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"http:\/\/opentsdb.net\/","title":"xxx","type":null},"content":[{"type":"text","text":"OpenTSDB "}]},{"type":"text","text":"等,而與此同時,幾乎所有場景都需要關係型數據庫。這樣一來,大家就不得不引入多個數據庫產品,使得技術棧以及監控運維變得十分複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然專用時序數據庫在一定程度上滿足了業務對於時序處理的需求,但也存在諸多問題,其中性能低、擴展性差的問題尤爲顯著。過去時序數據庫大多是爲數據中心的服務器監控、埋點數據處理等簡單場景設計的,所以,其無法爲物聯網等場景下的大量數據源和大量指標提供支持。除此之外,開發效率低、需要 MPP 數據庫或者大數據產品配合以及數據孤島化等方面,都是擺在從業者面前的難題。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"應運而生的超融合時序數據庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“我一直把數據庫的技術演進和生物界的進化類比去看。”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上世紀 60 年代誕生起,數據庫技術就一直在不斷地演進、迭代,其背後主要是兩股力量在推動:一股力量是性能問題,另一股力量是效率問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上世紀七八十年代,關係型數據庫開始獨步天下,從業者主要是基於關係型數據庫來高效存儲和處理應用開發中用到的數據;到了 2000 年左右,數據規模大幅增長,而大數據處理技術尚未展露雛形,整個社會對於浩瀚信息的處理仍處於比較迷茫的階段,以至於技術的迭代速度趕不上數據增長的速度。自此,大數據處理的性能問題開始顯現,數據庫領域隨之出現了時序數據庫、KV 數據庫、文檔數據庫等專用數據庫,以期解決性能從 0 到 1 的問題。但由於應用要與多個數據庫溝通,從多個數據庫讀取數據到應用程序內存中再進行關聯、聚集以及合併等計算,很多數據處理邏輯被迫只能放在應用中,開發和運維效率就不可避免地大打折扣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如此一來,爲了解決效率問題,行業中又出現了 "},{"type":"link","attrs":{"href":"https:\/\/prestodb.io\/","title":"xxx","type":null},"content":[{"type":"text","text":"Presto"}]},{"type":"text","text":" 等類型的產品,即在專用的數據庫上封裝一個查詢引擎,試圖把數據處理邏輯從應用處理邏輯中剝離出來。這種方式雖然在一定程度上解決了開發效率問題,但性能仍是短板,且並未從根本上解決技術棧複雜的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以看到,在進化了近 50 年後,現有的數據庫技術已經不能滿足從業者的需求 --- 他們需要更加簡單易用、省心省力的數據庫。在這樣的背景下,爲了能給用戶提供簡單易用的接口,真正實現數據平民化,姚延棟和他的團隊將關係數據庫、時序數據庫和分析數據庫融合在同一個數據庫產品中,打造了全球唯一一款 PB 級超融合時序數據庫 --"},{"type":"link","attrs":{"href":"https:\/\/www.ymatrix.cn\/","title":"xxx","type":null},"content":[{"type":"text","text":"MatrixDB"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/63\/63b9c6529e5fbaa81f79d84d58825366.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"超融合時序數據庫解決了什麼問題?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"目前,超融合時序數據庫主要應用在兩大場景:第一,時序、時空場景,通常是物聯網、工業互聯網、車聯網和智慧城市等領域;第二,實時數據分析場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"談到時序、時空場景,姚延棟分享了一個海量設備、大量存儲的典型物聯網場景。“以一家做光纖和 5G 通訊設備的國際製造商爲例,這家制造商大概有 1000 萬設備,每臺設備每次都會採集 300 個指標數據,每次共計需要採集 30 億指標。”基於這種情況下,MatrixDB 實現了超大規模數據的實時加載特性,在保證低延遲和高併發加載的同時,也減輕了系統資源消耗,充分將"},{"type":"text","marks":[{"type":"strong"}],"text":"快速採集、高效存儲"},{"type":"text","text":"的特性顯示了出來,使得海量數據的存儲問題、秒級採集的頻率要求都能得到完美的解決。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實時分析的特性方面,姚延棟又給出了另一個案例:在一個實時數據分析的業務中,MatrixDB 可以實現對 IT 運營域和 OT 生產域的數據收集,通過 ETL\/CDC 和物聯網協議插入數據以後,便能將兩張網的數據整合在一起,使得公司的全部數據一目瞭然地展現。當企業再基於這些數據進行分析時,就能得到更加精準且全面的結論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/63\/632c7897c104bd48783edd6c90af6421.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們還注意到了 MatrixDB 的另一個重要特性——"},{"type":"text","marks":[{"type":"strong"}],"text":"模塊化和可插拔。"},{"type":"text","text":"專用時序數據庫通常包含存儲器和簡單的執行器,沒有優化器和併發控制等關係數據庫經典組件。從本質上來看,它是把存儲器“做成”了數據庫,以此來解決一個特定的問題。"},{"type":"text","marks":[{"type":"strong"}],"text":"而超融合時序數據庫則是把存儲器“做進”數據庫,通過把各個核心功能做到模塊化、可插拔,在一個關係數據庫內部同時實現多種存儲引擎,以及跨存儲表關聯和 ACID。"},{"type":"text","text":"比如有 200 張表,其中 190 張是關係型數據,這部分可以使用關係引擎存儲;剩餘 10 張是時序數據,就可以使用時序引擎存儲,且它們可以相互關聯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與傳統的關係數據庫 + 專用時序數據庫相結合的架構相比,通過支持多種存儲引擎,超融合時序數據庫可以讓性能快 10-100 倍,同時大幅降低成本,提升開發運維效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e3\/e3e0b25d725a886d0f1be2344b4d13f9.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"令人驚喜的是,除了快速採集、高效存儲、實時分析以及模塊化和可插拔特性以外,我們注意到 MatrixDB 作爲一款數據庫產品,還提供了機器學習的能力。"},{"type":"text","text":"隨着人工智能技術的飛速發展,In-Database Machine Learning 成爲一個值得關注的方向,將機器學習的算法內置到數據庫將逐漸成爲主流。一方面,藉助分佈式數據庫的並行計算能力,可以使計算速度超越單機;另一方面,由於單機上的內存有限,在數據量很大的情況下,只能抽樣進行訓練,模型精度就會變差。通過 In-Database Machine Learning 模式,就能實現在全量數據上訓練,模型精度也將得到進一步提高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“過去從業者需要自己寫程序才能實現機器學習。”這是姚延棟提到的一個現象,並表示這其中的技術門檻比較高。“目前,MatrixDB 數據庫通過直接提供 SQL 接口,大大降低了機器學習的門檻,能夠在一定程度上緩解人才稀缺的問題”。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"下一步怎麼走?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"“未來我們會繼續在性能和效率兩個維度持續發力,並沿着更智能的方向去發展。”"},{"type":"text","text":"落實到具體的業務層面,姚延棟表示會在提升易用性、構建生態兩個方面重點發力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衆所周知,數據庫運維對於從業者來說是一個很大的挑戰,也因此衍生出了數據庫運維這個行業。尤其在分佈式數據庫環境中,節點數量多以及需求多樣化的特點,使得運維的難度更是大幅增加。姚延棟表示:“今後,我們將繼續致力於降低數據庫的使用門檻,使數據庫有能力提供自動性能調優、健康檢查等功能。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在構建生態的方面,他也給出了更高層面的考慮。數據庫是基礎軟件,沒有人能夠只使用數據庫就解決業務問題,必須與很多周邊產品搭配,才能發揮真正的價值。因此,對於數據庫產品來說,生態的重要性不言而喻。“如果沒有生態,我們相當於把複雜度問題扔給了用戶,聯合行業內上下游共建生態是我們接下來的方向”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"\"數據庫能定義未來記憶。\""},{"type":"text","text":"由於 MatrixDB 數據庫更多應用於物聯網、車聯網、工業互聯網和智慧生活等場景,姚延棟也與 InfoQ 談到了他對於萬物互聯時代中數據庫的理解,萬物互聯的目的是爲了更智能化,而智能的前提是基於記憶,但事物本身是沒有記憶能力的,如風力發電機、智能手環等等。“未來我們希望通過超融合時序數據庫,賦予一些沒有記憶能力的設備以記憶,爲智能衍生出更多的可能性。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"萬物互聯時代的智能化到底會是什麼樣?這個問題還未有定論,仍然需要等待技術隨着時代不斷演進,不斷進化才能得到答案。但可以肯定的是,在這之前先建立起事物的記憶能力,能夠爲不遠的智能化時代奠定基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"談及數據庫和四維縱橫的未來,姚延棟希望能夠讓數據處理簡單到像用電、用氣、用水一樣,把 MatrixDB 數據庫打造成一個真正的一站式數據處理平臺,讓從業者在進行數據處理時,不再需要關心底層的數據存儲以及計算的複雜性。這是四維縱橫正在探索的方向,也是行業共同努力的終極目標。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章