現在是採用湖倉一體的好時機嗎?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"近日,大數據獨角獸Databricks官宣H輪"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/s2a1EQomEYyaoktLPvdT","title":null,"type":null},"content":[{"type":"text","text":"融資"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",經過這一輪16億美元融資,其估值已經飆升至380億美元。Databricks聯合創始人兼首席執行官 Ali Ghodsi 在媒體採訪中表示,這筆資金將主要用於加速構建在lakehouse(湖倉一體)賽道的佈局。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作爲一個新興的數據架構,"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/theme\/106","title":null,"type":null},"content":[{"type":"text","text":"湖倉一體"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正成爲兵家必爭之地,賽道上的玩家既有Databricks、SnowFlake這樣的初創公司,也有亞馬遜、谷歌、阿里這樣的頭部雲廠商。湖倉一體新架構真的能落地嗎?有哪些可行的落地路徑?成本又來自哪裏?面向未來的新一代數據架構還有哪些趨勢需要關注?我們帶着這些問題採訪了"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#171a1d","name":"user"}}],"text":"分佈式系統和大數據平臺領域專家"},{"type":"text","text":"關濤,他給出了自己獨到的洞察和理解。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"湖倉一體的不同解法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:數據湖和數倉融合架構(即湖倉一體)是當下大數據領域非常重要的議題之一,不僅各大雲廠商先後提出了自己的技術方案,開源社區也有一些項目非常活躍。在您看來,目前業內對湖倉一體的定義是否達成一致了?不同廠商推的湖倉一體技術方案有哪些關鍵差異?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我認爲目前業內對湖倉一體的整體大方向是高度達成一致的。很多廠商都在重點推湖倉一體的概念,比如Databricks,它現在整個產品就是基於分析和AI的Lakehouse Platform,Snowflake和Redshift也都向湖倉一體做了非常多的傾斜和資源投入。從這個層面講,大家都意識到了:數據湖需要更好的管理能力、數據倉庫需要更好的靈活性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但大方向之下,不同廠商的實現路徑確實各不相同,這也和廠商自己的產品方向、技術基礎直接相關。比如Databricks,它是以數據湖爲軸發展起來的一套系統,所以它更多談的是從湖向倉怎麼走,最終走向湖倉一體。對於像Redshift這樣的廠商,包括MaxCompute,是以數倉爲核心,所以是從數倉向湖上走。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於以數倉爲軸的廠商來說,通常湖和倉是左右擺佈的,左邊是一個數據庫,右邊是一個數倉,然後數據湖和數倉之間有數據流動;對於以數據湖爲軸的廠商來說,很多時候他們是在數據湖之上再做一個數倉,更像是一個上下結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:您在之前一篇關於湖倉一體的文章中曾提到“數據湖與上雲無關”,那湖倉一體是否也不一定非要上雲?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我覺得這個問題需要分兩個層面來看。第一個層面,我們如果看數據湖最標準的定義,它實際就是指一個基礎的文件存儲系統,在文件存儲或者數據存儲的時候,不用關心數據的格式,不用做非常複雜的建模和數據管理。從這個層面來看,最傳統意義上的數據湖定義確實和雲沒有什麼關係。像線下部署的Hadoop體系,甚至存儲計算不分離,只要它滿足剛纔說的這些分佈式存儲的特性,其實都算數據湖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是從另外一個層面上講,現代的數據湖定義實際上跟雲強相關。現代的數據湖定義實際上是希望把存儲託管到雲存儲上,使用戶能夠通過託管避免非常複雜的分佈式有狀態系統的運維、支持等工作。那從這個層面上講,現代的數據湖幾乎都是以云爲軸來做的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"後面一個問題是說湖倉一體是不是不一定非要上雲,其實不見得。當然我們說上雲更多是指上公共雲,很多用戶線下有自己的專有云,湖倉一體也可以部署到用戶線下自己的所謂雲平臺上去。但從另外一個層面上講,雲的本質又變成了一個對基礎設施的抽象和定義。如果說技術抽象成多層的話,它還是一個雲架構,至於是部署在用戶的機房裏,還是部署在大的雲廠商的機房裏,只是部署形態的差別。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"落地路徑與成本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:現在大多數企業都已經有了自己的一套大數據架構,他們如何基於已有的架構落地湖倉一體?有哪些可行的落地路徑?成本可能主要會來自哪裏?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在有一部分企業已經有了自己的大數據架構,這些企業相對來說可能誕生的比較早,大多數都是選的Hadoop體系,或是自建的Hadoop體系,或是使用雲上託管的Hadoop體系。這些企業可以有很多選擇,他可以選擇像Databricks那樣的方案,也可以選擇像MaxCompute這樣的方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這兩條路徑都相對可行,那怎麼選?這通常要看企業是不是希望在大數據技術棧上做更多投入。如果企業覺得沒必要在基礎設施上投很多資源,而是要把更多資源放在業務上,那選一個更偏全託管版的湖倉一體解決方案更有價值。反之,如果企業技術人員很多,希望底層基礎設施足夠靈活並且是自己可控的,就可以選擇在湖上建倉的模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"還有一些比較新的企業,比如過去三年內成立的,它們有很多都處於高速增長階段。這些企業其實天生就長在雲上,甚至一開始選的大數據架構就已經是雲數倉的架構,這類企業基於現有的架構向前演進相對比較簡單。只要儘量使用雲基礎設施,開通幾個雲服務就能形成一套湖倉一體架構了,這是一個簡單直接且相對單一化的路徑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那成本主要來自哪裏?如果企業選擇全託管的湖倉一體解決方案,則成本主要來自於對當前數據,比如數倉遷移、數據整理等一次性開支,一旦這部分工作做完,後續在數據治理上形成正循環,整體成本不會太高。如果企業選擇自己維護一套湖倉一體架構,則成本主要來自持續維護和調優整套基礎設施的人力成本和硬件成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:根據您的瞭解,當前企業嘗試落地湖倉一體的時候遇到的問題和挑戰主要有哪些?現在是採用湖倉一體的好時機嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在大多數企業都還沒有用到湖倉一體的新架構,他們要麼選擇了數據湖方案,要麼選擇了數倉方案。湖倉一體作爲一個新興架構,很多企業目前還在早期探索階段。有些企業在把數據放到數據湖上之後,發現在數據湖上做好數據治理或者數據管理相對比較困難,這個時候再去採用湖倉一體模式,在現有相對更靈活但不夠管理化的數據上,再抽象一層數倉層和治理層,對數據做更好的管理和治理。對於數倉的用戶,如果採用的數倉系統支持湖倉一體架構,直接掛載數據湖就好了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"企業嘗試落地湖倉一體時會遇到的問題和挑戰主要有幾點。首先,如果團隊沒有足夠好的數據治理或數據管理經驗,挑戰會比較大。這也是爲什麼我們推出的方案几乎都在向全託管或全服務的SaaS模式走,就是希望能夠降低門檻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其次,對於自建湖倉一體的企業,他們會遇到的挑戰主要是湖倉一體的高複雜度,特別是湖倉之間如何協同的問題,這裏面涉及到兩套系統存儲打通的問題、元數據一致性問題、湖和倉上不同引擎之間數據交叉引用的問題,以及帶寬問題、安全問題,等等。另外,由於湖倉一體架構底層是一個二元體系,那向上面向用戶的時候,用戶是不是能看到兩個體系?如果用戶能夠看到兩個體系的話,如何區分和引導?如果用戶看不到的話,那底下開發需要做什麼樣的封裝?這些都是自建湖倉體系會遇到的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"總之,如果企業並不是一定要大力投入做基礎設施的話,直接採用全託管版本的湖倉一體的架構會簡單很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後,湖倉一體還是一個新興的方向,很多問題還在探索中,比如哪些數據放在數倉\/數據湖?更適合有一定探索和創新意願的企業。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"新一代數據平臺的架構迭代方向"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:您怎麼看湖倉一體未來的發展?在湖倉一體推廣和大規模落地的道路上存在哪些機遇和挑戰?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"湖倉一體的興起本質上是由用戶訴求推動的,大家希望得到更好的數據治理和管理能力,同時又希望有更好的靈活性,特別是隨着AI的興起,完全純數倉的二維關係表已經無法承接半\/非結構化數據的處理,AI引擎不可能只跑在純數倉模型上。所以湖倉一體一定是未來的發展趨勢。做數倉的會有更多數據湖屬性,做數據湖的也會有更多的數倉屬性,最後根據實際需求去找到中間的平衡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當然,挑戰也不可避免。湖倉一體,相對單一的數據湖或者數據倉庫場景,系統確實變得更復雜了。這其中就涉及到了剛纔提過的湖倉之間數據重合的問題、一致性的問題,還有元數據系統是打通還是統一的問題。將原本的二元體系做成一體化,會對技術架構帶來非常複雜的影響。此外,數據湖本身在訪問時就存在比如存算分離導致的帶寬問題等等,如果邊上還有個數倉的話,這個問題還會加劇。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"湖倉一體要大規模落地,對整個系統的設計維護要求非常高。所以,很多雲廠商都在試圖推出更偏託管化或者服務化的湖倉一體平臺,以屏蔽底層不同系統之間的差異。如何能夠既實現湖倉一體的能力,又讓系統變得更簡潔、更健壯,是未來大規模落地必然面臨的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:數據服務上雲已經是不可逆轉的趨勢,但與此同時業內也開始出現一些關於"},{"type":"link","attrs":{"href":"https:\/\/a16z.com\/2021\/05\/27\/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization\/","title":null,"type":null},"content":[{"type":"text","text":"雲脫鉤"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"(上雲後又撤出)的討論,核心觀點大致是“雖然上雲在企業發展早期確實能實現資源優化,但這種優化效果卻在業務規模擴大與增長的同時逐漸減弱,最終負擔開始超過收益”,您怎麼看待這類觀點?雲成本是企業在選用雲平臺數據服務時需要提前考慮的問題嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前業界確實有這麼一個說法,不過我是一個比較徹底的上雲黨。爲什麼選擇上雲?我認爲隨着社會和科技的發展,技術分層會帶來更高的效率,而集約化和規模化會帶來更高的效益。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"舉一個簡單的例子,構建一個50臺機器的數據中心,和構建5000臺機器、5萬臺機器的數據中心相比,單臺平均採購成本差異非常大。如果到了構建5萬臺機器的雲平臺這樣一個規模,甚至在選址、水電開支和周邊服務的生態等等都可以做更多的考慮。對於這種大規模集約化的採購,甚至還可以有更好的跟硬件廠商議價的能力,比如要定製化CPU,量大到一定程度之後,芯片廠商就會爲你服務。所以在集約化和技術分層化的影響下,最終雲上的裸成本一定是更低的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果有一些用戶覺得上雲比線下自建IDC可能還貴了,這其實會涉及兩個因素。第一,雲上的服務器質量相對比較高,比如通常情況下服務器3年就會換一臺,但對於線下IDC,可能就會用5年或者更長時間。不同的機器使用壽命也帶來了SLA的不同,如果只看賬面價錢的話,可能感覺好像貴了,但如果把所有SLA考慮進去的話,上雲其實不貴。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二,大家更多時候只看了硬件本身的價錢,然後用來比較成本高或低,實際上構建一套雲平臺以及之上的雲服務要投入非常多的人。以100臺物理機這樣一個規模的系統爲例(或者1000臺虛擬機摺合100臺物理機),我預估每年年化費用,連電費都算進來,一臺機器大概是3萬塊錢,100臺機器大概的價錢就是300萬一年。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但如果用戶在線下自建IDC的話,除了硬件成本,從最底層開始部署運維是需要人力的,100臺物理機基本上需要3~5個人左右。就算這5個工程師不是特別貴,人力資源成本大概也要300萬。所以100臺物理機這樣一個規模的資源,如果要線下自建的話,最終的TCO(總體擁有成本)實際上遠大於你看到的硬件設備的費用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"所以從這個層面上講,我認爲對於絕大多數企業而言,上雲一定能帶來雲成本的下降,這些成本並不僅僅是硬件的成本,還包括服務的成本、人力資源、系統安全的成本,而且後面幾部分成本是很高的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:您本次在ArchSummit深圳負責的"},{"type":"link","attrs":{"href":"https:\/\/archsummit.infoq.cn\/2021\/shenzhen\/track\/1085","title":null,"type":null},"content":[{"type":"text","text":"專題"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"將跟大家分享新一代數據平臺的架構迭代方向,能否跟我們概要地解讀一下主要有哪幾個方向?爲什麼這些方向需要重點關注?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"關濤:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據平臺發展至今已經有20年曆史了,目前架構側已經有非常多領域開始進入到相對比較固定的成熟期,但仍然有幾個趨勢我覺得是值得關注的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一個趨勢是"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"實時化以及近實時化架構的興起。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最初整個大數據平臺是以離線計算爲基礎的,後來隨着流計算的興起,形成了以流計算爲軸的實時計算,再往後隨着交互引擎的發展,比如ClickHouse,實際上形成了由實時計算做最前面的數據處理、由交互分析引擎做非常快的前端serving這樣一套架構,這是一個標準的實時化架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最近這一兩年有兩個開源的方向非常流行,一個是Delta,一個是Apache Hudi,它們分別來自Spark和Hive社區。從這兩個角度看,大家可以理解成之前離線大數據處理的兩個代表Spark和Hive已經同時開始向近實時化轉型了。Delta和Hudi能夠大幅降低數據寫入的延遲,使數據處理從離線開始轉向近實時化。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"我們認爲,在離線處理和實時化處理中間會形成一個近實時化的架構,這個架構的特點是比實時計算代價更低、但又比離線計算實時性更好,可以理解成是在成本和延時之間給用戶找到的另外一個平衡點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我認爲以後大多數的離線計算都會向近實時轉型,同時有很多現在必須要用實時計算的工作,大家可能想想會覺得也許不用付出那麼多的成本,直接用這種近實時化架構就好了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"近實時架構相比實時架構的好處是,它在做數據寫入,比如upsert處理,幾乎都不太採用像MemTable、LSM這種更貴但也更實時的架構,而是仍然以文件的模式來做,只不過是以小批量的文件merge。它的延遲可能在秒到分鐘級,數據從寫入到真實可見的可能在分鐘級,最終到底延遲是什麼樣的程度,可能跟整個系統的實現的方式有關,但確實不是寫入即可讀的狀態。但對於大多數應用而言,可能端到端5分鐘的延遲也是可以接受的,如果成本足夠低的話。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"隨着近實時架構的興起,最終整個計算的頻譜就會變成從純離線到近實時架構再到實時化架構這樣一個全頻譜,差不多還需要一年左右的時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二個趨勢,去年這個時候,國家把數據也變成了生產資料,這樣一來,很多有數據的組織,不管是企業還是政府機構,都在尋求怎麼讓數據發揮更大的價值,這就涉及到數據交換、數據共享以及背後對應的安全和隱私保護。我認爲"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"數據安全、共享、隱私保護"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"會成爲未來的熱點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一方面有數據可變現可交易的強訴求,同時也有安全合規的強要求,在這兩個需求和要求中間,會誕生非常活躍的技術。如何更安全地共享數據,既包括一方的數據怎麼做差分隱私加密,也包括多方的數據怎麼在不共享的情況下通過Feature抽取等方式做聯邦學習。實際上我們也能看得到非常多大型雲廠商和中小型創業公司在這個領域有不錯的投資。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三個趨勢是"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"IoT數據的採集和處理"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。現在大數據的絕大多數數據增長都源於人的行爲日誌數據。最早數據庫裏存的數據,通常是交易數據,量小但非常貴,比如銀行的交易數據、賬單、賬本等等。而過去這十幾年,大數據的增長主要是由人的行爲日誌數據驅動的。大概80%的數據都來源於人的行爲日誌,只有20%可能來自交易數據。未來隨着物聯網和智能設備的興起,設備上的數據會越來越多地接入進來,而設備上的數據規模可能在百億千億級別,那麼如何更好地採集和處理這些IoT數據就會形成一個新的熱點趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"IoT數據對數據處理的要求相當於又向前走了一步。對於傳統交易數據庫,是少量但非常貴的數據,就可以用非常貴的手段和設備去處理它,後來的大數據是海量數據,數據的價值更低,於是誕生了像MapReduce這樣的架構。而IoT設備產生的數據,它的數據量可能還會再大若干個數量級,數據價值很多時候就極低。比如大多數傳感器數據在不出問題的情況下幾乎是沒有價值的。這就會催生出數據處理時不同的優化和平衡方式,以及從雲到邊端的基礎架構的部署。比如有海量數據是在端設備上產生的,那設備上就要有一定的數據處理能力,先把一些無用的數據過濾掉,然後再把週期採樣數據和異常數據送到雲上做集中處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後一個趨勢是關於"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"AI for System"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的。AI作爲一個工具,可以做非常多事情,其中一件事情就是優化大數據系統。隨着數據的增長,維護數據表的工作量大幅增加。對於擁有大量數據的企業,內部數據工程師一個人維護上萬張表很常見。但當一個人需要維護上萬張表的時候,他幾乎很難去理解每張表的細節狀態,這種傳統的以人爲軸的數據開發和優化模式也即DBA模式已經幾乎無法再持續下去了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那如何讓大數據系統能做到更好的自動調優、自優化呢?簡要地說,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"怎麼讓大數據系統自動駕駛起來,這一定是未來的方向。這也可能會成爲中颱向後演進的一個核心方向"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",現在中臺治理通常是提供工具和指標給到人,給到人建議,由人來做。但現在越來越多治理指標已經交由機器自動完成了,可以說初步實現了領域內大數據系統的自動駕駛能力。未來,我們認爲大數據系統的自動駕駛能力會不斷提升,可能從三級自動駕駛慢慢演進到五級。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"11月12日-13日 "},{"type":"link","attrs":{"href":"https:\/\/archsummit.infoq.cn\/2021\/shenzhen\/schedule?utm_source=web&utm_medium=infoq&utm_campaign=9&utm_term=0917&utm_content=yueduyuanwen","title":"xxx","type":null},"content":[{"type":"text","text":"ArchSummit全球架構師峯會(深圳站)"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"2021,關濤老師將帶來題目爲《當下的大數據體系是什麼?》的主題演講,同時他還策劃了"},{"type":"link","attrs":{"href":"https:\/\/archsummit.infoq.cn\/2021\/shenzhen\/track\/1085?utm_source=web&utm_medium=infoq&utm_campaign=9&utm_term=0917&utm_content=yueduyuanwen","title":"xxx","type":null},"content":[{"type":"text","text":"“應雲而生的新一代數據架構”"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"專場,邀請來自SnowFlake、Databricks、阿里、網易嚴選等公司的專家,共同探討湖倉一體的落地路徑和未來數據架構的演進方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"採訪嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#171a1d","name":"user"}}],"text":"關濤,分佈式系統和大數據平臺領域專家。曾任阿里雲計算平臺事業部研究員,阿里巴巴通用計算平臺負責人,負責阿里巴巴主線大數據平臺。曾任阿里巴巴和螞蟻集團技術委員會計算平臺領域組長,阿里雲架構組大數據組組長。回國加入阿里雲之前,在微軟雲計算和企業事業部工作 9 年,主持和參與開發了包括 Azure Datalake、Cosmos\/Scope、Kirin Store、SearchRepository 在內的多套超大規模分佈式存儲和計算平臺。並著有多篇國內外會議論文和專利。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章