大數據+雲:Kylin/Spark/Clickhouse/Hudi 的大佬們怎麼看?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前不久舉辦的 Kylin 五週年慶典中,熱度最高的非這場“圓桌會談”莫屬。來自 Spark,Hudi,Clickhouse 以及 Kylin 等開源社區的大佬,來了一場跨越時差,跨越區域的“雲”上對談。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一代雲上數據分析產品的趨勢都有哪些?他們都看好什麼關鍵性技術呢?你想知道的都在本文啦!"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"熱點問題"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"湖倉一體和 Lakehouse 到底是什麼?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算和存儲分離已是大勢所趨?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到底是公有云,私有云還是混合雲?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據上雲,多雲管理有何難點?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據治理,數據安全如何實現?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"👏一起來看看都有哪些大佬👏"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主持人|李揚"},{"type":"text","text":" :Kyligence CTO,Apache Kylin PMC member"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":" :Databricks Spark 研發總監"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"郭煒"},{"type":"text","text":" :易觀數科 CTO,負責 ClickHouse 華人社區和 Apache DolphinScheduler 的運營"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李少鋒"},{"type":"text","text":" :Apache Hudi Committer & PPMC"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :Apache Kylin PMC Chair"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/67\/36\/67ea98efaa5c97ef63ffa0c7928e1a36.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"會議全程視頻回顧看這裏👇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.bilibili.com\/video\/BV1BU4y1x7G3?from=search&seid=16883316961221610220","title":"","type":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"完整視頻"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快來一起看看全文高能的會議實錄吧!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:我們知道最近有很多新的概念,比如 Databricks 提出的 Lakehouse,阿里講的湖倉一體,又比如 ClickHouse 代表的矢量計算優化等,大方向上面,最近的變化和趨勢很多,請各位老師從自己的角度談一談,你看好的數據分析的發展趨勢有哪些?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":" :在 2019 年底,我們公司 Databricks 就已經在推出所謂的湖倉一體,我們稱爲 Lakehouse 的這個概念。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e2\/f2\/e29d5195352c037a354f66c8098024f2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lakehouse 的概念的引入,其實是在一個大的背景下,就是數據倉庫和數據湖邊界越來越模糊,尤其是在雲原生這種環境中,大家基本上新的平臺,數據處理平臺,或者說數據架構,一般都不會原封不動地去複製之前的數據倉庫或者數據湖的架構。原因是它們現有的或者已有的架構,實際上是存在很多缺陷,並且缺陷明顯。當重新去創建一套新架構的時候,基本上算是取其精華,去其糟粕。這就是爲什麼 Lakehouse 整個概念的提出。當然,這是一個方向性的,有很多需要不斷去創新,不斷去重建。尤其是在當前公有云的重新框架下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如說 Object Store 這種廉價可靠的存儲模式,還有 Lakehouse,Warehouse 這麼幾十年的這種開發、研發的經驗積累和技術吸取,我們相信會有越來越多新的項目產生。包括了我們公司的 Delta Lake,也包括了 Hudi 這邊不斷的努力。這是我對數據湖和湖倉一體的觀察。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:從 Lakehouse 的角度來說,它對於用戶產生的價值,和之前相比,最大的區別是什麼呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":": 其實最大的區別應該是從兩個方面最明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一就是它和原有的數據倉庫,一般數據倉庫的主要的接口是 SQL,而數據湖這邊千奇百怪,什麼語言都有。那麼 Lakehouse 基本上就是響應這種所謂的多源支持,再按照用戶需求,比如說需要支持 AI 或者 BI 的系統,我們都要提供對應他們生態圈的這種語言,而不是強迫用戶去把它產生 SQL,或者是擴展 SQL 標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一點就是數據拷貝,如果使用數據湖或者數據倉庫,同時使用的話,我們同一份 Data 需要多份拷貝,這實際上是對一整套,不管是存儲費用,實時性、一致性、正確性、準確性,或者是安全性,都是一個很大的挑戰。並且存儲和計算分離,也是一個明顯的趨勢。湖倉一體必然是走這條道,而不是重新再走原來的老路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:請問一下郭煒老師,從你看數據分析的趨勢是有怎樣的變化?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"郭煒"},{"type":"text","text":" :目前 ClickHouse 其實在做一些大量的,底層的 CPU 整級的一些優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實從整個趨勢來看,原來大家都會覺得大數據是一個應用層,其實現在大數據慢慢是往下面發展越來越多。特別在最近,我們看到社區也好,還有企業應用也好,它的趨勢都變爲我們需要去和現有的,也許是硬件,像 ClickHouse,那麼也許是和雲,現在像雲原生的大數據來去結合。其實這中間就有不少的難點。雲原生很早以前,跟 Databricks 一起在 AMP lab,當時就出的 Mesos,爲什麼到現在我們還沒覺得沒有特別火呢?其實是我們有幾個比較大的技術門檻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個是過去在做的,CPU 在計算的時候我們都是叫預分配的制度。一個 job 起來,我們大概給這些分多少的 CPU、內存,這些都先分好,然後去進行計算。但是其實真的雲原生在做的時候,它往往是動態的,可能是跟着你現在的整體運行情況,可以動態加減。我認爲這個對於大數據來講,不像這種應用級的,我們看 SpringCloud 這些,其實沒那麼容易擴展和縮小,對於我們來講,這個是存在一定困難的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個是過去在做存儲的時候,大家都叫 share nothing,大家講的存儲和計算是在一起的,大家相互不要去做太多的工作,每個都是小的做完,最後再做 shuffle。但其實現在在做雲原生的時候,都變成是一個 storage,就像剛纔李瀟老師提到的,它底下也許是湖倉一體,也許廉價的冷存,也許是快速的熱存,怎麼能更快地去跑起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個就是剛纔提到的,底層的 CPU 指令級級別的高速運算,也許是矢量計算,也許是關於雲原生的某些結合或者雲原生包裝後的 CPU 的計算。我以上提到的這些,對於大數據圈,大家奔着雲原生去的時候,其實有三座大山擺在大家的面前。整體來講,對於企業來說,未來 5 年肯定會越來越多的企業,無論是私有云還是公有云,會選擇把過去這些硬件的 CPU,包括存儲,都集中化以後再分配,這些都是未來幾年,我們要去攻克的事情。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到有些開源的項目,像 Ozone, Alluxio,包括 Kylin 這次我看也都在做類似的一些工作。我相信未來這幾年,一定會有特別好的項目出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:讓我們一起期待。請問一下李少鋒老師,從你的視角,或者從你的社區,會怎麼來看數據分析的趨勢呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李少鋒"},{"type":"text","text":" :我接着李瀟老師的講 Lakehouse 和湖倉一體。Lakehouse 是之前 Databricks 提的,和阿里雲提的湖倉一體相比,雖然有相似,但我認爲還是有一些不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Databricks 提出的 Lakehouse,主要解決什麼問題?數據的重複性,對事務的支持,以及在同一份數據存儲上面對接不同的工作負載,並且可以端對端的支持報表。但是湖倉一體,更多的是說無縫打通湖和倉的元數據,然後對於湖倉有比較統一的一種開發體驗。阿里的湖倉一體可能更多的視角是從項目(共享元數據)這個層面去出發的,而 Databricks 的 Lakehouse,更多是說從架構以及一些特性去定義,其實兩者我個人理解還是有一些區別的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要實行一個 Lakehouse 架構,可能是基於 Delta 還有 Hudi 這一套, Databricks 藉助三個架構就實現 Lakehouse。但是對於湖倉一體來說的話,可能不需要藉助這三個來去做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如說阿里雲的 MaxCompute 裏面去訪問 OSS,湖裏面的數據,可能在更後端,就是湖倉一體。對於 Lakehouse,可能更強調在前端的架構。但其實也是一個比較新的架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前 Hudi,包括我在的團隊,也在做 Lakehouse 相關的事情。我們也是在不斷摸索,因爲現在還沒有哪個企業真正把 Lakehouse 架構落地,目前,包括阿里雲也是,還處於探索階段。我認爲這是一個不錯的方向,後面可以去服務阿里雲上的一些客戶,目前接觸的客戶很多有 Lakehouse 這方面的需求,這也是一個趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:這個非常有意思。其實我也是從不同的渠道聽到, Lakehouse 比較早是 Databricks 先提出來的,後來我也從一些渠道聽到阿里提湖倉一體。當時覺得這兩個詞是中英文的翻譯嗎?今天聽到大家的解釋,有一種豁然開朗的感覺。從大家的角度來看,雖然這兩個詞很像,而且可能實際上解決同一個問題。但是從技術角度來說,它們的出發點和架構設計都是不一樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"前面聽到李瀟老師強調的是,從用戶視角如何統一,各種 AI、BI 工具融合的使用,以及到數據是一份,是共享。李少鋒老師更多說到是從元數據視角,把湖和倉有機統一起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"現在請教一下史少鋒老師,從你的角度怎麼來看數據發展的趨勢呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :剛纔李揚提到,可能湖倉一體最早是由 Databricks 提出來的。的確,Databricks提出 Lakehouse 讓大家耳目一新,或者說開始真正地去審視這個趨勢。 "},{"type":"text","marks":[{"type":"strong"}],"text":"但是回顧歷史,我覺得可能最早去踐行這個趨勢的可能是 Snowflake,大概是從 2014 年、2015 年就開始在架構他的雲上數據倉庫。"},{"type":"text","text":" 並且他這個數據倉庫是完全基於雲的這麼一套架構和技術來設計的。回過頭來看,其實跟我們在做的湖倉一體還有 Lakehouse,或者說各個技術流派在做的非常的相似,或者說吻合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些都是架構在雲上,公有云的數據湖上,然後再引入一些,實現 ACID,實現 snapshot,實現批流融合的這麼一個輕的存儲層,然後再實現存算分離的架構,使得存儲和計算可以分別去 Scale,實現動態的資源分配的同時,還實現 workload 之間的隔離,從而也支持多租戶等等的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在很高興看到,不只一家廠商,更多的廠商,包括 Databricks,阿里雲,華爲等,大家都在做 Lakehouse,或者湖倉一體的分析。對於 Kylin,我們也感到這個趨勢非常明顯。今天我們越來越多用戶,希望讓 Kylin 能夠直接對接到他們雲上的存儲,最好是類似於 Hudi 還有 Delta Lake,過去可能主要是 hive 這樣相對來說比較靜態的存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析方面,Kylin 目前主要是服務於 BI 的分析,所以暫時 AI 的分析可能還接入不進來,但是 BI 的需求一直是存在的,即使未來走到 Lakehouse 的趨勢,Kylin 依然是有生存的空間。 "},{"type":"text","marks":[{"type":"strong"}],"text":"以 Kylin 的高性能和高併發來解決業務人員對這個數據探查的需求。萬變不離其宗,我認爲底層無論是 Data Warehouse 還是 Lakehouse,上層的話依然需要像 Kylin 這麼一個 Datamart 的存在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:我們前面聊了一些大家看好的大趨勢。大家其實也很關心,在這個大趨勢下面,你覺得哪些是關鍵技術,能夠支撐到這些大趨勢,或者是它價值的轉換,我覺得這可能是更多技術人員關心的問題。請問李瀟老師,從你的角度來看,對於 Lakehouse,或者是別的趨勢而言,你覺得下一個關鍵技術又是什麼呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":" :這裏面實際上從我或是我們團隊的角度來看,分析技術的趨勢,或者是當前架構方面,大家好像某種程度上已經形成了一個共識。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先就是, "},{"type":"text","marks":[{"type":"strong"}],"text":"未來的大趨勢會在公有云上面,不管是從私有云遷移,還是產生新的數據,這會是一個總體的方向。"},{"type":"text","text":" 在公有云上面,不管開源還是閉源,就像史少鋒老師說閉源的 Snowflake 其實在很早之前就已經開始在向這個方面努力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以這方面我們相信, "},{"type":"text","marks":[{"type":"strong"}],"text":"開源,尤其在存儲架構層面的開源是一個大勢所趨。因爲它有諸多好處,尤其是對用戶可以更容易地實現數據遷移,或者是更容易去讓外部的這些應用,尤其是生態圈裏各種各樣第三方的庫去訪問,去使用這些數據。"},{"type":"text","text":" 這也就是爲什麼某種程度上來說,爲了支持這種 AI 的場景,而不僅僅是 SQL 語言。我們可以看到像 Snowflake,據傳他們正在做一些嘗試,比如說批流一體。因爲他們也想要支持流處理。而流處理,如果僅是用 SQL 語言去做擴展,那麼生態圈的融合和數據的使用,就變得異常艱難了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果做 SQL,大家會發現,實際上 SQL 雖然是一個標準,但是最新的 SQL 標準,並沒有多少廠家是支持的,並且對流也沒有存在各種各樣的標準。其它方面,比如說 AI,如何去接入 python,如何去和生態圈的融合目前是空白的,而且這個標準的制定,週期也特別長,我們相信這種語言的使用,是擁抱用戶的,而不是基於廠商來定義的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至於計算和存儲的分離,我們也相信兩者應該是分離的,而不是綁定的。這個趨勢目前也是非常明顯。當然閉源這邊,比如說 Snowflake 的計算和存儲目前是捆綁的,我們也無法知道他們計算上是如何去分割。但是計算引擎方面,我們也相信像 Vectorized Engine 也是一個好的方向,開源這邊有 ClickHouse 這邊在做 Vectorized Engine,我們這邊內部非開源的 Delta Engine,也是一個 Vectorized Based 處理引擎。並且不是 JVM 寫的,是 C++ 寫的。我們相信這種系統級別的軟件,C++ 是一個更好的語言或者是更穩定的一套實現的方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:前面提到流式分析是從用戶視角,很需要的一種分析能力。但從消費方來說,可能 SQL 語言又不是很合適。從您的角度,哪一種用戶接口或者是語言來消費流式分析的數據是更合適的呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":" :我個人看法是,這一點應該取決於生態圈,我們看到現在整套數據架構裏面基本上有兩個大的方向,一個是 AI,一個是 BI。 "},{"type":"text","marks":[{"type":"strong"}],"text":"那麼 AI 這邊,我們看到的最流行的語言是 Python,並且可以看到語言層面的用戶的 adopotion 和 community 基本上 Python 是一家獨大,並且增長速度可以說是火箭級的,連國內的房地產商都在學 Python"},{"type":"text","text":" ,可以想象它的這種簡單易用性,大量的庫函數。即使是流跟這個 Python 本質來說是沒有關係的,任何語言包括 Scala,Java 都可以,但是從數據方的角度,怎麼把這個數據送到使用方,我們相信 Python 將會是一個大勢所趨,如何去擁抱 Python,我們 spark 社區在 3.0 release 的時候,也特別強調了 SQL 和 Python。我們也相信這是兩個主流語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:因爲其實流式是一個數據的 processing,它是一種處理方式,與消費其實還並不是有什麼直接耦合的關聯。但是李瀟老師指出的非常有道理,如果從用戶視角來看,易用性和可接入性纔是第一位的。郭老師,從您的角度來看,在這個趨勢下,哪一些是更容易發光發亮的技術點?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"郭煒:第一個像剛纔說的,雲原生。這個和李瀟老師略有不太一樣,我覺得不一定是公有云,雲原生可能是混合雲,也可能是私有云。"},{"type":"text","text":" 因爲現在特別是中國的一些企業,它其實全面上公有云,還要很長一段時間才能夠去做的,但是它的私有云建設,比如內部的雲原生,和內部的類似 K8S 這樣的私有云,反而是越來越激進了。 "},{"type":"text","marks":[{"type":"strong"}],"text":"所以我覺得未來的這種大數據研發,可能是公有云、私有云都能夠去跨的這種融合的方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就像大家說的,Snowflake 最近特別火,它是跨公有云的方式,如何變得既有公有云還有私有云,我覺得這個技術將來還蠻多挑戰,但是這會是一個趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲現在雲原生的這些大數據,全都是存儲和計算分離的,最終你的存儲在哪裏?其實我覺得並不重要,將來的網絡帶寬肯定是越來越好的,那麼如何在中間能更好地安全、加密,這我覺得是第一個跟雲原生相關的技術點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個,在大數據領域,特別是在座各位都是來自開源社區的, "},{"type":"text","marks":[{"type":"strong"}],"text":"我覺得其實通過開源的打法來去訴求到最後的雲原生的一些服務,這個也是未來很重要的趨勢"},{"type":"text","text":" 。通過開源能找到更多的用戶場景,以及更多的一些潛在的用戶,最終通過商業的這種服務,或者商業上面產品的提升,來去獲得最終企業的商業價值,我覺得未來一段時間,特別像我們在座的這些大數據的,和開源社區的小夥伴,這會是一個很好的路徑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個,其實跟技術點略有不同。現在對於企業,特別將來咱們做的,剛纔大家提到湖倉一體,包括我剛纔說的雲原生這些,其實還有很重要的一點,就是數據治理相關的東西。我挺贊同李少鋒老師剛纔說的, "},{"type":"text","marks":[{"type":"strong"}],"text":"其實現在很多時候,阻礙企業去大面積使用雲原生,或者是 Lakehouse 這樣的東西,往往不是由於技術原因,而是在於它的數據管理能不能夠做的更好"},{"type":"text","text":" ,這個數據能不能更快去做相關的融合,中間再出現各種各樣新型的數據源。最近像我們易觀在做的用戶行爲,這些點擊流,相信將來還會有線下 IoT 這些物聯網的數據上來,每次有新的這些數據進來以後,怎麼能用更好的技術方式,把它描述好,然後融合到現有的 Data Warehouse,或者是自己的大數據平臺裏,或者是將來 Lakehouse 裏,我覺得這個的挑戰不亞於技術本身的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我覺得這三個方面,都是未來我們這些從事數據的人員要去攻克的。雲原生、開源,以及這種業務相關的數據治理的平滑過渡,這個是有挑戰的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:雲原生開源和數據治理,其實前面兩位老師都提到了一點點流,和 IoT 這個大趨勢也有關係。那從 IoT 的角度,就是數據流的角度,您覺得這裏的治理技術,可能目前還比較模糊,對於數據治理,大概有哪些技術在這個方向上是有潛力的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"郭煒"},{"type":"text","text":" :IoT 的數據,因爲當年從傳統的,我們自己的這種結構化數據到大數據,就是點擊流的時候,中間它的數據更加的稀疏了,到 IoT,數據就更加稀疏了,裏面我們叫做有數據熵的這些數據,其實它變得更加稀少。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我看來,有幾個會要去做的,第一點就是 IoT 數據雖然種類繁多,但是針對每一個 IoT,它的行業,再抽象到概念模型,我覺得是很重要的。如果不抽象成概念模型統一起來,就會迷失在大量的 IoT 各種各樣的繁複數據裏面,你存了很多數據,可能不是大數據,那是“大垃圾”,因爲數據太多了,裏面沒有什麼能用的。你首先預定義一些,你想到的,根據這個行業,甚至某種類型設備裏面的概念數據模型,在這個數據模型之上,再通過批流一體的方式,其實不是所有的數據都要進入到最後的這個 Lakehouse 也好,數據倉庫也好。 "},{"type":"text","marks":[{"type":"strong"}],"text":"它其實通過邊緣計算,在你的邊緣端,就做了一部分數據的整理。那麼其實不會把所有數據全都放到雲端來浪費網絡,浪費存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼去表現這件事情,我覺得就是剛纔說第三點,數據治理。對於不同的業務,不同的行業,再到你去要哪些對你數據有價值的這些東西的時候,可能要有一些預先的想法,至少你的概念模型得想好。才能通過邊緣計算和其它方式,把這些這麼大量的數據,能夠有效地進行預處理,再有效地進行存儲,我覺得這個是對 IoT 數據的一些處理方法吧。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:李少鋒老師這邊,從阿里或者 Hudi 社區的角度,你是怎麼來看下一輪的一些閃亮的技術點?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李少鋒"},{"type":"text","text":" :從我目前工作和社區來看,有以下幾點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個就是數據組織技術的發展,比如 Delta、Hudi、Iceberg 就是有代表性的開源數據組織技術。數據存儲,HDFS,數據的 Ozone,以及雲上的對象存儲等,其實數據就是底層存儲不斷髮展,所以越來越廉價。對於上層的,比如像 spark,Flink 這種計算引擎也是在不斷髮展。但是對於數據組織的發展還是比較少的,比如說現在三個開源框架,對於 time travel 以及各種各樣的 snapshot,ACID 等。就現在這三個框架出來之後,也會引起一波熱潮。可以基於這幾個框架去做到我們 ACID,也就是說對於數據服務層的一個統一,就是摒棄之前比較繁瑣的這種 Lambda 架構。這個對於很多企業來說,還是很有吸引力的。就是企業可以基於一套數據庫,去替換之前他們的架構,支持流批統一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後數據通過數據組織層,落到了數據存儲,比如說現有的雲上存儲,我們怎麼去對它進行一個數據的訪問,從我這邊來出發,其實還是用的 Presto 這種引擎,也就是 SQL 引擎去訪問 OSS 上用戶的數據,這邊的發展開始直接去拉用戶在 OSS 上的數據,後面也會遇到一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如,我們去拉用戶的一些熱數據,有時會有受訪問速度的限制,受 Query 查詢不夠快的影響,後面如何去平衡本地存儲以及雲存儲,會在中間去架設一個緩存區,因爲 Alluxio 這邊和社區合作的一個項目,來加速一些用戶的 Query 查詢。那麼怎麼去融合本地存儲,或者說內存存儲以及雲端的冷數據和熱數據的融合。目前我們這邊做的還是不錯的,可以平衡這兩部分的存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果用戶需要他的數據更快速,可能需要增加一些成本。如果沒有太多考慮,可以直接從對象存儲拉數據,不需要進行一個加速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一點,我對郭老師在數據安全的觀點也比較贊同。目前對大部分開源社區來說,其實安全這一塊做的還是比較少的,就是數據端到客戶端的安全。現在我們本身的一些產品,其實也沒有太多考慮安全方面,我覺得這一方面也是後面需要着重去建設的。開源社區如果有一些比較好的安全方案可以融入到 Lakehouse 裏面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是三點,第一個是數據組織層面的發展,比如Delta、Hudi、Iceberg。第二個是本地端的緩存,就是雲端和本地緩存怎麼去做一個融合以及平衡。第三個就是剛剛郭老師說的安全方面,端到端的安全,我認爲這三個方面未來有較大的發展空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:多謝李少鋒老師,也提到 Lakehouse 和湖倉一體,以及數據一致性問題。又牽涉到數據的 ACID 的能力,又說到存算分離,還有遇到數據訪問速度的問題,所以一般在計算層貼近的地方,都有一層數據緩存,包括 Snowflake 架構上面也是有這一層。前面提到的 Alluxio 也是這方面很好的開源。對技術感興趣的同學,可以參考李少鋒老師說的這些關鍵字,可以自己來學習一下,相信都會很有幫助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請到另外一位少鋒,史少鋒老師。從你的角度來看,有意思的技術點可以跟大家分享兩個嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :前面說是追隨湖倉一體的有哪些關鍵技術需要突破,我這裏總結了三點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"站在 Kylin 的角度,以後如果要把 Kylin 變成一個雲原生的數據倉庫或者說數據分析引擎的話,主要面臨這三個方面的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第一個方面,怎麼樣讓它更好適應雲原生這麼一個環境。"},{"type":"text","text":" 不僅包括存儲、計算,還有安全,還有各種穩定性、可靠性、資源的有效利用。過去,因爲 Kylin 直接是一個 Hadoop 原生的應用,所以這些方面不存在太大問題。比方說 HDFS 是主要的存儲,它的訪問性能很不錯,它的原子性、一致性也是有保證的。計算我們就充分使用 Spark,使用 on Yarn 的模式,或者說 MapReduce,它們也是非常的成熟和穩定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"安全方面的話,因爲它是一個 totally private 的 deployment,所以安全在過去也並不是太大的一個考慮。但是,如果說我們要把它發展成一個雲上的應用的話,那麼所有的這些幾乎都是要推倒重來。因爲我們知道存儲計算在雲上和在 Hadoop 上都會不一樣,這是雲原生的第一個方面的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第二個方面,就是一個數據融合的挑戰。"},{"type":"text","text":" 未來我們也看到 Lakehouse,它上面處理的數據將是來自於多種數據源,除了一些交易系統,或者說一些 Tracking 的這種埋點系統,還有包括郭老師提到 IoT 的系統,各種源頭的數據,甚至說這些數據它可能是異構的,不僅僅是關係數據,還有一些半結構化的數據,比如說 json 的數據,xml 數據,甚至說有一些其它的圖像的數據等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去很多這些數據是在以一種批處理的方式,或者說 T+1 的,按天或者按小時的接入,那麼未來很大的一個趨勢就是實時的接入,希望能把這個數據的延遲,減小到更低。這也對我們的數據融合帶來挑戰。怎樣的一個數據分析引擎,可以 handle 多源的數據,異構的數據,還能實現它們實時和歷史混合的分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面李少鋒老師提到,他們目前可能採用一些像 Presto 這樣的引擎來做跨源、跨數據的訪問。未來是不是還有一些 Spark,在這方面也能夠有一些更大的突破。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天我們是使用 Spark 作爲主要的計算和加工的引擎,我們希望它也能夠在這方面有進一步的發展,對於我們來說就可以繼續在這條路上走下去。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第三方面,很大的一個挑戰就是多雲的挑戰。"},{"type":"text","text":" 今天,經常大家去講,要使用雲原生,或者說要上雲,但是我們看到雲,其實它是一個非常割裂的技術。除了有公有云,還有私有云,混合雲。在國外的話,美國主要是三朵公有云,私有云可能都不太多。 "},{"type":"text","marks":[{"type":"strong"}],"text":"作爲一個技術廠商的話,主要對接這三朵雲,就已經能夠把 70%、80% 的用戶喫下來。但是在國內的話,我們看到這個情況是不一樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"公有云,可能面向的主要還是中小企業。在這些大的金融機構,他們都在建立自己的私有云,或者說行業雲,以及一些混合雲。而它們背後雲的技術也是五花八門。對於像 Kylin,Kyligence 這樣的廠商,我們想要基於這個開源技術來提供大數據分析的服務,在上雲的時候也是面臨很大的矛盾。一方面我們想 keep 一套完整的技術體系或者是架構。另外一方面,我們卻面臨分別不同的雲。怎樣的設計或者架構,能夠適用到多個雲上來,會是一個非常考驗我們的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:史少鋒老師說得很全面,其實前面有一個對立的觀點,不知道大家注意到沒有,還挺值得聊一聊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"從李瀟老師以及美國目前情況來看,是公有云爲主,但是郭老師,包括史少鋒老師都說到了,國內顯而易見是混合雲,私有云跟公有云是交錯生長的狀態。"},{"type":"text","text":" 我們面臨的挑戰是,怎樣的一套技術架構,能夠在多雲環境下面都能成立的?那史少鋒老師在這個方向上,會看好哪樣的技術能夠支撐到這一點呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :這個問題也是我今天挺想跟各位探討的。特別李瀟老師,因爲 Databricks 在上雲這條路上走的比較領先。我們過去還在 Hadoop 上跑這個 Spark,主要是在 Yarn 上面做資源分配。現在上雲的話,特別是還要適應多雲的話,我們就發現,如果推倒 Hadoop 這套複雜的技術體系來看,要想把 Spark 這個通用的,分佈式的大數據的體系運行的好,可能要引入像 Kubernetes 抽象的一個資源調度的平臺。我想請教李瀟老師,從你的經驗來看,Kubernetes 是不是真的適合做這個大數據的處理和分析,Databricks 是不是後臺用 Kubernetes 來運行 Spark 的任務,如果不是的話,你們傾向的這麼一套架構,或者說調度方式是什麼樣的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/13\/29\/13616fbdd29cbc6a4df60cd8d720c829.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李瀟"},{"type":"text","text":" :我這邊來回答幾點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第一,我們還是堅信公有云,這個堅信是從7年前,8年前就一直走到現在。8年前在美國也是私有云、公有云混合,而且私有云絕大多數,公有云其實是很小一部分。但是今年的冬天,我不知道大家有沒有留意到新聞,美國的信用卡巨頭 Capital One,關掉了它們所有的數據中心,全部進入了公有云,這是一個指標性的事件,就是銀行都可以完全進入公有云。我們相信未來公有云會是大勢所趨。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,資源調度系統,我們因爲是 Spark 的原創團隊創建的公司,所以我們是 scheduler 完全自己重寫,基於 Standalone 做了很多擴展。但是我們也相信,如果 7 年前,我們再重新做這個決定,當時 Kubernetes 不存在,也可能存在,但是隻是在 Google 裏,那麼我們可能就不一定會自己重寫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是我也可以分享一些從我們社區來的信息。像 Apple,原來主要用 Mesos,今年應該是已經進入了產品化,就是 Spark on Kubernetes。所以它們在社區的很多貢獻,也是圍繞着 Kubernetes 的支持。我們也相信 Kubernetes 也是一個大勢所趨。我們公司內部使用大量的 Kubernetes,雖然沒有把它作爲資源調度,但是部署各方面都有,而且 Kubernetes 社區生態等各個方面都做的很好。我們相信它未來能夠可持續地發展,也是值得我們繼續深入投入的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :謝謝李瀟老師,這樣我們更加有信心了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚:我覺得這樣的討論最有趣的就是在同一個問題上,我們能看到不一樣的觀點。也許美國今年的今天,就是中國兩三年以後的未來也不一定,可能最後都是對的,但是在時間點方面,現在哪一個是更適合的,纔是各個公司,開源社區需要去考慮,需要去適應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"史少鋒"},{"type":"text","text":" :大家都知道郭大俠在國內技術圈非常活躍,他組織了 ClickHouse 中國社區,有上千人的用戶和組織者。我們也對 ClickHouse 這個技術非常感興趣。因爲在我看來,它其實是一個跟 Spark、Presto 這些引擎不太一樣的引擎。前面也提到這是一個 share nothing 的架構,着重使用機器的硬件,還有本地的數據存儲來提升分析效率,而且分佈式部署也是不太一樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我看到它也有一些上雲的打算,也在對接一些分佈式的存儲,像 S3 等。從郭大俠的角度來看,比如把 ClickHouse 真的變成一個雲上的,非常動態的分析引擎,到底是不是一個可行的方案?或者說你在社區裏有沒有見過一些人,真的把它跑在 Kubernetes 上,然後結合雲的存儲,能夠做到一個高可應用、高性能的數據分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"郭煒:首先我可以透露一下本身的 ClickHouse 原創團隊是俄羅斯的 Yandex 團隊,他們在公司內部是在做一些上雲的項目。在他們的 roadmap 遠期規劃裏,因爲所有的公司到他這個體量和規模的時候上雲或者說雲原生是一個趨勢。但是雲原生是不是用 K8S,現在他們還在討論,他們自己內部也有俄羅斯自己的科技來做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但對於國內來講,我舉一個例子,就是阿里雲。現在阿里雲上面其實就是有 ClickHouse 服務的,那麼它的底層在做存儲融合的時候,其實做了一些事情,只不過這個效率,各方面和原來 Standalone 在機器上跑還是有差距的。但是對於雲來講,其實可能規模效應要比這種追求每一臺機器的性能極致還是不一樣的。所以在我看來,ClickHouse 將來肯定會,也是作爲雲原生的一種大數據,只不過它將來走向哪一個雲社區,或者哪種雲架構,現在還沒有定下來。但我覺得這是一個趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從原創團隊到現在周邊的這些生態合作伙伴,其實都是會走向雲端的,這是個大趨勢,所以不用太擔心,將來它上不上雲,一定會上,只不過上哪種框架的雲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文轉載自公衆號apachekylin(ID:Apachekylin)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzAwODE3ODU5MA==&mid=2653082114&idx=1&sn=08f2380f7827912acb80a74f71fa268c&chksm=80a4acf3b7d325e564979a02a29b721e40a16fc5f3f9f980fbb9830e377ab873c7053b2eb648&token=1845978438&lang=zh_CN#rd","title":"","type":null},"content":[{"type":"text","text":"大數據+雲:Kylin\/Spark\/Clickhouse\/Hudi 的大佬們怎麼看?"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章