OLAP進階:Excel可直接分析的大數據語義層

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何在異構化、割裂化嚴重的大數據平臺上解決數據孤島的挑戰,並支持豐富的 OLAP 分析能力和進階分析功能,如可計算度量、多對多關係?背後的實現原理和技術難點是什麼,以至於用戶可以簡單地通過 Excel 感受到極其平民化的多維分析體驗?本次分享的主要內容包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據時代下的分析挑戰"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統 OLAP 的侷限"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kyligence 的解決方案"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當中的一些挑戰"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"效果展示"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"大數據時代下的分析挑戰"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 數據分析需求靈活多變"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e1\/e128feb240ca6ec586dca28a9a4aab9a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個挑戰是目前用戶分析的需求非常靈活多變。右邊這張圖是截取自 Gartner 的分析報告。這裏描述了數據分析的四個階段,第一階段是描述性分析,主要描述發生了什麼,一般是固定報表的形式。第二階段是診斷性分析,來探究數據指標爲什麼高了還是低了,是哪部分高了,哪部分低了,這時候就需要使用到多維分析以及明細查詢。第三個階段是預測性分析,根據歷史數據來預測接下來的走勢。第四部分是規範性分析,爲了促使指標最優,我們可以做些什麼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在實際的客戶分析場景中發現,用戶不再滿足於看固定報表,他還需要分析這些數據,這些指標背後的成因。因此多維分析,靈活查詢,明細查詢這些需求就在爆發式的增長。同時他們在分析的過程中,希望能夠高性能的進行交互式分析,而不是像以前可能執行一條 hive 語句後倒一杯咖啡,然後坐着等結果,這背後對數據分析平臺的要求是很高的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 數據孤島帶來割裂的分析體驗"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/27\/2723439c729aa5b4fdfe85ef7d92bbf0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個挑戰是數據孤島帶來割裂的分析體驗。很多企業內部信息系統多各自爲政,各系統之間缺乏整合,不同部門使用的數據存儲不一樣,數據規範也不一樣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各部門擁有各自信息系統的主導權,且侷限於部門級別的信息決策,缺乏公司層面的統一的信息決策。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"傳統 OLAP 的侷限"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 傳統 OLAP 的劣勢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統 OLAP 一定程度上能夠解決剛纔講的部分問題,但是他們存在着一些侷限性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e51d9f3cbd1177e0c6945d2ac58454d7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些侷限性有幾個點:一個是數據量及維度數量的限制,傳統 OLAP 一般使用的是 MOLAP 模式,在小數據量上,性能優勢明顯,但是在面對大型數據集時,可能會面臨維度爆炸的問題。第二點是擴展的侷限性,傳統 OLAP 的拓展起來十分麻煩,有些 OLAP 數據庫只能 scale up,這種情況就只能增加節點的內存和計算核心數量,但這個成本是極爲昂貴的。另外一些 MPP 架構的 OLAP 數據庫雖然能夠 scale out,但是能夠增加的節點數也比較有限,不像 hadoop 或者雲上能夠拓展到成千上萬個節點。另外還有一些缺陷比如費用昂貴、高基維處理能力差,高併發下性能堪憂等等。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 理想的OLAP平臺"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7e\/7e9098867b46fd466a866b186513203e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"講了這麼多傳統 OLAP 的缺陷,那我們理想中的 OLAP 平臺是什麼樣子呢?"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是完善的 OLAP 能力,上鑽下卷、高級分析功能,如可計算度量,多對多,時間智能等等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時支持 ANSI SQL 和 MDX ,能夠與廣泛的 BI 工具進行良好對接,尤其是 Excel,目前仍是廣大分析師的重要選擇。在此之上,如果能在海量數據上進行交互式分析,能夠滿足上千用戶的高併發查詢,以及面對數據激增的情況,能夠很好的進行橫向擴展。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼如何做到這些呢?"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Kyligence 的解決方案"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Kyligence:分佈式 OLAP 大數據分析引擎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a6cb617813e51d2cd554876939881ad5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kyligence 能夠對接不同種類的數據源,支持雲端及本地部署 hadoop,因此天然就是支持橫向擴展的。另外 Kyligence 能夠進行智能建模,智能加速查詢,向外提供標準的 ODBC\/JDBC\/MDX 接口,能夠對接廣泛的 BI,而且最重要的是我們暴露語義是統一的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們看一下 Kyligence 語義層中的一些功能:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 靈活定義層級、指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ca\/ca7aa75ea0adfa74270de4613e6d6eb7.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 多對多、多事實表場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ab\/ab17bd0e3e9ee9013499c3e06405bdcf.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ 支持標準MDX接口"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c072f765926018353ae5214a584c33b4.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④ 支持時間智能指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/92\/92b43dcc5acf266234f014b88d74e9dd.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"⑤ 支持多語言翻譯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/85\/85631b829b903785898c3ae9bc6622f9.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 強大的語義建模能力,助力業務場景分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/8023decb30c9c127d189cee88c736ba7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語義層幫忙屏蔽掉了底層的數據模型,意味着我的分析師不需要了解這些表結構,以及他們的關聯關係。語義層暴露出來的概念都是維度度量、層級這種可直接拖拽的東西。一些複雜的技術邏輯也被屏蔽掉了,比如多事實表分析,多對多分析等等。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 統一的安全策略"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b0\/b0b2d26c453c4c5f6ae99400a31cc6ec.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時 Kyligence 也具備統一的安全策略,包括一些行列級權限,比方說不同部門的同事,只能看到不同地區的數據,華北的同事只能看到華北的數據,看不到華東的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"當中的一些挑戰"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 跨事實表分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/eb\/ebf880a6e164d7d8c71808174c749208.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跨事實表分析是一個非常常見的分析場景,比如要分析不同年齡段年收入和年消費兩者之間的關係,收入和消費就是兩種完全不同類型的事實記錄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把兩類指標放在一起進行分析,Excel 常用的做法是 vlookup,Tableau 的做法是數據融合(data blending), SSAS 的做法是星座模型。那在 Kyligence 是如何解決這個問題的呢?答案是模型整合。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 多對多分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/36445cae0881bb096a08bf50796131cd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多對多分析在日常分析中經常見到,比如書籍和作者的關係,一本書可以有多個作者,一個作者也可以寫多本書。如果要分析書籍銷售額跟作者所在城市的關係,就會發現一本書的銷售額在多個作家重複,導致在多個城市重複,但如果統計所有作家書籍的銷售額時,又需要將這些重複的值去除,這就是典型的多對多場景。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 多對多典型處理方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多對多典型處理方案:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據源層面將數據分攤,將度量值分攤到不同的維度值裏。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型打平後進行去重"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於鍵值進行關聯查詢"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方案對比:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據源層面將數據分攤,容易操作,建立一個分攤數值後的 view 即可,缺陷是無法在各個維度用一個策略分攤,另外分攤後的只能保證某一維度彙總的值沒有問題"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型打平後進行去重,缺陷是打平會導致數據重複,極易引起數據膨脹,且在線去重計算有一定性能問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於鍵值進行關聯查詢,這個方案不需要將數據拉平,不用行行拿事實表主鍵來去重,直接利用 Cube 數據和維表數據 做關聯即可得出結果。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 性能優化:MDX on Spark"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Excel 數據查詢使用的是 MDX 語言,是專門用來做多維分析的。傳統的一些 MDX 查詢引擎都是單機處理的,其中存在的問題是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機內存容量有限,一條大查詢中間計算量可能非常龐大,一個指標計算依賴的數據可能達到上億,這非常容易導致查詢失敗,且影響其他小查詢的體驗。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機算力有上限,即使上述情況中超量的中間數據可以分批加載或採用落盤的方式來減輕內存的壓力,但如此龐大的中間計算量仍會導致結果計算異常緩慢。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決思路:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將 MDX 語法樹轉換成 Spark 執行計劃,依靠 Spark 的分佈式運算能力來解決單點問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:datafuntalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/ggwmmaqKabrW8sLZkKkwrg","title":"xxx","type":null},"content":[{"type":"text","text":"OLAP進階:Excel可直接分析的大數據語義層"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章