爲什麼預計算技術代表大數據行業的未來,一文讀懂

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"瞭解 Kylin 的技術同仁,一定對預計算這個概念不陌生。業內對於預計算的價值一直褒貶不一,今天筆者將結合自己的十多年的工作經驗,從預計算的歷史、原理到企業的應用,以及未來的發展來爲大家帶來更爲全面的解讀。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"預計算的早期形式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預計算是一種用於信息檢索和分析的常用技術, "},{"type":"text","marks":[{"type":"strong"}],"text":"其基本含義是提前計算和存儲中間結果,再使用這些預先計算的結果加快進一步的查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實在我們不知道預計算的時候,我們就已經使用過預計算了。 "},{"type":"text","marks":[{"type":"strong"}],"text":"預計算的歷史大概可以追溯到 4000 年前古巴比倫人最早使用的乘法表。"},{"type":"text","text":" 你回想小學背過的乘法表(如下圖所示), 記住了乘法口訣,我們就可以通過心算來進行一些簡單的乘法運算,這個過程其實就是一種簡單的預計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/96\/44\/96f72354d37a997795f21cecf283c144.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"乘法圖表來源:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"https:\/\/en.wikipedia.org\/wiki\/Multiplication_table"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據庫中的預計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預計算也廣泛應用於數據庫技術中。比如,關係數據庫中的索引其實就是一種預計算。爲了快速地檢索數據,數據庫會主動維護一個數據索引的結構,用來描述表格中一列或者多列數據的縮影。一旦索引的預計算完成,數據庫不用每次都重新查找表格的每一行,就能快速地定位數據。假設N是表格的行數,有了索引的預計算,數據檢索的時間可以從O(N)減少至O(log(N)) 甚至到 O(1)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"索引作爲一種預計算,帶來便利的同時也存在一些弊端。當表格中插入新的行數時,就需要重新進行的計算和儲存。 "},{"type":"text","marks":[{"type":"strong"}],"text":"當索引越多,查詢響應越快時,那其實也意味着要進行更多的預計算,這當然也會顯著減緩數據更新的速度。"},{"type":"text","text":" 下列圖表展示了索引數量增加後,表格插入行的性能也相應降低 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/89\/98\/896780e16fb7bd2479713418c9f19d98.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"索引數量對插入行性能的影響圖表來源:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"https:\/\/use-the-index-luke.com\/sql\/dml\/insert"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"彙總表,通常由物化視圖實現,也是數據庫中預計算應用的另一種形式。彙總表本質上是對於原始表格的彙總。一個十億行的交易表按照日期進行聚合以後,可能就只剩幾千行了。對數據的分析就可以通過彙總表而不是原始表來完成。受益於彙總表中數據量的大幅縮小,交互式的數據探索在彙總表上能提速數百倍甚至數千倍。而想在原始表格中完成這樣的交互式分析幾乎是不可能的。構建一個交易表的成本並不低,而且如果需要與初始表格保持同步更新的話,那成本就更高了。不過,考慮到分析速度的大幅提升及其所帶來的價值,彙總表仍然是現代數據分析中廣泛使用的一種工具。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"OLAP 和 Cube 中的預計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着數據庫技術的演進,數據庫根據用途也出現了專精和分工。1993年,關係數據庫之父埃德加·科德(Edgar F. Codd)創造了 OLAP(On-Line Analytical Processing)這一術語來表示聯機分析處理。 "},{"type":"text","marks":[{"type":"strong"}],"text":"由此,數據庫被分爲專精於在線事務處理的 OLTP 數據庫,和專精於在線分析的 OLAP 數據庫。"},{"type":"text","text":" 就同你推測的一樣,OLAP 數據庫將預計算技術的運用提升到了更高的層次。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cube 系統是一種特殊的 OLAP 數據庫,它將預計算髮揮到了極致。分析時數據可以具有任意數量的維度,而 Cube 就一個數據的多維度數組。將關係型數據載入到 Cube 的過程就是一種預計算,其中包括了對錶格的關聯和聚合。一個滿載的 Cube 約等於 2n 個彙總表,其中 n 是維度的數量。這種巨量的預計算可能需要數小時才能完成!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cube 的優勢和劣勢都十分明顯。一方面來說,一旦 Cube 構建完成,就能帶來最快的分析體驗,因爲所有的計算都已經預先完成了。無論你想查看數據哪個維度,結果其實都早已計算好了。 "},{"type":"text","marks":[{"type":"strong"}],"text":"除了從 Cube 獲取查詢結果和進行可視化操作之外,幾乎不需要再進行聯機計算,這完美實現了低延遲和高併發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,Cube 不夠靈活,而且維護成本較高。這不僅僅是因爲預計算和存儲本身消耗資源,更多是因爲將數據從關係數據庫中載入 Cube 通常需要人工建設數據管道。每次業務需求變更時,都需要一個新的開發週期來更新數據管道和 Cube。這既需要投入時間,也需要投入金錢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"儘管投入不菲,在追求極致的低延遲高併發的大數據多維分析場景下,Cube 技術一直是不可或缺的一個選項。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/34\/26\/342009c0aca83836e5fe6f183272c526.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Cube 圖源:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"https:\/\/en.wikipedia.org\/wiki\/OLAP_cube"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"大數據時代的挑戰與機遇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展望未來,預計算在大數據時代又會面臨什麼挑戰和機遇呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"先說結論,隨着數據總量和數據用戶的持續增加,預計算將成爲數據服務層中必不可少的基石。"},{"type":"text","text":" 爲了更好地解釋這一點,我們先要理解數字化轉型時代的大背景和預計算的技術特徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先來看看當下企業數字化轉型的一些大背景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據量在持續增長(如下圖所示)。未來,將有更多的數據需要分析,這也就是說,企業將每年投入更多的算力來處理每年新增的數據。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a5\/8d\/a53e544d0093e06d52cb1345c690be8d.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"數據增長圖來源:"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"摩爾定律已經走到盡頭。德克薩斯大學的研究表明,從芯片製造的角度來看,過去十年中摩爾定律的影響已大不如前。與此同時,雲計算的價格近年來基本保持平穩。這意味着,企業的計算成本會與數據量的增長保持同步。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/17\/15\/173b3dd3d64f5fd922c847c9f3399815.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"雲計算價格圖源:"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"https:\/\/redmonk.com\/rstephens\/2020\/07\/10\/iaas-pricing-patterns-and-trends-2020\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據使用者的數量會顯著增加。只有當數據被用於決策時,數據纔有價值。爲了讓數據這個“新石油”更好地驅動業務發展,理想狀態是公司中的每位員工都會使用數據。這也就是說,未來分析系統上的用戶可能將會是現在的數十倍甚至數百倍。平民數據分析師的時代要來了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"再來總結一下預計算的技術特徵"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預計算其實是以空間換回了時間。如果追求響應速度,那麼當然優先考慮預計算。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預計算增加了數據準備的時間與成本,但同時減少了數據服務的時間與成本。如果追求高併發和服務更多的消費者,那也優先考慮預計算。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預計算會導致數據管道邊長並增加端到端的數據延遲。這是需要改進的部分,這點我們也將在後文詳細介紹。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以上的大背景下,讓我們一起來看看,預計算將會如何幫助我們解決一些基本的分析需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"如何在數據增長的同時依舊保持快速查詢響應?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們使用聯機計算(通常是 MPP 數據庫)進行查詢時,查詢時間複雜度最小爲O(N),這意味着其所需的計算勢必與數據成線性增長的關係。假設,今天一條查詢運行時間是 3 秒,當數據量翻倍時,同樣的查詢運行時間就會變爲 6 秒。要想數據分析師不抱怨,讓查詢響應時間保持在 3 秒之內,你只能向 MPP 供應商付雙倍錢,讓 MPP 系統資源增加一倍。與聯機計算不同,當通過預計算進行查詢時,你會覺得它好像不受數據增長的影響。因爲大多數結果都被預計算了,所以查詢時間複雜度接近 O(1)。即使數據量加倍,查詢返回結果的耗時也與之前相差不大,查詢的響應時間仍將爲 3 秒。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6a\/c1\/6a4649af572edcf75ed6c82095b127c1.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"隨着數據量增長,對比在線計算和預計算完成查詢的時間複雜度"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"如何更好地滿足“平民分析師”的併發需求?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於聯機計算而言,用戶增長的影響類似於數據增長的影響。所需的計算量隨併發用戶的增長而線性增長。MPP 供應商可能會勸說你將集羣規模增加一倍,來支持數量翻倍的分析師,不過公司的IT預算可能不允許,因爲價格也翻倍了。另一方面,由於預計算將單條查詢所需的資源最小化,新增用戶所需的額外資源也能實現最小化。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"當數據量和用戶數量同時增長,如何管理 TCO(總擁有成本)?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雲的優勢在於,在雲上所有資源消耗都可以通過成本進行量化。下圖展示了在 AWS 中 MPP 數據服務和預計算數據服務之間的實際成本比較。實際成本包含數據準備成本和查詢服務成本。其中,測評使用的工作負載是具有 1 TB 數據的 TPC-H(決策支持基準測試)。 "},{"type":"text","marks":[{"type":"strong"}],"text":"假設我們今天有 40 位分析師,每位分析師每天運行 100 個查詢語句,那麼問題來了,如果數據量增長 25%,用戶增長 5 倍,一年後的總成本將是多少?"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/10\/c7\/101c6f6e460f08747bbee3452b9af5c7.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"預計算數據服務和 MPP 服務總體擁有成本對比"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實驗表明,當查詢或用戶數量增長時,預計算的 TCO 優勢明顯。"},{"type":"text","text":" 尤其是當每天查詢數量達到 20000 之後,預計算數據服務的 TCO 僅爲 MPP 服務的1\/3。數據量增長越大,預計算的優勢就越明顯。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"總而言之,在數字化轉型的時代,預計算將會是大規模數據變現的關鍵技術。"},{"type":"text","text":" 數據服務系統在預計算加持下,能夠同時實現快速響應時間,高併發和低 TCO。當然,就額外的數據準備而言,預計算也有它缺,這一點我們也會在下文展開討論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"舉例:將OLAP查詢提速200倍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"下面我們近距離觀察一個實例,看看 Apache Kylin 如何使用預計算,將一個 TPC-H 查詢加速200倍。"},{"type":"text","text":" TPC-H是一個數據庫研究領域常用的決策分析測試基準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在100 GB 數據量下,TPC-H 基準裏的7號查詢在 Hive+Tez 的 MPP 引擎下需要執行35.23秒。從下圖可以看到,這個查詢並不簡單,包括了一個子查詢。執行計劃顯示,這個查詢的包含了多個Join運算和一個 Aggregate 運算。這兩種計算也是整體執行中最大的瓶頸。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/dd\/3e\/dd4334cb18db63df51a906b3f838eb3e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"TPC-H 基準裏的7號查詢在 MPP 引擎下的執行計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從預計算視角,我們容易想到使用一個物化視圖,可以將 Join 運算提前算好,從而節省查詢時的開銷。如果人工來做,方法大致如下。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/75\/51\/750d3e2d4eab97363b1ef4da8acdf451.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"TPC-H 基準裏的7號查詢人工處理物化視圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意到新的執行計劃由於 Join 運算被替換爲物化視圖而大大簡化了。但這個方法的缺點在於物化視圖需要人編程工來創建和維護,並且應用層需要改寫 SQL 來查詢新的物化視圖,而不是原始表。這種改寫在實際工程中代價很大,因爲涉及大面積的應用層重構,通常需要一個完整的開發週期,並需要全迴歸的應用測試。最後,Aggregate 運算仍然在線計算,預計算還有較大的提升空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了做到更完美的預計算,Apache Kylin 做了一下設計:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"引入了多維立方體概念。一個 Cuboid 簡單來說,就是一個包含了 Join 和 Aggregate 預計算結果的物化視圖"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"幫助用戶通過 GUI 配置方式,自動創建和維護Cuboid"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"能自動優化查詢的執行計劃,動態選擇最合適的Cuboid 執行查詢,而用戶無需修改 SQL"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/42\/b5\/42115a931b9efff41ff6485d1a4d53b5.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"TPC-H 基準裏的7號查詢在 Apache Kylin 環境下的執行計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"在 Apache Kylin 上執行同樣的查詢,在相當的硬件條件下,只需要0.17秒。充分的預計算消除了 Join 和 Aggregate 兩個最大運算瓶頸。"},{"type":"text","text":" 在執行計劃優化過程中,系統會自動挑選最合適的 Cuboid 並替換到執行計劃裏。應用層的 SQL 不需要修改,就能獲得透明加速200倍的分析體驗。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"預計算未來可期"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管預計算在大數據領域表現優異,但也確實存在一些缺點,例如,預計算可能會加劇數據管道的延遲,還需要額外的人工運維。不過好消息是,Gartner 預測:“到 2022 年,通過機器學習的增加和自動服務級別管理的壯大,數據管理手動任務將減少 45%”。我們將會在接下來的兩年內,看到新一代智能數據庫系統緩解甚至徹底消除這些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"在不久的將來,新一代數據庫將以智能化和自動化的方式融入預計算技術。"},{"type":"text","text":" 下面是我們對未來一些預測:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了支持更大的數據量和服務更多的平民分析師,預計算將會被會在數據服務層廣泛使用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"藉助人工智能和自動化技術,預計算的數據準備工作將會實現全面的自動化。例如,炙手可熱的雲上數倉 Snowflake 就在底層數據塊上自動作小量聚合預計算並加以物化(small materialized aggregates [Moerkotte98]),過程對用戶完全透明,完全自動化。 "},{"type":"text","marks":[{"type":"strong"}],"text":"大數據 OLAP 引擎 Apache Kylin 也能根據用戶配置的維度組合,自動化的完成將關係數據加載到 Cube 中預計算。整體配置過程在 GUI 中完成,不需要編程或大數據技能就可以實現"},{"type":"text","text":" ,達到了半自動化的水平。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP 數據庫開始配備智能或透明的預計算功能。這樣的數據庫將能夠在聯機計算和預計算之間透明地切換。當需要查詢最新數據時,就可以直接從MPP 引擎查詢最新數據,不會受困於數據管道的延遲。當查詢能擊中某些預計算時,那麼已經計算好的結果將會在最大程度上減少查詢成本,同時系統吞吐量也會提高。新型數據庫將能夠實現自動決定何時預計算,作哪些預計算,並智能地運用預計算來實現各種運維目標,比如快速響應時間,高併發性和低 TCO。而以上這些對終端用戶都是透明的,徹底解放數據庫管理員。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文獻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Multiplication table https:\/\/en.wikipedia.org\/wiki\/Multiplication_table"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Database index https:\/\/en.wikipedia.org\/wiki\/Database_index"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"More indexes, slower INSERT https:\/\/use-the-index-luke.com\/sql\/dml\/insert"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP cube https:\/\/en.wikipedia.org\/wiki\/OLAP_cube"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"The Rise and Fall of the OLAP Cube https:\/\/www.holistics.io\/blog\/the-rise-and-fall-of-the-olap-cube\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Worldwide data volume https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Measuring Moore's Law 2020 https:\/\/www.nber.org\/system\/files\/chapters\/c13897\/c13897.pdf"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"IaaS Pricing Patterns and Trends 2020 https:\/\/redmonk.com\/rstephens\/2020\/07\/10\/iaas-pricing-patterns-and-trends-2020\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TPC-H decision support benchmark http:\/\/www.tpc.org\/tpch\/"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Augmented Data Management https:\/\/www.gartner.com\/en\/conferences\/apac\/data-analytics-india\/gartner-insights\/rn-top-10-data-analytics-trends\/augmented-data-management"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"李揚,Kyligence 聯合創始人兼 CTO"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Kylin 聯合創建者及項目管理委員會成員 (PMC),曾任 eBay 全球分析基礎架構部大數據資深架構師、IBM InfoSphere BigInsights 技術負責人和摩根士丹利副總裁,IBM“傑出技術貢獻獎”獲獎者,具有大數據分析領域 10 多年實戰經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"專注於大數據分析、並行計算、數據索引、關係數學、近似算法和壓縮算法等前沿技術。在過去 15 年的工作經歷中,見證並直接參與了 OLAP 技術的發展 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文轉載自公衆號Kyligence(ID:Kyligence)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzIyNTIyNTYwOA==&mid=2651010997&idx=1&sn=d0b1329ae5afe8d5e5a3cb1857503af1&chksm=f3f56952c482e044f560aaeb672c844cd46cf8563b60674c0be23650a78f46b1d35eb9a7d003&scene=0&xtrack=1&version=3.0.36.2330&platform=mac#rd","title":"","type":null},"content":[{"type":"text","text":"爲什麼預計算技術代表大數據行業的未來,一文讀懂"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章