傳統BI如何轉大數據數倉

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是一哥,前幾天建了一個數據倉庫方向的小羣,收集了大家的一些問題,其中有個問題,一哥很想去談一談——現在做傳統數倉,如何快速轉到大數據數據呢?其實一哥知道的很多同事都是從傳統數據倉庫轉到大數據的,今天就結合身邊的同事經歷來一起分享一下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"數據倉庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據倉庫:數據倉庫系統的主要應用主要是OLAP(On-Line Analytical Processing),支持複雜的分析操作,側重決策支持,並且提供直觀易懂的查詢結果。也就是說,數據倉庫彙總有可能有很多維度數據的統計分析結果,取百家之長(各個數據源的數據),成就自己的一方天地(規劃各種業務域的模型,指標)。可以參考之前的文章《","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzI4MzE4MjQxOQ==&mid=2649360704&idx=1&sn=a45ac5e44a9fc9241193af38b68a7fcc&chksm=f3903d7cc4e7b46a29a9ce061ead170b494268eca258aa4d51cdfae378474f0355fb9a881fe6&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"數據倉庫的前世今生","attrs":{}}]},{"type":"text","text":"》","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"傳統數據倉庫開發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/0747a441974736dde995113de3d49163.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的數據倉庫用Oracle的居多,多半是單機或者一個雙機環境運行。本身硬件,系統都容易形成單點故障。慢慢發展,應該會開始通過存儲形成容災的一個環境。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我瞭解的傳統的數據開發一般分爲3個崗位:數據工程師、ETL工程師、數據倉庫架構師,大多數人屬於前兩者。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據工程師:","attrs":{}},{"type":"text","text":"根據業務人員提交的邏輯來編寫“存儲過程”,他們能夠很輕鬆的編寫上千行的複雜邏輯SQL。在編寫SQL多年經驗中,掌握了各種關聯查詢、聚合查詢、窗口函數,甚至還可以用SQL自己編寫一些Function,最終組合成了存儲過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ETL工程師:","attrs":{}},{"type":"text","text":"傳統數據倉庫只有在大型企業中一般纔會有,比如電信、銀行、保險等行業。他們都會採購一些ETL工具,比如Informatica或者和第三方共建ETL工具,比如和華爲、亞信等。這些ETL工具功能非常強大。ETL工程師可以通過在平臺上拖拉拽的形式進行數據加工處理,同時ETL平臺的組件還可以支撐一些腳本的上傳,所以ETL工程師結合數據工程師開發的複雜存儲過程,在平臺上進行加工設計,最終形成一個個定時任務。然後他們還負責每天監控這些定時任務的狀態,對於重要部門的ETL人員還經常會熬夜值班監控。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據倉庫架構師:","attrs":{}},{"type":"text","text":"數據倉庫是依靠規範來有序進行的,架構師就是來建立這些規範的,包括數據倉庫的分層、模型命名、指標命名、ETL任務命名、ETL任務編排規範、存儲過程開發規範等等,最終形成《XX數據倉庫建設規範》,然後數據工程師和ETL工程師按照規範進行任務開發。如果遇到重大業務變更,比如主數據變更,需要和數據倉庫架構師評審後修改完善。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"大數據開發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f0/f03095011ae0e2e00f8ab009a5b98b3f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在的大數據架構多了一些東西,比如數據採集Flume、消息對壘kafka、計算引擎MR、Spark以及實時計算框架,這些都是以前傳統數據倉庫架構下沒有的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Flume","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flume是一種分佈式、高可靠和高可用的服務,用於高效地收集、聚合和移動大量日誌數據。它有一個簡單而靈活的基於流數據流的體系結構。它具有可調的可靠性機制、故障轉移和恢復機制,具有強大的容錯能力。它使用一個簡單的可擴展數據模型,允許在線分析應用程序。Flume的設計宗旨是向Hadoop集羣批量導入基於事件的海量數據。系統中最核心的角色是agent,Flume採集系統就是由一個個agent所連接起來形成。每一個agent相當於一個數據傳遞員,內部有三個組件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"source: 採集源,用於跟數據源對接,以獲取數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sink:傳送數據的目的地,用於往下一級agent或者最終存儲系統傳遞數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"channel:agent內部的數據傳輸通道,用於從source傳輸數據到sink","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詳細配置參考:《","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzI4MzE4MjQxOQ==&mid=2649359245&idx=1&sn=2320034d5c92eccc85a2d5def414528c&chksm=f39037b1c4e7bea7b39e9e1a74b42a1a92c09e901ee18ab6297b49e50741827555decec0e819&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"日誌收集組件—Flume、Logstash、Filebeat對比","attrs":{}}]},{"type":"text","text":"》","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"kafka","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka是最初由Linkedin公司開發,是一個分佈式、分區的、多副本的、多訂閱者,基於zookeeper協調的分佈式日誌系統(也可以當做MQ系統),常見可以用於web/nginx日誌、訪問日誌,消息服務等等,Linkedin於2010年貢獻給了Apache基金會併成爲頂級開源項目。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要應用場景是:日誌收集系統和消息系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka主要設計目標如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供消息持久化能力,即使對TB級以上數據也能保證常數時間的訪問性能。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高吞吐率。即使在非常廉價的商用機器上也能做到單機支持每秒100K條消息的傳輸。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持Kafka Server間的消息分區,及分佈式消費,同時保證每個partition內的消息順序傳輸。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時支持離線數據處理和實時數據處理。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scale out:支持在線水平擴展","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"大數據計算引擎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近幾年出現了很多熱門的開源社區,其中著名的有 Hadoop、Storm,以及後來的Spark、Flink,他們都有着各自專注的應用場景。Spark掀開了內存計算的先河,也以內存爲賭注,贏得了內存計算的飛速發展。Spark的火熱或多或少的掩蓋了其他分佈式計算的系統身影。不過目前Flink在阿里的力推之下,也逐漸佔領着實時處理的市場。其實大數據的計算引擎分成了三代:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第一代計算引擎","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無疑就是Hadoop承載的 MapReduce。這裏大家應該都不會對MapReduce陌生,它將計算分爲兩個階段,分別爲 Map 和 Reduce。對於上層應用來說,就不得不想方設法去拆分算法,甚至於不得不在上層應用實現多個 Job 的串聯,以完成一個完整的算法,例如迭代計算。MR每次計算都會和HDFS交互,和磁盤交互意味着產生更多的IO,也就會更慢。由於這樣的弊端,催生了支持 DAG 框架和基於內存計算的產生。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第二代計算引擎","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的特點主要是 Job 內部的 DAG 支持(不跨越 Job),同時支持基於內存的計算。這樣的話每次計算到中間步報錯了,就不會再從頭開始計算一遍,而是接着上一個成功的狀態,同時中間計算結果數據也可以放在內存中,大大提高了計算速度。另外,Spark還支持了實時計算,滿足了大家維護一套集羣,既可以搞離線計算也可以搞實時計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第三代計算引擎","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"促進了上層應用快速發展,例如各種迭代計算的性能以及對流計算和 SQL 等的支持。Flink開始嶄露頭角。這應該主要表現在 Flink 對流計算的支持,以及更一步的實時性上面。當然 Flink 也可以支持 Batch 的任務,以及 DAG 的運算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"轉型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了傳統數據倉庫轉型大數據是一個好的開端,但是路並不會那麼順利,也許並沒有你想想的那麼快速。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從身邊的同事來看,大家都是經歷了很多項目經驗之後,才真正轉型成功的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,作爲傳統的數據倉庫工程師,你已經對SQL掌握的非常熟練,這是你的優勢。可以看到,現在所有的大數據計算引擎都在支持着SQL,從最早的Hive到現在的Flink。他們大多和標準SQL語法近似,很快能夠快速掌握,現在很多離線數據倉庫還是建立在Hive之上,從這一點上,你能很快開發HiveSQL腳本。你可以把之前數據倉庫架構的那一套方法論拿過啦借鑑,數據倉庫這東西建了幾十年,還是最初的那幾套方法,足夠了。只是你會遇到很大的數據量問題,要根據實際情況進行合理的分層,合理的使用臨時表。但是,既然你選擇了轉型,相信你並不只是滿足換個環境寫SQL吧。所以你要學習Hadoop、學習Spark,身邊曾有一個同事從傳統數倉轉大數據後,就是換個環境寫SQL,寫了兩年了還不知道HDFS常用命令,不知道Spark的計算原理,每次SQL調優的時候,都去問平臺開發的同事,怎麼修改參數。所以趁着你還在轉型的興頭上,學一下這方面的知識。如果之前沒有java開發經驗,那不建議你學習MR,直接上手Spark吧,python也很好用的,使用pyspark能夠解決很多SQL解決不了的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,學一學kafka吧,離線數倉搞完了,不想搞實時數倉嗎?現在物聯網行業,每天有幾億數據上傳到平臺,每條消息有幾千個字段,我們無法選擇傳統的數據庫進行存儲。網關直接轉發數據到kafka,然後存儲到HDFS和HBase。其他行業也一樣,電商。搜索,每天那麼多人訪問,訪問數據也都是消息緩存的方式發送到HDFS落地分析的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,推薦兩本書吧,幫助你快速轉型成功——《阿里大數據之路》、《大數據日知錄》。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章