暢想數據湖

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是一哥,週末有讀者私聊我諮詢了一些問題,遂想起了之前看過的一些關於數據湖的知識,下面是基於之前的所見和自己的思考而成文。 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據湖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據湖是一個集中式存儲庫,允許您以任意規模存儲所有結構化和非結構化數據。您可以按原樣存儲數據(無需先對數據進行結構化處理),並運行不同類型的分析 – 從控制面板和可視化到大數據處理、實時分析和機器學習,以指導做出更好的決策。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是AWS給出的解釋。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看了很多數據湖的介紹文章,筆者認爲數據湖和我們常說的ODS數據很類似,也就是原始數據的保存區域,存儲來自各業務系統(消息隊列)的原始數據。比如電商網站的訪問日誌(埋點的時候是以JSON存儲),物聯網終端設備實時發送的數據等原始數據直接存儲在數據倉庫的ODS層。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據湖爲什麼火了","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做數據倉庫已經有ODS數據了,那麼怎麼突然大家都在提數據湖了?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真正的原因在於數據分析和機器學習這兩年成爲了主流,可以看看現在的招聘網站,很多招聘數據分析師和算法工程師的崗位,筆者所在城市尤爲明顯。15年的時候大家都在建立各自的大數據平臺,那時候你懂點Hadoop,已經很了不起了。現在各個大數據平臺已經建設成熟,逐步爲業務服務,越來越多的公司需要利用大數據服務於業務,提升變現能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於大數據建設的數據倉庫往往是各個維度的聚合數據,大多服務於傳統的報表分析。而機器學習往往需要使用到原始數據,另外很多機器學習用到的也不至於格式化數據,用戶的評論,圖像等都可以應用到機器學習中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼要有數據湖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看下上面的這個組織架構圖。數據湖的存在更多的是改變部門的組織架構,畢竟現在大部分公司都更注重業務分析的價值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a8/a83e6699f854baa7a8edcc5e5de2298d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統企業的數據團隊被當做IT體系,整天要求提數。現在,數據團隊只需要負責提供簡單易用的工具,業務部門直接進行數據的使用。這也就是人人具備數據分析能力(人人都是數據分析師,真的很難)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據湖 vs 數據倉庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3c/3ca0ffc72f25c6f36062c54b3f272017.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是AWS給出的對比,還是比較中肯的。 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的數據倉庫工作方式是集中式的:業務人員給需求到數據團隊,數據團隊根據要求加工、開發成維度表,供業務團隊通過BI報表工具查詢或者業務分析系統展示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據湖是開放、自助式的:開放數據給所有人使用,數據團隊更多是提供工具、環境供各業務團隊使用,業務團隊進行開發、分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和數據倉庫不同的是,以前數據倉庫都是先設計schema,然後灌入數據。數據湖的schema是隨用隨生成,隨着分析場景不同而不同。關於數據湖的技術實現方面可以瞭解下 delta lake這個項目(我司的平臺部分功能在delta lake這個項目出來之前已經實現了一些功能)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據湖對於數據分析師來說對數據的操控性更強,但是要求也更高,不光懂業務,懂sql,懂數據,還要懂大數據處理技術,每個人都在處理自己需要的數據,會造成很多冗餘數據存儲和計算資源浪費,無法形成共性的可複用的數據層,這方面數倉是有益的補充。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據湖並不是爲了顛覆數據倉庫,是爲了滿足數倉無法滿足的數據需求,二者是互補的(","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"目前來看","attrs":{}},{"type":"text","text":")。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ELT","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"你沒看錯,是ELT,不是ETL!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"週末有讀者私聊一哥,看了一篇ETL和ELT的文章,知道了概念,但是不知道具體在什麼場景下實施?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多時候,我們只講概念,很晦澀。先上一張圖:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a2e32e6c4b9c095a98cb6735453da80f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據集成包含三個基本的環節:Extract(抽取)、Transform(轉換)、Load(加載)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ETL:抽取是將數據從已有的數據源中提取出來;轉換是對原始數據進行處理,例如使用ETL工具(Informatica、Kettle等)進行過濾空值,指標計算等;加載是將數據寫入目的地,一般是關係型數據庫。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ELT:","attrs":{}},{"type":"text","text":"在抽取後將結果先寫入目的地,比如Hive中,然後由下游應用利用外部計算框架進行指標加工、建模,例如 Spark 來完成轉換的步驟。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以說現在大數據環境下,很多已經是ELT架構了,數據湖就非常適合作爲ELT架構中的“數據存儲目的地”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據湖的未來","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3月初和一個好友飯後閒聊,聊到數倉的建設。首先,我們思考一下數倉爲什麼會出現?其實是數據量的飛速增長,以至於當時的數據存儲計算引擎,不能很好的滿足分析需求;於是數倉概念和經典的理論出現了,很好的解決了當時的問題,用“規範+存儲”來解決了當時的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼現在大數據時代,隨着技術的不斷髮展,很多新技術出現了,大批量的存儲和計算不再是那麼難了,那麼我們放棄數倉那一套是否可行呢?從一哥現在處理的業務看,如果你的業務系統相對較單一,沒有幾十個業務系統每天往數倉裏灌數據,那麼數據湖可以滿足你的需求,並且對於“數據驅動”更“敏捷”。如果一線的業務系統較複雜,那麼現在使用數據湖也會一不小心會變成“數據沼澤”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以,下一個方向也許就是數據湖的數據治理,當數據湖的治理明確後,也就是它大放異彩的時刻了!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4f40824d187d08c3c8a367d17c7cdaa0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章