數據準備:讓ETL敏捷起來

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分析是一個不斷探索數據背後的規律,得出業務洞察的過程,開始整個工作之前,分析師需要先將原始數據轉換爲面向分析的、有業務語義的數據,數據清洗和整理是提升整個分析過程效率和質量的關鍵環節,數據分析是建立在一致、準確、完整的數據基礎之上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多調查顯示,數據分析師通常會將60%以上的時間花在數據清洗和各種數據整理上,真正用來探索數據,獲取業務洞察的時間反而有限,工作效率亟待提高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對這些工作過程中的挑戰,我們不禁反思:如何降低數據處理的門檻,提高數據分析效率,推廣基於數據驅動的決策文化,建設數據驅動型組織?實踐過程中,我們發現將數據準備產品化,業務化,可視化,自助式數據準備工具可以有效賦能業務,有效提高了數據的整體運轉效率。本文基於有數的數據準備產品,介紹有數BI在產品上的實踐。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"業務場景描述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着企業不斷髮展,業務對數據實效性和敏捷性要求越來越高,但是受到數據開發週期長、流程複雜的制約,很多業務決策仍然僅能依靠經驗,成本高、效率低,如何降低分析和開發的門檻,提高效率,成爲迫在眉睫的需求。(下圖爲常見數據工作流程圖)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e9\/e9d2fb65000b9c62ba98ae40ceaceec6.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務向分析師提出數據分析需求,比如效果評估、趨勢預測、異常診斷等"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析師根據業務需求,整理分析思路,梳理數據指標,提供數據報告和業務策略,基於數據的完備性、可用性向數據產品提出數據需求"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據產品梳理數據指標,抽象業務流程,設計最終產品,並向數據開發提需求"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據開發根據產品需求,完成數倉建設以及實體表開發,最終交付分析師使用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析師在數倉基礎上向業務提供數據報告和決策建議,完成整個分析鏈路"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"覆盤整個工作流程,數據分析師需要數據產品、數據開發的協作配合才能完成整個工作,中間需要反覆溝通,確認需求,工作效率很容易遇到瓶頸。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到問題的起點,如果大部分場景分析師可以獨立完成整個過程,效率是否會提高很多?面向分析師的工作場景,一款體量輕,應用簡單,操作便捷的工具是產品必然要求。這個工具可以賦能分析師獨立完成大部分數據整理工作,縮短流程、提高效率。推而廣之,甚至業務同學可以自己完成數據分析,得出有價值的業務結論。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據準備介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從數據分析的整個流程上來看,數據準備既包含排除異常、保證一致性、缺失值處理等清洗工作,也包含組合、轉置、透視、合併等數據整理工作,是一個不斷迭代、改進、優化的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據準備的產出即可應用於各種數據分析工作,也能作爲數據集應用於各種場景化數據產品。具體來說,數據準備將原始數據轉化爲準確、一致、清晰,並且有一定業務含義的數據,是數據和業務之間的橋樑和紐帶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/34\/349daf908dcf29f25b7a1bff32149648.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"介紹到這裏,很多人可能有疑問“這不就是ETL工要做的事情嗎?”,從某種角度來說確實是的,但是相對於ETL工具,數據準備有其獨特的產品需求和用戶羣體。(ETL是英文Extract-Transform-Load的縮寫,用來描述將數據從來源端經過抽取(extract)、轉換(transform)、加載(load)至目的端的過程)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詳細來說,ETL與數據準備之間的對比:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"用戶羣體:"},{"type":"text","text":"ETL一般面向數據開發,而數據準備面向業務用戶,例如數據分析師,產品經理,市場運營等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"應用場景:"},{"type":"text","text":"ETL一般用來做數據歸集和建模,面向長期,規範化數倉建模,數據準備一般用來做分析前數據預處理,很多場景是臨時,短期,探索性場景爲主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"產品形態:"},{"type":"text","text":"ETL一般是以任務和代碼的方式處理數據,用戶需要有比較高的開發能力。數據準備一般以可視化方式做數據處理,產品會提供很多內置算子,用戶僅需要了解基本的數據概念即可。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"產品介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從源頭和工具上來說,現有數據清洗和整理必須在數據開發平臺完成,平臺配置、代碼開發等都要求用戶有比較高的開發素養,但是對業務同學來說門檻太高,實用價值不大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e5408d9bada0e4d0c3bc308804508664.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(典型的數據開發平臺)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以終爲始,面向業務用戶,有數BI將數據開發過程產品化、工具化、可視化,用戶僅需掌握基本的數據知識,不需數據開發輔助,幾步拖拽即可實現複雜數據處理,門檻低、效率高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f5\/f5c431a3a8d2f7241de53613c13bf8c4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據準備:涵蓋數據獲取、整理、建模的整體數據處理流程"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據連接:數據開發在產品上經過簡單配置即可獲取數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輕量ETL:產品將常用轉置邏輯包裝成算子,用戶拖拉拽即可實現數據整理"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據建模:用戶將處理好的數據表關聯合並,爲分析做好準備"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分析:無需轉換平臺,直接在現有平臺即可實現數據分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據產品:用戶可以將數據和可視化報表加工成數據產品"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/89\/89bc0e690657e5802441ad8d7c5be1bb.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在設計上,有數BI基於MPP架構,數據計算節點支持水平擴展,隨着數據量不斷增長,產品仍然能夠高效處理數據。同時,面向數據分析場景,現有數據準備跟BI無縫銜接,數據準備產出的表直接可以用來分析。不需要切換平臺,即可實現數據的獲取,處理,分析,整個平臺的連貫性,一致性爲用戶的分析過程提供了極大的便利,從根本上降低了操作門檻,提高了效率,節約了整個企業的成本。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實際場景簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設業務需要分析不同等級會員的忠誠度,以便後續有的放矢,提高業務經營效率。用戶手頭有如下數據:2012-2015年的訂單明細數據;2016-2019年的訂單明細數據;會員維表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/62\/62b3f71e393f169cae2be6ac89ed3a89.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於當前數據,分析師面對以下幾個難題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶僅關注會員粒度的銷售數據,當前訂單表粒度過細,數據量太大,不適合用來做數據分析;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訂單表的客戶ID是“姓名+訂單ID”組合字段,分析師需要先把這個字段拆分成兩個字段;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析過程中需要做關聯查詢,但很多數據引擎針對Join操作查詢效率比較低。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於數據準備,分析師可以直接在產品上完成整個數據整理過程,然後直接基於清洗的數據的建模分析,製作報告。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7e\/7e3d17e5cb469d1501c3490452c76f65.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將兩個表做清洗操作,去掉異常值,將姓名拆分成兩列;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將2015年、2019年的明細數據合併成同一份數據;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於用戶ID跟用戶維表做關聯操作,形成一個明細粒度的大寬表;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將日期、會員信息、銷售信息做聚合操作,後續分析可以基於當前數據直接分析"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於當前表建立模型,製作報表,展示數據。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"面向未來建設數據準備"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了數據建設的全局性、一致性、可維護性,保證數倉體系的整體產出質量,頻繁使用、相對確定的需求應該由數據團隊負責開發、運維。相對來說,臨時的、高度不確定的探索性分析可以根據需要靈活處理,分析師在工作過程中可以將根據業務需求和上下文將原始數據加工成需要的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過實踐證明,探索性需求由分析師完成,然後將需求明確、應用範圍較廣的需求向數據產品提出需求,整體工作流程更加順暢,協作更佳便捷,有事半功倍的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d0\/d0157525be9dae6550bdb6b1ea9466cf.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於可視化、產品化的數據準備流程可以縮短整個分析流程的時間,推廣數據驅動的企業文化,但是實際運行工作中,我們仍然面對很多挑戰,突出表現爲統一運維監控,產品適用性,架構可擴展性等方面,這些挑戰的有效應對可以進一步提升產品能力,擴展應用場景,提高整個企業的運營效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"統一運維、監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於有數BI的數據準備,探索性數據分析工作不需要由開發參與,分析師可以獨立完成,省去了中間反覆確認需求,不斷修改設計過程,可以顯著節省工作時間,提高業務滿意率。但是在實際運行中,我們發現由分析師創建的表已經遠多於開發創建的表,這部分表很多是臨時創建,難以判斷價值,體量又大,運維起來非常痛苦。同時,缺乏專業的數據開發支持,很多表的設計並不合理,執行效率較低,浪費資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據分析的靈活性、開放性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了降低產品的使用成本,擴展用戶羣體,數據準備將常用算法包裝爲算子,降低了用戶用戶整理數據的成本,但是這種設計降低了數據開發的靈活性,某種程度上無法達到效率最優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"不同平臺的搭配應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於效率和質量考慮,部分大數據量,高複雜度的計算不適宜用可視化產品實現,數據準備需要搭配大數據平臺使用,但是不同平臺的聯合應用建設帶來了運維上的難題。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章