聊聊數據倉庫建設

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據倉庫是一個面向主題的、集成的、隨時間變化的、但信息本身相對穩定的數據集合,用於對管理決策過程的支持。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數倉建設思路","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數倉主要是圍繞着數據使用方與數據開發方訴求進行建設;因此在開始規劃數倉建設時,需要先剖析各方需求、痛點與癢點,然後再在這些訴求設計解決方案與確定建設內容。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據使用方","attrs":{}},{"type":"text","text":"主要訴求是能不能快速找到、找到怎麼用、有哪些數據,在使用數據時,主要存在三大類問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"找不到","attrs":{}},{"type":"text","text":",不知道數據有沒有、在哪裏。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"看不懂","attrs":{}},{"type":"text","text":",有很多業務方不是技術研發團隊的,看不懂數據到底什麼含義、怎麼關聯查詢、來源於哪個業務系統。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"不會用","attrs":{}},{"type":"text","text":",如何寫 SQL 或者哪些產品裏面能查詢到自己想要的數據指標。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此針對數據使用方,在數倉建設過程中需要滿足:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"找得到、看得懂、用得對","attrs":{}},{"type":"text","text":"數據開發工程師更多是關注數據開發便利性、高效性與快速定位問題,因此數據開發方主要是以下幾點訴求:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據複用","attrs":{}},{"type":"text","text":":數據需求煙囪式開發,導致了大量重複邏輯代碼的研發,通過數據複用可以縮短數據需求交付,提高數據開發效率,滿足業務對數據的敏捷研發要求。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題追蹤","attrs":{}},{"type":"text","text":":在數據處理過程與數據質量分析過程可以快速定位問題源頭。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"影響分析","attrs":{}},{"type":"text","text":":可以快速高效對數據規則修改或數據上下架影響進行分析。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數倉建設內容","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"明確數倉建設目標之後,主要是從以下幾個方面搭建數倉能力:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"分層架構","attrs":{}},{"type":"text","text":":分層架構可以簡化數據清洗的過程、爲數據與模型可複用提供基礎。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"主數據管理","attrs":{}},{"type":"text","text":":通過主數據打通各業務鏈條,統一數據語言,統一數據標準,實現數據共享。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"指標體系","attrs":{}},{"type":"text","text":":指標體系就是將各個指標按照特定的框架組織起來,從而統一指標名稱及口徑定義,理清指標間構成關係,避免重複建設","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"詞根管理","attrs":{}},{"type":"text","text":":通過詞根可以用來規範表名、字段名、主題域名等等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據血緣","attrs":{}},{"type":"text","text":":數據溯源、數據價值與質量評估。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"分層架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過數據分層管理可以更好組織、管理與維護數倉數據,簡化數據開發工作,每一層的處理邏輯相對簡單與容易理解,也比較容易保證每一個步驟的正確性,從而","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"簡化數據清洗的過程","attrs":{}},{"type":"text","text":"。分層是在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用空間換時間","attrs":{}},{"type":"text","text":",通過大量的預處理來提升應用系統的用戶體驗(效率),因此數據倉庫會存在大量冗餘的數據;不分層的話,如果源業務系統的業務規則發生變化將會影響整個數據清洗過程,工作量巨大。通過數據分層爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據與模型可複用提供基礎","attrs":{}},{"type":"text","text":",很多數據質量問題是因爲我們數據與模型無法複用導致業務口徑與技術口徑無法統一;新的需求,都從原始數據重新計算,從而衍生出很多數據質量問題。數倉分層一般如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5db7d2dd89eb585911bbec5d2e954a83.webp","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ODS層","attrs":{}},{"type":"text","text":":加載與處理業務系統源數據的臨時區。ODS是面向系統,貼源遷移。不改變數據結構和數據粒度,但需清洗髒數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DWD","attrs":{}},{"type":"text","text":":企業唯一的、集成的、準確的數據版本。數據按主題域組織,數據結構按實體和關係重構,數據粒度保留最細。使用E-R建模。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DWS","attrs":{}},{"type":"text","text":":面向業務,維度建模。數據按業務過程組織,數據結構按事實表和維度表重構,數據粒業務度按需彙總。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ADS","attrs":{}},{"type":"text","text":":面向應用場景使用適合的工具提升數據存儲與處理的效率,從而提供數據服務。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主數據管理","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"能夠滿足企業跨部門協同需要的、反映核心業務實體狀態屬性的企業(組織機構)基礎信息,屬性相對穩定、準確度要求更高、唯一識別的,就是主數據,稱爲MDM。這是《主數據管理實踐白皮書》給出的定義。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主數據是描述核心業務的關鍵事實,例如客戶、產品、員工、地區等;同時也包含這些事實間的數據關係。主數據管理主要體現了以下價值:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"消除數據冗餘","attrs":{}},{"type":"text","text":":不同系統、不同部門按照自身規則和需求獲取數據,容易造成數據重複存儲,形成數據冗餘。主數據打通各業務鏈條,統一數據語言,統一數據標準,實現數據共享,最大化消除了數據冗餘。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提升數據處理效率","attrs":{}},{"type":"text","text":":各系統、各部門對於數據定義不一樣,不同版本的數據不一致,一個核心主題也有多個版本的信息,需要大量的人力、時間成本去整理和統一。通過主數據管理可以實現數據動態整理、複製、分發和共享。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提高公司戰略協同力","attrs":{}},{"type":"text","text":":數據作爲公司內部經營分析、決策支撐的“通行語言”,實現多個部門統一後,有助於打通部門、系統壁壘,實現信息集成與共享,提高公司整體的戰略協同力。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖是主數據資產清單示例,要實現對主數據管理,主要是從以下幾方面實現:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/861fd5226262f6b2e4908495769343b3.webp","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"深入業務,紮根業務","attrs":{}},{"type":"text","text":":每個業務線關鍵實體既有差異,也有交叉;主要深入瞭解業務,纔可以保持主數據一致性、準確性、完整性、可控性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"面向主題域管理","attrs":{}},{"type":"text","text":":按照業務線、主題域和業務過程三級目錄方式管理主數據。通過分層歸類管理主數據可以提高管理的效率。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"指標體系","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"指標是一個可以量化目標事物多少的數值,有時候也稱爲度量,如:DNU、留存率等都是指標。指標體系就是將各個指標按照特定的框架組織起來,從而統一指標名稱及口徑定義,理清指標間構成關係,避免重複建設。下圖是指標體系示例。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/56/5658636e46e50e654879856de210cb99.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"業務過程","attrs":{}},{"type":"text","text":":業務過程是企業活動中不可拆分的行爲事件。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"維度/屬性","attrs":{}},{"type":"text","text":":維度是觀察和分析業務過程的角度,屬性是描述維度的信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"原子指標","attrs":{}},{"type":"text","text":":原子指標是對具體業務過程的度量或對具體維度/屬性的計數,具有明確的業務含義且在邏輯層面不可再拆分。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"修飾詞","attrs":{}},{"type":"text","text":":修飾詞是對原子指標進行修飾限定的詞彙,對應着明確的業務場景和業務規則,用於圈定原子指標業務統計的範圍。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"派生指標","attrs":{}},{"type":"text","text":":派生指標是原子指標與一個或多個修飾詞的組合。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"複合指標","attrs":{}},{"type":"text","text":":原子指標和派生指標經過疊加公式所計算出來的直接結果。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"詞根管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞根是企業最細粒度業務術語,是維度和指標管理的基礎,通過詞根可以用來統一表名、字段名、主題域名;建立和維護可收斂的詞根庫,業務域、主題域我們都可以用詞根的方式枚舉清楚,不斷完善,粒度也是同樣的,主要的是時間粒度、日、月、年、周等,使用詞根定義好簡稱,數倉開發的字段命名也可以使用詞根進行組合;劃分爲普通詞根與專有詞根","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"普通詞根:描述事物的最小單元體,如:交易-trade。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"專有詞根:具備約定成俗或行業專屬的描述體,如:美元-USD。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詞根示例如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6e/6ee3493cf09f87de5d9b2eec44634076.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據血緣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據的處理過程中,從數據源頭到最終的數據生成,每個環節都可能會導致我們出現數據質量的問題。比如我們數據源本身數據質量不高,在後續的處理環節中如果沒有進行數據質量的檢測和處理,那麼這個數據信息最終流轉到我們的目標表,它的數據質量也是不高的。也有可能在某個環節的數據處理中,我們對數據進行了一些不恰當的處理,導致後續環節的數據質量變得糟糕。因此,對於數據的血緣關係,我們要確保每個環節都要注意數據質量的檢測和處理,那麼我們後續數據纔會有優良的基因,即有很高的數據質量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據血緣關係的作用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據溯源","attrs":{}},{"type":"text","text":":數據的血緣關係,體現了數據的來龍去脈,能幫助我們追蹤數據的來源,追蹤數據處理過程。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"評估數據價值","attrs":{}},{"type":"text","text":":數據的價值在數據交易領域非常重要,數據血緣關係,可以從數據受衆、數據更新量級、數據更新頻次幾方面來給數據價值的評估提供依據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據質量評估","attrs":{}},{"type":"text","text":":從數據質量評估角度來看,清晰的數據源和加工處理方法,可以明確每個節點數據質量的好壞。從數據的血緣關係圖上,可以方便地看到數據清洗的標準清單。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據歸檔、銷燬的參考","attrs":{}},{"type":"text","text":":從數據生命週期管理角度來看,數據的血緣關係有助於我們判斷數據的生命週期,是數據的歸檔和銷燬操作的參考。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據血緣示例圖如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e1/e12bc05af5b931d41b9cdae8e202a0cb.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章