數據也需要滴血認親?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近我們在建設數據平臺的過程中,遇到了一些問題。舉兩個典型的例子:一、報表裏的數據看着不太對,就得讓熟悉這部分的同事先去檢查所有相關的數據和處理過程,確認沒有處理上的錯誤;二、修改了某個部分的數據處理邏輯,間接引起了報表變化,需要通知被影響的用戶,卻沒有一個便捷的機制知道誰被影響了。我們的業務雖說比不上大廠複雜,每次遇到問題現想,卻也不是容易的事。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"認識數據血緣關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決問題的核心是搞清數據之間的關係。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據系統在處理數據時,是從原始數據入手,通過清洗、轉換、融合一步步加工,得到不同粒度和形態的數據,匯聚成數據的海洋,通過不同的報表展現給使用者。整個過程形成了一個由數據和處理過程組成的單向網絡,分析查找問題就需要在這個網絡裏向上溯源、向下擴散。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個單向網絡的數據結構,就是有向無環圖,像下面這樣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a2203ec233f83ff6e6a1558e9f23ea2.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 a、b、c…… 這些是不同的數據,而它們之間連線就是數據處理過程。比如 a 和 b 之間的連線,從 a 指向 b,說的是源數據 a 經過處理,得到了目標數據 b。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣,如果數據 a 發生了變化,必然會影響數據 b,進而傳遞到這張圖中的所有數據。而如果只是數據 b 發生了變化,那就只會影響到 d 和 e。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種數據之間的關係,和生物之間代際相傳的血緣關係比較相似,上一代數據中的信息會傳遞到下一代,而問題也會隨之遺傳到下一代。因此,人們稱之爲數據血緣關係(Data Lineage),也叫數據沿襲關係。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果能得到數據之間的血緣關係,除了解決本文開頭提到的兩個典型問題之外,也可以讓下面的這些事情成爲可能:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上游數據發生變化時,能夠精準確定需要更新的下游數據和報表範圍。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"能夠識別上游數據在下游數據和和報表中的擴散方式,控制對敏感數據的訪問。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有助於優化數據處理鏈條,降本增效。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"管理數據血緣關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果數據關係特別龐雜,就需要用數據血緣系統來管理了。數據血緣系統一般分爲主動和被動兩種,主動數據血緣系統需要預先描述數據和它們之間的關係;而被動數據血緣系統,則不需要預先對數據進行定義,而是通過分析操作日誌,逆向推導出數據之間的關係。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的數據目前相對簡單,可以不急於上系統,先理清楚血緣關係。人工整理數據血緣關係,可以使用一個古老的工具 ","attrs":{}},{"type":"link","attrs":{"href":"https://graphviz.org/","title":"","type":null},"content":[{"type":"text","text":"Graphviz","attrs":{}}]},{"type":"text","text":",它在 1991 年誕生於 AT&T 實驗室,用於展現各種抽象的圖和網絡結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如像下面這樣:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c86388418f252a850a04b12935d42bf1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有向無環圖有很多用途,列舉幾個我用到過的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表達代碼之間的依賴關係","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析因果關係,魚骨圖本質上也是一種有向無環圖","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Git 代碼庫的版本歷史","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有思維導圖","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一點感想","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遇到問題之後,要使用模型來理解問題的本質是什麼,找到了合適的模型,就可以把問題抽象成我們熟悉的樣子。不僅便於我們進一步分析,也便於我們拿和模型相對應的工具,只要模型適用,工具就能發揮作用。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章