数据也需要滴血认亲?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近我们在建设数据平台的过程中,遇到了一些问题。举两个典型的例子:一、报表里的数据看着不太对,就得让熟悉这部分的同事先去检查所有相关的数据和处理过程,确认没有处理上的错误;二、修改了某个部分的数据处理逻辑,间接引起了报表变化,需要通知被影响的用户,却没有一个便捷的机制知道谁被影响了。我们的业务虽说比不上大厂复杂,每次遇到问题现想,却也不是容易的事。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"认识数据血缘关系","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解决问题的核心是搞清数据之间的关系。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大数据系统在处理数据时,是从原始数据入手,通过清洗、转换、融合一步步加工,得到不同粒度和形态的数据,汇聚成数据的海洋,通过不同的报表展现给使用者。整个过程形成了一个由数据和处理过程组成的单向网络,分析查找问题就需要在这个网络里向上溯源、向下扩散。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个单向网络的数据结构,就是有向无环图,像下面这样。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a2203ec233f83ff6e6a1558e9f23ea2.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 a、b、c…… 这些是不同的数据,而它们之间连线就是数据处理过程。比如 a 和 b 之间的连线,从 a 指向 b,说的是源数据 a 经过处理,得到了目标数据 b。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样,如果数据 a 发生了变化,必然会影响数据 b,进而传递到这张图中的所有数据。而如果只是数据 b 发生了变化,那就只会影响到 d 和 e。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这种数据之间的关系,和生物之间代际相传的血缘关系比较相似,上一代数据中的信息会传递到下一代,而问题也会随之遗传到下一代。因此,人们称之为数据血缘关系(Data Lineage),也叫数据沿袭关系。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果能得到数据之间的血缘关系,除了解决本文开头提到的两个典型问题之外,也可以让下面的这些事情成为可能:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上游数据发生变化时,能够精准确定需要更新的下游数据和报表范围。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"能够识别上游数据在下游数据和和报表中的扩散方式,控制对敏感数据的访问。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有助于优化数据处理链条,降本增效。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"管理数据血缘关系","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果数据关系特别庞杂,就需要用数据血缘系统来管理了。数据血缘系统一般分为主动和被动两种,主动数据血缘系统需要预先描述数据和它们之间的关系;而被动数据血缘系统,则不需要预先对数据进行定义,而是通过分析操作日志,逆向推导出数据之间的关系。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们的数据目前相对简单,可以不急于上系统,先理清楚血缘关系。人工整理数据血缘关系,可以使用一个古老的工具 ","attrs":{}},{"type":"link","attrs":{"href":"https://graphviz.org/","title":"","type":null},"content":[{"type":"text","text":"Graphviz","attrs":{}}]},{"type":"text","text":",它在 1991 年诞生于 AT&T 实验室,用于展现各种抽象的图和网络结构。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如像下面这样:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c86388418f252a850a04b12935d42bf1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有向无环图有很多用途,列举几个我用到过的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表达代码之间的依赖关系","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析因果关系,鱼骨图本质上也是一种有向无环图","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Git 代码库的版本历史","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"还有思维导图","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一点感想","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遇到问题之后,要使用模型来理解问题的本质是什么,找到了合适的模型,就可以把问题抽象成我们熟悉的样子。不仅便于我们进一步分析,也便于我们拿和模型相对应的工具,只要模型适用,工具就能发挥作用。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章