伴魚數據質量中心的設計與實現

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日常工作中,數據開發工程師開發上線完一個任務後並不是就可以高枕無憂了,時常會因爲上游鏈路數據異常或者自身處理邏輯的 BUG 導致產出的數據結果不可信。而這個問題的發現可能會經歷一個較長的週期(尤其是離線場景),往往是業務方通過上層數據報表發現數據異常後 push 數據方去定位問題(對於一個較冷的報表,這個週期可能會更長)。同時,由於數據加工鏈路較長需要藉助數據的血緣關係逐個任務排查,也會導致問題的定位難度增大,嚴重影響開發人員的工作效率。更有甚者,如果數據問題沒有被及時發現,可能導致業務方作出錯誤的決策。此類問題可統一歸屬爲大數據領域數據質量的問題。本文將向大家介紹伴魚基礎架構數據團隊在應對該類問題時推出的平臺化產品-數據質量中心(Data Quality Center, DQC)的設計與實現。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"調研"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業內關於數據質量平臺化的產品介紹不多,我們主要對兩個開源產品和一個雲平臺產品進行了調研,下面將一一介紹。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Apache Griffin"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/griffin","title":null,"type":null},"content":[{"type":"text","text":"Apache Griffin"}]},{"type":"text","text":" 是 eBay 開源的一款基於 Apache Hadoop 和 Apache Spark 的數據質量服務平臺。其架構圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/674dc3e0dff8631e5cd1e376fc3d9342.png","alt":"griffin","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"架構圖從 High Level 層面清晰地展示了數據質量平臺的三個核心流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Define:數據質檢規則(指標)的定義。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Measure:數據質檢任務的執行,基於 Spark 引擎實現。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Analyze:數據質檢結果量化及可視化展示。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,平臺對數據質檢規則進行了分類(這也是目前業內普遍認可的數據質量的六大標準):"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Accuracy:準確性。如是否符合表的加工邏輯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Completeness:完備性。如數據是否存在丟失。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Timeliness:及時性。如表數據是否按時產生。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uniqueness:唯一性。如主鍵字段是否唯一。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Validity:合規性。如字段長度是否合規、枚舉值集合是否合規。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Consistency:一致性。如表與表之間在某些字段上是否存在矛盾。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前該開源項目僅在 Accuracy 類的規則上進行了實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Griffin 是一個完全閉環的平臺化產品。其質檢任務的執行依賴於內置定時調度器的調度,調度執行時間由用戶在 UI 上設定。任務將通過 Apache Livy 組件提交至配置的 Spark 集羣。這也就意味着質檢的實時性難以保障,我們無法對產出異常數據的任務進行強行阻斷,二者不是在同一個調度平臺被調度,時序上也不能保持串行。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Qualitis"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/WeBankFinTech\/Qualitis","title":null,"type":null},"content":[{"type":"text","text":"Qualitis"}]},{"type":"text","text":" 是微衆銀行開源的一款數據質量管理系統。同樣,它提供了一整套統一的流程來定義和檢測數據集的質量並及時報告問題。從整個流程上看我們依然可以用 Define、Measure 和 Analyze 描述。它是基於其開源的另一款組件 Linkis 進行計算任務的代理分發,底層依賴 Spark 引擎,同時可以與其開源的 DataSphereStudio 任務開發平臺無縫銜接,也就實現了在任務執行的工作流中嵌入質檢任務,滿足質檢時效性的要求。可見,Qualitis 需要藉助微衆銀行開源的一系列產品才能達到滿意的效果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"DataWorks 數據質量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/help.aliyun.com\/document_detail\/73660.html?spm=a2c4g.11186623.6.1172.69876ef1ShbJMp","title":null,"type":null},"content":[{"type":"text","text":"DataWorks"}]},{"type":"text","text":" 是阿里雲上提供的一站式大數據工場,其中就包括了數據質量在內的產品解決方案。同樣,它的實現依賴於阿里雲上其他產品組件的支持。不過不得不說 DataWorks 數據質量部分的使用介紹從產品形態上給了我們很大的幫助,對於我們的產品設計非常具有指導性的作用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"設計目標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過一番調研,我們確定了 DQC 的設計目標,主要包括以下幾點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前暫且只支持離線部分的數據質量管理。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持通用的規則描述和規則管理。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"質檢任務由公司內部統一的調度引擎調度執行,可支持對質檢結果異常的任務進行強阻斷。同時,儘量降低質檢功能對調度引擎的代碼侵入。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持質檢結果的可視化。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"系統設計"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"背景補充"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"伴魚離線調度開發平臺是基於 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/dolphinscheduler","title":null,"type":null},"content":[{"type":"text","text":"Apache Dolphinscheduler"}]},{"type":"text","text":"(下文簡稱 DS)實現的。它是一個分佈式去中心化,易擴展的可視化 DAG 調度系統,支持包括 Shell、Python、Spark、Flink 等多種類型的 Task 任務,並具有很好的擴展性。架構如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/8158fe8b0331456e30c64bd0cdd31673.png","alt":"ds","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master 節點負責任務的監聽和調度,Worker 節點則負責任務的執行。值得注意的是,每一個需要被調度的任務必然需要設置一個調度時間的表達式(cron 表達式),由 Quartz 定時爲任務生成待執行的 DAG Command,有且僅有一個 Master 節點獲得執行權,掌管該 DAG 各任務節點的調度執行。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"整體架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是平臺整體的架構圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f7\/f781d6ec9fdb194210b1e9aaae864b46.png","alt":"dqc","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由以下幾部分組成:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DQC Web UI:質檢規則等前端操作頁面。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DQC(GO):簡單的實體元數據管理後臺。主要包括:規則、規則模板、質檢任務和質檢結果幾個實體。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DS(數據質量部分):質檢任務依賴 DS 調度執行,需要對 DS 進行一定的改造。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DQC SDK(JAR):DS 調度執行任務時,檢測到任務綁定了質檢規則,將生成一類新的任務 DQC Task (與 DS 中其他類型的 Task 同級,DS 對於 TasK 進行了很好的抽象可以方便擴展),本質上該 Task 將以腳本形式調用執行 DQC SDK 的邏輯。DQC SDK 涵蓋了規則解析、執行的全部邏輯。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下文主要闡述我們在各模塊設計上的一些思考和權衡。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"規則表述"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"標準與規則"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前文在調研部分提及了業內普遍認可的數據質量的六大標準。那麼問題來了:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何將標準與平臺的規則對應起來?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準中涉及到的現實場景是否我們可以一一枚舉?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便我們可以將標準一一細化,數據開發人員是否可以輕鬆的理解?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以將這些問題統一歸類爲:平臺在規則設定上是否需要和業界數據質量標準所抽象出來的概念進行綁定。很遺憾我們並沒有找到有關數據質量標準更加細化和指導性的描述,事實上作爲一個開發人員這些概念對於我來說是比較費解的,而更貼近程序員視角的方式是「show me the code」,因此我們決定將這一層概念弱化。未來更深入的實踐過程後再做更細化的思考。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"標量化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們着重討論下另一個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何對規則提供一種通用的描述(or maybe a kind of DSL)?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實當我們跳脫出前文所描述的一切背景和概念,仔細思考下數據質檢的過程,會發現本質上就是通過一次真實的任務執行產出結果,然後對比輸出結果與期望是否滿足,以驗證任務邏輯的正確性。這個過程可形象得和 Unit Testing 進行類比,只不過 Unit Testing 是通過模擬數據構造的一次代碼邏輯的執行。另外數據任務執行產生的結果是一張二維結構的 Hive 表,需要進行加工方能獲取到想要的統計結果,這也是兩者的區別之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順着這個思路,我們可以利用 Unit Testing 的概念從以下三方面繼續深入:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Actual Value"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據任務執行產出的結果是一張 Hive 表,我們需要對這張 Hive 表的數據進行加工、提取以獲得需要的 Actual Value。涉及到對 Hive 表的加工,必然想到是以 SQL 的方式來實現,通過 Query 和 一系列 Aggregation 操作拿到結果,此結果的結構又可分爲以下三類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二維數組"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單行或者單列的一維數組"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單行且單列的標量"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯然單行且單列的標量是我們期望得到的,因爲它更易於結果的比較(事實上就目前我們所能想到的規則,都可以通過 SQL 方式提取爲一個標量結果)。因此,在規則設計中,需要規則創建者輸入一段用於結果提取的 SQL,該段 SQL 的執行結果需要爲一個標量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Expected Value"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然 Actual Value 是一個標量,那麼 Expected Value 同樣也是一個標量,需要規則創建者在平臺輸入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Assert"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述標量的類型決定了斷言的比較方式。目前我們只支持了數值型標量的比較方式,包含「大於」、「等於」及「小於」三種比較算子。如出現其他類型標量,需要擴充比較的方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上三要素即可完整的描述規則想要表達的核心邏輯。如我們想要表述「字段爲空異常」的規則(潛在含義:字段爲空的行數大於 0 時判定異常),就可以通過以下設定滿足:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Actual Value :出現字段爲空的行數"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Expected Value:0"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Assert: 「大於」"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"規則管理"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"規則模板"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規則模板是爲了規則複用抽象出的一個概念,模板中包含規則的 SQL 定義、規則的比較方式、參數定義(注:SQL 中包含一些佔位符,這些佔位符將以參數的形式被定義,在規則實體定義時需要用戶明確具體含義)以及其他的一些元信息。下圖爲「字段空值的行數」模板的示例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c5\/c53a9e069070230be06e10b537d26bea.png","alt":"rule_template","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"規則實體"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規則實體是基於規則模板構建的,是規則的具象表達。在規則實體中將明確規則的 Expected Value、比較方式中具體的比較算子、參數的含義以及其他的一些元信息。基於同一個規則模板,可以構造多個規則實體。下圖爲「某表 user_id 唯一性校驗」規則的示例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f835b96d919c56fbf19bf1e930f1b829.png","alt":"rule","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得一提的是,規則可能不僅僅只是針對單表的校驗,對於多表的情況我們這套規則模板同樣是適用的,只要我們可以將邏輯使用 SQL 表達。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"規則綁定"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 DS 的前端交互上支持爲任務直接綁定校驗規則,規則列表通過 API 從 DQC 獲取,這種方式在用戶的使用體驗上存在一定的割裂(規則創建和綁定在兩個平臺完成)。同時,在 DQC 的前端亦可以直接設置關聯調度,爲已有任務綁定質檢規則,任務列表通過 API 從 DS 獲取。同一個任務可綁定多個質檢規則,這些信息將存儲至 DS 的 DAG 元信息中。那麼這裏需要考慮幾個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規則的哪些信息應該存儲至 DAG 的元信息中?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規則的更新 DAG 元信息是否可以實時同步?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要有兩種方式:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以大 Json 方式將規則信息打包存儲,計算時解析 Json 逐個執行校驗。在規則更新時,需要同步調用修改 Json 信息。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以 List 方式存儲規則 ID,計算時需執行一次 Pull 操作獲取規則具體信息然後執行校驗。規則更新,無須同步更新 List 信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們選擇了後者,ID List 方式可以使對 DS 的侵入降到最低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/23\/230d7ea54e577f12b27bd60e9bbff227.png","alt":"rule_bind","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"規則執行"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"強規則和弱規則"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規則的強弱性質由用戶爲任務綁定規則時設定,此性質決定了規則執行的方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"強規則"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和當前所執行的任務節點同步執行,一旦規則檢測失敗整個任務節點將置爲執行失敗的狀態,後續任務節點的執行會被阻斷。對應 DS 中的執行過程表述如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step1:某一個 Master 節點獲取 DAG 的執行權,將 DAG 拆分成不同的 Job Task 先後下發給 Worker 節點執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step2:執行 Job Task 邏輯,並設置 Job Task 的 ExitStatusCode。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step3:判斷 Job Task 是否綁定了強規則。若是,則生成 DQC Task 並觸發執行,最後根據執行結果修正 Job Task 的 ExitStatusCode。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step4:Master 節點根據 Job Task 的 ExitStatusCode 判定任務是否成功執行,繼續進入後續的調度邏輯。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"弱規則"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和當前所執行的任務節點異步執行,規則檢測結果對於原有的任務執行狀態無影響,從而也就不能阻斷後續任務的執行。對應 DS 中的執行過程表述如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step1:某一個 Master 節點獲取 DAG 的執行權,將 DAG 拆分成不同的 Job Task 先後下發給 Worker 節點執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step2:執行 Job Task 邏輯,並設置 Job Task 的 ExitStatusCode。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step3:判斷 Job Task 是否綁定了弱規則。若是,則在 Job Task 的 Context 中設置弱規則的標記 。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step4:Master 節點根據 Job Task 的 ExitStatusCode 判定任務是否成功執行,若成功執行再判定是否 Context 中帶有弱規則標記,若有則生成一個新的 DAG(有且僅有一個 DQC Task,且新生成的 DAG 與 當前執行的 DAG 沒有任何的關聯) 然後繼續進入後續的調度邏輯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Step5:各 Master 節點競爭新生成的 DAG 的執行權。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出在強弱規則的執行方式上,對 DS 調度部分的代碼有一定的侵入,但這個改動不大,成本是可以接受的。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"DQC Task & DQC SDK"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提及到一個 Job Task 綁定的規則(可能有多個)將被轉換爲一個 DQC Task 被 DS 調度執行,接下來我們就討論下 DQC Task 的實現細節以及由此引出的 DQC SDK 的設計和實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DQC Task 繼承自 DS 中的抽象類 AbstractTask,只需要實現抽象方法 handle(任務執行的具體實現)即可。那麼對於我們的質檢任務,實際上執行邏輯可以拆分成以下幾步:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提取 Job Task 綁定的待執行的 Rule ID List。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拉取各個 Rule ID 對應的詳情信息。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構建完整的執行 Query 語句(將規則參數填充至模板 SQL 中)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行 Query。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行 Asset。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最核心的步驟爲 Query 的執行。Query 的實現方式又可分爲兩種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Spark 實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:實現可控,靈活性更高。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:配置性要求較高。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Presto SQL 實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:不需要額外配置,開發量少,拼接 SQL 即可。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:速度沒有 Spark 快。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們選擇了後者,這種方式最易實現,離線場景這部分的計算耗時也可以接受。同時由於一個 DQC Task 包含多條規則,在拼接 SQL 時將同表的規則聚合以減少 IO 次數。不同的 SQL 交由不同的線程並行執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述執行邏輯其實是一個完整且閉環的功能模塊,因此我們想到將其作爲一個單獨的 SDK 對外提供,並以 Jar 包的形式被 DS 依賴,後續即便是更換調度引擎,這部分的邏輯可直接遷移使用(當然概率很低)。那麼 DS 中 DQC Task 的 handle 邏輯也就變得異常簡單,直接以 Shell 形式調用 SDK ,進一步降低對 DS 代碼的侵入。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"執行結果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單條規則的質檢結果將在平臺上直接展現,目前我們還未對任務級的規則進行聚合彙總,這是接下來需要完善的。對於質檢失敗的任務將向報警接收人發送報警。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/97\/97d93052e2d7b6cc9bac778ff8c84e11.png","alt":"result","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實踐中的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺解決了規則創建、規則執行的問題,而在實踐過程中,對用戶而言更關心的問題是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個任務應該需要涵蓋哪些的規則纔能有效地保證數據的質量?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們不可能對全部的表和字段都添加規則,那麼到底哪些是需要添加的?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些是很難通過平臺自動實現的,因爲平臺理解不了業務的信息,平臺能做的只能是通過質量檢測報告給與用戶反饋。因此這個事情需要具體的開發人員對核心場景進行梳理,在充分理解業務場景後根據實際情況進行設定。話又說回來,平臺只是工具,每一個數據開發人員應當提升保證數據質量的意識,這又涉及到組織內規範落地的問題了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來工作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據質量管理是一個長期的過程,未來在平臺化方向我們還有幾個關鍵的部分有待繼續推進:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於血緣關係建立全鏈路的數據質量監控。當前的監控粒度是任務級的,如果規則設置的是弱規則,下游對於數據問題依舊很難感知。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據質量的結果量化。需要建立起一套指標用於定量地衡量數據的質量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持實時數據的質量檢測。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/griffin","title":null,"type":null},"content":[{"type":"text","text":"Apache Griffin"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/WeBankFinTech\/Qualitis","title":null,"type":null},"content":[{"type":"text","text":"Qualitis"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/help.aliyun.com\/document_detail\/73660.html?spm=a2c4g.11186623.6.1172.69876ef1ShbJMp","title":null,"type":null},"content":[{"type":"text","text":"DataWorks"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章