Uber數據文化建設實踐

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據賦能Uber"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber通過賦能數十億打車和快遞服務,連接數以百萬計的乘客、企業、餐館、司機和快遞員,徹底改變了世界的出行方式。這個龐大的交通平臺的核心是大數據和數據科學,它們支撐着Uber的所有工作,比如更好的定價和匹配、欺詐檢測、降低預計達到時間(ETA)和實驗。每天PB級的數據被收集和處理,成千上萬用戶根據這些數據進行分析決策,從而構建\/改進這些產品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/fe\/d9\/fe9674f9f7c404944bdaf13e175967d9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"規模擴展帶來的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然我們能夠擴展我們的數據系統,但以前,對於一些重要的數據問題,我們沒有給予足夠的關注,在規模擴大之後,它們變得更加重要,涉及的具體問題包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據重複:"},{"type":"text","text":" 一些關鍵數據和指標缺少一個真實的數據來源,這導致了重複、不一致,並且在使用時會有很多困惑。消費者必須從解決業務問題中抽出時間來做大量的盡職調查,從而彌補這一點。使用自助服務工具創建的數十萬個數據集加劇了這個問題,因爲我們無法明顯看出哪個數據集更重要。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"發現問題:"},{"type":"text","text":" 如果沒有豐富的元數據和分面搜索,在數十萬數據集中發現數據是很困難的。糟糕的發現導致了重複的數據集、重複的工作和不一致的答案(這取決於回答問題時所使用的數據)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"**工具互不"},{"type":"text","marks":[{"type":"strong"},{"type":"strong"}],"text":"連"},{"type":"text","text":"通:**數據流經許多工具、系統和組織。但是我們的工具沒有相互集成,導致工作重複和糟糕的開發體驗——例如,必須在多個工具之間複製粘貼文檔和所有者信息;開發者無法自信地修改數據模式,因爲不清楚它在下游是如何使用的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"日誌不一致:"},{"type":"text","text":" 在移動設備上的日誌是手動完成的;日誌沒有統一的結構,我們無法通過簡單、一致的方法度量用戶的實際行爲,只能通過推斷來判定(這低效且容易出錯)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"流程缺失:"},{"type":"text","text":" 缺乏跨團隊的數據工程流程,導致各個團隊的成熟度不同,團隊間沒有一致的數據質量定義或指標。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"所有權和SLA缺失:"},{"type":"text","text":" 數據集沒有明確的屬主——它們通常沒有質量保證、錯誤修復的SLA不一致,電話支持、事件管理與我們對服務的管理方式相去甚遠。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些問題並不是Uber獨有的——根據我們與其它公司的工程師和數據科學家的交流,這些問題很常見,特別是對於那些增長非常快的公司。由於服務故障\/中斷即時可見,所以我們對服務和服務質量的關注比較多,而對數據和相關工具的關注往往比較少。但在規模比較大時,解決這些問題,並使其與服務工具\/管理的嚴格程度保持一致,變得極其重要,尤其是如果數據在產品功能和創新中扮演着關鍵角色,正如數據在Uber的角色一樣。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"需要一種整體的數據解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖顯示了從移動應用程序和服務到數據倉庫和最終消費面的高級數據流。我們最初只在數據流中出現數據問題的地方應急性地解決了這些問題的症狀,而沒有解決根本問題。我們認識到,需要一種整體的方案來解決這些問題,並徹底解決其根源。我們的目標是重新組織數據日誌系統、工具和流程,從而逐步改變整個Uber的數據質量。我們召集了橫跨端到端數據流堆棧的團隊,其中包括來自堆棧各個部分的工程師和數據科學家,最終修改了20多個現有系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/48\/03\/4870f6afebc991c70b4333765ff88e03.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了將精力集中在整體思考上,我們從Rider應用程序上獲取了與行程和會話信息相關的關鍵數據“切片”,並嘗試爲它們構建一個真實數據源(source of truth,SoT),並且修復應用程序上的日誌記錄、數據處理工具、數據本身以及將其維護成SoT所需的過程。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據處理的基本原則"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與試圖隱藏數據並向服務外部公開狹窄接口的服務不同,倉庫中的離線數據更多的是公開來自相關服務和領域的數據,以便一起進行分析。我們的一個關鍵認識是,爲了做好這一點,我們不僅要解決數據工具的問題,還要解決數據的人員和流程方面的問題。因此我們提出了一些指導原則:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據即代碼:"},{"type":"text","text":" 數據應該作爲代碼對待。對數據工件的創建、棄用和關鍵更改應該通過設計評審流程,並使用適當的書面文檔,而且是以客戶視角編寫的文檔。必須爲模型更改指定審閱者,他們在更改落地之前進行評審。模型複用\/擴展優先於創建新模型。數據工件具有與之相關聯的測試,我們要對其進行持續測試。通常,這是用於API服務的實踐,在考慮數據時,我們需要同樣嚴格。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據要有屬主:"},{"type":"text","text":" 數據就是代碼,所有的代碼必須有屬主。每個數據工件必須有一個明確的所有者,一個明確的目的,並且在用完後廢棄。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據質量可知:"},{"type":"text","text":" 數據工件必須有數據質量SLA,以及事件報告和管理,就像我們對服務所做的那樣。所有者負責維護這些服務協議。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"加速數據生產效率:"},{"type":"text","text":" 數據工具的設計必須可以改進生產者和消費者之間的協作,必要時必須有所有者、文檔和審閱者。數據工具必須與其它相關工具無縫集成,讓我們不必再考慮必要的元數據。數據工具應配備與服務相同級別的開發人員,提供在更改落地之前編寫和運行測試的能力,在轉入生產環境之前在準生產環境中測試這些更改的能力,並與現有的監控\/告警生態系統很好地集成。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"針對數據的組織:"},{"type":"text","text":" 團隊的目標應該是“全棧”配置,因此,必要的數據工程人才就可以對數據的整個生命週期進行長期的觀察。雖然更爲核心的團隊有複雜的數據集,但是大多數生成數據的團隊應該以本地所有權爲目標。我們應該有必要的培訓材料,並優先培訓工程師,使其相當精通數據的生產和消費實踐。最後,團隊領導應該對他們生產和使用的數據的所有權和質量負責。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"已經解決的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文的餘下部分,我們將重點介紹我們編程體驗中一些最有用也最有趣的收穫。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"數據質量和等級"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據質量差,我們經受了許多艱苦的工作。我們看到過這樣的例子:實驗測量值不準確,導致了大量的體力勞動,降低了驗證和修正數據的效率。事實證明,隨着大數據的應用,這個問題變得越來越普遍——據"},{"type":"link","attrs":{"href":"https:\/\/www.ibm.com\/blogs\/journey-to-ai\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"IBM"}]},{"type":"link","attrs":{"href":"https:\/\/www.ibm.com\/blogs\/journey-to-ai\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"的一份研究"}]},{"type":"text","text":"和哈佛商業評論("},{"type":"link","attrs":{"href":"https:\/\/hbr.org\/2016\/09\/bad-data-costs-the-u-s-3-trillion-per-year?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"HBR"}]},{"type":"text","text":")估計,由數據驅動的企業將因數據不足而遭受巨大的負面影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了減少繁重的工作和不良的業務影響,我們希望開發一種用於討論數據質量的通用語言和框架,以便任何人都可以以一致的預期來生產和消費數據。爲了實現這一點,我們發展了兩個主要概念:標準數據質量檢查和數據集等級定義。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"數據質量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據質量是一個複雜的話題,有許多不同的方面值得深入研究,因此我們將把數據質量的討論侷限於我們已經取得重大進展的領域,而將其它方面留待以後討論。Uber生成和使用數據的環境對我們選擇關注哪些數據質量領域起着重要的作用。雖然其中有些也能適用於其他人,但有些不是。在Uber,數據生產者和消費者面臨的一系列常見問題是:如何在分析最新數據和完整數據之間進行權衡?如果在不同的數據中心並行運行管道,我們如何解釋不同數據中心的數據一致性?在給定的數據集上應該運行什麼語義質量檢查?我們想要選擇一組檢查,爲解釋這些問題提供一個框架。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"數據質量檢查"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過幾次迭代,我們得到了下面描述的5種主要類型的數據質量檢查。每個數據集都必須附帶這些檢查並配置默認的SLA:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"新鮮度:"},{"type":"text","text":" 從數據生成到數據在目標系統中達到99.9%完成度之間的時間延遲,包括完整性水印(默認設置爲39秒),因爲只優化新鮮度而不考慮完整性會導致低質量的決策。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"完整性:"},{"type":"text","text":" 目標系統中的行數與源系統中的行數之比。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"重複:"},{"type":"text","text":" 具有重複主鍵或唯一鍵的行數百分比,在原始數據表中默認爲0%重複,而在建模表中允許少量重複。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"跨數據中心的一致性:"},{"type":"text","text":" 將當前數據中心中的數據集副本與另一個數據中心中的副本進行對比時,數據丟失的百分比。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語義檢查:"},{"type":"text","text":" 捕獲數據字段的關鍵屬性,例如null\/not-null、唯一性、不同值的百分比和值的範圍。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據集所有者可以選擇提供不同的SLA,並向消費者提供適當的文檔和解釋——例如,根據數據集的性質,人們可能想要犧牲完整性來換取新鮮度(比如流式數據集)。相似地,消費者可以選擇基於這些指標來消費數據集——基於完整性觸發器而不是簡單地基於時間觸發器來運行管道。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們正繼續研究更復雜的檢查,包括跨數據集概念的一致性,以及在上述時間維度檢查的基礎上進行異常檢測。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"數據等級"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了質量度量之外,擁有一種將數據集與業務不同等級的重要性關聯起來的方法也是必要的,這樣就很容易突出顯示最重要的數據。對於服務,我們就是通過分配“等級”(基於數據的業務重要性)實現這一點的。這些等級有助於確定停機的影響,並提供了關於哪些數據等級應用於哪些目的的指導原則。例如,如果某些數據影響合規性、收入或品牌,那麼它應該被標記爲第1級或第2級。由用戶創建的用於不太重要的臨時搜索的臨時數據,默認標記爲第5級,如果不使用則可以在一段固定時間後刪除。數據等級還確定提交的事件的級別,以及針對數據集創建的修復bug的SLA。分級的一個副產品是數據資產的系統性清單,我們依賴這些資產來做業務關鍵決策。這樣做的另一個好處是,對相似或不再作爲事實來源的數據集進行顯式去重。最後,通過分級實現的可見性有助於我們重構數據集,從而可以改進建模、數據粒度一致性和規範化級別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經開發了自動化方法來爲機構生成“分級報告”,顯示需要分級的數據集、分級數據的使用情況等,作爲機構的“數據健康”的衡量指標。我們還跟蹤這些指標作爲“工程卓越性”標準。隨着越來越多的採用和反饋,我們不斷迭代具體的定義和度量方法,進一步改進它們。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據質量工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們不將這些定義自動化並使其易於使用和應用,僅僅擁有這些定義是不夠的。我們將多個現有的數據質量工具整合到一個實現了這些定義的工具中。如果有意義,我們可以自動生成測試(對於原始數據,即由Kafka主題轉儲到數據倉庫的數據,除了語義測試之外,我們還可以自動生成四類測試),並且通過最小化數據集所有者的輸入簡化了測試創建過程。這些標準檢查爲每個數據集提供了一個最小測試集,同時,該工具也爲生產者提供了足夠的靈活性,他們只需提供一個SQL查詢就可以創建新的測試。我們學到了許多有趣的經驗,包括如何以較低的開銷擴展這些測試,如何簡化爲數據集構建一套測試的抽象,何時調度測試以減少誤報和噪音警報,如何將這些測試應用於流式數據集,以及我們希望在以後的文章中發佈的更多內容。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Databook和元數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,我們有成千上萬的數據集和成千上萬的用戶。如果我們考慮其它數據資產——報表、機器學習特徵、度量、儀表盤——我們管理的資產數量會大得多。我們希望確保:a)消費者使用正確的數據做出決策,b)生產者做出明智的決策以改進數據、確定錯誤修復的優先級等。爲此,我們需要一個單一的目錄,該目錄收集所有數據資產的元數據,並根據用戶的需求向他們提供正確的信息。事實上,我們意識到,糟糕的發現曾導致生產者和消費者產生重複、冗餘數據集的惡性循環,然後這些數據集就被廢棄了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們希望向用戶提供關於每個數據工件(表、列、度量)的詳細元數據:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基礎元數據:"},{"type":"text","text":" 例如文檔、所有權信息、管道、生成數據的源代碼、示例數據、譜系和工件層"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"使用元數據:"},{"type":"text","text":" 關於什麼人在什麼時候使用這些元數據、流行查詢和一起使用的工件的統計"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"質量元數據:"},{"type":"text","text":" 對數據進行測試,何時運行、哪些測試通過,以及數據提供的聚合SLA"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"成本元數據:"},{"type":"text","text":" 用於計算和存儲數據的資源,包括貨幣成本"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Bug和SLA:"},{"type":"text","text":" 針對工件、事件、近期預警和總體SLA提交的bug,從而響應所有者的問題"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"創建這個單一的元數據目錄並提供一個功能強大的用戶界面(具有基於上下文的搜索和發現功能),對於實現生產者和消費者之間的協作、減少使用數據的工作量以及提升總體數據質量至關重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了實現這個目標,我們對內部元數據目錄Databook的後端和UI進行了徹底的改進。我們對元數據詞彙表進行了標準化,使其易於向現有實體添加新的元數據屬性,設計了擴展性,以便以最小的工作量輕鬆定義新的實體類型,並將我們的大多數關鍵工具集成到該系統中,並將它們的元數據發佈到這個中心位置,把各種數據資產、工具和用戶連接起來。改進後的用戶界面更清晰,並且用戶可以更方便地過濾和縮小所需數據的範圍。經過這些改進後,工具使用量急劇增加。我們在這篇博客中詳細介紹了這些變化:"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/metadata-insights-databook\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"Turning Metadata Into Insights with Databook"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"應用程序上下文日誌"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了瞭解和改進產品,讓我們的應用程序打印日誌來獲取實際的用戶體驗是至關重要的。我們希望測量用戶體驗,而不是推斷用戶體驗,但是每個團隊都有一個自定義的日誌打印方法,導致在如何測量用戶體驗方面存在不一致。我們希望標準化整個應用系統中各個團隊的日誌打印方式,甚至“平臺化”日誌打印,這樣開發者就可以在開發所有產品功能時免於去考慮如何通過日誌打印必需信息,例如:向用戶展示了什麼、與用戶交互時應用程序的狀態、交互類型和交互持續時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在深入研究了Uber用來構建應用程序的移動框架之後,我們意識到,移動應用開發框架("},{"type":"link","attrs":{"href":"https:\/\/github.com\/uber\/RIBs?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"之前是開源的"}]},{"type":"text","text":")已經內置了一個天然的結構,當用戶交互時,可以提供有關應用程序狀態的關鍵信息。自動獲取"},{"type":"link","attrs":{"href":"https:\/\/github.com\/uber\/RIBs\/wiki?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"RIB層次"}]},{"type":"text","text":"將使我們瞭解應用程序的狀態,以及哪個RIB(大致可以將它們當作組件)當前是活躍的。應用程序上的不同屏幕映射到不同的RIB層次。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於這種直覺,我們開發了一個庫來捕獲當前的RIB層次,將其序列化,並且自動將其附加到應用程序觸發的每個分析事件。在接收這些信息的後端網關中,我們實現了從RIB層次結構到一組靈活的元數據(例如屏幕名稱、應用程序中階段的名稱,等等)的輕量級映射。這些元數據可以獨立演化,生產者或消費者都可以添加更多信息,而不必依賴移動應用程序的更改(由於構建和發佈週期長達數週,因此更改速度慢且成本高)。在後端,網關在寫入Kafka之前,除了序列化狀態之外,還會將這些額外的元數據附加到分析事件中。網關上的這個映射也可以通過API獲得,這樣倉庫作業就可以在映射演進時回填數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上面的核心問題之外,我們還必須解決一些其它問題,這裏我們將不再詳細介紹,例如:優化序列化的RIB層次結構來減少分析負載大小,使映射高效,通過自定義測試框架在應用程序更改時保持映射正確,將RIB樹正確映射到狀態,對屏幕和狀態名稱進行標準化,等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然這個庫並沒有完全解決我們要解決的所有日誌問題,但它確實爲日誌提供了一個結構,使許多分析變得更容易,如下所述。我們正在迭代這個庫來解決提出的其它問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"騎手漏斗分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用上面的日誌框架生成的數據,我們能夠極大地簡化對騎手行爲的漏斗分析。我們在幾個小時內就建立了一個儀表盤,這在過去可能需要幾個星期的時間。這些數據目前正在爲許多實驗監控和其它儀表盤提供支持,讓我們可以瞭解用戶行爲。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"度量標準化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們啓動Data180時,公司中有許多度量代碼庫。我們評估了這些解決方案的優缺點,並在一個名爲uMetric的代碼庫上進行了標準化。事實上,它不僅僅是一個代碼庫——它具有高級功能,例如讓用戶專注於YAML格式的定義,並通過爲不同的查詢系統(例如Hive\/Presto\/Spark)生成查詢、爲度量生成流式處理和批處理管道、自動創建數據質量測試等省去了大量的工作。這個系統正在得到更廣泛的採用,我們也在投資進一步加強它。我們正在自動化重複和近似重複的度量檢測,將此係統與Databook和其它數據消費界面集成,這樣消費者就可以直接消費度量結果,而不是複製和運行度量SQL(調整SQL更容易出錯及導致度量重複),改進了自助服務的性質,並在事故發生之前檢測到錯誤,等等。這種標準化幫助我們大大減少了消費時的重複和混亂。這個系統在這個博客中有詳細的描述——"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/umetric\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"The Journey Towards Metric Standardization"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其它工具和流程變更"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上面列出的更改之外,我們還實施了其它幾個工具和流程更改,以改進我們的數據文化,簡要介紹如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"共享數據模型:"},{"type":"text","text":" 爲了避免重複定義相同概念的模式(這很常見),我們改進了模式定義的工具,允許導入和共享現有類型和數據模型。我們正在構建其它特性和流程,從而推動共享數據模型的採用,減少重複和近似重複的數據模型的創建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"移動分析強制代碼評審者和單元測試:"},{"type":"text","text":" 我們重新組織了移動分析事件的模式,並允許生產者和消費者將自己添加爲強制評審者,以避免在沒有適當評審和通知的情況下進行更改。我們還構建了一個移動日誌測試框架,以確保數據測試在構建時運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"強制所有權:"},{"type":"text","text":" 我們改進了數據生成的底層數據工具和界面(模式定義、Kafka主題創建、創建數據的管道、度量創建、儀表盤創建,等等),以便在無法自動推斷所有者時強制提供所有權信息。所有權信息進一步標準化爲整個公司的單一服務,跟蹤團隊和組織,而不僅僅是單個創建者。這一修改可以避免新增無主數據。我們進一步運行啓發式算法,將所有者分配到“廢棄”的數據集,這些數據集沒有所有者或所有者已不在公司,這使我們有望達到100%的所有權覆蓋率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"跨工具集成:"},{"type":"text","text":" 我們集成了工具,這樣一旦在源工具上設置了文檔、所有權和其它關鍵元數據,它就無縫地跨所有下游工具流動。我們將管道工具與標準告警和監控工具集成在一起,使服務和數據管道在生成和管理告警方面具有一致性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"後續工作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們從這樣一個假設開始:對數據進行全面的思考——考慮跨人員和系統的端到端數據流——可以提高整體數據質量。我們認爲,這一努力已經顯示出支持這一假設的有力證據。然而,最初的工作僅僅是我們向更好的數據文化轉型之旅的開始。在這項工作取得成功的基礎上,我們將這個程序推廣到Uber不同的組織和應用程序。項目團隊專注於分級、構建真實數據源、提高數據質量和數據SLA,而平臺團隊則繼續改進上述以及更多的工具。雙方共同努力來改進流程,在Uber建立一種強大的數據文化。舉例來說,下面是一些正在進行中的工作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對工具進行了更多的基礎性改進,實現了更多的自動化以支持不同的數據質量檢查;實現了更多的集成以減少工作量"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增強應用程序日誌框架,進一步捕獲更多有關用戶在應用程序上實際“看了什麼”和“做了什麼”的直觀信息"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改進流程和工具,改善生產者和消費者之間的協作"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對數據資產實施生命週期管理,以便從系統中移除未使用和不必要的工件"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在工程師和數據科學家的日常數據開發流程中進一步應用上述原則"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們希望在未來走向更好的數據文化的過程中分享更多的經驗教訓。如果你有興趣在Uber解決具有挑戰性的數據問題,請在"},{"type":"link","attrs":{"href":"https:\/\/www.uber.com\/us\/en\/careers\/list\/?query=%22intelligence%22&fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"或[這裏](https:\/\/www.uber.com\/us\/en\/careers\/list\/?query=Engineering - Data&department=Engineering&team=Data&fileGuid=9DPRRvD9vRXQrPCv)申請。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/author\/krishnap\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"Krishna Puttaswamy"}]},{"type":"text","text":"是Uber高級工程師。他在市場團隊中處理各種數據和實驗問題。這篇博客中描述的工作,是他在應用數據改進Uber應用程序和服務時所面臨的實際問題的解決方案。他目前領導DataNG和一個重寫實驗平臺的項目。他以前在Airbnb和LinkedIn處理過數據\/機器學習問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/author\/sureshs\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"Suresh Srinivas"}]},{"type":"text","text":"是一位主要致力於數據平臺的架構師,專注於讓用戶成功地從Uber的數據中實現價值。這篇博客中描述的工作,是這一努力的一部分。在Uber之前,他聯合創建了Hortonworks,這是一家圍繞Apache開源項目建立的公司,旨在將Hadoop生態系統引入企業。Suresh是Apache Hadoop和相關項目的長期貢獻者,也是Hadoop PMC成員之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/ubers-journey-toward-better-data-culture-from-first-principles\/?fileGuid=9DPRRvD9vRXQrPCv","title":"","type":null},"content":[{"type":"text","text":"https:\/\/eng.uber.com\/ubers-journey-toward-better-data-culture-from-first-principles\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章