Minerva -- Airbnb的大規模數據指標系統 Part 3

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/c9e946329f726b22d770cc765","title":"","type":null},"content":[{"type":"text","text":"Minerva -- Airbnb的大規模數據指標系統 Part 1","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/eaaffd7a0fe9bb700d5df0b98","title":"","type":null},"content":[{"type":"text","text":"Minerva -- Airbnb的大規模數據指標系統 Part 2","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本系列的","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/c9e946329f726b22d770cc765","title":"","type":null},"content":[{"type":"text","text":"第一篇文章","attrs":{}}]},{"type":"text","text":"中,我們介紹了Minerva在改善Airbnb數據分析工作方面所起的作用。在","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/eaaffd7a0fe9bb700d5df0b98","title":"","type":null},"content":[{"type":"text","text":"第二篇文章","attrs":{}}]},{"type":"text","text":"中,我們深入探討了Minerva的核心計算基礎設施,並介紹了我們如何保證數據集和團隊數據的一致性。在第三篇也是最後一篇文章中,我們將重點講述Minerva如何極大的簡化和改善用戶的數據消費體驗。具體來說,我們將展示統一指標層(我們稱之爲Minerva API)如何幫助我們構建爲具有廣泛背景和不同級別數據專業知識的用戶量身定製的多功能數據消費體驗。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"以指標爲中心的方法(A Metric-Centric Approach)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當用戶採用數據來探索業務問題時,通常會考慮不同的指標和維度。例如,業務負責人可能想知道長期住宿(維度)佔預訂(指標)的百分比是多少。要回答這個問題,首先要找到正確的表單(where),通過必要的聯合(joins)或過濾(filters)(how),最終聚合數據(how)以得到正確的答案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然許多傳統BI工具試圖代表用戶把這些工作抽象出來,但大多數數據服務邏輯仍然嚴重依賴用戶來確定“where”和“how”。在Airbnb,我們希望提供更好的用戶體驗——用戶只要簡單的請求獲取指標和維度,就可以直接得到答案,而不必擔心“where”或“how”的問題。我們將這一願景稱之爲“以指標爲中心的方法”,最終發現這是一個艱鉅的工程挑戰。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"挑戰一:“Where”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大多數傳統數據倉庫中,數據以表的形式組織。這意味着要響應某個查詢,BI工具需要將相關的指標和維度與包含相關答案的物理表關聯起來。然而,對於給定的指標和維度的組合,也許有許多數據集包含有相關答案。這些表通常具有不同程度的數據質量和正確性保證,因此選擇正確的表來服務數據並非易事。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"挑戰二:“How”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了“where”之外,負責“how”的數據服務邏輯也有許多細微差別。首先,有不同的指標類型:由單個物理事件(例如:預定量)組成的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"簡單指標(simple metrics)","attrs":{}},{"type":"text","text":";基於維度過濾產生的一組簡單指標組成的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"過濾指標(filtered metrics)","attrs":{}},{"type":"text","text":"(例如:中國的預訂量);由一個或多個非派生指標組成的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"派生指標(derived metrics)","attrs":{}},{"type":"text","text":"(例如:圖書搜索匹配率)。此外,雖然有許多指標是遞增的(例如:預訂量),但也有許多指標不是:計數差、百分比和基於時間的快照等,不能簡單的通過彙總單個事件來計算。始終如一的在所有場景中正確的計算這些不同類型的指標是一個巨大的挑戰。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"挑戰三:與下游應用程序集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,只有在各種上下文環境、應用程序和工具中使用數據,才能做出基於數據的決策。指標越通用、越重要,就越有可能被廣泛應用於各種場合。例如,總預訂價值(Gross Booking Value,GBV)、預訂夜數(nights booked)和收入(revenue)是Airbnb最常用的指標,被廣泛應用於跟蹤業務表現、作爲隨機控制實驗的基準比較指標,並用於比較機器學習模型。在不同用例中基於這些指標提供服務,同時爲用戶提供上下文信息從而可以以正確的方式使用它們,是我們面臨的另一個核心挑戰。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"解決方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過構建Minerva API來解決這些挑戰,這是一個指標服務層(metric-serving layer),充當上游數據模型和下游應用程序之間的接口。有了Minerva API,任何下游應用程序都能夠以一致和正確的方式消費數據,而不用知道數據存儲在哪裏,也不用知道應該如何計算指標。本質上,Minerva API通過連接“what”和“where”來充當“how”。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Minerva API","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Minerva API由API web服務、元數據獲取應用以及客戶端(與Apache Superset","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[2]","attrs":{}}],"attrs":{}},{"type":"text","text":"、Tableau","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[3]","attrs":{}}],"attrs":{}},{"type":"text","text":"、Python","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[4]","attrs":{}}],"attrs":{}},{"type":"text","text":"和R","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[5]","attrs":{}}],"attrs":{}},{"type":"text","text":"集成)組成。這些組件爲下游應用程序提供本地NoSQL和SQL指標查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/71/71bc8503e4f0466badcc575fdbeee2f8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"Minerva API充當消費者和底層數據集之間的接口","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"元數據獲取器:抽象“Where”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們前面提到過,用戶只需要向Minerva提供指標和規格參數,而不需要指定“where”。當發出數據請求時,Minerva會花費大量精力來確定應該使用哪個數據集來響應該請求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Minerva在幕後選擇最佳數據源之前需要綜合考慮多個因素,其中最重要的因素之一是數據完整性。這意味着選擇用於查詢的任何數據源都應該包含給定用戶查詢請求所需的所有列,並且必須涵蓋查詢請求所需的時間範圍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此,我們構建了一個名爲元數據獲取器(Metadata Fetcher)的服務,該服務每15分鐘定期從數據源獲取元數據,並將其緩存到MySQL數據庫中。具體來說,我們定期從S3獲取Minerva配置的最新副本(存儲在Thrift二進制文件中),從而獲取Druid中每個有效Minerva數據源的列表。對於每個數據源,我們查詢Druid代理以讀取它的名稱以及相關的指標和維度列表。此外,我們還可以從代理獲取最小日期、最大日期以及日期計數,以確定是否有任何丟失的數據。每次獲取新信息時,我們都會更新MySQL數據庫,以維護真實數據源。通過元數據獲取器,我們能夠在任何給定的時間使用最好的數據源來服務數據請求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據API:抽象“How”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設用戶希望瞭解2021年8月的4周時間內,除私人房間外,各目的地地區的日均價格(average daily price,ADR)下降趨勢。示例查詢的完整規格定義如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"{\n metric: ‘price_per_night’,\n groupby_dimension: ‘destination_region’,\n global_filter: ‘dim_room_type!=”private-room”’,\n aggregation_granularity: ‘W-SAT’,\n start_date: ‘2021–08–01’,\n end_date: ‘2021–09–01’,\n truncate_incomplete_leading_data: ‘true’,\n truncate_incomplete_trailing_data: ‘true’,\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Minerva接收到這樣的請求時,它不僅需要確定從哪裏獲取數據,還需要知道如何過濾、組合以及聚合數據以獲得最終的結果。它採用了一種策略,通過Split-Apply-Combine範式","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[6]","attrs":{}}],"attrs":{}},{"type":"text","text":"來實現,該範式通常用於數據分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f2e8876d3dc1e43ec8a005ac080a81da.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"對'price_per_night'指標應用Split-Apply-Combine範式","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟一:將請求拆分爲原子指標請求","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Minerva API接收到如上所述的查詢請求時,它所做的第一件事就是通過創建一組相關的子查詢,將任何派生指標分解爲我們稱爲Minerva“原子”指標。如果一個用戶查詢只指定一個原子的Minerva指標,那麼第一步基本上是一個空操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的例子中,給定‘price_per_night’指標是一個比率指標(派生指標的一種特例),它包含一個分子(‘gross_booking_value_stays’)和一個分母(‘nights_booking’),Minerva API將這個請求分解爲兩個子請求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟二:採用並執行每個子查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於第1步中確定的原子指標,Minerva利用S3中存儲的指標配置來推斷相關的指標表達式和元數據,從而生成子查詢。我們繼續討論這個例子:Minerva數據API查找“gross_booking_value_stays”的指標定義,發現它是一個SUM聚合,類似的,“nights_booking”指標也是如此。在這兩個請求中,通過全局過濾器' dim_room_type != \" private-room\" '用於確保私人房間不在計算範圍內。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/cc7aff0ee66852e4d0cb859d672f1bd2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"對ADR指標應用Split-Apply-Combine範式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦爲每個原子指標都生成了關聯的子查詢,Minerva API最終將查詢發送給Druid或Presto。它將查詢分割成幾個跨越更小時間範圍的“片”,然後在達到資源限制時將結果合併到單個數據幀中。在基於聚合粒度拼接數據幀之前,API還會丟棄任何不完整的前置或後置數據。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟三:將原子指標結果合併到單個數據幀中","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦Minerva獲取到每個原子指標的數據幀,它會通過連接時間戳列上的數據幀將它們組合成一個單獨的數據幀。作爲最後一步,Minerva API在以序列化JSON格式將最終結果返回給客戶端之前將執行任何必要的聚合後計算、排序和限制操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總之,通過Minerva的數據源API和數據API,我們可以抽象出確定從哪裏獲取數據以及如何返回數據的過程。這個API作爲Minerva的單一抽象層,可以滿足來自下游應用程序的任何請求。然而,我們的故事並沒有就此結束:我們的許多工程挑戰都涉及到如何將不同的應用程序與這個API集成,我們將在下一節探討這些挑戰。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"數據消費經驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮到Airbnb內部數據消費者的多樣性,我們開始構建針對不同角色和用例的工具。通過Minerva API,我們構建了廣泛的用戶界面,這些用戶界面提供了一致的數據消費體驗。正如我們在","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/c9e946329f726b22d770cc765","title":"","type":null},"content":[{"type":"text","text":"第一篇文章","attrs":{}}]},{"type":"text","text":"中簡要提到的,有四個主要的集成點,每個點支持一組不同的工具和用戶:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據分析(Data Analysis):","attrs":{}},{"type":"text","text":"與Python和R集成,主要用於高級數據分析","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據探索(Data Exploration):","attrs":{}},{"type":"text","text":"與BI工具(如Superset、Metric Explorer","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[7]","attrs":{}}],"attrs":{}},{"type":"text","text":"和Tableau)的集成,爲精通數據的分析師量身定製,以幫助商業洞察","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"報告(Reporting):","attrs":{}},{"type":"text","text":"與XRF(eXecutive Reporting Framework,執行報告框架)","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[8]","attrs":{}}],"attrs":{}},{"type":"text","text":"集成,爲希望瞭解當前業務狀態的管理層量身定製","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"實驗(Experimentation):","attrs":{}},{"type":"text","text":"與ERF(Experimentation Reporting Framework,實驗報告框架)集成,專爲在Airbnb進行A/B測試的數據科學家、工程師或產品經理量身定製","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們構建這些特性時,我們總是在一致性、靈活性和可訪問性之間進行權衡。例如,Metric Explorer主要是爲非數據專家的非技術用戶構建的,這意味需要爲它優化一致性和可訪問性,而不是靈活性。Metric Explorer有嚴格的執行保護,防止用戶做錯誤的事情,並且幾乎沒有機會偏離確定的道路。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲另一個極端,通常受數據科學家青睞的R和Python客戶端要靈活得多。用戶可以完全控制如何利用客戶端API來執行定製分析或可視化。在接下來的幾節中,我們將介紹這些消費體驗是如何被創建的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"與Metric Explorer集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Metric Explorer由Airbnb創建,任何人(無論他們的數據專業水平如何)都可以利用數據做出明智的決策。由於其面向廣泛的目標用戶,Metric Explorer優化了可訪問性和數據一致性,而不是靈活性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d42ca5a090ac1d11f1d14bf01a3fae53.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"Metric Explorer對於想要回答高級業務問題的非技術用戶來說非常適合","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有Metric Explorer的指標、維度和相關元數據都來自Minerva的指標存儲庫,並被注入到Elasticsearch","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[9]","attrs":{}}],"attrs":{}},{"type":"text","text":"中。在用戶對數據執行任何操作之前,這些元數據作爲上下文方便的顯示在右側欄上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當用戶選擇執行Group By和Filter之類的數據操作時,Metrics Explorer按等級順序顯示維度,這樣只有很少或沒有業務上下文的用戶可以輕鬆的挖掘信息,而不需要提前知道維度值(如上所示)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當用戶對數據進行切片時,Minerva API會自動確定哪個組合是有效的,並且只會對有效的數據組合進行切割。在這種體驗中,用戶不需要知道任何有關所涉及指標來源的底層物理表的信息。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"與Apache Superset集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Metrics Explorer提供了有關參數的高級信息,但更有探索精神的用戶可以在Superset中進行更多操作。Apache Superset","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[10]","attrs":{}}],"attrs":{}},{"type":"text","text":"是Airbnb自助BI解決方案的核心工具。考慮到Superset在公司內的廣泛應用,我們知道需要提供類SQL的功能,從而與Superset進行集成,以便Minerva能夠被廣泛採用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用於Apache Superset和Tableau等BI工具的客戶端接口要複雜得多,很多應用可以選擇直接通過RESTful接口調用Minerva API。這些BI工具通常使用SQL(通過客戶端),而不是HTTP請求進行訪問。這意味着Minerva API需要支持類SQL接口,該接口需要遵循OLAP","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[11]","attrs":{}}],"attrs":{}},{"type":"text","text":"查詢結構。爲了構建這樣一個接口,我們利用sqlparse","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[12]","attrs":{}}],"attrs":{}},{"type":"text","text":"在Minerva API中添加了一個SQL解析器,用於將SQL語句解析爲AST,然後對其進行驗證並將其轉換爲本地HTTP請求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遵循DRY原則,我們複用Apache Calcite Avatica","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[13]","attrs":{}}],"attrs":{}},{"type":"text","text":"定義了客戶機和服務器之間的通用數據庫連接API。Minerva API充當Avatica HTTP服務器,客戶端要麼是基於SQLAlchemy","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[14]","attrs":{}}],"attrs":{}},{"type":"text","text":"定製的Python Database API","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[15]","attrs":{}}],"attrs":{}},{"type":"text","text":"數據庫驅動程序,要麼是Avatica提供的JDBC連接器(Tableau)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統BI工具在工具內部實現自定義業務邏輯,而Minerva通過類SQL的AGG指標表達式來整合這些邏輯。下表中,我們比較了在傳統BI工具和Superset工具中運行的查詢:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/aa/aa43b6e666fde0925058e6389d785880.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在左邊的查詢中,用戶不需要指定指標應該從哪裏計算,也不需要指定正確的聚合函數——這些細節都被Minerva抽象了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,假設Minerva中有12,000個指標和5,000個維度,但並不是所有的指標-維度組合都是有效的。例如,活動列表可以通過主機所在的位置來切割,但不能通過客人的出發位置來切割(也就是說,每個預訂的客人屬性可能不同)。我們在圖表控件中添加了事件監聽器,以確保左側視窗中只顯示符合條件的指標和維度的組合。這種設計有助於減少認知負載,簡化數據挖掘過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/cc101bc6fc69b6c141322e73b0ae047f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"Superset是以指標爲中心的,用戶可以從單個虛擬源查詢所有指標和維度","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"與XRF(eXecutive Reporting Framework)集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/c9e946329f726b22d770cc765","title":"","type":null},"content":[{"type":"text","text":"第一篇","attrs":{}}]},{"type":"text","text":"所述,XRF是一個框架,用於生成由執行人員和領導團隊使用的簡潔、高保真的業務關鍵報告。這個框架是通過Minerva的配置來配置的,並且完全由Minerva API提供支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b28a379881e0aff349fb2f430260e955.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"XRF自動化了大量重複的手工工作,並允許我們標準化高保真的業務關鍵報告","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要管理XRF報告,用戶首先需要定義報告配置,並指定所需的業務指標、維度切片以及需要應用的全局篩選器。此外,用戶還可以配置其他控制行爲,比如某個指標是否需要執行聚合(如MTD、QTD或YTD)操作,以及爲基於時間的比較增長率(如YoY、MoM或WoW)指定合適的單位。一旦指定了這些設置,Minerva API就會執行必要的聚合操作以及生成最終的報告。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"XRF輸出的數據可以通過自定義GoogleSheetHook渲染在Google表格中,也可以通過Presto連接到Tableau中。通過利用Minerva及其聚合邏輯中的指標定義,我們在用戶選擇的表示層中強制執行一致性保障。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"與ERF(Experimentation Reporting Framework)集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與分析或報告用例不同,實驗用例比較特殊,用於報告的指標只是一個起點。爲了做出正確的因果推論,在將指標轉換爲可用於有效統計比較的彙總統計數據之前,必須將指標與實驗分配的數據結合起來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常,Minerva向ERF提供“原始事件”。根據隨機單元和分析單元,我們使用不同的主題鍵將Minerva數據加入到分配日誌中,以便每個事件都有相關的主題,以及與之相關的實驗組。最後計算並彙總統計信息(如平均值、百分比變化和p值)並顯示在ERF記分卡中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/135d8c492ccbd0b2feb965873b13a383.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"顯示實驗統計摘要的ERF記分卡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗UI還會直接顯示相關的Minerva元數據,用戶還可以查看Minerva事件的描述和所有權信息。一個帶有ETA信息的譜系視圖允許用戶跟蹤ERF指標進展","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[16]","attrs":{}}],"attrs":{}},{"type":"text","text":",並幫助他們在出現延遲的情況下聯繫相關的Minerva指標所有者。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e015f6736997acafd4989aa76593c53c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"ERF顯示指標元數據,鏈接到SLA Tracker[17]從而可視化數據譜系和時間線","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總之,Minerva及其多種集成工具幫助用戶能夠在他們的計劃報告中輕鬆跟蹤指標,測量實驗產生的變化,並探索意外的變化——所有這些都讓用戶相信數據是正確和一致的,這種信心極大的減少了獲取洞察所花費的時間,增加了對數據的信任,並有助於支持數據驅動的決策。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"尾聲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Minerva引入了一種思考數據的新方法,不僅意味着以業務和指標爲中心的用戶接口,還需要我們調整傳統BI工具(主要使用SQL)來適應Minerva API的接口。在某種意義上,這類似於將一個新的方釘(Minerva)插入一個現有的圓孔(BI Tools)中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着越來越多的組織接受類似Minerva的指標層的理念,我們相信將會有一系列新的挑戰等着我們。也就是說,一些開創性的工作肯定會把分析帶到新的水平,我們爲能夠爲這一領域做出創新工作而感到自豪,我們也希望會有更多公司跟我們一樣在這一領域做出貢獻。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"感謝","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"感謝每一個爲這篇博文所介紹的工作和成果做出貢獻的人","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[18]","attrs":{}}],"attrs":{}},{"type":"text","text":"。除了之前的致謝,我們還想感謝那些與我們合作,在工作中採用Minerva的人。","attrs":{}}]},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有商標都是各自所有者的財產,相關資料的使用僅用於身份識別的目的,並不意味着贊助或背書。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Reference:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/how-airbnb-enables-consistent-data-consumption-at-scale-1c0b6a8b9206","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/how-airbnb-enables-consistent-data-consumption-at-scale-1c0b6a8b9206","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] ","attrs":{}},{"type":"link","attrs":{"href":"https://superset.apache.org/","title":"","type":null},"content":[{"type":"text","text":"https://superset.apache.org/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.tableau.com/","title":"","type":null},"content":[{"type":"text","text":"https://www.tableau.com/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.python.org/","title":"","type":null},"content":[{"type":"text","text":"https://www.python.org/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.r-project.org/","title":"","type":null},"content":[{"type":"text","text":"https://www.r-project.org/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jstatsoft.org/article/view/v040i01","title":"","type":null},"content":[{"type":"text","text":"https://www.jstatsoft.org/article/view/v040i01","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[7] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd#c576","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd#c576","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[8] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70#efb9","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70#efb9","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[9] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.elastic.co/elasticsearch/","title":"","type":null},"content":[{"type":"text","text":"https://www.elastic.co/elasticsearch/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[10] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[11] ","attrs":{}},{"type":"link","attrs":{"href":"https://en.wikipedia.org/wiki/Online_analytical_processing","title":"","type":null},"content":[{"type":"text","text":"https://en.wikipedia.org/wiki/Online_analytical_processing","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[12] ","attrs":{}},{"type":"link","attrs":{"href":"https://pypi.org/project/sqlparse/v","title":"","type":null},"content":[{"type":"text","text":"https://pypi.org/project/sqlparse/v","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[13] ","attrs":{}},{"type":"link","attrs":{"href":"https://calcite.apache.org/avatica/","title":"","type":null},"content":[{"type":"text","text":"https://calcite.apache.org/avatica/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[14] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.sqlalchemy.org/","title":"","type":null},"content":[{"type":"text","text":"https://www.sqlalchemy.org/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[15] ","attrs":{}},{"type":"link","attrs":{"href":"https://www.python.org/dev/peps/pep-0249/","title":"","type":null},"content":[{"type":"text","text":"https://www.python.org/dev/peps/pep-0249/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[16] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/visualizing-data-timeliness-at-airbnb-ee638fdf4710","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/visualizing-data-timeliness-at-airbnb-ee638fdf4710","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[17] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/visualizing-data-timeliness-at-airbnb-ee638fdf4710","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/visualizing-data-timeliness-at-airbnb-ee638fdf4710","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[18] ","attrs":{}},{"type":"link","attrs":{"href":"https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70#8a0a","title":"","type":null},"content":[{"type":"text","text":"https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70#8a0a","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你好,我是俞凡,在Motorola做過研發,現在在Mavenir做技術總監,對通信、網絡、後端架構、雲原生、DevOps、CICD、區塊鏈、AI等技術始終保持着濃厚的興趣,平時喜歡閱讀、思考,相信持續學習、終身成長,歡迎一起交流學習。微信公衆號:DeepNoMind","attrs":{}}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章