特徵平臺需求層次理論

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文是"},{"type":"link","attrs":{"href":"https:\/\/tech.ipalfish.com\/blog\/2021\/05\/31\/mlsys-we-love\/","title":null,"type":null},"content":[{"type":"text","text":"「算法工程化實踐調研」"}]},{"type":"text","text":"系列的第 1 篇,翻譯 "},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/eugeneyan","title":null,"type":null},"content":[{"type":"text","text":"Eugene Yan"}]},{"type":"text","text":" 的技術博客 "},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/feature-stores\/","title":null,"type":null},"content":[{"type":"text","text":"Feature Stores - A Hierarchy of Needs"}]},{"type":"text","text":" [1]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出於開發伴魚特徵平臺的需要,我最近閱讀了很多關於特徵平臺的實踐文章,但總有「一葉障目,不見泰山」之感——每個公司的算法工程化現狀不盡相同,導致解決方案的側重點不同,在架構上的區別也很大。正如我的前同事佘昶在他 2019 年的一篇文章中,到位地總結:我們缺乏一個"},{"type":"text","marks":[{"type":"strong"}],"text":"系統性"},{"type":"text","text":"地思考特徵平臺的框架。[2]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幸運的是,Eugene 的博客正好提供了這樣一個思考框架,並將這個思考框架用於分析當前的各個特徵平臺上。我在徵得 Eugene 的同意後,全文翻譯,以饗中文讀者。以下是譯文。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵平臺(feature store)最近很火。2020 年 12 月,AWS "},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2020\/12\/introducing-amazon-sagemaker-feature-store\/","title":null,"type":null},"content":[{"type":"text","text":"發佈"}]},{"type":"text","text":"了 SageMaker 特徵平臺。上個月,大數據平臺 Splice Machine 也"},{"type":"link","attrs":{"href":"https:\/\/splicemachine.com\/press-releases\/splice-machine-launches-the-splice-machine-feature-store-to-simplify-feature-engineering-and-democratize-machine-learning\/","title":null,"type":null},"content":[{"type":"text","text":"發佈"}]},{"type":"text","text":"了一款特徵平臺。Datanami 引用 Tecton.ai 聯合創始人的話,稱 2021 年爲"},{"type":"link","attrs":{"href":"https:\/\/www.datanami.com\/2021\/01\/19\/2021-the-year-of-the-feature-store\/","title":null,"type":null},"content":[{"type":"text","text":"特徵平臺之年"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我們的經驗,管理特徵是機器學習上線最大的瓶頸之一。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特徵和標籤是機器學習模型的輸入。"},{"type":"text","text":"在迴歸中,標籤是因變量,特徵是自變量。在表格中,標籤是我們想要預測的列,特徵是除 ID 外的其它列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家對於「"},{"type":"text","marks":[{"type":"strong"}],"text":"特徵平臺是什麼"},{"type":"text","text":"」有很多種理解。有人把它簡單地定義爲「一個集中存儲特徵的地方」。也有人稱特徵平臺能幫你「實現特徵的一次創建,多處使用」或「百倍地提高模型部署效率」。之所以回答五花八門,是因爲"},{"type":"text","marks":[{"type":"strong"}],"text":"每個人想要特徵平臺做的事情都不同"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我研究了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/eugeneyan\/applied-ml#feature-stores","title":null,"type":null},"content":[{"type":"text","text":"大量業界實踐"}]},{"type":"text","text":",試圖理解特徵平臺在不同場景下解決的問題。受心理學家馬斯洛的啓發,我發現特徵平臺的能力可以滿足多個層次的需求。我稱之爲「特徵平臺的需求層次」,我將逐層介紹這些需求,並討論業界的特徵平臺爲滿足該層次需求所做的實踐。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特徵平臺的需求層次"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"馬斯洛需求層次理論是一個心理動力理論,認爲人類有五個層次的需求,呈金字塔形。該理論認爲,人會首先滿足最大最基本的需求(金字塔底),纔會考慮更高層次的要求(金字塔頂)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"關於「馬斯洛需求層次理論」"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d2\/e3\/d23ebfdf6yy55d03a319e6262b5c68e3.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:馬斯洛的需求層次金字塔,底層代表基本需求。"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Maslow's_hierarchy_of_needs","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生理需求:"},{"type":"text","text":"是生存所必需的,例如空氣、水、食物、住處等。不滿足這個需求,人類身體將無法正常工作。只有滿足了這一最重要的需求,纔會考慮其它需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安全需求:"},{"type":"text","text":"在生理需求被滿足的前提下,需要保障安全,包括人身安全、健康、經濟安全、就業、法律秩序、社會穩定等。這一需求通常由家庭、社會和政府滿足。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"社交需求:"},{"type":"text","text":"感到社會羣體對自己的認可和接受。這類羣體包括同事、教友、職業機構、體育俱樂部、在線社區等,也包括家庭、朋友和導師。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"尊嚴需求:"},{"type":"text","text":"基於能力和成就。較低層次的尊嚴需求來自他人,包括他人的尊重認可和在其他人中的聲譽。較高層次的尊嚴需求來自自己,包括對自己的尊重和內在的成就感。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自我實現需求:"},{"type":"text","text":"最高層的需求,是個人潛力的完全實現。對自我實現的需求因人而異。有的人希望成爲完美父母,而有的人強烈希望在經濟上、學術上、體育上取得成功。人們通過發明、藝術、寫作等方式表達這種需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣,特徵平臺會首先滿足最必要和急迫的需求(包括特徵讀取和特徵服務),再去考慮高階需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/db\/dbad811f20b0686a47c099e808965b8d.jpeg","alt":"pyramid","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:特徵平臺需求層次)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最底層是訪問(access)的需求。"},{"type":"text","text":"這一層需求包括特徵可讀取、特徵轉換邏輯透明和特徵血緣可溯。它們使得特徵能被發現、分享和複用,減少重複。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從前,算法工程師進行機器學習開發時,60%的時間都花在編寫特徵轉換邏輯上。—— Airbnb"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"其次是服務(serving)的需求。"},{"type":"text","text":"這一層的核心需求是爲線上服務提供高吞吐、低延遲的特徵讀取能力,而無需通過 SQL 去數據倉庫讀取。其它需求還包括:與已有的離線特徵存儲集成,使得特徵能夠從離線特徵存儲同步到在線特徵存儲(例如 Redis);實時的特徵轉換等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常,數據工程師會將數據科學家實現的特徵重新實現爲可以在生產環境運行的特徵管道。這個重複實現的過程會讓項目推遲數月交付,讓跨團隊合作極爲複雜。—— GoJek"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"兩個底層需求被滿足後,我們訴諸準確(integrity)需求。"},{"type":"text","text":"最常見的需求是最小化 train-serve skew,確保特徵在訓練和服務環境下是一致的。另一個常見需求是 point-in-time correctness(又稱 time-travel),以確保歷史特徵和標籤被用於訓練和評估時不存在 data leaks。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練通常是離線的,而服務通常是實時的。保證訓練和服務環境下的數據一致性極爲重要。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"再往上,是便利的需求。"},{"type":"text","text":"特徵平臺需要足夠簡單好入手,例如提供簡單直觀的接口、易交互、易 debug 等,才能讓大家採納和受益。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"記住,我們是個平臺組。我們要搭建工具把提供給用戶,讓他們能夠自己動手豐衣足食。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最後是自治(autopilot)的需求,"},{"type":"text","text":"包括自動回填特徵、對特徵的分佈進行監控和報警等。我知道有些公司有做這一層的事情,但我沒怎麼讀到相關材料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵回填是訓練集迭代最主要的瓶頸。解決這一問題能極大地加速數據科學家的工作流。—— Airbnb"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"並非所有團隊都有全部五層需求,"},{"type":"text","text":"對大部分團隊而言,滿足第一、二層和部分第三層的需求就很受益了。不同團隊對於每一層需求的程度要求也不同。在線場景少的團隊相比每秒需要處理幾百萬請求的的 DoorDash 團隊,當然更少關心特徵服務的需求;如果模型和特徵每天更新多次,則更少需要關心 point-in-time correctness。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在逐層瞭解對特徵平臺的需求後,讓我們來看看不同的公司是如何實現這些需求的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對「創新者窘境」的借鑑"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"訪問:去重和複用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果特徵難以訪問,以下情況會發生:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同團隊反覆實現同一個特徵,導致同一特徵可能有多達 10 個版本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署多個相近的特徵管道,浪費計算和存儲資源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲同一特徵有多個版本,不同模型會使用不同版本的特徵,很難得到一致的結果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"迭代變慢。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表達同一業務概念的特徵被多個團隊反覆開發,已有工作無法複用。—— GoJek"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這一問題,"},{"type":"link","attrs":{"href":"https:\/\/www.gojek.io\/blog\/feast-bridging-ml-models-and-data","title":null,"type":null},"content":[{"type":"text","text":"GoJek 搭建 Feast"}]},{"type":"text","text":" 作爲數據工程師、數據科學家和算法工程師合作的接口。數據工程師和數據科學家創建特徵,並提交給特徵平臺。隨後,算法工程師消費特徵,而無需自己創建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(我很難評價這種做法,因爲我認爲數據科學家應該"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/","title":null,"type":null},"content":[{"type":"text","text":"端到端"}]},{"type":"text","text":"。當然,GoJek 的做法或許和組織架構有關,因爲它的數據工程團隊主要在印度,而算法團隊主要在新加坡。Feast 扮演了團隊間溝通的接口。)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 也採用了相似的做法,通過搭建 "},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"Palette 特徵平臺"}]},{"type":"text","text":",鼓勵不同部門分享和複用 Palette 中的特徵。這種做法最小化了重複工作,讓機器學習的的結果更加一致,加速機器學習的進程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵的可發現性使得特徵易於發現和使用。我沒有讀到多少特徵平臺語境下的可發現性的相關討論,估計它和我之前寫過的"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/data-discovery-platforms\/","title":null,"type":null},"content":[{"type":"text","text":"開源數據發現平臺"}]},{"type":"text","text":"很相似。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個層次上,特徵平臺基本上是個包含很多特徵的存儲,和數據倉庫的區別不大。把兩者區分開的是特徵平臺還能滿足特徵下一層次的服務需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"服務:在實時環境使用特徵"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們常用批數據離線訓練模型,然而在線模型服務需要實時讀取這些特徵。這難住了很多團隊——應該如何爲在線模型服務高吞吐、低延遲地提供(serve)這些特徵?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在開發模型的過程中發現:很多用於訓練的特徵,並無法在生產環境中獲取。—— Monzo Bank"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Monzo Bank 能從離線分析環境(用於模型訓練)中獲取特徵,但無法從生產環境(用於模型服務)中獲取特徵。Monzo Bank 採用了一個"},{"type":"link","attrs":{"href":"https:\/\/nlathia.github.io\/2020\/12\/Building-a-feature-store.html","title":null,"type":null},"content":[{"type":"text","text":"輕量的解決方案"}]},{"type":"text","text":",將離線分析存儲(BigQuery)中的特徵同步至在線存儲(Cassandra)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,在離線分析環境的 SQL 建表語句中加入標籤。這些表的更新頻率在小時或天級別。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵平臺中的 Go 服務檢查特徵表 schema 的正確性,例如必需的 "},{"type":"codeinline","content":[{"type":"text","text":"subject_type"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"subject_id"}]},{"type":"text","text":" 列是否存在。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有個 cron job 監聽特徵表的更新,將數據變動從 BigQuery 經過 Google Cloud Storage 的中轉同步至 Cassandra。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 的 Palette 採取了類似的雙存儲設計。離線存儲(Hive)保存特徵快照,用於訓練。在線存儲(Cassandra)實時提供同樣的特徵。特徵由 Flink 生成,寫入 Cassandra。兩個存儲之間會進行特徵同步:添加到 Hive 的特徵會被複制到 Cassandra,添加到 Cassandra 的特徵會被 ETL 到 Hive。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fb\/fb8caa62ad03d49fe0e25ea434e5db97.jpeg","alt":"uber-dual-store","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:創建批特徵(左邊)和實時特徵(右邊),並在存儲間同步。"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DoorDash 搭建了"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"超大規模的特徵平臺"}]},{"type":"text","text":",將特徵服務做到極致,滿足以下需求:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持在可持久、可伸縮的存儲中保存十億級別條數的特徵。DoorDash 有百萬級別的特徵實體(entity)和十億級別的特徵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持百萬級別 QPS(Queries per second)。特徵平臺有多個使用場景,其中包括餐廳排序。這一場景使用大量特徵,每秒做出超過一百萬次預測。綜合來看,特徵平臺的 QPS 超過一千萬。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持非實時特徵每日一次的快速批更新,和實時特徵(例如餐廳過去 20 分鐘的平均送餐時長)一天內不斷的更新。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DoorDash 在評估了 Redis、Cassandra、CockroachDB、ScyllaDB 和 YugabyteDB 後,選擇了 Redis。這篇"},{"type":"link","attrs":{"href":"https:\/\/doordash.engineering\/2020\/11\/19\/building-a-gigascale-ml-feature-store-with-redis\/","title":null,"type":null},"content":[{"type":"text","text":"好文"}]},{"type":"text","text":"介紹了 DoorDash 的評估過程和針對 Redis 做的後續優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個方法是實時計算特徵。例如,阿里巴巴的特徵服務平臺實時計算用戶行爲特徵的統計量(點擊、點贊、購買等),用於「猜你喜歡」"},{"type":"link","attrs":{"href":"https:\/\/102.alibaba.com\/detail?id=183","title":null,"type":null},"content":[{"type":"text","text":"實時推薦"}]},{"type":"text","text":"。"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/real-time-recommendations\/","title":null,"type":null},"content":[{"type":"text","text":"這篇文章"}]},{"type":"text","text":"介紹了更多實時推薦的內容。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"準確:創建正確的在線和離線特徵"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在滿足服務的需求後,我們來看準確需求。準確性解決的主要痛點是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"難以創建 point-in-time correct 的特徵,用於模擬生產環境。做不對的話,會導致 data leaks。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練和服務環境特徵的不一致,導致模型上線後表現欠佳。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決第一個痛點,Netflix 實現了"},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/distributed-time-travel-for-feature-generation-389cccdd3907","title":null,"type":null},"content":[{"type":"text","text":"分佈式 time-travel"}]},{"type":"text","text":"。它給離線和在線數據建立快照,快照內容包含成員類型、設備、當天時間等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/592c6c14270f1e80f86dec2b9388c329.jpeg","alt":"netflix-snapshots","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:從離線和在線的微服務創建快照。"},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/distributed-time-travel-for-feature-generation-389cccdd3907","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,爲每一個 context 都建立快照的成本很高。因此,Netflix 對觀看模式、設備類型、設備使用時長、地區等離線特徵進行分層抽樣,這些樣本很好地代表了用於模型訓練和評估的數據的分佈。抽樣通過 Spark 完成,快照存儲在 S3 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix 也會對在線特徵建立快照。數據產生自數百個微服務,數據包括觀看歷史、個性化觀看列表、評分預測等。數據由 Spark 通過 "},{"type":"link","attrs":{"href":"https:\/\/medium.com\/@Netflix_Techblog\/prana-a-sidecar-for-your-netflix-paas-based-applications-and-services-258a5790a015","title":null,"type":null},"content":[{"type":"text","text":"Prana"}]},{"type":"text","text":" 並行獲取,製成快照,以 Parquet 格式存儲在 S3 上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決 train-serve skew 這第二個痛點,GoJek 用 Apache Beam 實現數據處理管道,消費來自批和流數據源(例如 BigQuery 和 Kafka)的數據,注入離線和在線存儲(例如 BigQuery 和 Redis),並提供統一的接口來讀取歷史和實時數據。這種做法避免了因在生產環境重寫特徵管道而引入 train-serve skew。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cf354f1555d6511c807b2d4b9adc6f92.jpeg","alt":"gojek-feast","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:GoJek 基於 Apache Beam 的特徵注入。"},{"type":"link","attrs":{"href":"https:\/\/www.gojek.io\/blog\/feast-bridging-ml-models-and-data","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix 則通過共享的特徵編碼器(encoder)解決這一痛點。儘管他們在離線(Spark)和在線環境實現了不同的特徵生成管道,但不同管道共用特徵編碼器(即同樣的類、庫和數據格式)。這也保證了特徵生成過程在訓練和服務環境的一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/029a040e4e54ae5357d88e87a094aa45.jpeg","alt":"netflix-shared-encoders","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:Netflix 的離線和在線特徵生成使用同一個編碼器。"},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/session\/fact-store-scale-for-netflix-recommendations","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在前文介紹過 Uber 如何保持離線(Hive)和在線(Cassandra)特徵存儲的數據同步。任意一個存儲的新特徵都會被複制到另一個存儲,確保訓練和服務環境中數據的一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲確保特徵的準確,還需引入監控,回答以下問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵最近一次更新是什麼時候?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema 正確嗎?數據分佈發生了偏移嗎?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵服務達到了吞吐和延遲的要求嗎?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Airbnb Zipline UI 向數據科學家展示特徵的分佈、特徵和標籤之間的相關性、聚類分析(尚不清楚基於什麼做聚類分析)。類似地,Uber "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"Data Quality Monitor"}]},{"type":"text","text":" 通過以下方法給用戶展示每日數據質量分數和異常報警:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,蒐集特徵的指標,例如數值特徵的均值、中位數、最大值、最小值,以及類別特徵的唯一值個數和缺失值個數。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,基於指標建立多維時間序列,使用主成分分析(PCA)選取出要保留的主成分。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,使用主成分建立時間序列。如果當前測量值和上一步的預測值不匹配,則將該特徵標記爲異常。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b4\/b4a421d3c894645897631ef3ad44adbd.jpeg","alt":"uber-dqm","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:數據質量隨時間的變化,以及當事故發生時。"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"便利:儘量簡單"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,關於特徵平臺的便利需求,討論並不多。但顯然,好用的工具和平臺非常重要(想想 PyTorch 和 Tensorflow 的對比)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我找到的最好的例子來自 GoJek。GoJek 實現提供了統一的 Python、Java 和 Go SDKs,讓用戶可以在不同語言中幾乎無區別地使用 "},{"type":"codeinline","content":[{"type":"text","text":"get_batch_features()"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"get_online_features()"}]},{"type":"text","text":" 接口,簡化從離線存儲和在線存儲中獲取特徵的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
1
2
3
4
5
6
7
customer_features = ['credit_score', 'balance', 'total_purchases', 'last_active']

historical_features_df = feast.get_historical_features(customer_ids, customer_features)
model = ml.fit(historical_features_df) # pseudo code

online_features = feast.get_online_features(customer_ids, customer_features)
prediction = model.predict(online_features)
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,Netflix 實現了簡單的接口,讓數據科學家能夠容易地創建 point-in-time correct 和特徵和標籤。下面的例子展示如何讀取電影 "},{"type":"link","attrs":{"href":"http:\/\/outatimemovie.com\/","title":null,"type":null},"content":[{"type":"text","text":"OUTATIME"}]},{"type":"text","text":" 的觀看歷史快照。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
1
2
3
4
val snapshot = new SnapshotDataManager(sqlContext)
.withTimestamp(1445470140000L)
.withContextID(OUTATIME)
.getViewingHistory
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於該快照,用戶只需提供以下內容,即可進行 time-travel,創建用於訓練和評估的特徵:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上下文:模型在何地何時被如何使用,例如國家、設備、成員檔案、電影、時間等,其中時間非常關鍵。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物品:要評分或排序的物品,例如電影、推薦名單、搜索項等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標籤:監督學習的目標,例如點擊、已看、觀看分鐘數等。無監督學習不需要這些內容。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵編碼器:如何將上下文和物品"},{"type":"link","attrs":{"href":"https:\/\/developers.google.com\/machine-learning\/crash-course\/feature-crosses\/encoding-nonlinearity","title":null,"type":null},"content":[{"type":"text","text":"組合"}]},{"type":"text","text":"起來創建特徵,例如國家-電影、用戶 ID-電影等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber Palette 的 DSL(含代碼示例)"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 也"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"詳細介紹"}]},{"type":"text","text":"了他們如何通過拓展 "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#transformers","title":null,"type":null},"content":[{"type":"text","text":"Spark Transformer"}]},{"type":"text","text":" 和創建 DSL,進行特徵讀取和轉換。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"- Transformer:主要用於特徵讀取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"- Estimator:主要用於創建特徵,而不是像 "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#estimators","title":null,"type":null},"content":[{"type":"text","text":"Spark Estimator"}]},{"type":"text","text":" 一樣用於模型訓練。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"自治:儘量自動"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最頂層是自治需求。自治可以降低開發難度和運維成本,否則數據科學家需要花時間進行枯燥的手動操作。目前,有些公司分享了相關經驗,但自治在業界還不普遍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Airbnb 發現數據回填成爲數據科學家迭代模型實驗的瓶頸。因此,Zipline 支持自動特徵回填。數據科學家可以在簡單的 UI 上定義新特徵,指定開始和結束日期,以及回填任務的並行進程個數,隨後這些特徵就會"},{"type":"link","attrs":{"href":"https:\/\/speakerdeck.com\/artwr\/using-apache-airflow-as-a-platform-for-data-engineering-frameworks?slide=16","title":null,"type":null},"content":[{"type":"text","text":"通過 Airflow 管道"}]},{"type":"text","text":"添加到到已有的訓練特徵集中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/8019d3b76013d72f0957ec6f75296ac4.jpeg","alt":"airbnb-backfill","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖注:Airbnb 特徵回填 UI,和它所創建的 Airflow DAG。"},{"type":"link","attrs":{"href":"https:\/\/speakerdeck.com\/artwr\/using-apache-airflow-as-a-platform-for-data-engineering-frameworks?slide=17","title":null,"type":null},"content":[{"type":"text","text":"來源"}]},{"type":"text","text":")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是自治方面的其它實踐:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix "},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520","title":null,"type":null},"content":[{"type":"text","text":"Metacat"}]},{"type":"text","text":" 提供了特徵表的成本和存儲空間指標,便於刪除不用的特徵表,節約成本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"Data Quality Monitor"}]},{"type":"text","text":" 進行自動的異常檢測,基於數據質量指標和每日數據質量分數進行通知。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 實驗支持自動特徵選擇。用戶只需提供要預測的標籤,Palette 就能推薦出與標籤有關的特徵。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結:取決於你的需求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我希望現在大家對「特徵平臺是什麼」有了更清晰的理解。如果我們從零開始,我們所需的不過是訪問和服務,加上一點點準確性。如果我們在大廠搭建特徵平臺,則需更早考慮便利和自治。如有問題,歡迎聯繫 "},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/eugeneyan","title":null,"type":null},"content":[{"type":"text","text":"@eugeneyan"}]},{"type":"text","text":" !"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想開始搭建特徵平臺嗎?"},{"type":"link","attrs":{"href":"https:\/\/github.com\/feast-dev\/feast","title":null,"type":null},"content":[{"type":"text","text":"Feast"}]},{"type":"text","text":" 是個不錯的選擇。它滿足了訪問和服務的需求,並提供了一致的接口,讓訓練和服務可以使用相似的代碼。最棒的一點是,它是開源(免費)的。不妨告訴我進展如何!"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Feature Stores - A Hierarchy of Needs. "},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/feature-stores\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/eugeneyan.com\/writing\/feature-stores\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Rethinking Feature Stores. "},{"type":"link","attrs":{"href":"https:\/\/medium.com\/data-for-ai\/rethinking-feature-stores-74963c2596f0","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/medium.com\/data-for-ai\/rethinking-feature-stores-74963c2596f0"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章