特征平台需求层次理论

原創

伴鱼技术团队

2021-06-07 10:53

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文是"},{"type":"link","attrs":{"href":"https:\/\/tech.ipalfish.com\/blog\/2021\/05\/31\/mlsys-we-love\/","title":null,"type":null},"content":[{"type":"text","text":"「算法工程化实践调研」"}]},{"type":"text","text":"系列的第 1 篇，翻译 "},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/eugeneyan","title":null,"type":null},"content":[{"type":"text","text":"Eugene Yan"}]},{"type":"text","text":" 的技术博客 "},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/feature-stores\/","title":null,"type":null},"content":[{"type":"text","text":"Feature Stores - A Hierarchy of Needs"}]},{"type":"text","text":" [1]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出于开发伴鱼特征平台的需要，我最近阅读了很多关于特征平台的实践文章，但总有「一叶障目，不见泰山」之感——每个公司的算法工程化现状不尽相同，导致解决方案的侧重点不同，在架构上的区别也很大。正如我的前同事佘昶在他 2019 年的一篇文章中，到位地总结：我们缺乏一个"},{"type":"text","marks":[{"type":"strong"}],"text":"系统性"},{"type":"text","text":"地思考特征平台的框架。[2]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幸运的是，Eugene 的博客正好提供了这样一个思考框架，并将这个思考框架用于分析当前的各个特征平台上。我在征得 Eugene 的同意后，全文翻译，以飨中文读者。以下是译文。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征平台（feature store）最近很火。2020 年 12 月，AWS "},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2020\/12\/introducing-amazon-sagemaker-feature-store\/","title":null,"type":null},"content":[{"type":"text","text":"发布"}]},{"type":"text","text":"了 SageMaker 特征平台。上个月，大数据平台 Splice Machine 也"},{"type":"link","attrs":{"href":"https:\/\/splicemachine.com\/press-releases\/splice-machine-launches-the-splice-machine-feature-store-to-simplify-feature-engineering-and-democratize-machine-learning\/","title":null,"type":null},"content":[{"type":"text","text":"发布"}]},{"type":"text","text":"了一款特征平台。Datanami 引用 Tecton.ai 联合创始人的话，称 2021 年为"},{"type":"link","attrs":{"href":"https:\/\/www.datanami.com\/2021\/01\/19\/2021-the-year-of-the-feature-store\/","title":null,"type":null},"content":[{"type":"text","text":"特征平台之年"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据我们的经验，管理特征是机器学习上线最大的瓶颈之一。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特征和标签是机器学习模型的输入。"},{"type":"text","text":"在回归中，标签是因变量，特征是自变量。在表格中，标签是我们想要预测的列，特征是除 ID 外的其它列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家对于「"},{"type":"text","marks":[{"type":"strong"}],"text":"特征平台是什么"},{"type":"text","text":"」有很多种理解。有人把它简单地定义为「一个集中存储特征的地方」。也有人称特征平台能帮你「实现特征的一次创建，多处使用」或「百倍地提高模型部署效率」。之所以回答五花八门，是因为"},{"type":"text","marks":[{"type":"strong"}],"text":"每个人想要特征平台做的事情都不同"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我研究了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/eugeneyan\/applied-ml#feature-stores","title":null,"type":null},"content":[{"type":"text","text":"大量业界实践"}]},{"type":"text","text":"，试图理解特征平台在不同场景下解决的问题。受心理学家马斯洛的启发，我发现特征平台的能力可以满足多个层次的需求。我称之为「特征平台的需求层次」，我将逐层介绍这些需求，并讨论业界的特征平台为满足该层次需求所做的实践。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特征平台的需求层次"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"马斯洛需求层次理论是一个心理动力理论，认为人类有五个层次的需求，呈金字塔形。该理论认为，人会首先满足最大最基本的需求（金字塔底），才会考虑更高层次的要求（金字塔顶）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"关于「马斯洛需求层次理论」"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d2\/e3\/d23ebfdf6yy55d03a319e6262b5c68e3.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：马斯洛的需求层次金字塔，底层代表基本需求。"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Maslow's_hierarchy_of_needs","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生理需求："},{"type":"text","text":"是生存所必需的，例如空气、水、食物、住处等。不满足这个需求，人类身体将无法正常工作。只有满足了这一最重要的需求，才会考虑其它需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安全需求："},{"type":"text","text":"在生理需求被满足的前提下，需要保障安全，包括人身安全、健康、经济安全、就业、法律秩序、社会稳定等。这一需求通常由家庭、社会和政府满足。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"社交需求："},{"type":"text","text":"感到社会群体对自己的认可和接受。这类群体包括同事、教友、职业机构、体育俱乐部、在线社区等，也包括家庭、朋友和导师。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"尊严需求："},{"type":"text","text":"基于能力和成就。较低层次的尊严需求来自他人，包括他人的尊重认可和在其他人中的声誉。较高层次的尊严需求来自自己，包括对自己的尊重和内在的成就感。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自我实现需求："},{"type":"text","text":"最高层的需求，是个人潜力的完全实现。对自我实现的需求因人而异。有的人希望成为完美父母，而有的人强烈希望在经济上、学术上、体育上取得成功。人们通过发明、艺术、写作等方式表达这种需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同样，特征平台会首先满足最必要和急迫的需求（包括特征读取和特征服务），再去考虑高阶需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/db\/dbad811f20b0686a47c099e808965b8d.jpeg","alt":"pyramid","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：特征平台需求层次）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最底层是访问（access）的需求。"},{"type":"text","text":"这一层需求包括特征可读取、特征转换逻辑透明和特征血缘可溯。它们使得特征能被发现、分享和复用，减少重复。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从前，算法工程师进行机器学习开发时，60%的时间都花在编写特征转换逻辑上。—— Airbnb"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"其次是服务（serving）的需求。"},{"type":"text","text":"这一层的核心需求是为线上服务提供高吞吐、低延迟的特征读取能力，而无需通过 SQL 去数据仓库读取。其它需求还包括：与已有的离线特征存储集成，使得特征能够从离线特征存储同步到在线特征存储（例如 Redis）；实时的特征转换等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常，数据工程师会将数据科学家实现的特征重新实现为可以在生产环境运行的特征管道。这个重复实现的过程会让项目推迟数月交付，让跨团队合作极为复杂。—— GoJek"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"两个底层需求被满足后，我们诉诸准确（integrity）需求。"},{"type":"text","text":"最常见的需求是最小化 train-serve skew，确保特征在训练和服务环境下是一致的。另一个常见需求是 point-in-time correctness（又称 time-travel），以确保历史特征和标签被用于训练和评估时不存在 data leaks。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"训练通常是离线的，而服务通常是实时的。保证训练和服务环境下的数据一致性极为重要。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"再往上，是便利的需求。"},{"type":"text","text":"特征平台需要足够简单好入手，例如提供简单直观的接口、易交互、易 debug 等，才能让大家采纳和受益。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"记住，我们是个平台组。我们要搭建工具把提供给用户，让他们能够自己动手丰衣足食。—— Uber"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最后是自治（autopilot）的需求，"},{"type":"text","text":"包括自动回填特征、对特征的分布进行监控和报警等。我知道有些公司有做这一层的事情，但我没怎么读到相关材料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征回填是训练集迭代最主要的瓶颈。解决这一问题能极大地加速数据科学家的工作流。—— Airbnb"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"并非所有团队都有全部五层需求，"},{"type":"text","text":"对大部分团队而言，满足第一、二层和部分第三层的需求就很受益了。不同团队对于每一层需求的程度要求也不同。在线场景少的团队相比每秒需要处理几百万请求的的 DoorDash 团队，当然更少关心特征服务的需求；如果模型和特征每天更新多次，则更少需要关心 point-in-time correctness。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在逐层了解对特征平台的需求后，让我们来看看不同的公司是如何实现这些需求的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对「创新者窘境」的借鉴"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"访问：去重和复用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果特征难以访问，以下情况会发生："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不同团队反复实现同一个特征，导致同一特征可能有多达 10 个版本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署多个相近的特征管道，浪费计算和存储资源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因为同一特征有多个版本，不同模型会使用不同版本的特征，很难得到一致的结果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"迭代变慢。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表达同一业务概念的特征被多个团队反复开发，已有工作无法复用。—— GoJek"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决这一问题，"},{"type":"link","attrs":{"href":"https:\/\/www.gojek.io\/blog\/feast-bridging-ml-models-and-data","title":null,"type":null},"content":[{"type":"text","text":"GoJek 搭建 Feast"}]},{"type":"text","text":" 作为数据工程师、数据科学家和算法工程师合作的接口。数据工程师和数据科学家创建特征，并提交给特征平台。随后，算法工程师消费特征，而无需自己创建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（我很难评价这种做法，因为我认为数据科学家应该"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/","title":null,"type":null},"content":[{"type":"text","text":"端到端"}]},{"type":"text","text":"。当然，GoJek 的做法或许和组织架构有关，因为它的数据工程团队主要在印度，而算法团队主要在新加坡。Feast 扮演了团队间沟通的接口。）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 也采用了相似的做法，通过搭建 "},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"Palette 特征平台"}]},{"type":"text","text":"，鼓励不同部门分享和复用 Palette 中的特征。这种做法最小化了重复工作，让机器学习的的结果更加一致，加速机器学习的进程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征的可发现性使得特征易于发现和使用。我没有读到多少特征平台语境下的可发现性的相关讨论，估计它和我之前写过的"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/data-discovery-platforms\/","title":null,"type":null},"content":[{"type":"text","text":"开源数据发现平台"}]},{"type":"text","text":"很相似。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这个层次上，特征平台基本上是个包含很多特征的存储，和数据仓库的区别不大。把两者区分开的是特征平台还能满足特征下一层次的服务需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"服务：在实时环境使用特征"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们常用批数据离线训练模型，然而在线模型服务需要实时读取这些特征。这难住了很多团队——应该如何为在线模型服务高吞吐、低延迟地提供（serve）这些特征？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们在开发模型的过程中发现：很多用于训练的特征，并无法在生产环境中获取。—— Monzo Bank"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Monzo Bank 能从离线分析环境（用于模型训练）中获取特征，但无法从生产环境（用于模型服务）中获取特征。Monzo Bank 采用了一个"},{"type":"link","attrs":{"href":"https:\/\/nlathia.github.io\/2020\/12\/Building-a-feature-store.html","title":null,"type":null},"content":[{"type":"text","text":"轻量的解决方案"}]},{"type":"text","text":"，将离线分析存储（BigQuery）中的特征同步至在线存储（Cassandra）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，在离线分析环境的 SQL 建表语句中加入标签。这些表的更新频率在小时或天级别。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征平台中的 Go 服务检查特征表 schema 的正确性，例如必需的 "},{"type":"codeinline","content":[{"type":"text","text":"subject_type"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"subject_id"}]},{"type":"text","text":" 列是否存在。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有个 cron job 监听特征表的更新，将数据变动从 BigQuery 经过 Google Cloud Storage 的中转同步至 Cassandra。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 的 Palette 采取了类似的双存储设计。离线存储（Hive）保存特征快照，用于训练。在线存储（Cassandra）实时提供同样的特征。特征由 Flink 生成，写入 Cassandra。两个存储之间会进行特征同步：添加到 Hive 的特征会被复制到 Cassandra，添加到 Cassandra 的特征会被 ETL 到 Hive。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fb\/fb8caa62ad03d49fe0e25ea434e5db97.jpeg","alt":"uber-dual-store","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：创建批特征（左边）和实时特征（右边），并在存储间同步。"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DoorDash 搭建了"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"超大规模的特征平台"}]},{"type":"text","text":"，将特征服务做到极致，满足以下需求："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持在可持久、可伸缩的存储中保存十亿级别条数的特征。DoorDash 有百万级别的特征实体（entity）和十亿级别的特征。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持百万级别 QPS（Queries per second）。特征平台有多个使用场景，其中包括餐厅排序。这一场景使用大量特征，每秒做出超过一百万次预测。综合来看，特征平台的 QPS 超过一千万。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持非实时特征每日一次的快速批更新，和实时特征（例如餐厅过去 20 分钟的平均送餐时长）一天内不断的更新。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DoorDash 在评估了 Redis、Cassandra、CockroachDB、ScyllaDB 和 YugabyteDB 后，选择了 Redis。这篇"},{"type":"link","attrs":{"href":"https:\/\/doordash.engineering\/2020\/11\/19\/building-a-gigascale-ml-feature-store-with-redis\/","title":null,"type":null},"content":[{"type":"text","text":"好文"}]},{"type":"text","text":"介绍了 DoorDash 的评估过程和针对 Redis 做的后续优化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一个方法是实时计算特征。例如，阿里巴巴的特征服务平台实时计算用户行为特征的统计量（点击、点赞、购买等），用于「猜你喜欢」"},{"type":"link","attrs":{"href":"https:\/\/102.alibaba.com\/detail?id=183","title":null,"type":null},"content":[{"type":"text","text":"实时推荐"}]},{"type":"text","text":"。"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/real-time-recommendations\/","title":null,"type":null},"content":[{"type":"text","text":"这篇文章"}]},{"type":"text","text":"介绍了更多实时推荐的内容。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"准确：创建正确的在线和离线特征"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在满足服务的需求后，我们来看准确需求。准确性解决的主要痛点是："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"难以创建 point-in-time correct 的特征，用于模拟生产环境。做不对的话，会导致 data leaks。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"训练和服务环境特征的不一致，导致模型上线后表现欠佳。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决第一个痛点，Netflix 实现了"},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/distributed-time-travel-for-feature-generation-389cccdd3907","title":null,"type":null},"content":[{"type":"text","text":"分布式 time-travel"}]},{"type":"text","text":"。它给离线和在线数据建立快照，快照内容包含成员类型、设备、当天时间等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/592c6c14270f1e80f86dec2b9388c329.jpeg","alt":"netflix-snapshots","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：从离线和在线的微服务创建快照。"},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/distributed-time-travel-for-feature-generation-389cccdd3907","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而，为每一个 context 都建立快照的成本很高。因此，Netflix 对观看模式、设备类型、设备使用时长、地区等离线特征进行分层抽样，这些样本很好地代表了用于模型训练和评估的数据的分布。抽样通过 Spark 完成，快照存储在 S3 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix 也会对在线特征建立快照。数据产生自数百个微服务，数据包括观看历史、个性化观看列表、评分预测等。数据由 Spark 通过 "},{"type":"link","attrs":{"href":"https:\/\/medium.com\/@Netflix_Techblog\/prana-a-sidecar-for-your-netflix-paas-based-applications-and-services-258a5790a015","title":null,"type":null},"content":[{"type":"text","text":"Prana"}]},{"type":"text","text":" 并行获取，制成快照，以 Parquet 格式存储在 S3 上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决 train-serve skew 这第二个痛点，GoJek 用 Apache Beam 实现数据处理管道，消费来自批和流数据源（例如 BigQuery 和 Kafka）的数据，注入离线和在线存储（例如 BigQuery 和 Redis），并提供统一的接口来读取历史和实时数据。这种做法避免了因在生产环境重写特征管道而引入 train-serve skew。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cf354f1555d6511c807b2d4b9adc6f92.jpeg","alt":"gojek-feast","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：GoJek 基于 Apache Beam 的特征注入。"},{"type":"link","attrs":{"href":"https:\/\/www.gojek.io\/blog\/feast-bridging-ml-models-and-data","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix 则通过共享的特征编码器（encoder）解决这一痛点。尽管他们在离线（Spark）和在线环境实现了不同的特征生成管道，但不同管道共用特征编码器（即同样的类、库和数据格式）。这也保证了特征生成过程在训练和服务环境的一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/029a040e4e54ae5357d88e87a094aa45.jpeg","alt":"netflix-shared-encoders","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：Netflix 的离线和在线特征生成使用同一个编码器。"},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/session\/fact-store-scale-for-netflix-recommendations","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们在前文介绍过 Uber 如何保持离线（Hive）和在线（Cassandra）特征存储的数据同步。任意一个存储的新特征都会被复制到另一个存储，确保训练和服务环境中数据的一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为确保特征的准确，还需引入监控，回答以下问题："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征最近一次更新是什么时候？"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema 正确吗？数据分布发生了偏移吗？"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征服务达到了吞吐和延迟的要求吗？"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Airbnb Zipline UI 向数据科学家展示特征的分布、特征和标签之间的相关性、聚类分析（尚不清楚基于什么做聚类分析）。类似地，Uber "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"Data Quality Monitor"}]},{"type":"text","text":" 通过以下方法给用户展示每日数据质量分数和异常报警："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，搜集特征的指标，例如数值特征的均值、中位数、最大值、最小值，以及类别特征的唯一值个数和缺失值个数。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次，基于指标建立多维时间序列，使用主成分分析（PCA）选取出要保留的主成分。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最后，使用主成分建立时间序列。如果当前测量值和上一步的预测值不匹配，则将该特征标记为异常。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b4\/b4a421d3c894645897631ef3ad44adbd.jpeg","alt":"uber-dqm","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：数据质量随时间的变化，以及当事故发生时。"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"便利：尽量简单"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当前，关于特征平台的便利需求，讨论并不多。但显然，好用的工具和平台非常重要（想想 PyTorch 和 Tensorflow 的对比）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我找到的最好的例子来自 GoJek。GoJek 实现提供了统一的 Python、Java 和 Go SDKs，让用户可以在不同语言中几乎无区别地使用 "},{"type":"codeinline","content":[{"type":"text","text":"get_batch_features()"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"get_online_features()"}]},{"type":"text","text":" 接口，简化从离线存储和在线存储中获取特征的过程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

customer_features = ['credit_score', 'balance', 'total_purchases', 'last_active']

historical_features_df = feast.get_historical_features(customer_ids, customer_features)
model = ml.fit(historical_features_df)  # pseudo code

online_features = feast.get_online_features(customer_ids, customer_features)
prediction = model.predict(online_features)
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，Netflix 实现了简单的接口，让数据科学家能够容易地创建 point-in-time correct 和特征和标签。下面的例子展示如何读取电影 "},{"type":"link","attrs":{"href":"http:\/\/outatimemovie.com\/","title":null,"type":null},"content":[{"type":"text","text":"OUTATIME"}]},{"type":"text","text":" 的观看历史快照。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"1
2
3
4
val snapshot = new SnapshotDataManager(sqlContext)
  .withTimestamp(1445470140000L)
  .withContextID(OUTATIME)
  .getViewingHistory
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于该快照，用户只需提供以下内容，即可进行 time-travel，创建用于训练和评估的特征："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上下文：模型在何地何时被如何使用，例如国家、设备、成员档案、电影、时间等，其中时间非常关键。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物品：要评分或排序的物品，例如电影、推荐名单、搜索项等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"标签：监督学习的目标，例如点击、已看、观看分钟数等。无监督学习不需要这些内容。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特征编码器：如何将上下文和物品"},{"type":"link","attrs":{"href":"https:\/\/developers.google.com\/machine-learning\/crash-course\/feature-crosses\/encoding-nonlinearity","title":null,"type":null},"content":[{"type":"text","text":"组合"}]},{"type":"text","text":"起来创建特征，例如国家-电影、用户 ID-电影等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber Palette 的 DSL（含代码示例）"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 也"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/presentations\/michelangelo-palette-uber\/","title":null,"type":null},"content":[{"type":"text","text":"详细介绍"}]},{"type":"text","text":"了他们如何通过拓展 "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#transformers","title":null,"type":null},"content":[{"type":"text","text":"Spark Transformer"}]},{"type":"text","text":" 和创建 DSL，进行特征读取和转换。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"- Transformer：主要用于特征读取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"- Estimator：主要用于创建特征，而不是像 "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#estimators","title":null,"type":null},"content":[{"type":"text","text":"Spark Estimator"}]},{"type":"text","text":" 一样用于模型训练。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"自治：尽量自动"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最顶层是自治需求。自治可以降低开发难度和运维成本，否则数据科学家需要花时间进行枯燥的手动操作。目前，有些公司分享了相关经验，但自治在业界还不普遍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Airbnb 发现数据回填成为数据科学家迭代模型实验的瓶颈。因此，Zipline 支持自动特征回填。数据科学家可以在简单的 UI 上定义新特征，指定开始和结束日期，以及回填任务的并行进程个数，随后这些特征就会"},{"type":"link","attrs":{"href":"https:\/\/speakerdeck.com\/artwr\/using-apache-airflow-as-a-platform-for-data-engineering-frameworks?slide=16","title":null,"type":null},"content":[{"type":"text","text":"通过 Airflow 管道"}]},{"type":"text","text":"添加到到已有的训练特征集中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/8019d3b76013d72f0957ec6f75296ac4.jpeg","alt":"airbnb-backfill","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"（图注：Airbnb 特征回填 UI，和它所创建的 Airflow DAG。"},{"type":"link","attrs":{"href":"https:\/\/speakerdeck.com\/artwr\/using-apache-airflow-as-a-platform-for-data-engineering-frameworks?slide=17","title":null,"type":null},"content":[{"type":"text","text":"来源"}]},{"type":"text","text":"）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是自治方面的其它实践："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Netflix "},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520","title":null,"type":null},"content":[{"type":"text","text":"Metacat"}]},{"type":"text","text":" 提供了特征表的成本和存储空间指标，便于删除不用的特征表，节约成本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber "},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/monitoring-data-quality-at-scale\/","title":null,"type":null},"content":[{"type":"text","text":"Data Quality Monitor"}]},{"type":"text","text":" 进行自动的异常检测，基于数据质量指标和每日数据质量分数进行通知。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber 实验支持自动特征选择。用户只需提供要预测的标签，Palette 就能推荐出与标签有关的特征。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"总结：取决于你的需求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我希望现在大家对「特征平台是什么」有了更清晰的理解。如果我们从零开始，我们所需的不过是访问和服务，加上一点点准确性。如果我们在大厂搭建特征平台，则需更早考虑便利和自治。如有问题，欢迎联系 "},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/eugeneyan","title":null,"type":null},"content":[{"type":"text","text":"@eugeneyan"}]},{"type":"text","text":" ！"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想开始搭建特征平台吗？"},{"type":"link","attrs":{"href":"https:\/\/github.com\/feast-dev\/feast","title":null,"type":null},"content":[{"type":"text","text":"Feast"}]},{"type":"text","text":" 是个不错的选择。它满足了访问和服务的需求，并提供了一致的接口，让训练和服务可以使用相似的代码。最棒的一点是，它是开源（免费）的。不妨告诉我进展如何！"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"参考文献"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Feature Stores - A Hierarchy of Needs. "},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/feature-stores\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/eugeneyan.com\/writing\/feature-stores\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Rethinking Feature Stores. "},{"type":"link","attrs":{"href":"https:\/\/medium.com\/data-for-ai\/rethinking-feature-stores-74963c2596f0","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/medium.com\/data-for-ai\/rethinking-feature-stores-74963c2596f0"}]}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

从NoSQL到NewSQL——10年代大数据浪潮下的技术革新

引言在數字化浪潮的推動下，數據庫技術已成爲支撐數字經濟的堅實基石。騰訊雲 TVP《技術指針》聯合《明說三人行》特別策劃的直播系列——【中國數據庫前世今生】，我們將通過五期直播，帶您穿越五個十年，深入探討每個時代的數據庫演變

2024-04-28 23:12:26

“百团大战”下，20年代的国产数据库如何乘风破浪？

引言在數字化浪潮的推動下，數據庫技術已成爲支撐數字經濟的堅實基石。騰訊雲 TVP《技術指針》聯合《明說三人行》特別策劃的直播系列——【中國數據庫前世今生】，我們將通過五期直播，帶您穿越五個十年，深入探討每個時代的數據庫

2024-04-28 23:12:24

大数据小白的测试成长之路

引言 22年校招入職京東後，我一直在數據中臺測試部從事測試開發的工作。畢業後，寫的最多的文檔是測試計劃和測試報告，鮮有機會就自己的成長碼字進行回顧和總結。借“up技術人”欄目，也終於是在工作之餘回頭望，對自己這近兩年時光進行一個小總結

2024-04-28 11:17:19

如何从0到1设计诊断系统

引言在整車電子電氣體系中，診斷系統的設計扮演着至關重要的角色，負責支持整車的刷寫、故障排查和EOL(End of Line)等關鍵操作。這一重要性在於這些操作的實現都依賴於診斷系統的全面支持。因此，在設計診斷系統時，必須確保

2024-04-26 22:43:26

华为云Stack8.3面向香港正式发布，六大亮点激发云上跃迁

本文分享自華爲雲社區《華爲雲Stack8.3面向香港正式發佈，六大亮點激發雲上躍遷》，作者：華爲雲頭條。 2024年4月23日，在華爲雲香港峯會2024上，華爲混合雲副總裁胡玉海面向香港市場發佈華爲雲Stack8.3，提供110+本地

2024-04-26 10:33:21

对接HiveMetaStore，拥抱开源大数据

本文分享自華爲雲社區《對接HiveMetaStore，擁抱開源大數據》，作者：睡覺是大事。 1. 前言適用版本：9.1.0及以上在大數據融合分析時代，面對海量的數據以及各種複雜的查詢，性能是我們使用一款數據處理引擎最重要的考量

2024-04-24 22:33:08

重磅新品发布！云耀数据库HRDS，享受轻量级的极致体验

本文分享自華爲雲社區《重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗！》，作者：GaussDB 數據庫。所謂，凡有井水處，即能歌柳詞。大數據時代，凡有數據處，必有數據庫。隨着業務需求的不斷擴大和數據量的激增，數

2024-04-23 22:32:33

沙特2030年愿景和对中国IT企业的市场机会分析

沙特2030年願景和對中國IT企業的市場機會分析前言：最近“開源老DJ，帶你去沙特”欄目第一期已經播出，收到了不錯的反響。見COPU官網的回顧。（https://mp.weixin.qq.com/s/3B0jNVhybxTF1xPiy

2024-04-23 22:24:54

03-为啥大模型LLM还没能完全替代你？

1 不具備記憶能力的它是零狀態的，我們平常在使用一些大模型產品，尤其在使用他們的API的時候，我們會發現那你和它對話，尤其是多輪對話的時候，經過一些輪次後，這些記憶就消失了，因爲它也記不住那麼多。 2 上下文窗口的限制大模型對其inpu

2024-04-23 01:07:00

入职3年-我如何做一名AI产品经理

引言從2021年校招加入京東開始，我一直從事AI產品經理的工作，有幸見證了AI行業的熱情從一臺臺服務器燒到了全世界各個角落，也見證了京東AI中臺團隊的影響力如何一步步的擴大。從21年的迷茫到24年的堅定，很慶幸我正走在適合自己的道路上，

2024-04-22 11:16:31

01-大语言模型发展

AI大模型的相關的一些基礎知識，一些背景和基礎知識。多模型強應用AI 2.0時代應用開發者的機會。 0 大綱 AI產業的拆解和常見名詞應用級開發者，在目前這樣一個大背景下的一個職業上面的一些機會實戰部分的，做這個agent，即所謂智

2024-04-22 01:12:50

WhaleScheduler为银行业全信创环境打造统一调度管理平台解决方案

項目背景數字金融是數字經濟的重要支撐和驅動力。近年來，我國針對數字金融的發展政策頻頻出臺，《金融科技發展規劃（2022-2025年）》、《“十四五”數字經濟發展規劃》、《關於銀行業保險業數字化轉型的指導意見》、《金融標準化“十四五”

2024-04-19 21:18:25

用户行为分析模型实践（四）—— 留存分析模型

作者：vivo 互聯網大數據團隊- Wu Yonggang、Li Xiong 本文是vivo互聯網大數據團隊《用戶行爲分析模型實踐》系列文章第4篇 -留存分析模型。本文詳細介紹了留存分析模型的概念及基本原理，並

2024-04-19 11:26:00

京东内部研效架构师训练营，首次对外公开课，不可错过的研效之旅！

五月繁花似錦，讓我們帶你走進京東，開啓研效實戰之旅！四大單位聯合發起本次活動由“全國雲計算技術行業產教融合共同體”發起，聯合工業和信息化部電子第五研究所、E³CI軟件研發效能度量工作委員會、京東雲共同主辦，重磅推出“卓越研效架構師”

京東雲開發者

2024-04-19 11:16:30

软件测试从自动化到智能化，大模型开始加入

隨着科技的飛速發展，軟件行業也在不斷地演進和創新。作爲軟件行業的關鍵環節之一，軟件測試行業也在經歷着前所未有的變革。從最初的手動測試，到自動化測試，再到如今的智能化測試，軟件測試行業正在經歷一場深刻的技術革命。在這場革命中，Testin雲測

2024-04-19 00:53:25

24小時熱門文章

最新文章

最新評論文章