Flink 助力美团数仓增量生产

原創

2021-01-26 19:33

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文由美团研究员、实时计算负责人鞠大升分享，主要介绍 Flink 助力美团数仓增量生产的应用实践。内容包括：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"数仓增量生产","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"流式数据集成","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"流式数据处理","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"流式 OLAP 应用","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"未来规划","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、数仓增量生产","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.美团数仓架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先介绍一下美团数仓的架构以及增量生产。如下图所示，这是美团数仓的简单架构，我把它叫做三横四纵。所谓三横，第一是贯穿全链路的元数据以及血缘，贯穿数据集成、数据处理、数据消费、以及数据应用的全过程链路。另外一块贯穿全链路的是数据安全，包括受限域的认证系统、权限系统、整体的审计系统。根据数据的流向，我们把数据处理的过程分为数据集成、数据处理、数据消费、以及数据应用这 4 个阶段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在数据集成阶段，我们对于公司内部的，比如说用户行为数据、日志数据、DB 数据、还有文件数据，都有相应的集成的系统把数据统一到我们的数据处理的存储中，比如说 Kafka 中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在数据处理阶段，分为流式处理链路、批处理链路以及基于这套链路的数仓工作平台（万象平台）。生产出来的数据，经过 Datalink 导入到消费的存储中，最终通过应用以不同的形式呈现出来。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们目前在 Flink 上面应用比较广泛的地方，包括从 Kafka 把数据导到 Hive，包括实时的处理，数据导出的过程。今天的分享就集中在这些方面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/89cc8e161b31c7623aae77ad3d13577d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.美团 Flink 应用概况","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美团的 Flink 目前大概有 6000 台左右的物理机，支撑了 3 万左右的作业。我们消费的 Topic 数在 5 万左右，每天的高峰流量在 1.8 亿条每秒这样的水平上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1e/1e332e2839c8479161ece79e42b3c3e6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.美团 Flink 应用场景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美团 Flink 主要应用的场景包括四大块。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一，实时数仓、经营分析、运营分析、实时营销。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二，推荐、搜索。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三，风控、系统监控。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四，安全审计。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/01/013912605ffdcc73c82a2d637de7c94b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.实时数仓 vs 数仓增量生产","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来我要引入增量生产的概念。离线数仓关注的三块需求，第一个就是时效性。第二个就是质量，产出的数据的质量。第三个就是成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于时效性，有两个更深层次的含义，第一个叫做实时，第二个叫准时。并不是所有的业务需求都是实时的，很多时候我们的需求是准时。比如做经营分析，每天拿到相应的昨天的经营数据情况即可。实时数仓更多的是解决实时方面的需求。但是在准时这一块，作为一个企业，更希望在准时跟成本之间做一个权衡。所以，我把数仓的增量生产定义为对离线数仓的一个关于准时跟成本的权衡。另外，数仓增量生产解决比较好的一个方面是质量，问题能够及时发现。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b9/b9559d36eb5a5896fa6761ca7d656c5c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.数仓增量生产的优势","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数仓增量生产的优势有两点。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"能够及时发现数据质量问题，避免 T+1 修复数据。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"充分利用资源，提前数据产出时间。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图所示，我们期望做的实际上是第二幅图。我们期望把离线的生产占用的资源降低，但同时希望它的产出时间能够提前一步。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e85bcfb58c3dad40bfa2740ea1f56518.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、流式数据集成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.数据集成 V1.0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们来看一下流式数据集成的第一代。当数据量非常小以及库非常少的时候，直接做一个批的传输系统。在每天凌晨的时候把相应的 DB 数据全部 load 一遍，导到数仓里面。这个架构优势是非常简单，易于维护，但是它的缺点也非常明显，对于一些大的 DB 或者大的数据，load 数据的时间可能需要 2~3 个小时，非常影响离线数仓的产出时间。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6b/6b86d6216f3b4da5b5991670f85b77a4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.数据集成 V2.0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于这个架构，我们增加了流式传递的链路，我们会有经过流式传输的采集系统把相应的 Binlog 采集到 Kafka，同时会经过一个 Kafka 2 Hive 的程序把它导入到原始数据，再经过一层 Merge，产出下游需要的 ODS 数据。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据集成 V2.0 的优势是非常明显的，我们把数据传输的时间放到了 T+0 这一天去做，在第二天的时候只需要去做一次 merge 就可以了。这个时间可能就从 2~3 个小时减少到一个小时了，节省出来的时间是非常可观的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/65/65af7218474c389af74fe822d0ef7e7a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.数据集成 V3.0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在形式上，数据集成的第三代架构前面是没什么变化的，因为它本身已经做到了流式的传输。关键是后面 merge 的流程。每天凌晨 merge 一个小时，仍然是非常浪费时间资源的，甚至对于 HDFS 的压力都会非常大。所以在这块，我们就迭代了 HIDI 架构。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这是我们内部基于 HDFS 做的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e3/e38ff04fa2821dd14da1d86effd9a1b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.HIDI","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们设计 HIDI，核心的诉求有四点。第一，支持 Flink 引擎读写。第二，通过 MOR 模式支持基于主键的 Upsert/Delete。第三，小文件管理 Compaction。第四，支持 Table Schema。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e2a31e47680cbeaf4f85beb007f2d03a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于这些考虑，我们来对比一下 HIDI，Hudi 和 Iceberg。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"HIDI 的优势包括：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持基于主键的 Upsert/Delete ","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持和 Flink 集成 ","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小文件管理 Compaction","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劣势包括：不支持增量读。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Hudi 的优势包括：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持基于主键的 Upsert/Delete ","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小文件管理 Compaction","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劣势包括：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"写入限定 Spark/DeltaStreamer","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流读写支持 SparkStreaming","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Iceberg 的优势包括：支持和 Flink 集成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劣势包括：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持基于 Join 的 Upsert/Delete ","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流式读取未支持。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/357d58cc237b3d95b740d0c7c28f09ff.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.流式数据集成效果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图所示，我们有数据产生，数据集成，ETL 生产三个阶段。把流式数据集成做到 T+0，ETL 的生产就可以提前了，节省了我们的成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3a/3a997ce77d35b9578c2e40502a3e9434.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、流式数据处理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.ETL 增量生产","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们来讲一下 ETL 的增量生产过程。我们的数据从前面进来，到 Kafka 之后，有 Flink 实时，然后到 Kafka，再到事件的服务，甚至到分析的场景中，这是我们自己做的分析链路。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是批处理的一个链路，我们通过 Flink 的集成，集成到 HDFS，然后通过 Spark 去做离线生产，再经过 Flink 把它导出到 OLAP 的应用中。在这样的架构中，增量的生产实际上就是下图标记为绿色的部分，我们期望用 Flink 的增量生产的结构去替换掉 Spark。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dd/dd57edc2cf62f86dfb9c17cf81532543.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.SQL 化是 ETL 增量生产的第一步","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样的一个架构有三个核心的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一， Flink 的 SQL 的能力要对齐 Spark。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二，我们的 Table Format 这一层需要能够支持 Upsert/Delete 这样的主键更新的实时操作。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三，我们的 Table Format 能够支持全量和增量的读取。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们的全量用于查询和修复数据，而我们的增量是用来进行增量的生产。SQL 化是 ETL 增量生产的第一步，今天分享的主要是说我们基于 Flink SQL 做的实时数仓平台对这一块的支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/22/220c86493493affe23d09c90dc712253.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.实时数仓模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图所示，这是实时数仓的模型。业界应该都看过这样的一个模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6b/6b9eeb9e972906da78bdc0018f256548.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.实时数仓平台架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实时数仓的平台架构，分为资源层、存储层、引擎层、SQL 层、平台层、还有应用层。在这里重点强调两点。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一，是对于 UDF 的支持。因为 UDF 是弥补算子能力中的非常重要的一环，我们希望在这里面做的 UDF 能够加大对于 SQL 能力的支持。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二，是在这个架构里面只支持了 Flink Streaming 的能力，我们并没有去做 Flink 的批处理的能力，因为我们设想未来所有的架构都是基于 streaming 去做的，这跟社区的发展方向也是一致的。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/dbacf6b07f7a6d5c88faf54531ed039f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.实时数仓平台 Web IDE","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这是我们数仓平台的一个 Web IDE。在这样的一个 IDE，我们支持了一个 SQL 的建模的过程，支持了 ETL 的开发的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/71/71139ec3c62774d831a7bbcac253e97c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、流式 OLAP 应用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.异构数据源同步","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面看关于流式的导出跟 OLAP 的应用这一块。如下图所示，是异构数据源的同步图。业界有很多开源的产品做这一块。比如说，不同的存储里面，数据总是在其中进行交换。我们的想法是做一个 Datalink 这样的一个中间件，或者是中间的平台。然后我们把 N 对 N 的数据交换的过程，抽象成一个 N 对 1 的交换过程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dc/dcc4cdf3e9638f6148e53cfe91ea7c49.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.基于 DataX 的同步架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"异构数据源的第一版是基于 DataX 来做同步的架构。在这套架构里面，包含了工具平台层、调度层、执行层。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"工具平台层的任务非常简单，主要是对接用户，配置同步任务，配置调度，运维。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"调度层负责的是任务的调度，当然对于任务的状态管理，以及执行机的管理，很多的工作都需要我们自己去做。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在真正的执行层，通过 DataX 的进程，以及 Task 多线程的一个形式，真正执行把数据从源同步到目的地。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这样的一个架构里面，发现两个核心的问题。第一个问题就是扩展性的问题。开源的单机版的 DataX 是一个单机多线程的模型，当我们需要传输的数据量非常大的时候，单机多线程模型的可扩展性是很大的问题。第二个问题在调度层，我们需要去管理机器、同步的状态、同步的任务，这个工作非常繁琐。当我们的调度执行机发生故障的时候，整个灾备都需要我们单独去做这块的事情。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fd/fd7d2cbc27b5e2adee1820fbc2a6eee5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.基于 Flink 的同步架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于这样的架构，我们把它改成了一个 Flink 的同步的架构。前面不变，还是工具平台层。在原有的架构里面，我们把调度层里面关于任务调度和执行机的管理这一块都交给了 Yarn 去做，这样我们就从中解脱出来了。第二个，我们在调度层里面的任务状态管理可以直接迁移到 cluster 里面去。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于 Flink 的 Datalink 的架构优势非常明显。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一，可扩展性问题得到解决了，同时架构也非常简单。现在当我们把一个同步的任务拆细之后，它在 TaskManager 里面可以扩散到分布式的集群中。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二，离线跟实时的同步任务，都统一到了 Flink 框架。我们所有同步的 Source 和 Sink 的主键，都可以进行共用，这是非常大的一个优势。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a4e09902dd487aacb935b751a96496d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.基于 Flink 的同步架构关键设计","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们看一下基于 Flink 的同步架构的关键设计，这里总结的经验有四点。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一，避免跨 TaskManager 的 Shuffle，避免不必要的序列化成本；","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二，务必设计脏数据收集旁路和失败反馈机制；","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三，利用 Flink 的 Accumulators 对批任务设计优雅退出机制；","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四，利用 S3 统一管理 Reader/Writer 插件，分布式热加载，提升部署效率。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9abb34ccf18b82039cd8aa198f196af1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.基于 Flink 的 OLAP 生产平台","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于 Flink 我们做了 Datalink 这样的一个数据导出的平台，基于 Datalink 的导出平台做了 OLAP 的生产平台，在这边除了底层的引擎层之外，我们做了平台层。在这上面，我们对于资源、模型、任务、权限，都做了相应的管理，使得我们进行 OLAP 的生产非常快捷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bc/bc9546b065d6312c2765b81cec9c5775.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这是我们的 OLAP 生产的两个截图。一个是对于 OLAP 中的模型的管理，一个是对于 OLAP 中的任务配置的管理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9e224abf64e5959cd3afc0e49712e5c5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、未来规划","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"经过相应的迭代，我们把 Flink 用到了数据集成、数据处理、离线数据的导出，以及 OLAP 生产的过程中。我们期望未来对于流批的处理能够是统一的，希望数据也是流批统一的。我们希望，不管是实时的链路，还是增量处理的链路，在未来数据统一之后，统一用 Flink 处理，达到真正的流批一体。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dd/ddf7b18aac088d8647e12f4f0be97ee4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

京东广告研发——效率为王：广告统一检索平台实践

1、系統概述實踐證明，將互聯網流量變現的在線廣告是互聯網最成功的商業模式，而電商場景是在線廣告的核心場景。京東服務中國數億的用戶和大量的商家，商品池海量。平臺在兼顧用戶體驗、平臺、廣告主收益的前提推送商品具有挑戰性。京東廣告檢索平臺

2024-04-25 23:17:47

RocketMQ 之 IoT 消息解析：物联网需要什么样的消息技术?

前言：從初代開源消息隊列崛起，到 PC 互聯網、移動互聯網爆發式發展，再到如今 IoT、雲計算、雲原生引領了新的技術趨勢，消息中間件的發展已經走過了 30 多個年頭。目前，消息中間件在國內許多行業的關鍵應用中扮演着至關重要的角色。隨着數

2024-04-24 23:40:04

“企业创新新引擎”数据库专项赋能会，让云原生技术普惠千行百业！

本文分享自華爲雲社區《“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！》，作者： GaussDB 數據庫。 4月19日，由福州軟件園科技創新發展公司和華爲技術有限公司聯合主辦的HCDG城市行福州站——“企業創新新引擎”數據庫專

2024-04-24 10:32:53

如何增强Java API 的导入和导出性能

前言 GrapeCity Documents for Excel (以下簡稱GcExcel) 是葡萄城公司的一款服務端表格組件，它提供了一組全面的 API 以編程方式生成 Excel (XLSX) 電子表格文檔的功能，支持爲多個平臺創建、操

2024-04-23 10:23:02

SLS 查询新范式：使用 SPL 对日志进行交互式探索

作者：無哲引言在構建現代數據和業務系統的過程中，可觀測性已經變得至關重要，日誌服務（SLS）爲 Log/Trace/Metric 數據提供了大規模、低成本、高性能的一站式平臺服務，並提供數據採集、加工、投遞、分析、告警、可視化等功能，從

2024-04-22 21:12:05

WhaleScheduler为银行业全信创环境打造统一调度管理平台解决方案

項目背景數字金融是數字經濟的重要支撐和驅動力。近年來，我國針對數字金融的發展政策頻頻出臺，《金融科技發展規劃（2022-2025年）》、《“十四五”數字經濟發展規劃》、《關於銀行業保險業數字化轉型的指導意見》、《金融標準化“十四五”

2024-04-19 21:18:25

千帆杯AI原生应用创意挑战赛-效率工具常规赛重磅上线！

賽題內容本期比賽爲開放賽題，參賽者需要圍繞“效率工具”主題，結合自身的專業背景和創意想法，設計並開發出具有創新性和實用性的AI原生應用。要求使用工具：AppBuilder。參賽者可用0代碼創建應用調試指令，也可自定義組件與workf

2024-04-19 11:29:42

文档图像大模型

隨着信息技術的快速發展，文檔處理已經成爲日常生活和工作中不可或缺的一部分。傳統的文檔處理方法往往需要人工參與，效率低下且易出錯。近年來，隨着深度學習技術的突破，文檔圖像大模型在智能文檔處理領域嶄露頭角，爲提升文檔處理性能提供了新的解決方案。

2024-04-18 11:29:52

GaussDB(DWS)基于Flink的实时数仓构建

本文分享自華爲雲社區《GaussDB(DWS)基於Flink的實時數倉構建》，作者：胡辣湯。大數據時代，廠商對實時數據分析的訴求越來越強烈，數據分析時效從T+1時效趨向於T+0時效，爲了給客戶提供極速分析查詢能力，華爲雲數倉GaussDB

2024-04-18 10:32:57

五一假期畅游指南：Python技术构建的热门景点分析系统解读

導言五一假期即將到來，作爲一名熱愛旅遊的技術達人，我總是希望能夠通過技術手段更好地規劃我的旅行路線。在這篇文章中，我將向大家介紹一款基於Python技術的熱門景點分析系統，幫助您在五一假期中游玩得更加盡興！ 1. 系統概述熱門景點

2024-04-16 23:25:46

裁员了！别错过2024年大数据工程师必备的10项技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

还在担心报表不好做？不用怕，试试这个方法（四）

系列文章：《還在擔心報表不好做？不用怕，試試這個方法》（一）《還在擔心報表不好做？不用怕，試試這個方法》（二）《還在擔心報表不好做？不用怕，試試這個方法》（三）概要在上一篇文章《還在擔心報表不好做？不用怕，試試這個方法》（三）中，

2024-04-16 10:23:03

MaxCompute 近实时增全量处理一体化新架构和使用场景介绍

隨着當前數據處理業務場景日趨複雜，對於大數據處理平臺基礎架構的能力要求也越來越高，既要求數據湖的大存儲能力，也要求具備海量數據高效批處理能力，同時還可能對延時敏感的近實時鏈路有強需求，本文主要介紹基於 MaxCompute 的離線近實時一體

2024-04-15 23:41:52

普元信息顾伟：用更简单的方式来建设数据中台

近日，普元信息與鏡舟科技聯合舉辦“數據中臺新範式”雲端峯會，深入解析湖倉一體、批流一體、治理與運營一體的數據中臺新範式特徵，闡述以一站式聯合方案賦能企業提質增效的實踐經驗。普元信息數智研究院院長顧偉發表主旨演講《基於湖倉一體，構建開發

2024-04-12 11:43:03

Sql优化之回表

前言： MySQL的性能是大家在使用時十分關心的問題，比如在高併發訪問時，並且有慢sql存在的情況下，MySQL的性能會明顯下降，這會導致數據庫響應時間變慢，甚至導致數據庫宕機。那麼爲了避免Mysql性能問題，比較常用的方式創建適當的索引

2024-04-08 23:16:30

24小時熱門文章

最新文章

最新評論文章