百信银行基于 Apache Hudi 实时数据湖演进方案

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"简介:","attrs":{}},{"type":"text","text":" 本文介绍了百信银行实时计算平台的建设情况,实时数据湖构建在 Hudi 上的方案和实践方法,以及实时计算平台集成 Hudi 和使用 Hudi 的方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介绍了百信银行实时计算平台的建设情况,实时数据湖构建在 Hudi 上的方案和实践方法,以及实时计算平台集成 Hudi 和使用 Hudi 的方式。内容包括:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"背景","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"百信银行基于 Flink 的实时计算平台设计与实践","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"百信银行实时计算平台与实时数据湖的集成实践","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"百信银行实时数据湖的未来","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"总结","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百信银行,全称为 “中信百信银行股份有限公司”,是首家获批独立法人形式的直销银行。作为首家国有控股的互联网银行,相比于传统金融行业,百信银行对数据敏捷性有更高的要求。数据敏捷,不仅要求数据的准确性,还要求数据到达的实时性,和数据传输的安全性。为了满足我行数据敏捷性的需求,百信银行大数据部承担起了建设实时计算平台的职责,保证了数据快速,安全且标准得在线送达。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"受益于大数据技术的发展和更新迭代,目前广为人知的批流一体的两大支柱分别是:“统一计算引擎” 与 “统一存储引擎”。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink,作为大数据实时计算领域的佼佼者,1.12 版本的发布让它进一步提升了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"统一计算引擎","attrs":{}},{"type":"text","text":"的能力;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时随着数据湖技术 Hudi 的发展,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"统一存储引擎","attrs":{}},{"type":"text","text":"也迎来了新一代技术变革。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Flink 和 Hudi 社区发展的基础上,百信银行构建了实时计算平台,同时将实时数据湖 Hudi 集成到实时计算平台之上。结合行内数据治理的思路,实现了数据实时在线、安全可靠、标准统一,且有敏捷数据湖的目标。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、百信银行基于 Flink 的实时计算平台设计与实践","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 实时计算平台的定位实时计算平台作为行级实时计算平台,由大数据 IaaS 团队自主研发,是一款实现了实时数据 ”端到端“ 的在线数据加工处理的企业级产品。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其核心功能具备了实时采集、实时计算、实时入库、复杂时间处理、规则引擎、可视化管理、一键配置、自主上线,和实时监控预警等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前其支持的场景有实时数仓、断点召回、智能风控、统一资产视图、反欺诈,和实时特征变量加工等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"并且,它服务着行内小微、信贷、反欺诈、消金、财务,和风险等众多业务线。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"截止目前,在线稳定运行的有 320+ 的实时任务,且在线运行的任务 QPS 日均达到 170W 左右。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 实时计算平台的架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照功能来划分的话,实时计算平台的架构主要分为三层:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ 1)数据采集层","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采集层目前主要分为两个场景:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一个场景是采集 MySQL 备库的 Binlog 日志到 Kafka 中。我行所使用的数据采集方案并没有采用业界普遍用的如 Canal,Debezium 等现有的 CDC 方案。1、因为我们的 MySQL 版本为百信银行内部的版本,Binlog 协议有所不同,所以现有的技术方案不能很好的支持兼容我们获取 Binlog 日志。2、同时,为了解决我们数据源 MySQL 的备库随时可能因为多机房切换,而造成采集数据丢失的情况。我们自研了读取 MySQL Binlog 的 Databus 项目,我们也将 Databus 逻辑转化成了 Flink 应用程序,并将其部署到了 Yarn 资源框架中,使 Databus 数据抽取可以做到高可用,且资源可控。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二个场景是,我们对接了第三方的应用,这个第三方应用会将数据写入 Kafka,而写入 Kafka 有两种方式:1、一种方式是依据我们定义的 Json shcema 协议。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(UMF协议:{col_name:””,umf_id\":\"\",\"umf_ts\":,\"umf_op_\":\"i/u/d\"})协议定义了 ”唯一 id”,”时间戳“ 和 ”操作类型“。根据此协议,用户可以指定对该消息的操作类型,分别是 \"insert\",\"update\" 和 \"delete\",以便下游对消息进行针对性处理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、另外一种方式,用户直接把 JSON 类型的数据写到 kafka 中,不区分操作类型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ 2)数据计算转换层","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消费 Kafka 数据进行一层转换逻辑,支持用户自定义函数,将数据标准化,做敏感数据的脱敏加密等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ 3)数据存储层","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据存储到 HDFS,Kudu,TiDB,Kafka,Hudi,MySQL 等储存介质中。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/00a15f4b8771dba178f697a85e9a1c48.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上图所示的架构图中,我们可以看到整体实时计算平台支持的主要功能有:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"开发层面:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、支持标准化的 DataBus 采集功能,该功能对于支持 MySQL Binglog 同步到 Kafka 做了同步适配,不需要用户干预配置过多。用户只需要指定数据源 MySQL 的实例就可以完成到 Kafka 的标准化同步。2、支持用户可视化编辑 FlinkSQL。3、支持用户自定义 Flink UDF 函数。4、支持复杂事件处理(CEP)。5、支持用户上传打包编译好 Flink 应用程序。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"运维层面:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、支持不同类型任务的状态管理,支持savepoint。2、支持端到端的延迟监控,告警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在实时计算平台升级迭代的过程中,社区 Flink 版本之间存在一些向下不兼容的情况。为了平滑的升级 Flink 版本,我们对计算引擎的多版本模块进行统一的抽象,将多版本之间做了严格的 JVM 级别隔离,使版本之间不会产生 Jar 包冲突,Flink Api 不兼容的情况。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/00cfd83c04cecc195963635cb2e80776.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上图所示,我们将不同的 Flink 版本封装到一个独立的虚拟机中,使用 Thrift Server 启动一个独立的 JVM 虚拟机,每个版本的 Flink 都会有一个独立的 Thrift Server。在使用的过程中,只要用户显示指定的 Flink 版本,Flink 应用程序就会被指定的 Thrift Server 启动。同时,我们也将实时计算的后端服务嵌入一个常用的 Flink 版本,避免因为启动 Thrift Server 而占用过多的启动时间。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时为了满足金融系统高可用和多备的需求,实时计算平台也开发了多 Hadoop 集群的支持,支持实时计算任务在失败后可以迁移到备集群上去。整体的方案是,支持多集群 checkpoint,savepoint,支持任务失败后,可以在备机房重启实时任务。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、百信银行实时计算平台与实时数据湖集成实践","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介绍本内容之前,我们先来了解一些我行目前在数据湖的现状。目前的实时数据湖,我行依然采用主流的 Lambda 架构来构建数据仓库。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec27615276e48beb3af3b50bbaa9746c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1. Lambda","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lambda 架构下,数仓的缺点:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同样的需求,开发和维护两套代码逻辑:批和流两套逻辑代码都需要开发和维护,并且需要维护合并的逻辑,且需同时上线;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"计算和存储资源占用多:同样的计算逻辑计算两次,整体资源占用会增多;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据具有二义性:两套计算逻辑,实时数据和批量数据经常对不上,准确性难以分辨;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重用 Kafka 消息队列:Kafka 保留往往按照天或者月保留,不能全量保留数据,无法使用现有的 adhoc 查询引擎分析。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2. Hudi","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决 Lambda 架构的痛点,我行准备了新一代的数据湖技术架构,同时我们也花大量的时间调研了现有的数据湖技术,最终选择 Hudi 作为我们的存储引擎。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Update / Delete 记录:Hudi 使用细粒度的文件/记录级别索引,来支持 Update / Delete 记录,同时还提供写操作的事务保证,支持 ACID 语义。查询会处理最后一个提交的快照,并基于此输出结果;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"变更流:Hudi 对获取数据变更提供了流的支持,可以从给定的时间点获取给定表中已 updated / inserted / deleted 的所有记录的增量流,可以查询不同时间的状态数据;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技术栈统一:可以兼容我们现有的 adhoc 查询引擎 presto,spark。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"社区更新迭代速度快:已经支持 Flink 两种不同方式的的读写操作,如 COW 和 MOR。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4aad6e1d75f76421e02260ff2d44f886.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在新的架构中可以看到,我们将实时和批处理贴源层的数据全部写到 Hudi 存储中,并重新写入到新的数据湖层 datalake(Hive 的数据库)。出于历史的原因,为了兼容之前的数据仓库的模型,我们依然保留之前的 ODS 层,历史的数仓模型保持不变,只不过 ODS 贴源层的数据需要从 datalake 层获取。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5e523d3950bb0c95bbf219d7094c3592.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我们可以看到,对于新的表的入仓逻辑,我们通过实时计算平台使用 Flink 写入到 datalake 中(新的贴源层,Hudi 格式存储),数据分析师和数据科学家,可以直接使用 datalake 层的数据进行数据分析和机器学习建模。如果数据仓库的模型需要使用 datalake 的数据源,需要一层转换 ODS 的逻辑,这里的转换逻辑分为两种情况:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、第一种,对于增量模型,用户只需要将最新 datalake 的分区使用快照查询放到 ODS 中即可。2、第二种,对于全量模型,用户需要把 ODS 前一天的快照和 datalake 最新的快照查询的结果进行一次合并,形成最新的快照再放到 ODS 当前的分区中,以此类推。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们这么做的原因是,对于现有的数仓模型不用改造,只是把 ODS 的数据来源换成 datalake,时效性强。同时满足了数据分析和数据科学家准实时获取数据的诉求。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,对于原始的 ODS 存在的数据,我们开发了将 ODS 层的数据进行了一次初始化入 datalake 的脚本。1、如果 ODS 层数据每天是全量的快照,我们只将最新的一次快照数据初始化到 datalake 的相同分区,然后实时入 datalake 的链路接入;2、如果 ODS 层的数据是增量的,我们暂时不做初始化,只在 datalake 中重新建一个实时入湖的链路,然后每天做一次增量日切到 ODS 中。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最后,如果是一次性入湖的数据,我们使用批量入湖的工具导入到 datalake 中即可。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整体湖仓转换的逻辑如图:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d7/d76167903ca41f93ed9b954a6df1a1bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3. 技术挑战","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我们调研的初期,Hudi 对 Flink 的支持不是很成熟,我们对 Spark - StrunctStreaming 做了大量的开发和测试。从我们 PoC 测试结果上看,1、如果使用无分区的 COW 写入的方式,在千万级写入量的时候会发现写入越来越慢;","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、后来我们将无分区的改为增量分区的方式写入,速度提升了很多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之所以会产生这个问题,是因为 spark 在写入时会读取 basefile 文件索引,文件越大越多,读取文件索引就会越慢,因此会产生写入越来越慢的情况。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时,随着 Flink 对 hudi 支持越来越好,我们的目标是打算将 Hudi 入湖的功能集成到实时计算平台。因此,我们把实时计算平台对 Hudi 做了集成和测试,期间也遇到一些问题,典型的问题有:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、类冲突2、不能找到 class 文件3、rocksdb 冲突","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决这些不兼容的问题,我们将对 Hudi 的依赖,重新构造了一个独立的模块,这个工程只是把 Hudi 的依赖打包成一个 shade package。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、当有依赖冲突时,我们会把 Flink 模块相关或者 Hudi 模块相关的冲突依赖 exclude 掉。 5、而如果有其他依赖包找不到的情况,我们会把相关的依赖通过 pom 文件引入进来。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用 Hudi on Flink 的方案中,也遇到了相关的问题,比如,checkpoint 太大导致 checkpoint 时间过长而引起的失败。这个问题,我们设置状态的 TTL 时间,把全量 checkpoint 改为增量 checkpoint,且提高并行度来解决。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COW 和 MOR 的选择。目前我们使用的 Hudi 表以 COW 居多,之所以选择 COW,1、第一是因为我们目前历史存量 ODS 的数据都是一次性导入到 datalake 数据表中,不存在写放大的情况。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、另外一个原因是,COW 的工作流比较简单,不会涉及到 compaction 这样的额外操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是新增的 datalake 数据,并且存在大量的 update,并且实时性要求较高的情况下,我们更多的选择 MOR 格式来写,尤其写 QPS 比较大的情况下,我们会采用异步 compaction 的操作,避免写放大。除了这种情况外,我们还是会更倾向以 COW 的格式来写。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、百信银行实时数据湖的未来","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我行实时数据湖的架构中,我们的目标是将实时数仓的整个链路构建在Hudi之上,架构体系如图:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7dc5493acf7bae75cd077b965590534e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们整体的目标规划是替代 kafka,把 Hudi 作为中间存储,将数仓建设在 Hudi 之上,并以 Flink 作为流批一体计算引擎。这样做的好处有:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MQ 不再担任实时数据仓库存储的中间存储介质,而 Hudi 存储在 HDFS 上,可以存储海量数据集;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实时数据仓库中间层可以使用 OLAP 分析引擎查询中间结果数据;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真正意义上的批流一体,数据 T+1 延迟的问题得到解决;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"读时 Schema 不再需要严格定义 Schema 类型,支持 schema evolution;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持主键索引,数据查询效率数倍增加,并且支持 ACID 语义,保证数据不重复不丢失;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hudi 具有 Timeline 的功能,可以更多存储数据中间的状态数据,数据完备性更强。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、总结","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介绍了百信银行实时计算平台的建设情况,实时数据湖构建在 Hudi 上的方案和实践方法,以及实时计算平台集成 Hudi 和使用 Hudi 的方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用 Hudi 的过程中,也遇到一些问题,由衷感谢社区同学的帮助。特别感谢社区 Danny chan,leesf 解疑答惑。在实时数据湖架构体系下,构建我们实时数仓,流批一体方案还是在摸索中。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章