Adobe基于Iceberg的数据湖性能提升实践

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介绍了Adobe公司在使用Iceberg时遇到的小文件问题以及高并发写入的一致性问题。针对这两个问题,Adobe给出了有指导意义的解决方案。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"本文最初发表于Adobe技术博客("},{"type":"link","attrs":{"href":"https:\/\/medium.com\/adobetech\/high-throughput-ingestion-with-iceberg-ccf7877a413f","title":"xxx","type":null},"content":[{"type":"text","text":"《High Throughput Ingestion with Iceberg》"}]},{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"),经InfoQ翻译并分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"企业用户通过使用"},{"type":"link","attrs":{"href":"https:\/\/www.adobe.com\/experience-platform.html","title":null,"type":null},"content":[{"type":"text","text":"Adobe Experience Platform"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"对企业数据进行汇集和标准化处理,可以获得企业数据360度全方位的展示。我们之前发布了一篇名为《"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/@jaeness\/88cf1950e866","title":null,"type":null},"content":[{"type":"text","text":"Iceberg at Adobe"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"》的博客,在这个博客中,介绍了我们在海量数据和一致性方面面临的挑战,以及系统向apache Iceberg迁移的需求。Adobe Experience Platform作为一个开放可拓展的平台,面对海量的数据,不仅可以支持低延时的"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"流式计算"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",也可以高效的对数据进行"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"批处理"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在使用Iceberg时,我们遇到了将小文件高频地发送到Adobe Experience Platform进行批处理的场景。该场景在大数据行业中被称为“小文件问题”,这是一个常见的典型案例。我们在尝试将这些小文件大规模地提交到Iceberg时,该问题很快就暴露出来,我们亟需一个解决方案。本文详细介绍了磁盘缓冲写入模式下,我们是如何通过关键的数据写入方案来解决这个问题的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据的流式处理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"数据的流式提取允许你将数据从客户端和服务器端实时地发送到Experience Platform。Adobe Experience Platform支持将获取的数据转化为流式数据,这些数据会持久化存储在数据湖中支持流式传输的数据集中。在从数据源获取数据时,可以对数据进行自动验证的配置操作,以确保数据来源真实可靠。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"负责数据提取的关键组件是Siphon Stream。它使用了Apache "},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html","title":null,"type":null},"content":[{"type":"text","text":"Spark Structured Streaming"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流处理框架,该框架是一个分布式可拓展的数据处理引擎,支持多种连接器以及对接多种类型的数据文件。鉴于Apache Spark有着丰富的消息处理语义,因此被Adobe Experience Platform选为默认的数据处理引擎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在生产环境下,一般架构下的数据流提取遇到了一些挑战,我们在其他文章中详细介绍了改进措施,具体点击"},{"type":"link","attrs":{"href":"https:\/\/docs.google.com\/document\/d\/1k5m_q35LtXfBto7q0FfTzbHmqZ_pZeTsdUiCbCX5Yhw\/edit#heading=h.5mpa15l58hub","title":null,"type":null},"content":[{"type":"text","text":"这里"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可以阅读了解。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据的批处理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"直接数据提取是批处理的一部分。批处理文件将逐批上传到暂存区中,然后进行处理,最后提交到主存储区中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9b\/9bae2fe1d2d8ff9f0544f1d673148e20.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图一:直接数据提取架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"对于数据消费者而言,此过程的优点是数据的抽取速度更快,但是该方案的文件存储方式存在着一些缺点,比如会引起消费者的一些使用问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"这些缺点主要体现在高吞吐量场景下会引发两个我们熟知的技术问题:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"小文件问题"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"高并发写入导致资源竞争"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"系统的吞吐量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们当前所讨论的高吞吐量是以某个数据集每天的文件数来衡量的。我们通常会在时序数据库中看到一些高频小文件,这些文件记录了点击流和其他事件的数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通过观察时序数据,我们每天需要处理的数据量统计如下:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"总计文件抽取数量在"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"700,000"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"左右。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"总计数据抽取大小在"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"1.5 TB"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 左右。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"每个数据集支持的文件吞吐量约为"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"100,000。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"每个数据集支持的吞吐量大小约为 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"250 GB"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0c\/0c8da0614ecd445ca38be223420c1ea6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图二:每日数据吞吐量统计"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"小文件问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"小文件问题是分布式存储中已知的问题。对于HDFS,当存储多个小于"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"块大小"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的文件时,就会出现此问题。 HDFS旨在处理以"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"大文件"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"形式存储的大量数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一般情况下,HDFS中的每个文件,目录和块都表示为namenode机器内存中的一个对象,每个对象占用150个字节。因此,如果有1000万个小文件,每个文件占用一个块,那么总共将占用3GB内存。当前的硬件可以毫无压力地支持这个级别的内存。当然,如果有10亿个文件的话,硬件上可能就无法很好地进行支持了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"小文件进行读取检索时,通常会导致从一个数据节点到另外一个数据节点的大量查找和跳转,这些都是效率十分低下的数据访问模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg通过优化文件扫描并更加准确地定位需要加载的文件,实现读取效率的提升。Iceberg读取器会借助元数据,对分区和列存储数据进行修剪操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上面的解决方案在一定程度上会有所帮助,但是在诸如Adobe Experience Platform之类的高吞吐量数据抽取系统中,将会遇到在Iceberg表中"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"高并发写入"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的问题。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"高并发写入问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg使用乐观锁来处理并发写入问题。当出现数据版本冲突时,Iceberg会自动重试以确保数据更新成功。 在这种情况下,Iceberg可能会进行多次重试,以较低的时延高效地进行数据操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg使用快照在并发写入之间进行隔离,并且这些快照是在Iceberg表的每次提交(数据更新或更改)时进行创建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg的初始测试报告显示每个数据集每分钟限制15次提交,而我们需要每个数据集每分钟至少处理30次以上。 每次提交均从先前的快照开始,并产生一个新的快照。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"进行这种限制的原因是,Iceberg会一直扫描先前快照的元数据文件,并在写入数据后生成新的快照。读取以前的快照然后创建新的快照,这个过程十分耗时,并且耗时会随着元数据增加而增加。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们需要解决Iceberg的这一局限性问题,并在不引入不可预测的数据延迟的情况下,满足我们对数据提取机制的吞吐量期望。 "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"解决方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们提出与Iceberg集成的策略是在Adobe Experience Platform中缓冲我们写入的数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"缓冲写入"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"是一种批处理模式,可以满足我们的数据需求,因为它可以解决我们在Adobe Experience Platform这样的高吞吐量数据处理环境中出现的两个主要问题:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"HDFS(Hadoop分布式文件系统)中已知的小文件问题"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg表上的高并发写入问题"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"该解决方案意味着存在一个单独的服务,该服务提供缓冲点,负责确定何时以及如何打包数据,并将数据从该缓冲点迁移到数据湖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"使用单独的服务的有以下优点:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"优化写入效率:减少写入次数的同时,增加写入的数据量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"优化读取效率:读取器打开的文件数量进一步缩小。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"弹性伸缩:由于缓冲写入使用单独的按需作业,因此该服务可以根据业务需要进行弹性伸缩。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架构介绍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"下图说明了在Adobe Experience Platform中,从生产者到消费者的数据提取的缓冲写入解决方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e8a6398044c47f926741a708b9df80d2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图三: 缓冲写入解决方案架构图"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"该架构图按照数据处理阶段可以简要概括为三个方面:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"生产者"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":在数据处理的第一阶段,客户端或外部解决方案生成的数据可以是事件、CRM数据或维度数据,通常这些数据都以文件形式存在。它们将会通过批量提取服务(Bulk Ingest)进行上传或推送。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"数据处理平台"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":这是数据处理的第二阶段,在这里数据会被消费,并以Parquet文件写入到缓冲区中,在缓冲区,数据会停留一段时间。此缓冲区是数据提取过程的第一站。此时,Flux组件将被通知新数据已就绪,并对数据读取以进行合并。根据指定的规则和条件,Flux会处理一定数量和(或)一定时间的缓冲数据,然后将数据高效地写入用户可以访问的存储当中。数据跟踪器(Data Tracker)会依次负责生成高级元数据和指标数据。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"消费者"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":在数据处理的最后阶段,客户可以使用Adobe Experience Platform SDK和API高效提取聚合后的数据,以训练他们的机器学习模型,或者运行SQL查询,绘制仪表看板和生成报告,丰富统一配置管理数据,对用户群体进行画像和细分等等 。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"数据平台包括以下组件:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Bulk Ingest"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":批量提取模块,负责提取数据并将其保存到缓冲存储中"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Data Lake Buffer Zone"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":数据湖缓冲区,缓冲数据的存储位置"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Pipeline Services"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":数据管道服务,即Adobe Experience Platform的Kafka中间件,主要用于状态管理"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Flux"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":Spark状态流应用程序,包含缓冲数据的逻辑处理"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Compute"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":基于Azure Databricks构建的Adobe Experience Platform计算服务,该服务提供了对Apache Spark的支持,可以用于执行短生命周期的作业任务以及长期运行的应用程序"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Consolidation Worker"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":数据合并计算单元,使用Iceberg语义处理后将缓冲区存储中的数据分类合并,然后将其写入主存储中"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Data Lake Main Storage"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":数据湖主存储,可对用户提供数据存储和读取服务"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Data Tracker"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":数据跟踪器模块,负责对高级元数据进行管理"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Catalog"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":为用户提供高级元数据服务"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"数据平台收集到的文件数据大小不一,并且可能属于不同的数据分区。这些文件在到达数据处理平台时会进行缓冲,然后经过重新组合以写入最少数量的文件,同时生成最少数量的提交。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"为了了解其工作原理,接下来逐一介绍两种场景:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"无缓冲写入"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"和"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"缓冲后写入"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果不使用任何缓冲写入操作,那么任何新到达的文件都会被立即提取,不会进行任何优化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f865d2ed67b3ac0575ddcab5904741fa.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图四:没有使用缓冲写入方案的数据流示意图"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"从图中可以看出,15个小文件意味着15次的数据提交。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"使用缓冲写入方案升级后,数据会一直等待适当的时间来被提取和优化,以使文件和提交的数量最少。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/29\/29e3112ae92ebff00b99a4ae1a909a40.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图五:使用缓冲写入方案的数据流示意图"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"从上图可以看出,使用缓冲写入方案,同样的15个小文件,最终合并成了4个文件,并且最后只需要4次提交操作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据合并计算单元"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"实现缓冲写入的主要组件是“数据合并计算单元(Consolidation Worker)”。这是一个生命周期很短的过程,当缓存了足够的数据并且必须将其写入主存储时会触发该过程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"该计算单元以租户形式工作,并会尝试优化分区内的数据,但功能不仅仅局限于此,它还可以处理跨多个分区的文件数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg社区帮助我们克服了使用Consolidation Worker的两个障碍。首先,通过使用Apache Iceberg中已经存在的Write Audit Publish Flow(WAP)功能,在提交数据时会提供准确一次性语义保证。其次,它降低了由于并行覆写version-hint.txt文件而导致的提交错误。对于这些感兴趣的读者,可以阅读一下文章结尾列出的资源链接。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"写入审核发布流程 (WAP)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg具有“写入审核发布”的功能,该功能为数据存储提供分段提交支持,并保证提交后数据可用。WAP功能依赖于外部给定的ID(wapID),之后可以通过该ID检索到对应的分段提交。最重要的一个作用是它可以确保分段提交的唯一性——确保两段提交不会存在相同的ID。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/41\/4106ce5b3247e9a10a5745e521d6ae5b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图六:数据合并计算中的写入审核发布流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"WAP工作流程是在Consolidation Worker中实现的:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通过提供的ID检查数据是否已经存在于Iceberg中,如果是,则只需更新高级元数据"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"检查数据是否是作为单独的提交缓存,如果是,则将其写入到表中以供用户使用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果以上两种情况均不存在,那么就加载并使用WAP功能写入数据:暂存具有特定ID的数据,随后对数据进行选择写入"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最后,更新高级元数据存储。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Iceberg的version-hint.txt文件改进"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg存在高吞吐量并发写入的问题是因为要确保当前正在使用的表是最新版本。在某些情况下,由于已知的非原子性操作,比如重命名或者移动文件操作,会导致提供Iceberg SDK当前版本的元文件(version-hint.txt)丢失。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"当Iceberg提交新版本时,它将通过调用HDFS CREATE(overwrite = true)API,用新的版本值(当前版本值的增量)替换version-hint.txt的当前内容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而ADLS(Azure DataLake Store)选择将此API实现为DELETE + CREATE操作。因此会存在这样一种可能, 在特定的时间内进行提交操作时,version-hint.txt文件可能会不存在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"由于Iceberg读取器和写入器均依赖于版本提示文件version-hint,在解析并选择要加载适当的元数据版本文件时会引起数据不一致问题。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d0\/d0e4e127b4fdc45e85ba7df4972c92fe.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图七: Iceberg默认的版本文件实现方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"该问题的解决方案依赖于ADLS文件系统API的一致性保证来持久化表版本,比如CREATE(overwrite = false)操作的原子性和列表目录的读写一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因此,实现方案是使用目录文件来持久化保存版本信息,因此每个写入都将使用CREATE(overwrite = false)创建新的文件来标识新版本信息,而读取版本信息时必须列出版本文件目录,并选择此时目录中版本最高的那个值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ea\/eabb411b184fd4dccb3d301df91e52b9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图八:Iceberg版本文件实现改进方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"该改进是Adobe解决一致性问题所采用的方法,我们承认Iceberg Community可能会以不同的方法来解决这个问题。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"结论"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"借助缓冲写入解决方案,系统可以远远以超过我们制定的目标吞吐量的负载正常运行。我们进行了一项基准测试,每小时将大约20万个小文件提取到单个Iceberg表中,最终测试非常成功。此外,我们还可以支持水平扩展以适应未来海量数据处理的需求。唯一的限制就是计算设置的有限的可用资源,这是一个非常值得研究的问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在读取数据方面,由于平台已针对读取进行了优化,这完全可以满足我们的业务需求。当读取数据时,Iceberg本身会带来很多好处,例如矢量化读取,元数据过滤和数据过滤。通过借助其他工具来进一步的优化,还会带来运行时优势。我们将在下一个博客《Taking Query Optimizations to the Next Level with Iceberg》中进一步讨论这些好处。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"参考资料"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.adobe.com\/experience-platform.html","title":null,"type":null},"content":[{"type":"text","text":"Adobe Experience Platform"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/docs.adobe.com\/content\/help\/en\/experience-platform\/ingestion\/home.html","title":null,"type":null},"content":[{"type":"text","text":"Adobe Experience Platform Data Ingestion Help"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html","title":null,"type":null},"content":[{"type":"text","text":"Spark Structured Streaming"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/adobetech\/redesigning-siphon-stream-for-streamlined-data-processing-eccbcb2f3038","title":null,"type":null},"content":[{"type":"text","text":"Redesigning Siphon Stream for Streamlined Data Processing (Part 1)"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/adobetech\/part-2-redesigning-siphon-stream-for-streamlined-data-processing-in-adobe-experience-platform-1eca98fc85ad","title":null,"type":null},"content":[{"type":"text","text":"Redesigning Siphon Stream for Streamlined Data Processing in Adobe Experience Platform (Part 2)"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/adobetech\/creating-the-adobe-experience-platform-pipeline-with-kafka-4f1057a11ef","title":null,"type":null},"content":[{"type":"text","text":"Creating Adobe Experience Platform Pipeline with Kafka"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"["},{"type":"link","attrs":{"href":"https:\/\/www.mail-archive.com\/[email protected]\/msg00738.html","title":null,"type":null},"content":[{"type":"text","text":"DISCUSS] Write-audit-publish support"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/iceberg\/issues\/1496","title":null,"type":null},"content":[{"type":"text","text":"Update version-hint.txt atomically (Github ticket discussion)"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/iceberg\/pull\/1465","title":null,"type":null},"content":[{"type":"text","text":"Core: Enhance version-hint.txt recovery with file listing (Github Pull Request)"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"英文原文链接:"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/adobetech\/high-throughput-ingestion-with-iceberg-ccf7877a413f","title":null,"type":null},"content":[{"type":"text","text":"High Throughput Ingestion with Iceberg"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章