数仓系列 | Flink 窗口的应用与实现

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"简介:"},{"type":"text","text":" 本文根据 Apache Flink 系列直播整理而成,由 Apache Flink Contributor、OPPO 大数据平台研发负责人张俊老师分享。主要内容如下: 1. 整体思路与学习路径 2. 应用场景与编程模型 3. 工作流程与实现机制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者 | 张俊(OPPO大数据平台研发负责人)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整理 | 祝尚(Flink 社区志愿者)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"校对 | 邹志业(Flink 社区志愿者)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"摘要:本文根据 Apache Flink 系列直播整理而成,由 Apache Flink Contributor、OPPO 大数据平台研发负责人张俊老师分享。主要内容如下:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"整体思路与学习路径"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"应用场景与编程模型"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"工作流程与实现机制"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tips:点击「下方链接」可查看更多数仓系列直播视频~ "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"underline"},{"type":"strong"}],"text":"数仓系列直播:  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ververica.cn/developers/flink-training-course-data-warehouse/","title":null},"content":[{"type":"text","text":"https://ververica.cn/developers/flink-training-course-data-warehouse/"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"整体思路与学习路径"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d6f34297440e5c72a45a9997ace01836.png","alt":"640 1.png","title":"640 1.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当我们碰到一项新的技术时,我们应该怎样去学习并应用它呢?在我个人看来,有这样一个学习的路径,应该把它拆成应用和实现两块。首先应该从它的应用入手,然后再深入它的实现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"应用主要分为三个部分,首先应该了解它的应用场景,比如窗口的一些使用场景。然后,进一步地我们去了解它的编程接口,最后再深入了解它的一些抽象概念。因为一个框架或一项技术,肯定有它的编程接口和抽象概念来组成它的编程模型。我们可以通过查看文档的方式来熟悉它的应用。在对应用这三个部分有了初步的了解后,我们就可以通过阅读代码的方式去了解它的一些实现了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实现部分也分三个阶段,首先从工作流程开始,可以通过 API 层面不断的下钻来了解它的工作流程。接下来是它整体的设计模式,通常对一些框架来说,如果能构建一个比较成熟的生态,一定是在设计模式上有一些独特的地方,使其有一个比较好的扩展性。最后是它的数据结构和算法,因为为了能够处理海量数据并达到高性能,它的数据结构和算法一定有独到之处。我们可以做些深入了解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上大概是我们学习的一个路径。从实现的角度可以反哺到应用上来,通常在应用当中,刚接触某个概念的时候会有一些疑惑。当我们对实现有一些了解之后,应用中的这些疑惑就会迎刃而解。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"为什么要关心实现"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"举个例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/36fd64e984186426fdcdcc1bc4963c6b.png","alt":"640 2.png","title":"640 2.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看了这个例子我们可能会有些疑惑:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ReduceFunction 为什么不用计算每个 key 的聚合值?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当 key 基数很大时,如何有效地触发每个 key 窗口计算?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口计算的中间结果如何存储,何时被清理?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口计算如何容忍 late data ?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当你了解了实现部分再回来看应用这部分,可能就有种醍醐灌顶的感觉。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"应用场景与编程模型"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"实时数仓的典型架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a6/a6231a5f944503286f311d01e931d858.png","alt":"640 3.png","title":"640 3.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第一种最简单架构"},{"type":"text","text":",ODS 层的 Kafka 数据经过 Flink 的 ETL 处理后写入 DW 层的 Kafka,再通过 Flink 聚合写入 ADS 层的 MySQL 中,做这样一个实时报表展现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺点"},{"type":"text","text":":由于 MySQL 存储数据有限,所以聚合的时间粒度不能太细,维度组合不能太多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第二种架构"},{"type":"text","text":"相对于第一种引入了 OLAP 引擎,同时也不用 Flink 来做聚合,通过 Druid 的 Rollup 来做聚合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺点"},{"type":"text","text":":因为 Druid 是一个存储和查询引擎,不是计算引擎。当数据量巨大时,比如每天上百亿、千亿的数据量,会加剧 Druid 的导入压力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第三种架构"},{"type":"text","text":"在第二种基础上,采用 Flink 来做聚合计算写入 Kafka,最终写入 Druid。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺点"},{"type":"text","text":":当窗口粒度比较长时,结果输出会有延迟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ 第四种架构在第三种基础上,结合了 Flink 聚合和 Druid Rollup。Flink 可以做轻度的聚合,Druid 做 Rollup 的汇总。好处是 Druid 可以实时看到 Flink 的聚合结果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 应用场景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a9f46651208fd9f04871aea366a7683d.png","alt":"640 4.png","title":"640 4.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 聚合统计"},{"type":"text","text":":从 Kafka 读取数据,根据不同的维度做1分钟或5分钟的聚合计算,然后结果写入 MySQL 或 Druid 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 记录合并"},{"type":"text","text":":对多个 Kafka 数据源在一定的窗口范围内做合并,结果写入 ES。例如:用户的一些行为数据,针对每个用户,可以对其行为做一定的合并,减少写入下游的数据量,降低 ES 的写入压力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 双流 join"},{"type":"text","text":":针对双流 join 的场景,如果全量 join 的话,成本开销会非常大。所以就要考虑基于窗口来做 join。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 抽象概念"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7b9f09c4e8fbe1eb2160467708c53ab.png","alt":"640 5.png","title":"640 5.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ TimestampAssigner:"},{"type":"text","text":" 时间戳分配器,假如我们使用的是 EventTime 时间语义,就需要通过 TimestampAssigner 来告诉Flink 框架,元素的哪个字段是事件时间,用于后面的窗口计算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ KeySelector"},{"type":"text","text":":Key 选择器,用来告诉 Flink 框架做聚合的维度有哪些。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ WindowAssigner"},{"type":"text","text":":窗口分配器,用来确定哪些数据被分配到哪些窗口。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ State"},{"type":"text","text":":状态,用来存储窗口内的元素,如果有 AggregateFunction,则存储的是增量聚合的中间结果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ AggregateFunction(可选)"},{"type":"text","text":":增量聚合函数,主要用来做窗口的增量计算,减轻窗口内 State 的存储压力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Trigger"},{"type":"text","text":":触发器,用来确定何时触发窗口的计算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Evictor(可选)"},{"type":"text","text":":驱逐器,用于在窗口函数计算之前(后)对满足驱逐条件的数据做过滤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ WindowFunction"},{"type":"text","text":":窗口函数,用来对窗口内的数据做计算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Collector"},{"type":"text","text":":收集器,用来将窗口的计算结果发送到下游。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上图中红色部分都是可以自定义的模块,通过自定义这些模块的组合,我们可以实现高级的窗口应用。同时 Flink 也提供了一些内置的实现,可以用来做一些简单应用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 编程接口"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"stream \n .assignTimestampsAndWatermarks(…)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章