干货 | 实时数据聚合怎么破

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实时数据分析一直是个热门话题,需要实时数据分析的场景也越来越多,如金融支付中的风控,基础运维中的监控告警,实时大盘之外,AI模型也需要消费更为实时的聚合结果来达到很好的预测效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实时数据分析如果讲的更加具体些,基本上会牵涉到数据聚合分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据聚合分析在实时场景下,面临的新问题是什么,要解决的很好,大致有哪些方面的思路和框架可供使用,本文尝试做一下分析和厘清。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在实时数据分析场景下,最大的制约因素是时间,时间一变动,所要处理的源头数据会发生改变,处理的结果自然也会因此而不同。在此背景下,引申出来的三大子问题就是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过何种机制观察到变化的数据 "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过何种方式能最有效的处理变化数据,将结果并入到原先的聚合分析结果中"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析后的数据如何让使用方及时感知并获取"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以说,"},{"type":"text","marks":[{"type":"strong"}],"text":"数据新鲜性和处理及时性"},{"type":"text","text":"是实时数据处理中的一对基本矛盾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c4f2c7a2d59d363c82dcd3d05b54964.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外实时是一个相对的概念,在不同场景下对应的时延也差异很大,借用Uber给出的定义,大体来区分一下实时处理所能接受的时延范围。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1a\/1a43575116ca0752abc36c20734399bb.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、数据新鲜性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为简单起见,把数据分成两大类,一类是关键的交易性数据,以存储在关系型数据库为主,另一类是日志型数据,以存储在日志型消息队列(如kafka)为主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二类数据,消费端到感知到最新的变化数据,采用内嵌的pull机制,比较容易实现,同时日志类数据,绝大部分是append-only,不涉及到删改,无论是采用ClickHouse还是使用TimeScaleDB都可以达到很好的实时聚合效果,这里就不再赘述。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"针对第一类存储在数据库中的数据,要想实时感知到变化的数据(这里的变化包含有增\/删\/改三种操作类型),有两种打法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打法一:"},{"type":"text","text":"基于时间戳方式的数据同步,假设在表设计时,每张表中都有datachange_lasttime字段表示最近一次操作发生的时间,同步程序会定期扫描目标表,把datachange_lasttime不小于上次同步时间的数据拉出进行同步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这种处理方式的主要缺点是无法感知到数据删除操作,为了规避这个不足,可以采用逻辑删除的表设计方式。数据删除并不是采取物理删除,只是修改表示数据已经删除的列中的值标记为删除或无效。使用这种方法虽然让同步程序可以感知到删除操作,但额外的成本是让应用程序在删除和查询时,操作语句和逻辑都变得复杂,降低了数据库的可维护性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打法一的变种是基于触发器方式,把变化过的数据推送给同步程序。这种方式的成本,一方面是需要设计实现触发器,另一方面是了降低了insert\/update\/delete操作的性能, 提升了时延,降低了吞吐量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打法二:"},{"type":"text","text":"基于CDC(Change Data Capture)的方式进行增量数据同步,这种方式对数据库设计的侵入性最小,性能影响也最低,同时可以获得丰富的开源组件支持,如Cannal对MySQL有很好支持,Debezium对PostgreSQL有支持。利用这些同步组件,把变化数据写入到Kafka,然后供后续实时数据分析进一步处理。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、数据关联"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新鲜数据在获取到之后,第一步常见操作是进行数据补全(Data Enrichment), 数据补全自然涉及到多表之间的关联。这里有一个痛点,要关联的数据并不一定也会在增量数据中,如机票订单数据状态发生变化,要找到变化过订单涉及到的航段信息。由于订单信息和航段信息是两张不同的表维护,如果只是拿增量数据进行关联,那么有可能找不到航段信息。这是一个典型的实时数据和历史数据关联的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解决实时数据和历史数据关联一种非常容易想到的思路就是当实时数据到达的时候,去和数据库中的历史数据进行关联,这种做法一是加大了数据库的访问,导致数据库负担增加,另一方面是关联的时延会大大加长。为了让历史数据迅速可达,自然想到添加缓存,缓存的引入固然可以减少关联处理时延,但容易引起缓存数据和数据库中的数据不一致问题,另外缓存容量不易估算,成本增加。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有没有别的套路可以尝试?这个必须要有。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以在数据库侧先把数据进行补全,利用行转列的方式,形成一张宽表,实现数据自完备,宽表的变化内容,利用CDC机制,让外界实时感知。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、计算及时性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在解决好数据变化实时感知和数据完备两个问题之后,进入最关键一环,数据聚合分析。为了达到结果准确和处理及时之间的平衡,有两大解决方法:一为全量,一为增量。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 全量计算(1m
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章