数据库内核杂谈(二十一): 流处理系统简介

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然咱们是数据库内核博客,但流式处理系统已经成为数据系统的主流之一,并且提供了类似于SQL的接口,现在也有流批一体的趋势(我个人觉得,还得观察一下,因为毕竟数据源方式不同,服务的应用也不同,使用一套系统,感觉很难鱼和熊掌兼得)。这一期,咱们聊一聊流处理系统。内容源于 Facebook 2016 年 "},{"type":"link","attrs":{"href":"https:\/\/sigmod.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"SIGMOD"}]},{"type":"text","text":" 上发表的一篇文章,标题就叫做"},{"type":"link","attrs":{"href":"https:\/\/blog.acolyer.org\/2016\/07\/11\/realtime-data-processing-at-facebook\/","title":"xxx","type":null},"content":[{"type":"text","text":"《Realtime Data Processing at Facebook (Meta)》"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,为什么需要流处理系统,因为有低延时的应用需求:如实时数据分析,如性能指标,error指标;推荐系统,为了取得最好的推荐结果,希望可以采集到某些实时的特征等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这篇文章首先讨论了一下流处理(或者叫实时数据处理)系统的5个重要属性。分别是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)易用性:程序员如何声明流处理的逻辑,SQL语言(或者类SQL语言)支持是否已经足够;还是要支持general purpose的处理逻辑,比如可以让程序员用C++或者Java语言来实现处理逻辑然后交给系统执行(类似于map-reduce)。从声明,测试到发布,整个生命周期需要多长时间?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)性能:性能一般指延迟和吞吐量(throughput)需求。延迟是毫秒级别,秒级别,或者是分钟级别?吞吐量需要多高?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)健壮性(fault-tolerance):系统能够支持什么级别的崩溃恢复?对于数据处理,能提供什么样的service level agreement ,是至少一次,至多一次,还是保证一次?如果某个task崩溃了,如何恢复in-memory的状态,等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)扩展性(scalability):数据处理是否能被shard或者reshard来提高吞吐量?系统是否能动态地伸缩(elasticity)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5)正确性:是否提供类似数据库的ACID保证?是否会有数据丢失(这点和上面的健壮性有重叠)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Facebook在设计流系统时的决策是基于这个前提:秒级别的延迟和几百GB\/s吞吐量需求(a few seconds of latency with hundreds of GB\/s throughput)。在这个前提下,不同的批处理过程可以通过一个persistent的message bus系统(Scribe,类似于Kafka)相连来传输数据。异构数据传输和数据处理,能够使得整个系统更好地处理上述提到的这些属性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Facebook(Meta)流处理系统简介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整个Facebook流处理生态提供了3个不同的系统。结合下面这张数据流图,依次来介绍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c0\/e0\/c0b1312cf38e952yy4c93260a61151e0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据从mobile端或者服务器(web)端产生,首先以log形式记录到Scribe(上文提到的persistent的message bus系统)。流系统Puma,Stylus和Swift可以从Scribe中读取数据,执行数据处理,再写回Scribe。以这种方式,三个系统结合Scribe可以组成复杂的数据处理DAG。最终,处理完的数据通过Scribe写入Laser,Scuba和Hive三类Data stores。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scribe"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scribe是一个非常scalable,基于persistent store(文件系统)的"},{"type":"link","attrs":{"href":"https:\/\/www.cnblogs.com\/svenzhang9527\/p\/7354684.html","title":"xxx","type":null},"content":[{"type":"text","text":"message bus"}]},{"type":"text","text":"系统,类似开源的Kafka系统。数据以一个个category(Kafka中的术语叫topic)的形式存在,每个category可以shard成多个bucket来提高吞吐量。bucket是流处理系统的基本单元。Scribe将数据存储在HDFS上,通常retention可以到几天。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Puma"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Puma提供了类SQL的语法并支持用Java语言写可扩展的UDF(user defined functions)。Puma的优势在于开发流程非常快(因为提供了类SQL语法),整个声明周期可以在小时级别完成。Puma可以非常高效地完成简单的类SQL的聚合操作。文中给出了一个简单示例,在5分钟的sliding window中计算topK events。Puma的简易code如下,即使从来没接触过Puma语法,相信理解下面的内容也不困难。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/22\/1b\/22efc757b0bef09a61c7ef8d05f09b1b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Puma的另一个优势是对于简单的filtering逻辑,比如只选取某些相关的数据,可以提供秒级别的延迟(这些处理后的数据可以马上被写入到另一个scribe category)。和传统数据库不同,Puma选择更好地支持那些被长期运行的app而不是ad-hoc analytics,因此它可以通过code generation来生成优化的处理代码。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Swift"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(插一句题外话,在读这篇paper前,我都不知道有这个系统,其实读完简介,我依然是云里雾里)。Swift只提供了非常简单的API:从某个scribe中读取N个string或者bytes,然后周而复始。如果在处理某个checkpoint的时候app crash了,可以接着从当前checkpoint重来。Swfit通常用于非常低吞吐量,且无状态的数据处理。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Stylus"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Stylus是一个通用的流处理系统,语言是C++。它提供的API和开源的流处理系统如Storm,Samza,Millwheel类似,它分别支出无状态和有状态的流处理。因为实现语言是C++,因此Stylus不仅支持各种操作(包括读取外部系统获取信息),性能也非常高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"咱们也快速介绍一下data store,这些系统可以从Scribe导入数据,但不再支持导出到Scribe,而是通过自身的API对外提供数据服务。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Laser"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Laser是一个高吞吐量,低延迟的key-value存储,它可以通过Scribe导入数据,之后这些数据就可以被其他应用访问,包括Stylus,Swift和Puma。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scuba"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scuba可以看成一个高性能,但支持单个table的in-memory数据库。它可以支持非常低延时的数据导入,然后通过类SQL(但是只能查询单个table)或者UI操作来查询数据,查询也在毫秒级别完成。因此Scuba广泛应用在各种性能,监控, debug指标中。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Hive data warehouse"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive data warehouse就省略了,大家都懂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"介绍完了所有系统,再通过一个简单的例子来梳理一下。文中给出了下面这个示例:从event流里找出最热的event topic(通过将event count进行高到低排序),输入event流有event的基本信息如event timestamp,event type,dimension_id(用来获取相关dimension信息)event text等,输出就是每个topic的TopK events。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bf\/32\/bfd61a5f3f9b40f88f9ef59737110f32.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)Filterer:可以过滤掉不符合规定的信息,并且将event流重新以event的dimension_id作为sharding的形式分发到下游的scribe中(这样,下游处理可以根据dimension_id来进行并行处理)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)Joiner:Joiner需要根据dimension_id抓取相应的dimension信息,并且调用classification系统来得到event topic。因为上游的scribe是以dimension_id作为sharding,因此joiner可以cache相应的dimension信息来减少network bandwidth(有状态的处理)。处理过的信息以的pair形式发送到下游的Scribe。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)Scorer:Scorer通过收集一个sliding window里topic的event count来计算score。由于计算score需要考虑到long-term trend和current count,因此scorer需要存储long-term trend作为状态。最终输出(shard by topic)到下游。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)Ranker:最终, Ranker针对每个topic计算出当前sliding window的topK events。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文中有提到,所有的logic都可以用Stylus来实现。不过,Filterer和Ranker可以更快地用Puma实现。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"设计决策"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来才是本文的重点,文中介绍了5个维度的设计决策。并且讨论了这些决策是如何影响文章最开始介绍的流处理系统的5个属性(易用性,性能,健壮性,扩展性,正确性)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"编程语言支持(language paradigm)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"编程语言支持会影响到易用性和性能。文中介绍了三大类:declarative(声明式)类似于SQL应该是最易于理解和上手的,缺点在于表达的局限性;Functional(函数式)将整个application封装成不同function(operator)的组合,不如SQL那么容易上手,但提供了更多的控制。最后就是procedural:直接提供C++或者Java等语言接口。Procedural提供了最大的控制同时也在很大程度上能保证性能,缺点就是开发周期更长。这三类各有优缺点,在Facebook内部,Puma实现了declarative,而Stylus实现了procedural。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据传输(data transport)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"复杂的流处理逻辑通常用DAG表示。如何实现数据从一个节点传输到另一个节点,影响到整个流数据的健壮性,性能以及可扩展性,以及一定程度的易用性(尤其是在debugging时)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文中也介绍了三大类:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)direct message transfer:类似于用RPC或者in-mem message queue来直接传输数据,这类的好处在于延迟非常低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)broker based:通过引入中间broker来decouple上游和下游。Broker虽然增加了性能负担,但提高了扩展性,方便scale out。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)persistent storage based broker:类似Scribe或者Kafka。毋庸置疑,这个方法虽然最heavy,但是带来了message bus系统所有的好处,解耦,扩容,订阅分发,持久保存等等。Facebook内部使用第三类,用来提升健壮性,可扩展性,以及易用性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据处理语义(processing semantics)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据处理语义决定了正确性和健壮性。 文中也介绍了三大类:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)更新内部状态:读取一个event,进行相应处理(如查询外部系统)然后对in-memory状态进行更新;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)生成output event:处理完event后,生成一个output event到下游;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)保存状态至外部系统,如数据库:这里面可以涉及到offset和checkpoint的保存来进行灾备恢复。如果是无状态的节点,只能选择生成output event,有状态的节点三者都可能涉及。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于event处理的正确性,如果选择at least once(至少一次),节点应该选择先保存in-memory state,再更新offset;如果选择 at most once(至多一次):节点应该选择先保存offset,再更新in-memory state;如果选择exactly once(强一致):必须保证原子更新,如利用transaction机制。在介绍的系统中,Puma选择了at least once,而Scuba选择了at most once。因为Scuba本身就自带sampling,而且查询,为了追求效率是best effort,因此,少量的数据丢失是可以接受的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"状态保存方式(state-saving mechanism)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于有状态的处理节点,如何保存状态。文中介绍了下面这几类:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)复制到其他节点;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)本地数据库或文件存储;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)远程数据库或文件存储;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)依赖上游节点存储;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5)全局snapshot存储。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介绍的系统中,Stylus提供了本地数据库和远程数据库的状态存储。本地存储的优势是减少带宽,程序崩溃恢复也快。而远程存储则可以应对硬件级别的机器故障(需要重新provision一个新node,再将状态导入)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"重复处理机制(reprocessing\/backfill mechanism)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于某些特定应用场景,我们会需要重新处理一些旧数据。如引入了一个新的流处理逻辑,需要用一段过去的数据来测试;引入新指标,需要重新运行数据来获取这个指标。要处理旧数据,需要以下这些机制:1)stream的数据保留的retention足够长,比如在Scribe中设置更长的retention;2)使得流处理系统可以处理data warehouse的数据(batch处理)。Facebook系统中Scribe的retention通常不能很久,通常几天。因此,需要使得流系统对接data warehouse来处理,通过引入tailer。Backfill机制会影响系统的易用性,可扩展性和正确性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"总结一下,这期,我们通过介绍Facebook内部的流处理系统生态,讨论了流处理系统中5个维度的设计决策,以及它们对流处理系统5个关键属性的影响(下图展示了不同维度的设计决策分别会影响哪些属性,以供参考)。感觉阅读!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/94\/68\/9467bd6e8e19c1a37dab16a542ab3868.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章