保姆级教程:滴滴如何基于开源引擎,打造自主可控服务体系

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、分享背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在滴滴负责过LogAgent、Kafka、Flink、Elasticsearch、Clickhouse等开源大数据引擎服务体系建设工作,走过很多弯路,趟过很多坑,积累了一些实战经验;近年疫情肆虐,加速了企业数字化转型的步伐,与数十家互联网、金融、证券、教育企业进行了深度交流,大家对基于开源数据引擎建设自主、可控、安全的服务体系有强烈诉求,常见的困惑是如何基于开源引擎,结合企业特点与发展阶段,进行高ROI的服务体系建设。 "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、建设实践"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"滴滴基于开源引擎搭建大数据基础设施,始于数据驱动业务运营与商业决策的BI需求,随着实时数据流量达到百MB\/S,存储达到PB级,开源数据引擎的服务运营会遇到各种各样的稳定性、易用性、运维友好性挑战。经历了四个阶段:引擎体验期、引擎发展期、引擎突破期、引擎治理期,不同阶段遇到的痛点问题与服务挑战各不相同:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎体验期:伴随业务快速发展,引擎的选择、版本的选择,机型的选择、部署架构的设计,覆盘来看是早期稳定性工作的关键抓手。 "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎发展期:随着引擎用户规模的增长,日常运营过程中的问题答疑、最佳实践落地、线上问题诊断消耗团队60%+的精力,亟需大数据PaaS层建设,降低用户引擎技术学习、应用、运维门槛,提升用户自助服务能力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎突破期:随着集群规模的膨胀、业务场景的多元,势必触碰开源引擎的能力边界,亟需构建基于开源引擎的内部迭代机制,既需要与开源社区紧密协同,平滑版本升级,享受社区的技术红利,又需要在开源引擎的基础上进行BUG FIX与企业特性增强。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎治理期:随着PaaS平台的构建,引擎版本的快速迭代,会衍生三大类问题:未区分SLA场景的混用、超出引擎能力边界的误用、没有成本意识的滥用。导致引擎服务口碑低、资源(机器+人力)ROI业务价值低,亟需基于元数据驱动的引擎治理体系建设。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、引擎体验期"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解决特定技术问题的开源引擎众多,比如消息队列有Kafka、Pulsar、RocketMQ等,技术选型对于服务的SLA、运维保障至关重要,严重影响幸福感与价值感。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)引擎选择"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要综合考虑社区的Star数、Contributor数,国内是否有PMC或Committer;应用的广泛度,有无众多大厂生产实践背书,线下Meetup是否频繁,线上问题答疑响应速度,线上最佳实践资料是否丰富,部署架构是否放精简等多种因素。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)机型选择"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"业务从应用场景上可以分为IOPS场景与TPS场景,开源引擎可以分为CPU密集型、IO密集型、混合型。以分布式搜索引擎Elasticsearch为例,企业级搜索场景,随机IO频繁,对磁盘的IOPS能力要求高,SSD磁盘是刚需;CPU消耗需要根据查询复杂度、QPS来评估此业务场景对CPU的诉求;推荐的做法是模拟业务场景做BenchMark,摸底引擎在特定场景下的性能表现,为机型选择提供依据,下图是滴滴Elasticsearch引擎在做机型选择时的压测验证:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6f\/6f2ea42f8e6130fef3d5bc971bfc2e7b.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于引擎原理与业内最佳实践建议,结合应用场景与压测结果,根据当下可选机型做选择,理想状态是CPU、磁盘容量、网络IO、磁盘IO资源均衡使用,尽量让CPU成为瓶颈。另外随着引擎的不断优化,软硬件基础设施的发展,机器过保置换是常态,最佳机型选择是一个动态演进的过程,滴滴运维保障团队2020年进行了新一轮机型优化与场景的调优,Elasticsearch日志集群成本降低了一半,CPU峰值平均利用率达到了50%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3)部署选择"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以Elasticsearch为例,最小高可用部署集群是3个节点,单节点承担Master Node、Client Node、 Data Node三种角色。一般3到5个节点集群规模影响不大,一旦到几十个节点,写入吞吐量达到百MB\/S,节点网络处理线程池、内存资源竞争突显,元数据处理与索引数据处理资源未隔离,会导致集群元信息出现同步性能或一致性问题,进而诱发集群不可用,所以达到了一定体量后,需要考虑分角色部署。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a7\/a7f94c64c679bd2e8955bfb0c78d751b.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"滴滴Elasticsearch分角色部署实践"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用户来申请所需引擎服务,需要结合业务重要程度给出合理部署架构,常见的有单租户单集群VS多租户大集群两种模式。引擎体验阶段,对引擎的掌控力有限,线上服务稳定性要求高,延迟与抖动敏感,建议选择独立集群方案;线下场景在乎的是整体资源利用率,RT抖动不敏感,建议走大集群多租户的部署架构。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、引擎发展期"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着引擎服务业务方的持续增长,集群个数(10+)与集群规模快速增长(100+)。一方面会遇到用户咨询与答疑量激增的问题,有引擎上手门槛高的原因,更多的是资源申请、Schema变更等高频用户操作未平台化赋能有关;另一方面会触碰到运维保障能力边界,指标体系不完备,问题诊断低效,缺乏降级、限流、安全、跨AZ高可用的服务体系支撑,运维人员疲于奔命 。这个阶段需要做好两个方面的工作,一方面亟需提升引擎用户、运维保障人员的工作效率,将高频的变更或操作平台化实现,另一方面需要补足开源引擎在运维友好性上的短板,构建引擎的指标体系,高可用体系,进行服务架构升级。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)引擎PaaS平台建设"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"保姆式人肉支撑用户高频资源创建、Schema变更等操作,在引擎建设初期用户不多的情况下尚能应对。随着引擎服务用户的增多,引擎上手门槛高,最多有一个Quick Star的小白用户指南;官方用户手册,侧重功能性描述,缺乏结合业务场景,给出最佳实践的场景化指导的问题充分暴露, ALL IN ME的用户服务体系成为引擎后向前赋能业务的瓶颈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般开源引擎都有自己的Metric指标体系,庞杂且晦涩难懂,缺乏对引擎业务过程的深度理解;另外引擎都有自己的原子API能力,脚本化的实现了高频操作,操作过程不透明,缺乏Double Check的强制机制容易存在安全与归零风险。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"综上亟需一套PaaS云管系统,面对普通用户降低接入和使用成本,针对常见问题给出FAQ,针对业务场景沉淀最佳实践。针对运维保障人员,打造全托管的引擎服务,需要体系化的构建引擎的指标体系来提升问题定位的效率;需要平台化的实现集群的安装、部署、升级、扩缩容的自动化执行,提升集群变更效率与安全性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以滴滴Kafka PaaS云平台Logi-KafkaManager为例介绍平台建设理念,具体设计参见:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzU1ODEzNjI2NA==&mid=2247527050&idx=1&sn=bafb07f4145d3125d427e3481a158436&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"滴滴开源Logi-KafkaManager 一站式Kafka监控与管控平台"}]},{"type":"text","text":",已开源https:\/\/github.com\/didi\/Logi-KafkaManager。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)引擎服务架构升级"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"开源项目一般会开放其核心能力,定义好客户端与服务端的交互协议,周边生态对接、不同语言SDK版本的维护基本都交给开发者自己贡献,很多企业级特性比如租户定义,租户认证,租户的限流等都交给用户自己拓展实现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在开源服务应用早期,往往是研发人员因业务需要引入,对引擎只有基本原理的了解,为了填补开源引擎在安全、限流、灾备、监控等方面的不足,往往都由业务\/中间件架构师选择自己熟悉语言对开源SDK做一层切面包装,对业务侵入性做到最低,自己维护SDK版本,统一推广与升级。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着业务的不断拓展,业务中台化和微服务化的落地,伴随着人员和组织的膨胀与熵增,SDK的能力增强和BugFix变更变得异常艰难,推动业务升级SDK,沟通与落地的成本极高,需要要做好压测、降低业务侵入、论证业务收益、配合业务发布节奏等各种工作,SDK收敛的周期以年为单位,SDK的引擎服模式已无法支撑业务快速发展的需要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于以上这些原因,业界普遍采用了经典Proxy代理架构,在服务端与客户端之间架起了桥梁,滴滴内部很多引擎都有对应的落地实践,比如Kafka-GateWay,ES-GateWay,DB-Proxy,Redis-Proxy等,下面以滴滴Elasticsearch服务为例,介绍Proxy架构建设实践。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Elastic Proxy层承接企业特性的拓展如安全、限流、高可用,突破大数据引擎的集群规模扩展瓶颈,用户既能享受多租户共享集群、独享集群、独立集群不同SLA保障等级的服务模式,又不用关心底层物理集群与资源的细节,构建了一套全托管的服务形态。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7cc01633c48ec8d3bc926d01284c16a7.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跨大版本升级,不管是存储格式、通讯协议、数据模型、都有许多不兼容的点,可以依托于Proxy架构实现跨大版本平滑升级,以下是滴滴2019年从ES2.3跨版本升级6.6.1的架构实践,详情参见:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzU1ODEzNjI2NA==&mid=2247487159&idx=1&sn=49c915854e9e32cfb02f20e6874fcc6e&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"滴滴ElasticSearch平台跨版本升级以及平台重构之路"}]}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ad\/ad8b18464c4226330dfacf1b5b153730.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依托Proxy的平台架构真正做到了底层引擎服务与上层业务架构的解耦,为后续底层引擎技术创新、服务架构快速迭代打下了坚实的基础。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、引擎突破期"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着业务的快速发展,实时峰值流量达到GB\/S,离线数据增量达TB\/天,不论是数据吞吐量还是集群规模都达到了开源引擎的能力边界。一方面低频故障场景比如磁盘故障、机器死机、Linux内核问题,在集群实例达到数百近千规模时频出;另一方面引擎在极端场景的BUG高概率被触发。都需要对引擎有深度的掌控力,能进行日常Bug Fix与内部版本的迭代,最终做到对引擎的自主与可控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)引擎原理深度掌控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着引擎服务业务方的拓展,引擎的商业价值凸显,企业逐步搭建专门的引擎研发团队,如何展开对开源引擎的深度学习与改进,以及后续如何Follow社区的节奏,在享受社区技术迭代红利的同时实现企业增强特性的落地,结合在滴滴的实践经验,以Kafka为例抛出一些自己的理解。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快速熟悉开源代码,需要搭建本地调测环境,熟悉打包、编译、部署各个环节,跑通测试用例,学会基于测试用例进行功能调试,熟读官方用户与开发文档:https:\/\/kafka.apache.org"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/14\/1430f34ae16795bd734772200dbe94aa.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"熟悉引擎的启动与日常运行日志,熟悉功能模块的主干流程与运行原理。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/18\/18bb392133bbfc76f83a53aac521b3d9.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"定期开展引擎原理分享与源码交流,参加社区Meetup,跟进社区Issue列表。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"线上问题覆盘:熟读源码有利于建立引擎整体的宏观认知,线上问题才是最好的学习场地。线上故障深追Root Cause及时覆盘,探寻引擎每个细节,是将知识点变成认知的高效途径。引擎同学看到问题应该像看到金子一样,眼睛发光,追根究底,才能提升引擎的掌控力。 "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/25\/25d88c137492fd16235367ddd5c65353.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展开混沌工程,梳理引擎各异常场景,结合异常运行日志与监控指标,掌握该场景引擎的运行原理,明确引擎能力边界,做好稳定性保障的预案。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/00\/003534d6a1775ddd1e1af836c65510bf.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)引擎内部分支迭代"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"线上引擎Bug:一般有两种解法,一种是社区有对应的Issue\/Feature已解决,我们合并对应的Patch到本地维护的分支版本,进行Bug修复;另一种我们向社区提交Issue,提出解决方案,按照社区Patch提交流程进行测试与Review,最终合并到当前版本。滴滴每年累计向Apache社区包括Hadoop,Spark,Hive,Flink,HBase, Kafka,Elasticsearch,Submarine,Kylin开源项目贡献150+Patches。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"企业级特性研发:引擎在内部服务的时候,一方面需要跟企业的LDAP、安全体系、管控平台对接,另一方面,企业增强的引擎特性,实现方式比较本地化或者是极端小众的场景,社区不接收对应的Patch,我们需要维护自己的特性列表和拓展实现。在维护企业特性方面修改尽可能内聚,能够拓展引擎接口实现的尽量插件化的实现,方便后续版本升级与本地分支维护,以滴滴Elasticsearch为例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d102e25fd95b36b9a36159912fa3263f.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着社区版本的发展,企业内部分支的演进,需要定期跟进社区版本,享受社区红利,升级方案的制定需要非常严谨,涉及的环节和评估的事项较多,以滴滴Elasticsearch 7.6版本为例进行说明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/53\/53246acb42315a4f02d6e0d4ee2dfdae.png","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般涉及版本特性梳理,新特性的体验,评估升级收益,评估版本的兼容性;内部特性反向合并到开源版本,做好功能测试、性能测试、兼容性测试;制定详细升级与回滚方案,集群的升级计划,评估用户的改动与影响面等工作。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、引擎治理期"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎治理的核心目标是向上透传业务价值,向下索要技术红利,核心是通过数据看清问题,找到ROI最高的切入点,同时配套推动业务改造的核心抓手,讲清引擎服务的长期价值,保障技术持续的投入和价值创造!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)透传业务价值"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着引擎服务业务方的增多,业务方应用场景持续动态演进,一方面我们可以通过资源利用率看清资源ROI情况,另一方通过用户对RT的敏感度,服务运行稳定性关注度,应用场景的重要程度,定期Review分级保障体系。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心业务,在运维保障的精力投入、核心硬件资源的倾斜程度,独立集群的服务模式,高可用架构的设计,不遗余力的守好稳定性底线,就是创造了最大的业务价值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非核心业务,通过业务价值分模型,配合组织建设抓手,进行红黑榜排名运营,鼓励“越用越好,越好越用“的正向增长飞轮;对用信用分高的客户,服务响应、资源申请、业务保障都给予更多的倾斜,最终保障整体资源ROI的最大化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)索要技术红利"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服务运营通常都满足2-8原则,20%的业务特别核心,分级保障机制运营合理,对于服务运营方压力可控,在保障核心业务价值基本面的情况下,需要对剩下的80%的业务方,在资源、性能、成本、稳定性上,从平台层面进行统一治理与优化,结构化降低服务成本,提升服务质量,打造一个人人为我,我为人人的正向服务口碑体系建设,基于完善的指标体系,我们可以洞察出开源系统的软件瓶颈,根据ROI分阶段优化与创新,具体案例可以参考,我们在ES上的几个技术创新:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzU1NDA4NjU2MA==&mid=2247500894&idx=2&sn=75fc7035e8cf387ff74d06da787188e0&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"滴滴离线索引快速构建FastIndex架构实践 "}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzU1ODEzNjI2NA==&mid=2247496570&idx=1&sn=a11ff747ce40d3aa1ccc5667b5767f21&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"滴滴 ElasticSearch 千万级 TPS 写入性能翻倍技术剖析"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当业务的体量达到一定的规模,技术创新的商业价值得以体现,以上的优化每年都给公司节省近千万的成本,技术的价值得到了充分的体现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉宾介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"张亮,"},{"type":"text","text":"滴滴云-商业数据 负责人。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2014年加入滴滴,负责过LogAgent、Kafka、ElasticSearch、OLAP的引擎建设工作,具有丰富的高并发、高吞吐场景的架构设计与研发经验,主持构建过任务调度系统、监控系统、日志服务、实时计算、同步中心等数据体系的平台设计与研发工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文转载自:dbaplus社群(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文链接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/-h1CXJJkQ8bqHUs3pHxSsA","title":"xxx","type":null},"content":[{"type":"text","text":"保姆级教程:滴滴如何基于开源引擎,打造自主可控服务体系"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章