Logi-KafkaManager开源之路:一站式Kafka集群指标监控与运维管控平台

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"导读","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从2019年4月份计划开源到2021月1月14号完成开源,历时22个月终于修成正果,一路走来实属不易,没有前端、设计、产品,我们找实习生、合作方、外部资源支持,滴滴Kafka服务团队人员也几经调整,内部迭代了3个大版本,我们最终还是克服重重困难做到了!一经开源获得了社区用户广泛的认可,截止当前Star达到1140,钉钉用户突破550人,","attrs":{}},{"type":"link","attrs":{"href":"http://way.xiaojukeji.com/article-edit/%E6%BB%B4%E6%BB%B4%E5%BC%80%E6%BA%90Logi-KafkaManager%20%E4%B8%80%E7%AB%99%E5%BC%8FKafka%E7%9B%91%E6%8E%A7%E4%B8%8E%E7%AE%A1%E6%8E%A7%E5%B9%B3%E5%8F%B0","title":null},"content":[{"type":"text","text":"滴滴开源Logi-KafkaManager 一站式Kafka监控与管控平台","attrs":{}}]},{"type":"text","text":"文章阅读破1W+ UV。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Logi-KafkaManager简介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka作为滴滴大数据消息队列,每天承载万亿级消息的生产与消费,面对100GB/S+峰值采集流量,服务了公司内近千Kafka用户,托管了数十Kafka集群,数万Kafka Topic,单集群>300+Broker。历经四年打磨沉淀,围绕Logi-KafkaManager打造了滴滴Kafka平台服务体系,内部满意度达到90分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LogI-KafkaManager是面向Kafka用户、Kafka运维人员打造的共享多租户Kafka云平台,专注于Kafka资源申请、运维管控、监控告警、资源治理等核心场景。免费体验地址:http://117.51.150.133:8080/kafka ,账户admin/admin,欢迎","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/didi/Logi-KafkaManager","title":null},"content":[{"type":"text","text":"Star","attrs":{}}]},{"type":"text","text":"。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"为什么要开发Logi-KafkaManager","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"滴滴内部有几十个kafka 集群,450+ 的节点,每周500+UV用户,需要完成 topic 创建、申请、指标查看等操作;每天运维人员还有大量topic管控、治理、集群运维操作。因此我们需要构建一个Kafka的管控平台来承载这些需求。我们调研了社区同类产品,在监控指标的完善程度、运维管控的能力、服务运营的理念都无法很好的满足我们的需求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Logi-KafkaManager功能亮点","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、产品化设计之关注点分离","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"业界开源的KafkaManager定位是一个面向运维人员的监控工具,在滴滴我们定位是全托管Kafka服务工具型平台产品,针对的人群区分为Kafka用户、Kafka运维。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka用户:关注的是Topic相关的操作,Topic资源申请与扩容、Topic指标监控、Topic消费告警、Topic消息采样、Topic消费重置等。  ","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka运维:关注的是Kafka集群相关的操作,集群监控、集群安装、集群升级、集群Topic迁移、集群容量规划等。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、Kafka业务运行过程数据化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作为消息中间件,Kafka最核心的能力就是消息的生产、消费,用户高频的问题都与此相关,作为服务提供方,我们需要详细的感知Topic的生产消费在服务端各个环节耗时,快速界定到底是服务端还是客户端问题,如果是服务端问题,出在哪个环节,如下图所示","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b1/b1e47e47046d0d0c3668b8dd61c4570a.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"请求队列排队时间(RequestQueueTimeMs):","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker本地处理时间(LocalTime):","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"请求等待远程完成时间(RemoteTimeMs) :","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"请求限流时间(ThrottleTimeMs)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"响应队列排队时间(ResponseQueueTimeMs)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"响应返回客户端时间(ResponseSendTimeMs)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接收到请求到完成总时间(TotalTimeMs)","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过将这些服务端运行指标,以Topic粒度呈现,显著的提升了服务用户的效率,如下图所示:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d8/d85d9778ef1562e3a835b5d37d0991cd.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、Kafka服务保障强管控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka各语言客户端版本众多,官方也只有精力维护Java版的SDK,滴滴受限于服务人力,没有进行客户端版语言与版本管控,服务端拓展实现强管控客户端元信息的能力。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拓展服务端能力,强感知客户端的链接地址,协议类型,方便后续引擎对用户行为的感知与强管控。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8d/8dd3de9526752261d6c215be797483bd.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拓展实现Kafka服务端的安全认证能力,通过账号机制记录应用元信息,包括人员信息、业务信息、权限信息;通过Topic创建管控,记录压缩类型、Partiton、Quota等元信息,在服务端实现了对客户生产、消费能的强管控。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/59/59bd94ec110ee01ef5b7cf0c175890ee.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、最佳实践之专家服务沉淀","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多年Kafka服务运营经验,我们沉淀了大量的服务保障最佳实践,结合应用场景,截止目前构建了以下几项专家服务,后续我们会持续打磨与完善。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Topic集群分布不均迁移:不同broker上leader数目不均;同一个broker上不同磁盘leader分布不均;同一个topic在broker上不同磁盘分布不均。我们需要发现热点,给用户推荐迁移计划","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8f911176efeb3f2390082c05e87faafd.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Partitont不足扩容提示:根据单Partition承载流量,按照业务场景与底层硬件资源进行主动扩容提示,扩容标准:滴滴的实践是TPS场景:单Partition 3MB/S;IOPS场景:单Partition 10条/S","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b4/b4647576c2a0bc2f948aaa803a4d2c64.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Topic无效资源下线:针对线上持续一个月Topic无流量,无生产消费链接的资源,通知用户进行主动资源释放","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/3547d39b5e675df76c47dacc0e9e4900.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Logi-KafkaManager架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平台设计之初,我们就基于开源的理念进行平台建设,遵循了依赖精简、分层架构、能力API化、100%兼容历史开源版本的原则,整体架构如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/34398b223e42ced06d25099bf4d490ff.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"资源层: kafka 引擎和 Logi-KafkaManager 除了 zookpeer 之外只依赖 msyql,依赖精简,部署方便;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引擎层:当前滴滴 kafka 引擎版本是2.5,我们在此基础上开发了一些自己的特性,如磁盘过载保护、IO线程池分离、Topic创建资源分配优化等功能,并且完全兼容开源社区的 0.10.X kafka版本;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关层:引擎层之上滴滴设计了kafkaGateway网关层,提供了安全管控、topic 限流、服务发现、降级能力;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服务层:基于kafkaGateway 我们在 Logi-KafkaManager 上提供了丰富的功能,主要有:topic管理、集群监控、集群管控能力;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平台层:分别针对普通用户和运维用户,提供不同的功能集合,尽可能的将一些日常使用中的高频操作在平台上进行承接,降低用户的使用成本,同时核心能力API化,方便用户生态对接。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"写在最后","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"项目开源只是万里长征的第一步,产品还需要持续的打磨与建设,但行好事莫问前程,感谢那些曾经为这个项目付出努力的童鞋们,特别是当前团队的兄弟们,过去一年非常不容易,开源的技术梦想让我们紧密的团结在一起,以此文向开源的领路人章文嵩致敬!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Logi-KafkaManager是2021年团队开源梦想的一小步,是滴滴","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s/-KQp-Qo3WKEOc9wIR2iFnw","title":null},"content":[{"type":"text","text":"Logi日志服务套件","attrs":{}}]},{"type":"text","text":"整体开源计划的重要组成部分,欢迎关注Obsuite公众号或者加入Logi滴滴用户钉钉群,给我们的产品提出宝贵意见,推荐给身边有需要的技术小伙伴。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bb/bbb815b09efbfffad2c4f038b581975d.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/32/32b183eeaa34b1ecba279ef8ed920f35.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章