微服务沉思录-观测性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"观测性(Observability)是微服务得以稳健运行的至关重要一环。在生产环境若缺乏良好的观测性工具和方法,就好比高空的飞机在没有仪表板的情况下飞行一样,两眼一抹黑,充满不确定性因素和未知风险,无法及时发现、定位、转移和修复错误。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"业界通常将观测性大致分为三大类:Metrics,Tracing和Logging。通常来说Metrics监控侧重于技术指标的收集与观测,如服务调用QPS、响应时间、错误率和资源使用率;Logging侧重于运行日志的采集、存储与检索;而Tracing则偏向于调用链的串联、追踪与APM分析。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/162f43e271919c6be9ca2169fca48b4d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想实时观测运行时数据,必须有强劲、稳定的数据流Pipeline工具来持续支撑数据的采集、传输、存储和应用,包括监控指标、日志数据和链路跟踪数据。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据采集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据采集一般通过应用发送消息(如Kafka、RocketMQ等),或调用REST API来实现,注意一定要异步处理,防止极端情况下主线程被阻塞。也可以通过安装到服务器上的Agent来完成日志采集,应用只需要将日志写到磁盘对应的位置即可,典型的Agent软件如Flume、Scribe、Logstash或自研工具等;","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据传输","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据传输一般包括消息中间件等用于数据移动的平台,也包括数据处理平台,常见的流计算平台有Storm、Spark Streaming、Flink等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从实际工程经验看,建议采集的数据最终都写入Kafka消息中间件,Kafka性能卓越,且具备解耦、缓冲、标准化等优势。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,有时候采集层直接将数据写入存储层,无需额外传输和处理,这种一般适合数据量比较小的传输场景。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据存储","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据存储一般要兼顾数据量和存储性能。常见的数据存储有:时序数据库系如OpenTSDB、InfluxDB、Graphite、Prometheus等;预聚合系如Druid、Kylin(不支持实时摄入)等;Hadoop系如HDFS、Hbase、Hive等;Lucene系如Elasticsearch。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每种数据存储系统特点不一样,一般来说,Metrics监控用Prometheus的比较多,而日志存储大多使用Elasticsearch。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"数据应用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"观测性数据的应用一般主要是检索、分析和展示。常见工具有OLAP引擎如Hive、Impala等,第三方分析、展示工具如Grafana、Superset、Kibana等,有实力的公司一般还会自研分析工具;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/553c1976d072deff77784e7f18bccaff.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"监控","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"为什么要分层","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多微服务匆匆上线后,缺乏全面的、分层的、立体化、多维度的监控,仅支持少量的Metrics埋点,如API接口的QPS及响应时间,维度单一,不利于发现和排查问题。等真正出现问题时,要么无法及时感知故障,要么无法下钻和定位问题。以下是一个真实的案例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某视频播放服务,在晚高峰(21:00左右)期间,出现了部分API告警,监控面板耗时升高,客户端甚至出现了超时等现象。除此之外,工程师拿不到其他任何信息,对接口的成功率、长尾延迟(TP90、TP99等)、外部依赖情况、运行时监控(线程池、JVM、连接池等)、调用链异常等指标一无所知,查询分析业务日志也没看到明显异常。经过长时间的排查,最终发现是某第三方服务出现故障,导致播放服务延迟上升,整个服务可用性受到了较长时间的影响。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果能有一套成熟的、立体化的监控体系,可以从多维度来监控和感知系统问题,则可以极大缩短问题的处理时间、提升故障处理效率。本章节重点介绍一种经过实践验证过的分层监控体系。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分层监控体系中,将监控分为:基础监控、接入层监控、中间件监控、应用层监控、链路监控、端到端监控。通过分层架构设计,使得可观测性得到较大提升。可以7*24小时监控服务的QPS、平均及长尾响应时间、错误率、系统负载等情况,并提供实时的告警机制,提升问题发现的敏捷性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基础监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基础监控一般指单机的系统指标监控,如CPU使用情况、内存占用情况、磁盘使用率、系统平均负载、网络情况等。一般采用开源软件实现,如Nagios、Zabbix、Ganglia等,有条件的公司则会自研监控软件,用于定制化需求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"中间件监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中间件监控泛指应用之外的资源监控,典型的有存储、消息中间件、搜索等。如MySQL、MongoDB、HBase、Redis的数据量监控、连接数监控、QPS监控、主从同步、慢查询监控如分析等。如ActiveMQ、RocketMQ、Kafka的消息写入和消费监控、消息积压监控等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"接入层监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"监控流量的入口,如请求QPS、延迟(平均/百分比分布/区间分布)、错误率、HTTP状态码、API业务码等监控,一般通过采集和分析接入层(如Nginx)的访问日志得到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"应用监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"应用监控指应用服务本身的各类指标监控,如QPS、延迟(平均/百分比分布/区间分布)、错误率、资源饱和度、入口(Inbund)监控、出口(Outbund)监控等。一般由应用服务内部埋点Metrics数据得到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"链路监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"链路监控,泛指拓扑分析、错误分析、资源分析、监控告警(QPS/延迟:百分比分布|区间分布/错误率)、链路检索等一系列标准监控。可以支持请求的串联与分析,较大程度提升排查问题效率。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"端到端监控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"端到端监控也称黑盒监控,通常会在全国各地乃至海外设立许多拨测点,覆盖各主流地理位置(如华东、华北、华中和华南)和各主流运营商(如移动、联动、电信)。由拨测点发起周期性请求,对目标服务进行远侧拨测,并对返回结果进行必要的校验(如HTTP状态码、耗时、报文分析校验等),对有问题的拨测进行及时告警通知。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"端到端监控和以上其他监控不同点在于,能模拟真实外网环境,能发现其他监控无法识别的故障,如华南地区到服务主机房的网络出现大量丢包和延迟。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"侵入OR非侵入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"涉及到Metrics埋点的代码,侵入式和非侵入式孰优孰劣并不能一概而论。侵入式并不一定总是糟糕的、强耦合、不易扩展设计,非侵入式也不一定就是优雅的设计。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多数情况下,侵入式埋点粒度更细,能细到代码级别的织入监控指标,而且具备更好的性能和灵活性。非侵入式埋点由于大多采用了运行时织入(RTW)或Agent技术,往往都有一些性能损耗,另外粒度一般只能到方法级别。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"能否标准化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"试想一下一个微服务的监控是如何上线的。通常工程师会捕捉需要监控的技术指标,并在代码进行埋点,等服务上线后,在监控系统上配置对应的监控面板(Dashboard)和必要的报警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看起来似乎很完美,但经过长期的项目迭代,我们也会发现一些问题:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·       埋点不规范,似乎各种指标都不缺,在关键时刻似乎又派不上用场;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·       大量的手工埋点和重复性埋点,效率低,容易出错;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鉴于此,可以考虑对各种常见的监控指标进行抽象和总结,提炼出标准化的监控。常见的标准化监控有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"入口(Inbund)监控,对服务的QPS、延迟(平均/百分比分布/区间分布)、错误率、HTTP状态码、业务码进行监控,并支持自动化生成监控大盘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出口(Outbund)监控,对第三方依赖的QPS、延迟(平均/百分比分布/区间分布)、错误率进行监控,能自动生成监控及报表,并具备告警能力,可通知到第三方技术负责人。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"资源(Resource)监控,对应用的运行时进行监控,如JVM内存分布、线程池(核心大小、队列占用情况等)、连接池(最大连接数、最小活跃连接数、连接可用情况等)。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章