系统稳定性建设实践总结

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020年,注定是个不平凡的一年。疫情的蔓延打乱了大家既定的原有的计划,同时也催生了一些在线业务办理能力的应用诉求,作为技术同学,需要在短时间内快速支持建设系统能力并保障其运行系统稳定性。恰逢年终月份,正好梳理总结下自己的系统稳定性建设经验和思考。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"开篇","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在开始介绍服务稳定性之前,我们先聊一下","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SLA","attrs":{}},{"type":"text","text":"。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SLA","attrs":{}},{"type":"text","text":"(service-level agreement,即 服务级别协议)也称服务等级协议,经常被用来衡量服务稳定性指标。通常被称作“几个9”,9越多代表服务全年可用时间越长服务也就越可靠,即停机时间越短。通常作为服务提供商与受服务用户之间具体达成承诺的服务指标——质量、可用性,责任。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3个9,即99.9%,全年停机时间:365 * 24 * 60 *(1-99.9%)= 52.56min","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4个9,即99.99%,全年停机时间:365 * 24 * 60 *(1-99.99%)= 5.256min","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5个9,即99.999%,全年停机时间:365 * 24 * 60 *(1-99.999%)= 0.5256min","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在严苛的服务级别协议背后,其实是一些列规范要求来进行保障。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、系统稳定性建设是指什么?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于系统稳定性是指什么这一问题,相信好多开发同学都会有自己的理解和认知,但可能会存在是否理解片面或者是否标准的疑惑,那到底有什么判定标准和划分边界呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们不妨看下来自于维基百科的解释:","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稳定性是数学或工程上的用语,判别一系统在有界的输入是否也产生有界的输出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若是,称系统为稳定;若否,则称系统为不稳定。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"简单理解,系统稳定性","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本质上是系统的确定性应答","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从另一个角度解释,服务稳定性建设就是如何保障系统能够满足SLA所要求的服务等级协议。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、为什么需要系统稳定性建设?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以确定的一点,服务稳定性建设是非常必要的,不管是满足日常系统正常运行还是重大节庆活动的稳定有序运营。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们来看几个由于服务稳定性故障造成影响的案例:","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)2020年国庆前一天,受“2020年最难打车日”的需求影响,滴滴平台和嘀嗒平台相继出现宕机故障;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)2018年亚马逊prime day:亚马逊会员日故障(顾客无法将商品添加到购物车结账),导致公司损失高达9900万美元。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)2015年由于中国工商银行部分地区因计算机系统升级,造成柜面和电子渠道业务办理缓慢,甚至不能受理业务;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)2012年12306铁路订票网站因机房空调系统故障,导致暂停互联网售票、退票、改签业务。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/70d72b31bb80abe4d9b7135fe3881841.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服务稳定性对于企业来说非常重要,不仅仅会对企业带来直接的经济损失,甚至会对行业、人们的生活造成非常严重的影响。所以说服务稳定性建设的意义非常重大。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、系统稳定性建设为什么难?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于稳定性以及如何提升稳定性指标,我们可以想到很多的优化项:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"eg. 加服务器、扩容、超时重试、服务降级、资源隔离&备份、代码逻辑优化、异步事件化...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那系统稳定性建设的主要难点是什么呢?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 面对的挑战比较大","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"流量未知","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"尤其对于一个新改革上线的新业务而言,系统稳定性建设主要是流量洪峰的是个未知数,由于没有经验可以参考,我不确定是百万级别还是千万级别,还是更高级别?","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"改动量大","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"往往这种系统稳定性建设需要考虑需求主要是短时间内支持XX能力的上线,这其中往往涉及系统层面从下到上的多处变更,包括底层数据结构调整、业务逻辑改造以及用户交互方式的优化等等。时间短,改动大,质量难以保证。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"不确定性","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"软件工程往往被用来描述“研究用工程化方法构建和维护有效的、实用的和高质量的软件”。其包括软件建设的方方面面,凡事事无巨细,任何细微的疏忽都可能造成全盘故障问题,不确定性问题尤其严重。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 系统稳定性建设是一个系统性的大工程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多环节分工精细复杂,不容一点疏忽。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从系统构成来看,可以区分为单服务系统稳定性和多服务集群稳定性。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"单服务稳定性","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要包括:功能配置可控、缓存加速(利器) 、服务隔离(第三方)、场景异常兜底方案、服务监控与及时响应等等","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集群稳定性","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要包括:合理的系统架构、优秀的集群部署、科学的熔断限流、压测机制、精细的监控体系等等","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、系统稳定性建设如何入手?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1 系统稳定性建设前提","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在提出系统稳定性建设解决方案之前,我们需要明确一下前提条件:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"业务熟悉","attrs":{}},{"type":"text","text":" 需要对业务全貌流程熟悉,具备较强的掌控力;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"架构明确","attrs":{}},{"type":"text","text":" 需要对系统技术架构熟知并具有一定的实操经验。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只有这样,对业务、架构都具备掌控能力之后,才谈得上去做稳定性建设的拆解和优化,才有基本的保障。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 流程划分","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般情况下,我们提到系统稳定性建设,更像将系统稳定性作为一个专项Topic来搞,从其运行流程来看,主要存在以下几个方面:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"前提","attrs":{}},{"type":"text","text":" 明确目标 (基准)","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"事前","attrs":{}},{"type":"text","text":" 请求链路优化、服务性能优化&压测、应急预案制定、故障演练","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"事中","attrs":{}},{"type":"text","text":" 故障监控、定位问题、故障止损、问题修复","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"事后","attrs":{}},{"type":"text","text":" 故障覆盘、整改优化、经验总结沉淀","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服务稳定性建设其实是一个系统性的大工程,包括了方方面面。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1c/1cdd0211502cfccbecda60cce86dccb0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、系统稳定性建设的关键动作","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从上一Part工作拆解来看,稳定性建设囊括的点比较多,而且杂。更多情况下,我们会做服务稳定性专项,针对某些特定场景下的特定问题而梳理出对应的方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那我们可以以小见大,从单服务系统本身出发,提炼看看存在哪些稳定性建设的关键点。其实只有每个单服务环节都稳定可靠,那集群系统乃至整个工程系统的稳定性才有保障。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假如系统面对突增的请求流量情况下,如何做好服务稳定性建设呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"稳定性建设关键动作","attrs":{}},{"type":"text","text":"拆分如下几类:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.1 削峰限流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,经典的秒杀场景,春节的火车票抢购、电商平台的双11秒杀等等,都是短时间上亿的用户涌入,瞬间流量巨大(高并发)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不管前期对服务器资源做了如何的扩容,都会存在一个处理上限,所以一定要进行必要的削峰限流策略,类似于城市早晚高峰错峰限行的解决方案。同样,秒杀场景也需要类似的解决方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那具体如何实现呢?","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用消息队列来削峰","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消息队列来缓冲瞬时流量,把同步的直接调用转换成异步的间接推送,中间通过一个队列在一端承接瞬时的流量洪峰,在另一端平滑地将消息推送出去。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消息队列就像“水库”一样,拦蓄上游的洪水,削减进入下游河道的洪峰流量,从而达到减免洪水灾害的目的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a5aea3eecedc173e97aef50a7095599.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用挡板过滤无效请求","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流量挡板过滤,主要是建立","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一种验证机制过滤掉无效请求,保障核心服务避免受更多外界无效请求的影响","attrs":{}},{"type":"text","text":"。比较常用的方案就是“布隆过滤器”。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b2f0df34957d7f6e180be8da1a54325a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"产品策略的调整","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"产品策略调整是一种特别有效的手段,效果甚至会优于技术层面的改进优化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如:利用排队策略,有效打散高并发请求;调整活动宣传时间分散点,避免同一时刻出现高并发请求…","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.2 缓存加速","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缓存是解决并发的利器,可以有效的提高系统的吞吐量。按照业务以及技术的纬度必要时可以增加多级缓存来保证其命中率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要应用思路:在数据库与服务端之间利用 Redis 做缓存服务,减少请求直接冲击到数据库。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/06/06969177fb8341906dc2a8383514f5f8.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.3 异步化处理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"与异步对应的就是同步,即所有事情排队一件件的有序进行,等上件事情完成后才会去做下一件事情。有点像一根签子串起来的糖葫芦。需要实时处理并响应,一旦超过时间会结束会话,在该过程中调用方一直在等待响应方处理完成并返回。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"异步处理不用阻塞当前线程来等待处理完成,而是允许后续操作,直至其它线程将处理完成,并回调通知此线程。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要强调一点:异步是一种设计理念,异步操作不等于多线程,常见的消息中间件、发布订阅的广播模式等,都可以实现异步处理的方式。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、稳定性建设过程中的一些经验","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.1 做好压测","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提前做好系统压测,做到心中有数,防患于未然,压力预估要切合实际,不要盲目过大。对于性能瓶颈点,尽量提前做好改","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"进优化或者重点关注布防","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.2 应急预案必备","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"应急预案一定要有","attrs":{}},{"type":"text","text":",研发人员往往比较自信,这是好事也是坏事,我们需要做最坏的打算。因为经验再丰富的工程师,也无法穷举未来可能发生的意外事件,而故障往往出现在预案之外的地方(墨菲定律)。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.3 完善监控体系","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"建立完善的监控、告警机制,避免我们成为瞎子聋子,保障报错及时感知。在监控点的设置上,主要原则是:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"所有的依赖都是不可信的!","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6.4 快速响应能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"类似于在行驶的飞机上换引擎,过程中无论发生什么样的故障,立即要动用一切力量“快速”止损。服务要有等级划分,保障抓大放小,保护核心服务原则,如确实存在不能快速定位问题时,可逐层降级。主要目标:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"防止问题扩大,故障止损,快速恢复","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"总结","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"稳定性建设关键点","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"削峰限流","attrs":{}},{"type":"text","text":" 面对资源上限,做技术、业务层面的处理,达到流量削峰保障服务稳定性;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"缓存加速","attrs":{}},{"type":"text","text":" 利用缓存解决并发,有效提升系统的吞吐量,同时需注意避免热Key、大Key问题;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"异步化处理","attrs":{}},{"type":"text","text":"(同步->异步),有效提升响应效率,保障数据的最终一致性。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"技术服务于业务","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"技术还是要解决实际问题来落地","attrs":{}},{"type":"text","text":"。应用场景很关键,所有的优化工作不要单纯为了技术而技术,技术归根结底还是为应用场景和产业落地服务。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以尝试将业务视角目标做为最终目标,通过一切技术手段来保障目标的达成,从而实现技术价值最大化。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"不拘泥于形式,灵活运用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稳定性方案需要视场景而灵活调整应用,切忌生搬硬套。在具体实现过程中,关键要把控主要行动路径,多条路径情况下选取投入产出比最高的那一条。推进一个行动路径:问题驱动(问题感知->问题分析->问题控制->问题解决)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Thanks for reading!","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章