SRE實戰(01) 初識+探索SRE如何推進好大夫在線技術債務改造

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你是否正面臨着產品迭代在不斷提速(催進度、要deadline)的同時,服務產線BUG\/故障也在變多、有大量用戶投訴要響應,每天都要花大把時間去處理突發情況、去救火,而無法把主要精力都投入到正常項目中的糟糕的工作狀態?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何保障網站的高可用是行業內的痛點,也是工程師焦慮源頭之一。大家都在積極嘗試去解決這類問題,好大夫在線參考Google SRE思想,結合國內其他公司的經驗和我們自身的特點,努力落地SRE,並取得了一定的進展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們帶着問題一起探索SRE:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何去衡量服務的穩定性?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務接口平均耗時30多ms,爲何單機QPS提升不上去呢?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容量規劃擴縮容,熔斷限流,主要參考的指標是什麼?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頻繁處理用戶投訴意見建議,先於用戶提前發現問題?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着時間推移服可用性逐漸下滑,產線BUG頻出,如何監控服務可用性?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術債務是如何產生的,如何一步步讓工程師陷入絕望的困境?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重要不緊急的技術債務,何去何從,SRE是如何推進技術債務改進?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是否能提前識別潛在風險,提前解決,讓服務保持健康?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"..."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本系列文章,將嘗試帶着這些疑問,結合好大夫在線面臨的實際問題,一起來探索SRE的落地過程,以及如何用SRE來轉變大家的工作思路。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"SRE 基礎認知"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"SRE職責"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRE是一個崗位嗎?是救火專員嗎?需要成爲全棧工程師,還是隻用盯着監控面板的值班人員?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRE的職責主要是保障網站的可用性,是一種認知共識,有一套相關方法論,是一種系統化的思維方式。從故障預防,到故障處理,再到故障覆盤,形成一個閉環。[注1]"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/93\/933d56bfc5189c33cb9bef8491842325.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖中可以看出,SRE涵蓋不同的階段,細分的職責也不相同。涉及日誌收集,分析,風險識別,熔斷限流,告警,通知。同時每個部分又都是相關依賴的,涉及不同的部門和崗位。是一個整體系統,需要各個方向聯動保障。我們可以提供一些抓手,讓整個體系運轉起來,從而保障了整體的可用性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"SRE體系工作流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、故障預防:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着微服務的推進,業務模塊被劃分到不同的獨立的服務中。鏈路請求越來越複雜,依賴的中間件也越來越多。這給服務治理帶來了不少的挑戰,同時對工程師編程能力要求也越來越高。一方面需要加強工程師面向失敗編程的意識,一方面增強框架的治理能力。比如關心RPC請求異常的Code碼,失敗重試,允許數據中間態的存在,考慮分佈式事務一致性,合理的熔斷限流策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、故障發現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包括框架通用埋點日誌(RPC鏈路\/中間件依賴),業務核心鏈路埋點日誌(訂單狀態流轉事件)。日誌收集到CLickhouse,通過不同的分析規則生成相應的指標,然後基於Prometheus觸發告警,通知到相應的業務方開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、故障處理:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRE職能成員(業務開發或系統架構組)收到告警,配合Grafana看板快速定位問題,部分操作可以基於治理平臺完成。我們採用預先配置截圖的交互方式,將常見的排查問題思路固化到看板截圖上,方便後期其他SRE當值人員故障處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4、故障覆盤:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"及時覆盤將排障經驗固化到看板中,方便下次告警的時候配合看板截圖快速定位問題。針對常見的處理措施,需要集成到治理平臺上。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"SRE面臨的挑戰"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"如何衡量服務可用性?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一種從時長考慮:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MTTR、MTTF、MTBF是體現系統可靠性的重要指標 [注2]"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MTTF (Mean Time To Failure,平均無故障時間),指系統無故障運行的平均時間,取所有從系統開始正常運行到發生故障之間的時間段的平均值。MTTF =∑T1\/ N;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MTTR (Mean Time To Repair,平均修復時間),指系統從發生故障到維修結束之間的時間段的平均值。MTTR =∑(T2+T3)\/ N;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MTBF (Mean Time Between Failure,平均失效間隔),指系統兩次故障發生時間之間的時間段的平均值。MTBF =∑(T2+T3+T1)\/ N"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很明顯:MTBF= MTTF+ MTTR"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衡量穩定性:"},{"type":"text","marks":[{"type":"strong"}],"text":"AO = MTBF \/ (MTBF + MTTF)"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ae\/ae8aa8041278affab9411049742d10e4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二種從服務質量來看: 請求維度:成功率 = 成功請求數 \/ 總請求數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏不能統計單個請求,而是看一段時間的概率分佈,比如5xx佔比,計算一段時間的5xx佔比達到5%,持續10min。這塊一般用幾個9來衡量,比如3個9,4個9。[注3]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/106fa7ce231ce61c58e836e87f84a07d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置穩定性目標一般需要考慮:成本,業務容忍度,當前業務的現狀。大部分時候4個9目標比3個9目標投入成本要高很多。有些底層的服務,穩定性要求就比一般配套服務的高。比如doctor醫生服務的穩定性就需要99.99,它一旦有問題很可能就是波及全站的範圍。由於很多歷史原因,“大泥球”的服務積累的技術債務會很多。這時候針對這個服務定一個合理的指標,比定一個標準的指標要好很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇穩定性目標涉及到了測量方法和判斷方法的問題,包含三個要素:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衡量指標,比如5xx比例;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衡量目標,總訪問量(QPM)code=200佔比小於95%;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"影響時長,QPM其實就是一個聚合指標,也就是說不能簡單的計算單次。這在設置告警規則的時候尤爲重要,比如持續5分鐘,服務sentry報錯大於10。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的衡量指標就是SLI,衡量的目標就是SLO,如果針對服務質量的還有一個SLA。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來的問題就是:如何選擇合適的SLI,設置合理的SLO。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"五大“黃金指標” "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面也談到了選擇衡量指標的問題,服務相關的指標很多,比如cpu負載高了要不要關係,線程數高了要不要關心,QPS量上漲了要不要關心等等。指標不能多,要設置的合理,比如cpu負載高,服務依然能提供穩定的服務,那可以認爲服務依然正常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參照SRE,Google運維解密和趙誠老師的課程,一般參考以下五大“黃金指標”,這些指標常用來制定擴縮容,熔斷限流,服務降級的策略:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、容量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要有服務QPS\/QPM;核心鏈路QPS\/QPM;單機QPS;服務最低存活實例數,資源利用率如CPU負載。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、可用性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是指核心鏈路是否異常;每分鐘服務sentry數;服務端6xx\/5xx\/429\/430(限流)佔比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、時延"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於存在長尾效應,平均耗時不一定能反映當前的現狀,一般用耗時分佈的95線\/99線(T95\/99)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4、錯誤率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要針對用戶側鏈路,入口網關流量,如nginx\/kong請求Request的5xx\/4xx佔比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5、人工介入次數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是指程序的魯棒性;支持自動故障轉移;支持冪等;支持失敗重試(注意防止雪崩);減少人工干預的次數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面給出一些示例:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ff\/ff4380b372f190e3ebd8af490ae2621d.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"如果及時發現問題?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇好了指標,接下來就是要採集數據提煉出想要的指標,繪製監控面板,設置告警規則,觸發告警了。這部分組件近幾年在快速發展,周邊的生態也非常的豐富。我們也經歷了從人工 -> 工具化 -> 系統化 -> 平臺化進化過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"SRE生態及工具集"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b6\/b698ad0ab0e124c7794f3796344387a0.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRE工具集主要從數據源採集,分析生成監控指標,判定這些指標閾值觸發告警後,及時響應。主要包括以下6個部分:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、數據採集"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"java體系的性能消耗比較大,我們選擇的是Fluent-bit和gohangout,將收集的數據發佈到Kafka。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、數據分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要有流式分析和離線分析。我們也是兩者相結合的,鏈路分析基於我們自研的一套TracerLog。我們會分析鏈路的依賴拓撲關係,找出循環調用,慢接口,慢SQL,雙向依賴等常見的風險點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、數據存儲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控體系用的Prometheus,由於Prometheus原生不支持分佈式存儲,我們採用Clickhouse做遠程存儲。針對存儲時間長的會採用稀疏存儲模式,大量採用物化視圖聚合數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4.監控畫像"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌查詢主要基於ELK體系,Grafana用於分析後的指標聚合展示,慢慢也衍生出了公司的看板文化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5.告警通知"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於AlertManager的hock模式,我們研發了電話\/短信\/企業微信通知,讓整個處理流程移動化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"6.告警響應"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了讓日常處理平臺化,解放運維成本,我們推出了PaaS雲平臺,輔助開發日常運維。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"SRE的切入點:推進技術債務改造"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"技術債務是如何產生的?"},{"type":"text","text":" 一般技術債務可以分爲三類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不良代碼的積累;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務建模;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務架構設計。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"不良代碼積累 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這類技術債務很容易想到。隨着服務的不斷迭代,服務越來越複雜,需求的變更,爛代碼的引入,建模的不合理,不可避免的就帶來一些技術債務。技術債務增加,服務穩定性越差,越容易寫出爛代碼,越容易積累更多的技術債務,服務穩定性也就越來越差,從而形成了一個正反饋迴路。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/68\/68800bcec2b6d73f485921dba0c86b6a.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出現這種現象的主要原因有:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、工程師的過度自信。認爲以後會有時間解決,以完成當前項目功能爲導向。然而大部分情況下,項目結束後就沒有下文了,加TODO的地方後續也都沒人過問。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、工程師過度依賴搜索引擎。有些工程師秉承夠用就行,面向“搜索引擎編程”,遇到問題先Google,ctrl+c & ctrl+v 讓編程變成了體力勞動。表面上看貌似沒啥問題,但不確定性往往就隱藏在這裏面。先不說這些第三方的代碼質量本身就參差不齊,更可能是破壞最小依賴原則,往往只需要一個組件的某個功能,卻引入了整個組件,而這個組件又依賴其他組件。更糟糕的是後續這些依賴沒人記得當初是怎麼引進來的,不敢去修改,又沒人負責升級迭代,從而變成了不定時炸彈(祖傳代碼)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、錯誤的理解敏捷開發。只剩下了表面上的“快速”,需求粗糙、變更過快,疲於應付,沒有安排後續重構(整理)時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"程序上的墨菲定律:"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"程序的規模會一直不斷地增長下去,直到解決線上問題將有限編碼時間填滿爲止。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直到這個龐然大物不停的出問題,直到解決一個難題卻引入另外一個的難題,才意識到重要的難題一直沒時間去做,從而把工程師帶入泥潭。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"艾森豪威爾矩陣:"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我有兩種難題:緊急的和重要的,而緊急的難題永遠是不重要的,重要的難題永遠是不緊急。-- 艾森豪威爾 [注4]"}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/74\/747c81c0c4895bc85667e61afd5d5cb8.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務部門與研發人員經常犯的共同錯誤就是將第三優先級的事情提到第一優先級去做。換句話說,他們沒有把真正緊急並且重要的功能和緊急但是不重要的功能分開。這個錯誤導致了重要的事被忽略了,重要的系統架構問題讓位給了不重要的系統行爲功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"成效的軟件研發團隊會迎難而上,毫不掩飾地與所有其他的系統相關方進行平等的爭吵。請記住,作爲一名軟件開發人員,你也是相關者之一。軟件系統的可維護性需要由你來保護,這是你角色的一部分,也是你職責中不可缺少的一部分。公司僱你的很大一部分原因就是需要有人來做這件事。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"業務建模 & 業務架構設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實更難解決的技術債務往往是來自於業務建模的不合理和業務架構設計缺陷,主要原因有:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、未理解SOLID設計原則"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRP:單一職責原則;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP:開閉原則;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LSP:里氏替換原則;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ISP:接口隔離原則;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DIP:依賴反轉原則。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這塊就不展開了,下次有機會再討論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、過早的引入不需要的設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然SOA微服務架構是當下主流,新項目一開始就拆分成不同的服務,大家爲了適應這個龐雜的服務調用流程不得不額外付出巨大人力成本。好的系統架構是支持漸進式的,沉澱出特定領域邏輯,等到合適的時候再拆分,或者隨業務的發展變化還得支持聚合。對採用什麼數據庫,什麼框架不要提前限定死,應該留有更多的餘地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、邊界劃分不合理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"軟件架構設計本身就是一門劃分邊界的藝術。邊界的作用是將軟件分割成各種元素,以便約束邊界兩側之間的依賴關係。關於邊界劃分可以參考[注5]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4、依賴不合理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高層策略依賴低層組件等,這塊可以採用DIP設計原則解決。微服務間調用依賴不合理,耦合度高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"我們的嘗試:技術債務識別及優化追蹤"},{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/5978006b888d10201eb9987b2457b450.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於鏈路分析找出潛在的風險:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、慢接口:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"慢,會嚴重影響整個服務的吞吐量,最終反映到用戶體驗上,造成客戶流失。這裏不採用平均耗時是因爲接口延遲滿足長尾效應,延遲越大危害越大,所以優先優化延遲高的接口。一般採用接口的延遲百分位第99位(P99)來衡量。目前我們希望優化後端服務P99<100ms,前端服務P99<600ms。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、異常接口:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接口錯誤率,實時反映服務可用性。一般用作熔斷降級的指示燈。根據業務容忍度可用性可以設置成3個9或4個9。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、慢sql:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務的迭代,未經優化的SQL可能會產生雪崩。2019年初,由於搜索引擎蜘蛛的抓取大分頁數據導致數據庫雪崩,影響全站的故障。目前我們會分析慢SQL,收集SQL指紋,給出相應的優化建議。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4、穩定依賴:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依賴關係必須要指向更穩定的方向。根據最小依賴原則,能不依賴就不依賴,能少依賴就少依賴,高層策略不能依賴低層組件。當然完全沒依賴那就變成孤島了沒有意義。當下微服務依然是主流, 各服務對依賴的穩定性要求也不一樣,基礎服務要求比前臺服務高。我們可以參考一個指標: "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Fan-in:入向依賴,這個指標指代了組件外部類依賴於組件內部類的數量;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Fan-out:出向依賴,這個指標指代了組件內部類依賴於組件外部類的數量;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":" I:不穩定性,I=Fan-out\/(Fan-in+Fan-out)。該指標的範圍是[0,1],I=0意味着組件是最穩定的,I=1意味着組件是最不穩定的。[注6]"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常見的服務依賴風險點有:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多次依賴,同一條鏈路下游服務多次調用上游服務不同的接口;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"循環依賴,同一條鏈路下游服務多次調用上游服務同一個接口;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雙向依賴:同一條鏈路兩個服務互爲上下游,相互耦合調用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果對這部分感興趣可以查看我們之前的一篇文章: "},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzUxOTg4NDAxMg==&mid=2247483825&idx=1&sn=2b7b77f81b7dd27c9e0b1180755ac990&chksm=f9f39c32ce841524bc934de8280757aa227d67b725f231a6194414fd342338409343339a382e&mpshare=1&scene=21&srcid=0923DcPSwO9xPKYPpfG9DsZk&sharer_sharetime=1632397680071&sharer_shareid=4a2b708e2f0fccaed635228731ef77a2&version=3.1.8.90238&platform=mac#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"線上系統優化祕笈(Ⅰ) -- 慢接口分析"}]},{"type":"text","text":"。後續,我們還會對具體如何識別和計算技術債務這塊做深入探索,敬請期待。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次主要分享SRE基礎認知,以及SRE如何以技術債務作爲切入點嘗試推進系統健康度建設。技術債務一直都是工程師揹負的包袱,而這部分對工程師的要求非常高,需要具備長期抗衡意識,這部分的經驗積累纔是成就架構夢的原始基石,唯有堅持方能蛻變。關於這部分的經典書籍還是非常多的,《代碼整潔之道》,《架構整潔之道》,《重構:改善既有代碼的設計》也是工程師必備聖經。如果對SRE感興趣可以閱讀《SRE Google運維解密》,《SRE生存指南》,極客時間趙誠老師的課程《SRE實戰手冊》。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文獻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注1] 趙誠《SRE實戰手冊》第01|SRE迷思"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注2] MTTR\/MTTF\/MTBF圖解"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注3] 趙誠《SRE實戰手冊》第02|系統可用性"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注4] 《架構整潔之道》第2章 艾森豪威爾矩陣"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注5] 《架構整潔之道》第17章 劃分邊界"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[注6] 《架構整潔之道》第14章 穩定依賴原則"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方勇:好大夫基礎架構部高級工程師,專注於 SRE、微服務、中間件的穩定性和可用性建設,整體負責好大夫服務治理雲平臺的設計和搭建。"}]}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章