系統高可用之健康檢查和健康度量那些事

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着人們的生活水平的不斷提高,人們對身體健康越來越重視,很多人都做過體檢,一般公司都會有一年一度的體檢福利,健康體檢是家喻戶曉了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着互聯網的快速發展,同類同質產品之間的競爭越來越大,產品之間一個重要的差異就是用戶體驗。影響用戶體驗的,除了產品設計因素外,技術層面也是一個重要的影響因素,主要體現在服務的可用性和響應速度。提升服務可用性和響應速度如此重要,爲了實現這樣的目標,必須要有相應的手段,其中健康檢查就是保障服務可用性和快速響應一個非常重要的前提。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"健康檢查有哪些項目、指標和方法呢?此文帶你一一揭曉。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、什麼是健康檢查","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"健康體檢是指通過醫學手段和方法對受檢者進行身體檢查,瞭解受檢者健康狀況、早期發現疾病線索和健康隱患的診療行爲。而系統的健康檢查是利用技術手段檢測網絡、主機、應用、服務等一系列對象是否健康或可用的過程。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d47db5bb2b527ba41f452cf1cc2c98c.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、爲什麼需要做健康檢查","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"互聯網產品對用戶體驗提出了很高的要求,但常常由於技術側原因,發生服務響應慢或者服務不可用等一系列影響用戶體驗的問題,導致業務中斷,影響收入,公司品牌和口碑也會受到巨大的負面影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"影響服務不可用和響應慢的因素很多,可能是服務硬件損壞、光纖被挖斷,可能是請求量過大導致數據庫CPU負載、磁盤IO過高,又可能是某同學埋了雷,新上線的功能第一次運行就發生了OOM……","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要保證系統高可用,我們應該怎麼做呢?有人說,系統節點冗餘消除單節點故障不就行了嗎。說的沒錯,消除單節點是系統高可用的常用手段。消除單節點有一個很重要的前提是發現問題節點,把問題節點踢除或者把流量切換到其他正常節點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何“發現問題節點”,就是系統健康檢查需要做的事情。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、如何做健康檢查","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"談論如何做健康檢查前,首先要弄明白的是要檢查的對象究竟是誰。對象可以網絡連接,可以是一個小小的功能組件,可以是一個進程,可以是服務集羣,也可以是機房單元。所以,要做到“高可用”,首先要弄清楚要做哪層面的高可用,哪些對象可能存在單點問題,要把“對象”搞清楚。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,健康檢查如何做呢?通常有兩種方式:主動和被動。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1 主動模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由檢查方作爲主動方,定時主動發起健康檢查請求,請求的報文內容或者格式通常是獨立設計的,被健康的對象作簡單自檢後返回響應。舉個例子:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"check interval=3000 rise=2 fall=5 timeout=1000 type=http;\ncheck_http_send \"HEAD /check.do HTTP/1.0\\r\\n\\r\\n\";\ncheck_http_expect_alive http_2xx http_3xx;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"配置間隔2000豪秒定時向後臺web服務器http://(ip:port)/check.do接口發送檢查請求,如果連續失敗次數達到fall=5次,服務器被認爲宕機,如果連續成功次數達到rise=2次,服務器被認爲是up健康狀態。當然了,響應狀態碼必須是2xx或者3xx才被認爲是健康狀態。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/906a4f3f98424f6c4e61b7ea00697806.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 被動模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"被動健康檢查不設計獨立的健康檢查請求,而是以正常連接情況或者業務請求的響應作爲指標來衡量檢查對象的健康狀態。例如nginx官方開源版本的被動健康檢查配置:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Nginx是基於連接探測,如果在30s內嘗試連接3次失敗,則認爲後端web服務不可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3e/3e12f5caabaf5ea12b6c44397ee278bf.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3 消除單點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面談到,要實現高可用就要消除單點故障,最簡單直接的方案加備服務節點,通過定時心跳健康檢查發現主服務節點宕機後,備服務節點把主的工作接管過來,客戶端把請求流量切換到備服務節點。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/34b9deb61788782c25f6941826dcf26e.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主服務節點與備服務節點之間通過專用的心跳線進行健康檢查,由於網絡分區等原因它們可能無法收到對方心跳,這時備節點會認爲主節點已宕機,主節點也認爲備節點已宕機,但其實主從兩節點狀態都是正常的,客戶端能正常訪問到主從兩節點,出現“雙寫”,這種現象在業界稱爲“腦裂(split-brain)”。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1bdc9e80a5dad4a2ee90626bc6792a95.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出現腦裂會導致數據混亂的災難事件發生,影響業務的正確性,這時引入第三方機構進行仲裁可以有效避免腦裂的發生。出現腦裂會導致數據混亂的災難事件發生,影響業務的正確性,這時引入第三方機構進行仲裁可以有效避免腦裂的發生。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.4 第三方仲裁","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然主從雙方無法確認對方的存活,出現爭議時可以由第三方仲裁節點做出決定,到底誰是主由它說了算,第三方仲裁節點一般是由Zookeeper這種高可用方案來實現。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f3/f3db0994783325e2afafe6f7d0a00a42.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、健康檢查例子","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.1 網絡設備","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Keepalived是一款保證集羣高可用的服務軟件,其功能類似於heartbeat,用於防止單點故障。但是它一般不會單獨出現,而是與其它負載均衡技術(如LVS、HAProxy、Nginx)一起工作來達到集羣的高可用。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d8/d83a446621ebb1ce333718be4ca9b83f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它的健康檢查也包含兩個方面,一個是Keepalived組件之間的健康檢查(通過VRRP心跳報文),如下圖所示","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1c/1cdbf72fd5d6c16866c85a402f49657d.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個是Keepalived組件與本地負載均衡組件的健康檢查,配置如下:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"vrrp_script check_nginx_running {\n script \"/usr/local/bin/check_running\"(定義腳本)\n interval 10(腳本執行的間隔)\n weight -10(腳本執行的優先級)\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,應用的健康檢查方式通過自定義腳本實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Keepalived組件之間通過VRRP協議進行健康檢查,如果主服務器宕機,備服務器通過VRRP協議選舉成爲新的主服務器,把虛擬IP從舊的主服務器上爭搶過來,實現高可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"VRRP報文是封裝在IP報文上的,支持各種上層協議,網絡設備通常也是使用VRRP協議實現主備高可用切換,如交換機、路由器、防火牆等。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/95/958febaa823d0b40ffab3a760f975f96.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當網絡設備發生故障時,VRRP機制能夠選舉出新的網絡設備承擔數據流量,從而保障網絡的可靠通信。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.2 網絡連接","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/96/960ab72020bcbcf0a2f3e6d5607e5f6d.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"移動設備連接互聯網通過NAT方式,移動App的PUSH推送需要與服務器保持長連接,但大部分移動網絡運營商都在連接一段時間沒有數據交互時,會淘汰 NAT列表中的對應連接,造成連接中斷。爲了保持網絡連接的“健康”可用,我們可以在連接建立後,App與服務器互相定期發送Ping Pong心跳信息來保持連接的持續有效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上是應用層的連接健康檢查方案,操作系統也支持底層網絡的連接健康檢查即Keepalive。TCP Keepalive可以在連接無活動一段時間後,發送一個空的探測報文,使TCP連接不會被客戶端或者防火牆等中間網絡設備關閉。Linux可以通過以下三個參數對Keepalive的間隔、頻率和閾值和進行配置:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"net.ipv4.tcp_keepalive_time = 7200\nnet.ipv4.tcp_keepalive_intvl = 75\nnet.ipv4.tcp_keepalive_probes = 9","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.3 主機與進程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主機之間的可達性可以通過Ping命令進行識別,Ping命令使用的是ICMP協議,它能識別從客戶端到目標主機整個路徑的網絡連通性。Ping通常用於手工測試某臺主機是否啓動和網絡是否聯通。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ICMP是網絡層協議,與具體進程是沒關係的,無法通過Ping識別進程是否存在。但進程有端口,有進程信息,可以通過telnet端口或ps命令檢測進程是否存在。進程可能會由於內存不足被kill或者其他原因異常關閉,可以通過cron定時腳本檢測識別後自動拉起,這種方案對老破舊項目中只能單實例部署的應用的可用性提升非常有效。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.4 中間件-RocketMQ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5db2f310ed40669ec53d5df425d4519f.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NameServer是RocketMQ的路由中心,NameServer中維護着Producer集羣、Broker集羣、 Consumer集羣的服務狀態和路由信息。當有新的Consumer加入集羣時,除了上報自身信息外,還獲取各個Broker的地址、Topic、隊列等信息,這樣就能知道它消費的Topic消息存儲到哪個Broker和隊列上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NameServer可以部署多個,NameServer之間相互獨立不互通。Producer、Broker、Consumer服務啓動時需要指定多個NameServer,服務的信息會同時註冊到多個指定的 NameServer上,達到高可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Broker節點與所有NameServer會保持TCP長連接,每隔30s給NameServer發送心跳報文,告訴NameServer自己還活着。而每個NameServer每隔10s檢查一下各個Broker的最近一次心跳時間,如果發現某個Broker超過120s都沒發送心跳報文,就認爲這個Broker已經宕機了,會關閉對應的網絡連接channel,並將其從路由信息裏移除。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5.5 應用層 - Spring Boot Actuator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個服務實例或者進程會通過定期的心跳包向其他服務來報告它的存活,但有這個心跳包還是不夠的,不足以反映它的健康狀況。比如磁盤空間不足了,服務已經無法再寫數據了,但它還能響應心跳包;服務依賴Redis,但Redis服務出了問題連接不上,但它還能響應心跳包;服務的某些功能依賴分佈式存儲服務,但分佈式存儲服務不可用了,但它依然能響應心跳包。我們可以看到,要確定一個服務實例是否存活並且“健康”,還是有很多方面需要考慮的。Spring Boot Actuator能比較好的解決這個問題,它能反映整個服務的健康狀況,包括它所依賴的子系統的健康狀況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spring Boot Actuator是Spring Boot的一個子項目,Actuator提供Endpoint(端點)給外部應用程序進行訪問和交互。Actuator包括許多功能,比如健康檢查、審計、指標收集等等,可幫助我們監控和管理Spring Boot應用程序。Health就是其中一個Endpoint,它提供了關於Spring Boot應用的基本健康情況信息,允許其他雲服務或者k8s等定時檢測到應用的健康狀況,對異常情況及時作出響應。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假如某微服務應用使用到了MySQL、Amazon S3、Elastic Search、DynamocDB這些資源系統,它的健康檢查結果就應該包含所有這些子系統的健康狀況:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5c/5c08ccde0979b4d2f25036381eaa4a35.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Actuator的健康檢查由HealthIndicator接口實現,HealthIndicator接口只有一個health()方法,返回值是Health健康對象。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"@FuncationalInterface\npublic class HealthIndicator {\n \n /**\n * Return an indication of health.\n * @result the health for\n */\n public Health health();\n \n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Health對象有狀態status和details兩個字段,status默認有UNKNOWN、UP、DOWN和OUT_OF_SERVICE四個值,用戶可以自定義和擴展,details是一個KV結構,用戶可以隨意自定義要返回的數據值。@JsonInclude(Include.NON_EMPTY)public final class Health extends HealthComponent {","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"@JsonInclude(Include.NON_EMPTY)\npublic final class Health extends HealthComponent {\n \n private final Status status;\n \n private final Map details;\n \n ...\n \n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Actuator內置了很多常用的HealthIndicator:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/072354a92ca9c972cd671fab7150f3cb.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶可以根據實際情況自定義,比如:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"@Override\npublic Health health() {\n int errorCode = check(); // perform some specific health check\n if (errorCode != 0) {\n return Health.down().withDetail(\"Error Code\", errorCode).build();\n }\n return Health.up().build();\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下health的狀態是啓用且對外開放的,通過http://locahost:8080/actuator/health就可以查詢到應用的健康狀態:{“status”: “UP”},這是一個彙總的狀態,詳細的健康信息可以通過配置項management.endpoint.health.show-details=always打開,一個完整的包含details的健康檢查信息如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/12/125894c73d6c8d6c43342252d63ff2c4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"彙總的健康狀態由 HealthAggregator 彙總而成的,彙總的算法是:所有子系統的健康狀態按DOWN、OUT_OF_SERVICE、UP、UNKNOWN這個順序進行排序取最前面一個狀態值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如ehCache是UP,MySQL是UNKNOWN,diskSpace是OUT_OF_SERVICE;那麼排序下來就是:OUT_OF_SERVICE、UP、UNKNOWN,取第一個就是OUT_OF_SERVICE,即服務不可用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高可用是一個很複雜的工程問題,它是由一系列的子問題構成,健康檢查和健康度量只是其中一個。業務要保持連續不中斷,系統要保證持續運行,就要保證全鏈路所有參與的節點都是高可用的,避免出現單點故障。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何及時發現不健康或故障的節點並告警,如何在節點出現不健康或故障時及時failfast/failover避免發生雪崩效應,健康檢查在其中扮演着非常重要的作用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:vivo 互聯網服務器團隊-Chen Jianbo","attrs":{}}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章