微服務沉思錄-觀測性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"觀測性(Observability)是微服務得以穩健運行的至關重要一環。在生產環境若缺乏良好的觀測性工具和方法,就好比高空的飛機在沒有儀表板的情況下飛行一樣,兩眼一抹黑,充滿不確定性因素和未知風險,無法及時發現、定位、轉移和修復錯誤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業界通常將觀測性大致分爲三大類:Metrics,Tracing和Logging。通常來說Metrics監控側重於技術指標的收集與觀測,如服務調用QPS、響應時間、錯誤率和資源使用率;Logging側重於運行日誌的採集、存儲與檢索;而Tracing則偏向於調用鏈的串聯、追蹤與APM分析。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/162f43e271919c6be9ca2169fca48b4d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想實時觀測運行時數據,必須有強勁、穩定的數據流Pipeline工具來持續支撐數據的採集、傳輸、存儲和應用,包括監控指標、日誌數據和鏈路跟蹤數據。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據採集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據採集一般通過應用發送消息(如Kafka、RocketMQ等),或調用REST API來實現,注意一定要異步處理,防止極端情況下主線程被阻塞。也可以通過安裝到服務器上的Agent來完成日誌採集,應用只需要將日誌寫到磁盤對應的位置即可,典型的Agent軟件如Flume、Scribe、Logstash或自研工具等;","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據傳輸","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據傳輸一般包括消息中間件等用於數據移動的平臺,也包括數據處理平臺,常見的流計算平臺有Storm、Spark Streaming、Flink等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從實際工程經驗看,建議採集的數據最終都寫入Kafka消息中間件,Kafka性能卓越,且具備解耦、緩衝、標準化等優勢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,有時候採集層直接將數據寫入存儲層,無需額外傳輸和處理,這種一般適合數據量比較小的傳輸場景。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據存儲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲一般要兼顧數據量和存儲性能。常見的數據存儲有:時序數據庫系如OpenTSDB、InfluxDB、Graphite、Prometheus等;預聚合系如Druid、Kylin(不支持實時攝入)等;Hadoop系如HDFS、Hbase、Hive等;Lucene系如Elasticsearch。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每種數據存儲系統特點不一樣,一般來說,Metrics監控用Prometheus的比較多,而日誌存儲大多使用Elasticsearch。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"觀測性數據的應用一般主要是檢索、分析和展示。常見工具有OLAP引擎如Hive、Impala等,第三方分析、展示工具如Grafana、Superset、Kibana等,有實力的公司一般還會自研分析工具;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/553c1976d072deff77784e7f18bccaff.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"監控","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"爲什麼要分層","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多微服務匆匆上線後,缺乏全面的、分層的、立體化、多維度的監控,僅支持少量的Metrics埋點,如API接口的QPS及響應時間,維度單一,不利於發現和排查問題。等真正出現問題時,要麼無法及時感知故障,要麼無法下鑽和定位問題。以下是一個真實的案例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某視頻播放服務,在晚高峯(21:00左右)期間,出現了部分API告警,監控面板耗時升高,客戶端甚至出現了超時等現象。除此之外,工程師拿不到其他任何信息,對接口的成功率、長尾延遲(TP90、TP99等)、外部依賴情況、運行時監控(線程池、JVM、連接池等)、調用鏈異常等指標一無所知,查詢分析業務日誌也沒看到明顯異常。經過長時間的排查,最終發現是某第三方服務出現故障,導致播放服務延遲上升,整個服務可用性受到了較長時間的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果能有一套成熟的、立體化的監控體系,可以從多維度來監控和感知系統問題,則可以極大縮短問題的處理時間、提升故障處理效率。本章節重點介紹一種經過實踐驗證過的分層監控體系。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分層監控體系中,將監控分爲:基礎監控、接入層監控、中間件監控、應用層監控、鏈路監控、端到端監控。通過分層架構設計,使得可觀測性得到較大提升。可以7*24小時監控服務的QPS、平均及長尾響應時間、錯誤率、系統負載等情況,並提供實時的告警機制,提升問題發現的敏捷性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基礎監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎監控一般指單機的系統指標監控,如CPU使用情況、內存佔用情況、磁盤使用率、系統平均負載、網絡情況等。一般採用開源軟件實現,如Nagios、Zabbix、Ganglia等,有條件的公司則會自研監控軟件,用於定製化需求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"中間件監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間件監控泛指應用之外的資源監控,典型的有存儲、消息中間件、搜索等。如MySQL、MongoDB、HBase、Redis的數據量監控、連接數監控、QPS監控、主從同步、慢查詢監控如分析等。如ActiveMQ、RocketMQ、Kafka的消息寫入和消費監控、消息積壓監控等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"接入層監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控流量的入口,如請求QPS、延遲(平均/百分比分佈/區間分佈)、錯誤率、HTTP狀態碼、API業務碼等監控,一般通過採集和分析接入層(如Nginx)的訪問日誌得到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"應用監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用監控指應用服務本身的各類指標監控,如QPS、延遲(平均/百分比分佈/區間分佈)、錯誤率、資源飽和度、入口(Inbund)監控、出口(Outbund)監控等。一般由應用服務內部埋點Metrics數據得到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"鏈路監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鏈路監控,泛指拓撲分析、錯誤分析、資源分析、監控告警(QPS/延遲:百分比分佈|區間分佈/錯誤率)、鏈路檢索等一系列標準監控。可以支持請求的串聯與分析,較大程度提升排查問題效率。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"端到端監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"端到端監控也稱黑盒監控,通常會在全國各地乃至海外設立許多撥測點,覆蓋各主流地理位置(如華東、華北、華中和華南)和各主流運營商(如移動、聯動、電信)。由撥測點發起週期性請求,對目標服務進行遠側撥測,並對返回結果進行必要的校驗(如HTTP狀態碼、耗時、報文分析校驗等),對有問題的撥測進行及時告警通知。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"端到端監控和以上其他監控不同點在於,能模擬真實外網環境,能發現其他監控無法識別的故障,如華南地區到服務主機房的網絡出現大量丟包和延遲。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"侵入OR非侵入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"涉及到Metrics埋點的代碼,侵入式和非侵入式孰優孰劣並不能一概而論。侵入式並不一定總是糟糕的、強耦合、不易擴展設計,非侵入式也不一定就是優雅的設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多數情況下,侵入式埋點粒度更細,能細到代碼級別的織入監控指標,而且具備更好的性能和靈活性。非侵入式埋點由於大多采用了運行時織入(RTW)或Agent技術,往往都有一些性能損耗,另外粒度一般只能到方法級別。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"能否標準化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"試想一下一個微服務的監控是如何上線的。通常工程師會捕捉需要監控的技術指標,並在代碼進行埋點,等服務上線後,在監控系統上配置對應的監控面板(Dashboard)和必要的報警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看起來似乎很完美,但經過長期的項目迭代,我們也會發現一些問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·       埋點不規範,似乎各種指標都不缺,在關鍵時刻似乎又派不上用場;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·       大量的手工埋點和重複性埋點,效率低,容易出錯;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於此,可以考慮對各種常見的監控指標進行抽象和總結,提煉出標準化的監控。常見的標準化監控有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"入口(Inbund)監控,對服務的QPS、延遲(平均/百分比分佈/區間分佈)、錯誤率、HTTP狀態碼、業務碼進行監控,並支持自動化生成監控大盤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出口(Outbund)監控,對第三方依賴的QPS、延遲(平均/百分比分佈/區間分佈)、錯誤率進行監控,能自動生成監控及報表,並具備告警能力,可通知到第三方技術負責人。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源(Resource)監控,對應用的運行時進行監控,如JVM內存分佈、線程池(核心大小、隊列佔用情況等)、連接池(最大連接數、最小活躍連接數、連接可用情況等)。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章