雲原生可觀測建設要點與案例分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這幾年大家會發現業界內頻繁地提到可觀測,也有很多人會問可觀測跟之前傳統的監測到底有什麼區別?可觀測並不是一個新的概念,它其實是傳統監測的擴展。傳統監測領域更多是基於外部的視角去看一個系統,去看一些系統的行爲,從而規劃整個系統的失敗模型,它更多的是從運維的視角來看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天,我們把這個概念從監測擴展到可觀測,其實更多是從系統內部的白盒化思路去看系統內部的運行狀況,是由內往外的,同時結合多種觀測手段,包括我們傳統說的 Metrics 指標,從而做一個非常深入的分析,瞭解整個系統運行狀態的根因。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6a\/6a629267d86c8ffcb05a624e51ae947f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外從使用者角度來講,傳統監測更多是從運維角度,一些傳統的 Metric 維度指標,從外部進行觀測來得知裏面發生了什麼。雲原生可觀測貫穿了整個應用,甚至整個應用開發的生命週期,包括開發、測試、上線、部署、發佈。所有的生命週期不僅會通過 Metrics,還會有系統日誌、業務日誌、鏈路追蹤等方式來進行整個全方位 360 度無死角的監測。換句話說,更多是從內往外來診斷出系統內部產生問題的根因,究竟出現了什麼樣的問題,發生的原因,以及一些對應的恢復手段,這些是我們整個可觀測核心關注的點。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲原生時代對穩定性提出更高要求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着容器、微服務逐漸流行起來,我們進入到了雲原生時代。傳統企業要做雲原生轉型,對整個監測以及穩定性方面提出了更高的要求:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0e\/0e97c81d6e6d91fe91755978ae8b8df9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第一,支撐業務快速迭代。"},{"type":"text","text":"舉個例子,阿里巴巴內部有將近 8000 到 9000 個應用,每天會做將近 4000 次的應用發佈,這樣頻繁快速的迭代,對系統的穩定性、可觀測、運維等方面提出了極高的要求,要求通過各種手段完成穩健的支撐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第二,複雜的調用拓撲。"},{"type":"text","text":"隨着整個微服務化的興起,傳統的大型單體應用在微服務化之後,帶來非常好的彈性、便捷的服務,但同時也導致整個應用的鏈路會變得非常複雜。今天如果按照傳統的方式來做的話,我們可能只需要依賴於一些專家的經驗去看個別問題,這本質上是一個瓶頸類型的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第三,極致的用戶體驗。"},{"type":"text","text":"今天企業擁抱雲原生時代,對數字化轉型有強烈的訴求,需要一個更極致的 IT 方面的體驗。比如說故障響應必須要更快,一個問題從發現到恢復也希望更快,處理的時間更要加快,這是一個挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最後,高效的運維協同。"},{"type":"text","text":"通過傳統的工單方式有時候的效率會偏低,如何解決組織協同的問題,這也是我們關注的一個方向。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲原生可觀測覆蓋場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雲原生可觀測重點包括以下幾個場景:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、應用發佈部署。"},{"type":"text","text":"在特定的場景下我們需要能支撐非常場景化的監測和觀測能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、全景監測。"},{"type":"text","text":"這個很好理解,因爲今天無論從應用的前端,從用戶側到應用側,再到中間件,還是再到底層的 IaaS、基礎設施層,從端到端,這其中所有的鏈路都需要納入到企業的監測體系中做一個全景監測,這是企業應該致力於做的事。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3、智能告警。"},{"type":"text","text":"如今僅是把所有的觀測做好了是遠遠不夠的,我們需要引入一些更高級的玩法。阿里內部這麼多年將一些人工智能技術引入到我們的觀測領域來,是爲了能夠幫助減輕整個運維的負擔,這也是後面會詳細分享的部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4、性能診斷。"},{"type":"text","text":"在發現問題或者說在性能壓測的時候,如何快速地診斷到問題,可以依賴一些工具發現問題的調用鏈,以及一些專家經驗級的實踐,這是我們在性能診斷方面需要加強的。所有的這些場景都包含了整個應用的生命週期。同時我們也要支持各種各樣的雲環境,包括公有云、專有云、混合雲的體系,這些是雲原生的核心場景。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲原生可觀測建設要點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來分享一下我們通過最佳實踐,總結雲原生可觀測的幾個要點。核心要點有三個:1. 我們的數據從哪裏來。2. 我們如何建立這方面的可觀測數據模型。3. 我們如何用好這些數據模型、如何分析。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據從哪裏來"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ad\/ad0113c1791b5a3d6ebbfbab4609fa84.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天的整個可觀測體系已經非常豐富了。我們要把所有的端到端,包括上層的業務到應用程序,再到分佈式系統、中間件、底層基礎設施,所有這些都納入到可觀測體系中,核心是阿里雲的各種監測產品,包括雲監測 & Prometheus、ARMS、SLS 等。這些產品的組合能夠幫助用戶把所有的可觀測點都納入到整個體系中來,包括各種維度的 Metrics、各種維度的指標、各種維度的 Trace,包括開源的兼容、自建的鏈路追蹤,以及業務日誌、系統日誌、組件日誌,所有維度都可以納入到可觀測裏來。這是第一步,主要解決可觀測的數據從哪裏來、做得是否全面的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"如何建立可觀測的數據模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拿到數據之後怎麼建模?考慮到傳統企業的運維,可能更需要統一的監測視圖,比如更需要做 2D 或者 3D 的展示,我今天給大家做了一個展示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先最底層就是 IaaS,包括一些容器、虛機。另外一層就是上面的應用,這裏面也包括微服務應用和組件,分佈式數據庫、分佈式消息,以及緩存等。再往上一層就是整個的應用服務,每一步其實都是可以做下鑽的,看某一個問題節點時,可以算到非常詳細的地方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/10e71603742cd4e519cbaae39b9a5d6e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,在容器場景下,因爲今天的容器是作爲整個基礎設施的標準,大家可以依託於整個監測體系快速搭建針對滿足自己需求的平臺。我們監測關鍵的核心數據組件,包括應用狀態、RT、CPU 響應、消息等。另外就是緩存的狀態,也可以做一個展示,包括 RDS 分佈式數據庫、管理型數據庫、MQ、核心數據庫等。我們也有非常多的數據庫診斷手段,包括ES檢索數據庫、MQ 消息,整個都是構成針對運維人員的統一監測大盤,可以方便快速的自定義搭建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/07\/072de72baf350f00a2bc8af341563a5e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,白屏化的集成定位。我們監測到數據之後,接下來就是問題定位。今天不僅是在阿里內部,我們對外提供的一些產品,其實已經能夠實現快速白屏化的定位,就是說今天你不需要再翻你的代碼、再去登錄機器看日誌了,所有的問題都可以通過全鏈路展示的方式,定位到整個鏈路的根節點。包括我們關心的一些內存、CPU、JVM 參數、線程參數,都可以通過白屏化的方式展現給大家,這是我們對可觀測系統做了非常高度的集成。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"如何分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/578075270259980f5411fa95f3d3f0d9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"介紹了數據模型怎麼建、怎麼收集之後,還是遠遠不夠的。今天我們是需要對可觀測數據做一個深度挖掘,主要分爲兩個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是採用人工智能技術。但我們發現單純採用人工智能技術來做,有時候是不起作用的,同時也需要一些專家經驗指導,我們需要把整個專家經驗沉澱下來。目前業界做故障診斷主流的方法是從算法的角度給出一些基線發現問題,但是針對問題根因的診斷還是源於排查人員的技能。今天我們要找到的不僅僅是異常結果,還需要把整個端到端到問題的根因分析,以及相應的關鍵信息都展現出來。這裏面依賴兩點:一是"},{"type":"text","marks":[{"type":"strong"}],"text":"人的技能"},{"type":"text","text":"問題,二是"},{"type":"text","marks":[{"type":"strong"}],"text":"機器的算法"},{"type":"text","text":"問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是人的技能問題。我們的專家經驗是能夠在一定程度上幫助大家去解決問題的,但是怎麼把這個東西給沉澱下來,這是一個問題。另外在機器方面,我們採用確定性的人工智能算法,能夠通過對指標的檢測從而解決問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天我們的思路是把這兩項相結合,在人工智能這種算法應用的基礎上,再通過專家經驗的沉澱來做指導。因爲我們在實戰過程中發現,如果僅僅依賴於人工智能的話,其實人工智能在有些場景下就變成了人工智障,所以必須依靠專家經驗的沉澱來指導這個算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以爲了達到兩者的結合,在專家經驗方面,阿里雲將其沉澱在產品中。阿里雲可觀測產品 ARMS 覆蓋了 50 多個故障場景,包括應用的變更、RT的突增、突發的大請求、單機的問題、MySQL 等,都會把相應的經驗固定下來幫助大家快速自動診斷這部分問題場景,這是我們通過大量的實踐,把專家經驗通過白屏化的方式沉澱下來,自動化的展示給大家。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面必不可少的是我們要做可觀測的日常預測監測,這也是集團內用的非常多的。阿里內部做異常檢測主要是三個方面,一個是"},{"type":"text","marks":[{"type":"strong"}],"text":"服務器層面"},{"type":"text","text":",將服務器故障的特徵訓練出來。另外是"},{"type":"text","marks":[{"type":"strong"}],"text":"集羣層面"},{"type":"text","text":"的一些異常指標、特徵給訓練出來。最後是"},{"type":"text","marks":[{"type":"strong"}],"text":"應用層面"},{"type":"text","text":",面向應用和日誌的,我們通過一些 API 出口的異常模型把它訓練出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/78\/785112d6f2fe273e2912dfcd542f6ff4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是"},{"type":"text","marks":[{"type":"strong"}],"text":"監測數據"},{"type":"text","text":",包括多指標的監測數據收集上來之後,通過做一方面的預處理,把一些無關的指標去除,或者說一些相關的指標去重。這裏面我們也用到了創新的方法,採用了標準化的模型方式,把正常跟異常的差異納入到某一個區間內進行分析。做完這些預處理之後,就要"},{"type":"text","marks":[{"type":"strong"}],"text":"建立特徵工程"},{"type":"text","text":",今天我們也是把單指標異常的分數作爲異常特徵方式,這塊我們做得比較多,核心是把整個異常特徵以及時序特徵的準確率提得更高。特徵做完之後就是"},{"type":"text","marks":[{"type":"strong"}],"text":"多指標方式"},{"type":"text","text":",阿里採用時序預測的方式,多指標模型建立的更準確。同時,模型建立完之後,上線運行的過程中,我們會不斷反饋,對整個訓練出來的模型進行不斷地修正,形成一個"},{"type":"text","marks":[{"type":"strong"}],"text":"閉環的正反饋"},{"type":"text","text":"。這就是可觀測產品基於日常檢測的基本的框架。近期阿里雲會慢慢將這部分開放出來給大家用,大家可以關注一下。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲原生可觀測建設的最佳實踐"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0d\/0d085a54128e3574f362c3bb897f1723.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"場景一,容器場景下的全景可觀測能力。今天容器已經成爲 IaaS 層的標準,整個容器場景下的可觀測能力,包括全鏈路端到端"},{"type":"text","marks":[{"type":"strong"}],"text":"多樣的數據接入能力"},{"type":"text","text":",包括 APM 廠商、Prometheus、主動撥測、流量監測、網絡監測等,全部納入到容器場景下的可觀測能力。另外就是"},{"type":"text","marks":[{"type":"strong"}],"text":"全方位數據可視"},{"type":"text","text":",我們會把數據的可觀測建模呈現得非常通俗易懂,對於運維人員和客戶以非常友好的方式展現出來,而且每一層都可以 2D、3D 拓撲全景展示,每一層都可以層層下鑽,幫助分析根因。還有一個特點是"},{"type":"text","marks":[{"type":"strong"}],"text":"快速發現問題"},{"type":"text","text":",通過專家和白屏化的診斷手段,所有的問題都可以層層下鑽,直到找到最底層的相關信息,幫助大家解決問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3e\/3e5f390f515c12a6388fb7d29ef24c92.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"場景二:複雜鏈路的智能診斷。"},{"type":"text","text":"主要依託於兩點。第一是專家經驗,我們沉澱下來將近 50 多個場景的專家經驗做白屏化的根因分析。第二是確定化的人工智能技術進行一些問題的異常預測和檢測等等,包括整個阿里雲可觀測的體系,把整個阿里雲的核心雲產品,包括 ECS、EDAS、AHAS、MSE,能夠跟所有產品之間做一個深度對接,之後再注入一些組建的自恢復能力,這樣能生成一個自動化的問題發現、自動化診斷、自動化恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"場景三:對人工智能技術在異常檢測場景下的最佳實踐"},{"type":"text","text":"。主要是兩個場景:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/12\/12f6f8d185161dfa008b91b9c14d583c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個是比較常見的運行時的異常檢測。阿里巴巴內部有將近 8000 到 1萬 個應用,我們的監測頻度比較高,都是分鐘以內,這樣的話幾乎每時每刻都有上百萬個指標要進行監測,這個量非常大。如果依賴於運維人員做這個事顯然是不現實的,所以說我們採用異常檢測算法平臺,主要思路是基於 STL 時序分解的基線預測,再加上基於上下邊界的預估。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基線方法我們內部有幾個最佳實踐第一個是我們採用的 STL 的時序數據的預測分解,另外一個是基於 ARIMA 框架的拓展 ARIMA-PRO,對週期性的序列做到更好的預測,並且能夠自動的去更新我們的 ARIMA 框架參數,包括 DBQ 參數,就是差分、AR 參數,這些核心的人工智能參數我們是能夠做到自閉環。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個是基於 Holt-winters 時續預測的模型,進行殘差值的預測。另外上下文預估方案,我們是採用 IQR+歷史同筆的先驗方案的做的,這是我們內部比較好的實踐。這是剛纔說的簡單的運行時的檢測算法框架,核心是幾個創新點,一個是 ARIMA-PRO,可以做到關鍵參數自動更新的框架的增強。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一個是基於 STL 以及 IQR 殘差值的檢測,並且基於 IQR 上下文邊界的劃分,是算法平臺裏面比較核心的幾個技術點。比如說基於 RMSE 均衡分佈差的統計方式,能夠從之前的 0.74 下降到 0.59,誤差下降了 20 個百分點。在 2019年這個算法框架成功預測了某一次因爲退款下跌引發的故障,取得了不錯的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個是應用發佈時異常檢測。很多故障都是新版本上線的時候,所以我們針對應用發佈也做了非常多的算法,尤其在阿里,前面也說了我們 8000個 應用,每天可能有 4000次 的應用發佈,內部應用迭代非常快。如果是依靠傳統的設置固定的閾值監測的話不夠靈活,拓展性比較差,而且需要人工去不斷地更新。效率是非常低下的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/00\/004ccd25d9ca38d21d0251727b11554d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天我們在發佈時實現的框架,核心是實現了我們算法模型自適應的閉環。我們整個線上的模型以及事例,通過篩選誤報和微檢測的異常來更新正負樣本,在我們的大數據平臺更新訓練集、重新更新模型。同時基於 SBD,是能夠針對偏移後的序列相關性進行很好的序列分析算法,把整個訓練閉環做得非常好。這是當時做發佈時的異常檢測算法框架的比較核心的創新點。整個算法框架上線之後,內部單個維度的監測效果有了 3 到 5 倍的提升。另外從整體監測維度,提升了大概將近 5 到 10個 百分點,整個效果還是比較明顯的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/87240188c126a021070c180fab4730d5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文主要介紹了運行時和上線發佈時的監測案例,其實還有很多其他的,比如說日常出現異常情況的監測,也是業內比較典型的例子,其他還有一些包括業務指標的異常檢測等。我們已經把一些人工智能技術以及專家的經驗沉澱到雲原生產品上,除了在我們內部使用之外,正在慢慢地開放到外部的雲產品上,歡迎大家去使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:阿里巴巴中間件(ID:Aliware_2018)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/On5wUXFQYTRl2j1cU7tQEQ","title":"xxx","type":null},"content":[{"type":"text","text":"雲原生可觀測建設要點與案例分析"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章