運維數智化時代——京東數科AIOps落地實踐(一)

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自從2016年Gartner提出AIOps概念以來,平臺化和智能化已經成爲了運維體系發展的大趨勢。從整體來看,運維發展可以分爲5個階段,分別爲手工及腳本運維、工具標準化運維、平臺自動化運維、DevOps和AIOps。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自動化運維給手工及腳本運維的效率帶來了很大提升,但是系統軟件只能預置和按照我們制定的流程運行工作,不能自主適應,不能夠處理相似的“新”問題,AI的種種特質給運維當前的一些痛點提供了良好的解決方案,AIOps應運而生,以AI的能力,賦能IT運維領域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"智能運維(ArtificialIntelligence for IT Operations,AIOps)通過引入大數據和人工智能技術,從海量監控數據和複雜的IT軟硬件中學習和總結規律,自動、準確、快速地發現異常、定位故障和預測風險,提高企業IT系統可用性和運維效率,可以進一步處理自動化運維不能解決的問題。AIOps賽道的拓展主要來自標準運維數據積累和運維業務發展需求雙重驅動的影響:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CMDB平臺、監控系統平臺、流程管理中心等成熟的一體化運維平臺爲標準運維數據的積累打下堅實的基礎;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控系統覆蓋面足夠多,但缺乏系統化的處理流程和方法,規模更大的數據、更復雜的動態運維環境使得自動化運維難以爲繼。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了應對以上的驅動需求,AIOps的着力點在於:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"賦能DevOps"},{"type":"text","text":":通過AI的能力進一步處理自動化運維不能解決的問題"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實時分析及處理"},{"type":"text","text":":通過AIOps的智能算法和不斷提升的自動化水平,對問題進行實時診斷並給出操作建議,大幅度降低平均檢測時間(MTTD)和平均修復時間(MTTR)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"降低報警噪音"},{"type":"text","text":":通過AIOps的數據關聯能力,確定基礎設施、業務程序和業務之間的關係,不斷完善算法,可以不斷提高過濾警報噪音的水平減少誤報;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障原因分析及預測"},{"type":"text","text":":通過AIOps的海量數據分析能力,可以幫助識別造成問題的根本原因,並挖掘時間序列中的事件模式從而對預期行爲進行檢測,提供運維建議。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對AIOps的賽道,京東數科智能運維團隊合理配置團隊角色,運維工程師、開發工程師和算法工程師扮演着不同的角色,三者缺一不可。以下是數科智能運維團隊基於內部人員職能分配的一些嘗試和經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/03\/4b\/036ec1d561feaea557d19f14a488254b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"運維工程師"},{"type":"text","text":":能從業務的技術運營中,提煉出智能化的需求點。在開發實施前能夠考慮好需求方案,規範數據格式。前期可以通過仿真手法探索和驗證方案可行性,起草合適的解決方案;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"開發工程師"},{"type":"text","text":":負責進行平臺相關功能和模塊的開發,以降低用戶使用門檻,提升用戶使用效率,並且將運維數據工程師交付的數據通過友好的方式展示給用戶;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"算法工程師"},{"type":"text","text":":針對來自運維工程師和算法方案進行理解和梳理,完成最終落地方案的輸出工作;在工程落地上能夠考慮好健壯性、魯棒性、敏捷性等,合理拆分任務,保障成果落地,以提升最終業務運營質量。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"京東數科智能運維團隊在行業內已經有了長期的耕耘,在各着力點大量投入,打通各個環節,在運維知識沉澱和算法積累上有着持續的積累,不僅賦能內部,還可提供大量外部服務。在運維各場景下應用AI的能力提供可靠的算法服務,在日常運維和大促期間各算法學件都有着卓越的表現,在保證高性能的前提下,運維場景解決方案的通用性、自動化、魯棒性都是我們追求的第一目標。我們對AIOps的不斷探索和對運維全場景不斷鑽研,可以讓迭代的AIOps學件和產品不斷地爲內部和外部賦能,以AI驅動運維數字化轉型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、AIOps常見場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AIOps圍繞質量保障、成本管理和效率提升的基本運維場景,逐步構建智能化運維場景。在質量保障方面,細分爲異常檢測、故障診斷、故障預測和故障自愈等基本場景;在成本管理方面,細分爲指標監控、異常檢測、資源優化、容量規劃和性能優化等基本場景;在效率提升方面,分爲智能預測、智能變更、智能問答和智能決策等基本場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/45\/8c\/45c98bdf2c8a6a861737196aafced18c.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、AIOps能力建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AIOps的建設可以先由單個場景的探索開始,逐步完善和串聯,直至解決整個完整問題的運維算法學件,在算法學件的基礎上打磨成具有通用性和流程性的智能運維整體解決方案。行業通用的演進路線如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開始嘗試應用AI能力,還無較爲成熟的單點應用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具備單場景的AI運維能力,可以初步形成供內部使用的學件。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有由多個單場景AI運維模塊串聯起來的流程化AI運維能力,可以對外提供可靠的運維AI學件。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要運維場景均已實現流程化免干預AI運維能力,可以對外提供供可靠的AIOps服務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有核心中樞AI,可以考慮成本、質量、效率三個方面,達到業務不同生命週期對三個方面不同的指標要求,可實現多目標下的最優或按需最優。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,京東數科智能運維團隊對內提供服務形式:指標鑑明平臺、告警辨明平臺、日誌闡明平臺和故障探明平臺四大產品平臺,此外還可提供特定場景算法模型文件、算法學件容器化部署方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、AIOps落地實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020年京東數科智能運維團隊在打通數字化運維、加速AIOps落地過程中將AI賦能智能解決方案全場景。其中,對異常發現和根因定位展開說明如下:結合指標數值和日誌文本兩大數據源特點構建“榫卯’”型算法設計,在保證平臺可遷移性的基礎上增強算法匹配場景豐富度、算法自動編排準確度、算法定製拓展自由度。我們會繼續加大投入,在進行業務及運維知識積累的同時讓AIOps賦能業務研發、產品和運營團隊,對內降本增效提高生產效率,對外以AI驅動產業數字化轉型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0c\/60\/0c1c4da649996fbf403df6b2d080e060.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"京東數科智能運維平臺內嵌衆多可插拔學件,配置簡單,使用方便,並且具有高準確性和高時效性。下面詳細介紹故障檢測、故障定位和故障修復三個模塊:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障檢測模塊"},{"type":"text","text":":快速發現時序監控數據的異常。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障定位模塊"},{"type":"text","text":":精準定位複雜系統的根源問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障修復模塊"},{"type":"text","text":":結合運維知識圖譜和運維專家經驗,推薦智能的解決方案,快速修復故障。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三個模塊層層遞進,共同提升運維體驗和運維效率。整體流程可以快速發現故障並進行自動異常定位,對於異常事件提供解決方案推薦並實現部分場景故障自愈,能極大地降低研發配置固定閾值和運維排查問題的成本,極大地提升運維服務質量和業務可用率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運維監控系統數據除靜態配置屬性外絕大多數爲時序數據,表現形式爲時序指標和時序日誌,基於海量的時序數據判斷業務是否異常是故障發現的重要手段。對於種類繁多、關係複雜的數值指標,指標異常檢測學件組不僅可以實現快速自動編排、覆蓋運維指標多特徵突升突降、斷崖式波峯波谷、趨勢走向異常等異常類型,對於指標維度、週期性或隱性規律、節假日及活動、突發事件等影響因素皆有自適應算法和既定策略安排,無需人工配置閾值和規則,幫助研發和運維人員快速發現規則難以識別的異常,並支持自主配置異常告警方式,避免誤報和告警風暴。在指標異常檢測模塊我們引入波形分析技術,結合空間和時間特徵,分析指標間異常聯動影響,提升異常檢測準確度。時空數據分析手段的引入是發現規則和策略難以識別的異常的重要手段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於業務黃金指標和重點監控指標,配置告警日誌分析既可以在文本日誌層面捕捉瞬間發生的異常,又可以解析日誌內容,確定異常主體,歸併異常事件類型,同時起到對異常檢測及後續根因定位關聯分析的驗證作用。經過大量異常事件實踐和理論驗證,三個算法學件組具有特定地編排方式,內部的算法學件可以自動適配接入的指標數據,覆蓋運維全場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/46\/73\/462b5aaeddb62f711933cd0ce3456673.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的運維故障定位高度依賴運維人員的經驗和排查方向的正確與否,如何將運維專家經驗沉澱並智能化是解決故障定位的問題關鍵。靜態的CMDB配置和調用鏈關係是可以查詢的,但是異常往往是發生在動態變化的過程之中,運維知識圖譜就是我們團隊應對該場景最高效的武器。智能故障定位是爲了解決龐雜系統中根因定位的問題,運維知識圖譜結合強化學習算法是AI賦能該場景的卓越方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"強化學習算法是按照層次在全局進行搜索的,它將搜索所有可能關聯的節點,確保了根因定位算法的準確性。運維知識圖譜爲搜索提供規範和方向,使得搜索並非是獨立的而是兼顧調用變更和配置變更的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們採用的運維知識圖譜是動態可拓展的,配置數據、日誌、告警、變更等信息都已經接入其中。標準化數據的接入是快速的、自動的,對其他運維繫統具有較高的兼容性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當搜索過程結束時,算法會自動地對故障根因進行修正和排序,並調用日誌分析系統計算推薦根因的置信度。故障定位結果會按照故障分析報告的格式存儲,便於運維覆盤時查詢和檢驗算法準確度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/cf\/42\/cf829da0043a037788394f0eece5d642.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在故障智能修復階段,運維專家經驗也將指導我們對故障事件進行分析並給出可行操作建議和操作風險指標。故障定位模塊發出推薦根因的同時,調取知識圖譜中關聯的數據,通過關聯分析算法挖掘故障關聯關係,生成事件信息描述報告。運維知識圖譜將根據調用鏈依賴進行全鏈路的檢查,給出故障修復建議和操作風險提示,對於部分場景已實現故障自愈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/50\/cc\/50508e8bd9279a6ayy0e1618c9205ccc.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章轉載自:京東數科技術說(ID:JDDTechTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/LyNl8zITQkWt_wIqWq04rw","title":"xxx","type":null},"content":[{"type":"text","text":"運維數智化時代——京東數科AIOps落地實踐(一)"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章