強化企業IT運維的五大AIOps策略

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在現代化的企業中工作,我們希望AIOps(中文資料中也稱爲“智能運維”——譯者注)能強化IT運維,使企業在提高性能的同時降低成本、預防IT事故並提高業務的敏捷性。但在市場上存在着多種差異化的AIOps產品,我們如何能確保所選路線的正確性?一旦決定採用AIOps,應如何最大化地發揮其作用?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正如題目所示,本文將給出五種策略,可確保企業能夠針對自身業務制定正確的AIOps規劃。我們先用一定的篇幅給出“AIOps”這一術語的確切定義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“AIOps”是Gartner於2016年創立的術語,指結合大數據、人工智能和機器學習實現IT運維流程的自動化和提升。當時,這個非常寬泛的定義在一定程度上引發了理解上的混淆,各IT供應商基於自身實際提供的產品,對AIOps給出了各自的闡釋。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"時至今日,業界領先供應商的產品已確定AIOps的落地現狀,這些產品響應了各家客戶正面對的挑戰。AIOps當前已更深入人心,定義也更明確,應用和趨勢也更實際。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"AIOps平臺涵蓋了基礎設施和運維 (I&O)、DevOps、SRE、服務管理等領域,大範圍地強化了IT實踐和功能。其中,I&O是最能體現AIOps優點的領域,涉及異常檢測、故障診斷、事件關聯和根本致因分析 (RCA,root cause analysis) 等,切實全面地改進了監控、服務管理和自動化任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在闡釋了AIOps定義後,下面言歸正傳,列出前面提及的五種策略。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"腳踏實地,不要好高騖遠"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"提出一個宏偉的願景,通常情況下是件好事。一旦樹立了一個遠大的目標,即便沒能達成,也會走得更遠。但在實施AIOps解決方案時,如果行事目標過於籠統,可能會導致項目延期數月甚至數年。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"公司的高管可能會自上而下地頒佈命令,在整個組織中推進和實施人工智能和機器學習,但並沒有明確定義需解決哪些具體需求。事實上,在細化落實AIOps能力的構建中,好的做法是確定可逐步實現的各個短期目標,而不是隻給出一個長線的願景目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"例如,在“報警-工單”流程引入AIOps平臺時,落地過程中最好採用漸進的方式。即在保持現有的“報警-工單”工作流基礎設施運作的同時,逐步實施各個新的AIOps功能。基於此,我們可以先將部分監控報警輸入到AIOps事件關聯平臺,並將輸出返回給工單處理系統。這樣提供了一個能在實際投入生產之前對結果進行對比的基線。一旦用戶感到滿意,我們可將更多的工具逐步地添加到AIOps平臺中,直到實現監控層和可觀察層的完全集成。此後,我們才能着手去考慮如何額外添加新的AIOps功能,例如根本致因變更、修復的自動化等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"逐步推進的方法不僅保證了在完全依賴AIOps平臺前確證其切實可用,而且可以讓團隊有機會在此過程中同步積累所需的各項技能,不必一步到位去掌握全部。一步到位可能會操之過急,甚至適得其反。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"選擇領域爲中心,還是選擇領域無關?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Gartner最新的AIOps市場指南中(“Gartner Market Guide for AIOps”,2021年4月6日,作者Pankaj Prasad,Padraig Byrne和Josh Chessman),給出了兩類AIOps解決方案,即“領域爲中心”和“領域無關”。領域爲中心的AIOps功能,是基於網絡、應用、基礎設施或雲監控等特定領域(實踐)的數據。相比之下,優秀的領域無關AIOps解決方案能跨多個領域工作,組合並管理抽取自多來源和多供應商IT技術的數據,以及體現環境變更情況的數據,從中獲得洞察力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在近期的一次AIOps視頻會議中提出,好的策略是將領域爲中心的AIOps功能內置於監控工具中,服務於一次性的特定用例;同時持續規劃部署能兼顧多種用例的、領域無關的獨立解決方案。例如,對於光學設施中的信號質量監控,使用領域爲中心的AIOps工具可瞭解連接的丟失情況。但負責維護運行在光學設施上的高質量視頻通話時,則應選擇領域無關的AIOps工具。因爲服務水平(SLA)的下降存在多種可能致因,涉及構成服務的多個領域和多種技術,瞭解根本致因需要關聯所有可能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"需注意的是,Gartner同時也指出:“隨着組織在AIOps採用上逐漸成熟,他們需要的是一個能跨I&O、DevOps、SRE甚至在某些情況下包括安全實踐的統一的、領域無關的平臺”。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"借力數據富集(Enrichment),驅動智能運維"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據富集(Enrichment)是實現事件關聯全過程的幕後英雄。第一手的報警數據只是出發點,並不足以確定問題的根本致因,進而可着手執行有效修復。收到來自多個領域的報警,是很難將它們相互關聯,形成一組精細粒度的工單的。如果使用時間戳或故障原點(point of origin)的話,它們提供的洞察信息非常有限,並且無法關聯其它來源或時間窗的相關報警。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"易於部署的數據富集告警,實現了對單個報警的增值,爲確定報警的相互關聯和關聯方式提供了額外的理解層級,讓用戶專注於高層級的關聯事件,避免糾結於每個進入AIOps平臺的低層報警。好的數據富集過程會減低“數據噪音”,有助於向用戶的CMDB、APM和編排工具中增添拓撲信息,在變更管理和CI\/CD流水線中增添變更信息,以及將業務場景引入團隊的知識和過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"選擇提供內置的、可擴展的數據富集功能的AIOps工具,將推動運維全過程的智能化。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"過程自動化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自動化具有許多優點,包括一致性、節省時間和最大限度地減少錯誤。一旦工單系統實現AIOps自動化,平均確認時間 (MTTA,Mean Time to Acknowledge) 可降低到毫秒級!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"將運行手冊(Runbook)納入工單系統,意味着一旦出現特定的報警,就會觸發特定的工作流。運行手冊會自動執行所有不需要做額外考慮的技術步驟,例如檢查網絡資源狀態、獲取服務器或系統的信息等。將這些步驟全部置入工單,儘可能在無需人工參與的情況下完成識別和實施的必要修復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"過程自動化不僅減低了IT運維團隊的工作負擔,加快了事故和宕機的解決速度,而且能解放運維團隊,去聚焦於高價值、有挑戰性的工作,在驅動業務創新的同時改進生產率。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"驅動持續洞察"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"實施AIOps解決方案的最大價值,並不僅僅是爲分析和改進性能問題提供專屬方案。AIOps支持用戶去分析各個階段,從事件檢測到開展調查和RCA,瞭解各階段所需的時間,形成補救措施和解決方案,在過程中持續推動流程的改進。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"設置KPI可實現進度跟蹤,確定導致延遲和性能問題的致因,進而給出滿足過程效率改進需求中應關注的方面,確定可提供價值最大化的下一步過程,進一步提高團隊的生產力。例如,識別並跟蹤受IT故障影響最大的應用或業務的持續變化情況,可提供對運維熱點的鳥瞰視圖。進一步跟蹤最頻繁檢查、最頻繁報警類別及其MTBF(平均故障間隔時間),有助於定位確切的問題位置。跟蹤和測量一定時間內的事件是否屬於L1、L2、L3或是企業特定的運維層級,可以確定並改進運維整體的效率。持續跟蹤MTTA(平均確認時間)、MTTD(平均檢測時間)和 MTTR(平均解決時間)等KPI,有助於分析和改進事件管理生命週期中的各個階段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"謹記,無論採用何種策略,IT運維團隊都是企業運維過程中的關鍵合作者。與團隊保持密切的溝通,確保AIOps解決方案能降低團隊的工作量,而不是帶來更多的工作內容。企業可能已經發現了需更新或調整的關聯模式,團隊也可能已經從進一步的數據富集中受益。無論如何,企業用戶都需要與運維團隊共進退,找出並解決痛點,確定運行良好之處。確保團隊意識到自身的作用,最大化地發揮團隊的作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"AIOps正迅速發展,如何確保選擇正確的路線,如何確保能從市場衆多可用的AIOps平臺中做出一個明智的選擇,這是非常具有挑戰性的抉擇。確定AIOps對企業未來發展的意義,採用上述五種策略,實施AIOps平臺就能帶來非常卓越的收益和效率,幫助企業真正地改進運維。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"作者簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Yoram Pollack"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"是BigPanda公司的市場產品部門負責人,主要關注IT運維和安全中的新興技術,尤其是AIOps。具體涉及:探索如何在IT運維中實現機器學習和人工智能以降低IT噪聲,檢測並探究可能的根本致因,人工IT故障管理的自動化。Yoram具有工程領域背景,並經過20多年的表述能力訓練。他目前的工作職責是助力企業理解技術如何滿足自身需求並實現業務增長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文鏈接:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "},{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/articles\/aiops-it-operations\/","title":null,"type":null},"content":[{"type":"text","text":"AIOps Strategies for Augmenting Your IT Operations"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章