騰訊雲網絡運維平臺建設之路

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、騰訊雲網絡介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/87e176c98790b71fc07e318f8b8f5379.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖所示爲騰訊雲網路underlay架構,騰訊雲的層級架構從上到下看,先是從地域Region級別,再到各可用區,最後到達網絡計劃模塊。從這張圖來看,往上走就是騰訊雲的內網,往下走就是騰訊雲的外網。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲的內網有三個連接:網絡計劃模塊之間的連接,可用區之間的連接,以及跨地域之間的連接。騰訊雲的外網主要接入了騰訊雲三網帶寬,以及BGP,另外還承載着外網流量調度的功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d9\/d9240d2fc11346624cee23460fc22e8d.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖所示是騰訊雲網絡的overlay架構,overlay是基於underlay網絡架構之上的,雲的用戶所使用的都是overlay的網絡。overlay網絡主要分爲兩個節點,一個是網絡節點,一個是計算節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單來講overlay可以理解爲:"},{"type":"text","marks":[{"type":"strong"}],"text":"通過騰訊自研的SDN控制器來構建點到點的隧道"},{"type":"text","text":"。比如子機跟子機之間的通信在所在母機上面構建一個隧道,如果子機跟paas服務進行通信,就在SDN控制機上面構建一個母機到網關集羣的隧道。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/50\/50cdf06805027d1b4dc877ad0caa1d27.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲現在已經擁有了40多個可用區,100多個Zone,服務器已經達到100W+了。這樣的體量是非常大的,而且騰訊雲還是在不停地演進當中,它的網絡架構也在快速進行迭代,底層光纜錯綜複雜,不管是underlay還是overlay網絡變更也非常得多,網絡的故障也是各式各樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲網絡作爲騰訊雲的基礎設施,它承載所有云上數據的傳輸,它的穩定性決定雲網絡質量以及用戶口碑。我們對網絡的穩定性提出了更高的要求,對網絡故障要做到:1分鐘發現故障,3分鐘故障恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多時候,網絡隱患並不會立即演變成網絡故障。網絡故障在我理解看來它是有生命週期的,分爲:事前的隱患階段、事中網絡變更階段和事後網絡故障階段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事前屬於網絡隱患階段,可能會有一些異常事件發生,但是不至於影響到業務的正常使用;事中階段很多基於網絡的變更導致的網絡突發情況;事後階段,即由意外事件導致了網絡故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0f0da3859f3d1ab056f6c0e5af06d18c.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此,我們在隱患階段引入了混沌工程的實踐;在網絡變更的時候,爲了遏制網絡變更導致的網絡突發,我們引入了網絡變更體檢;在網絡故障已經發生的階段,我們通過建立網絡監控,快速定位網絡故障,儘快恢復,從而提升網絡的可用性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/92\/927558a7fceeeae5f479173c9833e82b.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到2020年11月份,在混沌工程方面我們全年已經支持了500多起演習,發現了30+的網絡問題;另外網絡變更已經接入了1000多次,將網絡變更故障總時長壓縮在20分鐘以下;在網絡監控方面,我們做到了15-30秒發現網絡問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、騰訊雲網絡運維平臺建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 混沌工程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"依上文所述,我們因爲想要在網絡故障前解決網絡隱患,從而引入了混沌工程。那麼混沌工程是怎麼做的,它又是怎麼在騰訊雲網絡上落地的呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們需要了解一下什麼是混沌工程?在我看來,"},{"type":"text","marks":[{"type":"strong"}],"text":"混沌工程就是在生產環境上做一些探索性的實驗,發現現網系統的脆弱環節,然後不斷地提升這個系統的彈性"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/22\/2230e37b4d63ef2e5a6fda4722f4edd6.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲隨着服務化或微服務化的普及,以及CI\/CD的引入,從開發到上線的整個過程開始變得非常便捷,但是這卻使得在一個複雜的分佈式系統裏面,業務故障的隨機不可預知的概率大大增加,進而引發整個網絡的紊亂和故障,導致用戶業務上的不可用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然故障發生的時候我們有相應的監控和處理,但是我們還是希望在隱患還沒有演變成網絡故障的時候就能把它們挖掘出來,由此我們引入了混沌工程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程跟測試是有一定區別的,最主要的一點我認爲是環境的問題。混沌工程最終還是希望能到生產環境中去做印證演習,而測試主要還是以非生產環境爲主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外演習對於運維人員也是一個考驗,對大家的應急反應能力要求很高。另一個主要區別在於輸入,測試一般是來做一些功能印證,輸入和輸出通常都是可以預知的,而混沌工程更多是一種意外事件的引入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"混沌工程在騰訊雲網絡故障產品中落地是網絡演習"},{"type":"text","text":",我們的演習場景一般都來自於現網的故障。一般情況下網絡的異常包括:質量丟包、流量突增以及流量哈希負載不均,瞭解了這個事件以後,做演習時候就要找出它的關鍵路徑,然後是它的業務指標。雲網絡的業務指標包括:路由的收斂、網絡的質量和流量等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/61\/613cdc556b48edd4f502739f60d9e4a7.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個過程中,要有一些視圖來指導你的演習,不然容易迷失。當我們有了穩態指標,在任務執行過程中對一些異常事件做處理,比如你要做隔離,那麼隔離的工具是不是好用,設備是不是響應,網絡是不是異常都是需要考慮的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最爲重要的一點是,在做混沌工程的時候,不能把實驗變成一次網絡故障。你需要極力控制它的影響範圍,一旦影響範圍擴大了就需要有回滾措施。主要就是故障注入和故障銷燬,故障注入就是異常的注入,故障銷燬就是如果演習終止或者結束了故障要及時銷燬。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後我們做的這些演習都希望演變成可自動化執行的流程,所以對穩態指標的判定、故障的自動銷燬、異常的處理、故障的隔離都要有相應的措施,不能讓意外演變成一次故障。演習結束的時候,我們也要對演習報告和產生的問題進行彙總分析,抽象成一些場景以及後續推進演習的優化方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/43\/434e35eb6fdc6228a4f1e257067fbd52.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 變更體檢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個騰訊雲的體量在不斷增長,網絡架構也在不斷演進,相應的網絡變更數量也是水漲船高。網絡變更在騰訊雲上有一套比較標準的管理規範,需要建立規範基線,變更要有時間窗口,變更的申請、審批、實施、公告都要做到很全面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9f\/9fdf18252f720f3a8e555ffc4dab45c7.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對網絡變更需要歸類出場景,由這些場景再提煉出比較好的實施方案。另外變更還需要進行審批,審批主要是去看變更的技術環節以及風險控制,以及對橫向影響面的評估把握。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後在變更實施的時候,我們還要沉澱出一套風險控制的理論,儘量把風險壓縮到最低,找出一些最佳實踐。當我們有了比較成熟的或者風險比較小的方案後,將它引入自動化的變更實施,做到無人值守。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使有了這些規範,實際情況還是會存在一些網絡變更的問題,主要是哪些問題呢?一個是網絡變更對業務團隊是不透明的。第二個問題是網絡變更人員其實是沒有感知業務的指標數據,做不到故障的感知,業務方在定位問題的時候也不能很快地關聯到網絡變更。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ef\/ef3d91995b0a58d4720e98aedf49a1f2.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經分析,"},{"type":"text","marks":[{"type":"strong"}],"text":"最主要的問題在於:網絡變更的時候缺少自己的業務指標的監控"},{"type":"text","text":"。所以這塊我們引入了網絡變更體檢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡變更體檢主要是在什麼環節呢?網絡變更審批完之後我們就要添加相應的網絡變更任務。在網絡變更任務的變更實施的窗口期就要做執行監控分析,由於網絡變更往往基於一個點的變更,所以存在一些能很好探測業務的指標作爲異常評判。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是有些變更業務指標很難採到樣本,那麼該去做呢?在這裏我們會做一些關聯業務指標的告警分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a5\/a5b2a633399e9bb47075e066e34e16c2.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 網絡監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了網絡變更,還有一項舉措是必不可少的,那就是網絡監控。"},{"type":"text","marks":[{"type":"strong"}],"text":"我們對網絡監控的要求是:快、準、全,並且顆粒度要求足夠細"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲網絡監控需要覆蓋非常多的場景,包括外網運營商、內網LAN&DCI、網關集羣質量、轉發質量監控、專線監控等,監控的方式也各式各樣,包括Ping、TraceRoute、Curl、Socket等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外還需要提高告警精準性,能夠做到快速精準定位,減少故障影響時間,監控粒度爲5-10秒這個級別,故障發生後要求15-30秒發現問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ad\/ad460740a5bc524be6d14ee1ada934e2.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼做到精準呢?首先你的"},{"type":"text","marks":[{"type":"strong"}],"text":"探測源必須是穩定的"},{"type":"text","text":",不能有高負載的情況。另外"},{"type":"text","marks":[{"type":"strong"}],"text":"探測源和探測目標之間的路徑應該是很短的"},{"type":"text","text":",如果路徑很長,當異常發生的時候你的問題往往也定位不清。此外你"},{"type":"text","marks":[{"type":"strong"}],"text":"採集的樣本必須是較爲穩定的"},{"type":"text","text":",不能這一會兒是活的,下一會兒直接不通了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做網絡監控我們又面臨了哪些挑戰呢?首先在於目標指標的採集,其實不是樣本越多就越好,我們希望能用比較少量但精準的樣本來反應情況,但是準確的樣本還要保證它是長期活躍的,如果它的狀態是“半死不活”的,那麼對監控採樣數據的干擾性就會比較大。最後探測的問題也需要覆蓋得比較全面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e4\/e4f70d3414a1b262ea81d422ca7943a3.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個挑戰在於快速發現問題,只有探測的粒度足夠細,監控定位的速度才能夠快上來,但是探測快了多了以後,別人發現了可能就做一些動作來限制你。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次我們還需要採取一些策略,當數據採集上來之後,能夠對這些數據做快速全面的異常檢測。網絡異常不僅僅指突發性持續異常,對於網絡不連續抖動這樣的異常,我們也需要能監控到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對此,騰訊雲制定了下面的網絡監控的方案。在探活階段把高質量樣本調進來,然後進入到探測池,在探測池裏構建出循環探測,探測器就只管自己的探測,探測完之後數據快速落到存儲裏面。數據落進來之後,我們的探測不再是純粹的探測發現問題,還需要具備問題分析能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在探測的時候,我們需要結合探測路徑以及路徑上的網絡設備的日誌,再結合一些指標,比如流量是否發生變化等做分析,來定位網絡問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4f\/4f4c8776d9cb6b2a63d070dc195be36a.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、騰訊雲網絡運維平臺未來思考"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上文所述,在網絡排障方面,我們針對網絡隱患採用了混沌工程的實驗;對於網絡變更,我們引入了變更體檢;在網絡監控方面我們已經比較全面和準確的覆蓋了現網問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未來我們還需要深入探索,在網絡隱患層面,除了混沌工程還有沒有其他更好的方法呢。另外,我們現在很多的網絡定位是通過抓包來實現的,但是路徑一旦變長,這件事就開始變得不可控,而且也不好進行協調,所以我們也在思考:在故障定位上是不是也有一些別的方法可以去做呢?最後,我們也希望在網絡故障的時候系統能做到一定的網絡自愈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此,我們也做了很多的嘗試。在網絡故障預測方面,我們想結合網絡設備的syslog、snmp等數據提前挖掘出網絡隱患。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在排障方面,我們希望能夠做到全鏈路的排障,結合網絡拓撲、流量染色、鏡像等綜合分析,把網絡故障的定位做得更好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是故障自愈方面,在於網絡流量的自動化調度和網絡設備、鏈路故障自動化隔離。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a0\/a059953e29d66217dd461a5e761a759b.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、Q&A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:您剛纔介紹的混沌工程和對網絡做整體變更之後的控制,一般是通過點到點而不是針對一個面來全盤做監控,那麼騰訊雲目前是怎麼做網絡監控的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"通過點來做主要是因爲點的監控會更加精準,只要這個點可以採集到業務指標。另外做探測一定要靠近它,鏈路要短,這樣探測到問題那基本上就是這個點的問題了。但是當我沒辦法拿到這個點的探測業務指標數據該怎麼辦呢?根據網絡層級結構,會有關聯到上下聯的網絡設備,這時候你把關聯做起來,如果發現上下聯出問題了,就要第一時間定位是不這個點引起的,因爲正常的話上下聯是不會有問題的,通常是網絡變更導致產生問題。前文也提到了我們會設有紅綠燈機制,對於準確率很高的聯繫就會直接強制要求馬上回滾,減少故障影響,對於這種面的問題通常需要運維的介入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:剛剛老師講到:鏈路比較長的話要縮短探測鏈路。如果鏈路很長就會分成多段探測,還有很多分支,對於各種故障的點,可能一下子檢測出來的點會很多,這種人工去分析的話很難,有沒有技術上的手段做判斷?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"我們之前有采用這樣的方案,一個點有問題,我可以覆蓋兩個探測點,兩個點探測到都是你有問題,那大概率是你的問題。還有一種是Full Mesh的,這個問題會被放大,因爲鏈路一長會傳遞,會放大,這個問題就比較難解。還有一種思路,對於異常路徑的匯聚,探測數據不是有異常嘛,它走過的路是不是有重疊的地方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:"},{"type":"text","text":"這個判斷是人工判斷嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"這是自動化分析,異常目標數據是有探測路徑的,我們在探測路徑上可以做一層匯聚,大家走的公共節點是哪個,那大概率就是你的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:我們的log那麼多,除了自動化的分析方法之外,還有沒有利用深度學習或機器學習的方法來進行?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"我們是有做一些嘗試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:"},{"type":"text","text":"目前有沒有部署到現網上面?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"目前有,但是它的準確率還不夠,我們也有做日誌的規範化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:"},{"type":"text","text":"是用模板匹配嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"有模板匹配的,基於規則的也有,基於算法的也有。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:"},{"type":"text","text":"有沒有基於深度學習的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"有一些嘗試,主要是做日誌異常的檢測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:我們對監控的數據要做標註,是之前已經做好的還是怎樣的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"我們現在採用的是以無監督的居多。日誌打標是比較耗時耗力的工作,但不是說完全不可以做,目前也有團隊在做這個事情,會對日誌做一些基於規則的打標工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:剛剛提到網絡變更可能會導致網絡故障,如果業務監控不全,它自己排查不出來業務故障,能不能單個業務去做網絡變更?比如這個應用沒有做好災備就故障了,事後我要去排查爲什麼會故障,要去解決這個問題,但是我又想把這個場景復現一下,需要運維團隊協助嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"我們做的是一個面上的問題,而不是像你這種純業務的,我們雲網絡的監控是剝離業務的,對所有業務是同等對待的。除了SVIP級別客戶的監控,其他都是大盤的監控,很少監控到點的問題。點的問題雖然有SVIP級別客戶的監控,但因爲樣本數量少,想挑選高質量樣本點的變更就更加困難,所以穩態指標很難比較好得挑出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q:"},{"type":"text","text":"能有什麼方法可以幫助業務方排查遇到的問題嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A:"},{"type":"text","text":"我們就是全鏈路的排障,這樣就能通過模擬流量把問題分析出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:陳政產"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/E4XrGIGDZVBb0NQHRBD1lQ"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:騰訊雲網絡運維平臺建設之路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:雲加社區 - 微信公衆號 [ID:QcloudCommunity]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章