聊聊微服務環境中的可觀察性和彈性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes簡化了微服務的管理和擴展工作。但對於開發人員和運維團隊而言,跟蹤如此多的活動部件往往是一大挑戰。弄清楚對系統進行了哪些變更,以及變更由誰所做這樣的簡單過程逐漸成了不可能完成的任務。獲得清晰的可觀察性以實現更好的監視和故障排除,是改進開發流程的關鍵所在。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聊聊分佈式系統中的變更跟蹤和挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我是Itiel,Komodor的首席技術官。今天,我將和你們討論分佈式系統中的變更跟蹤,以及變更的陰暗面。Komodor是一家初創公司,它建立了業內第一個Kubernetes原生的故障排除平臺。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我是DEV Empowerment理念的信徒,這個理念的內容基本上就是快速前進和“測試左移”。在之前工作中,我曾在eBay、Forter和Rookout工作。我有很多後端和基礎設施相關的經驗。另外,我還是Kubernetes的忠實粉絲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我首先要談談爲什麼要關心變更,以及哪些事物改變了你的環境。然後,我將嘗試縮小範圍,談談我所說的變更指的是什麼,以及在當今的現代化環境中哪些變更具有極大風險。稍後,我將討論爲什麼我們很難找出系統中發生了什麼變化。我們將討論變更跟蹤的未來。最後,我將提供一些有用的提示,幫你減輕在今天的現代化系統中跟蹤變更時遇到的種種痛苦。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/00\/00f2abd3d84e9a81f8b6a30bc11a5cae.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"爲什麼關心變更?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼首先,你爲什麼要關心變更?我們提到了跟蹤以及停機時間和微服務成本之類的東西。這些應該不是什麼新鮮的話題,但對於某些公司來說,每個小時,甚至可以是每分鐘都會出現問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"談到“問題”(issue),它的內涵是很豐富的,從整個系統的停機時間到階段性的小問題,或者像是某個錯誤之類的問題都包含在裏面。所有事件中有85%可以追溯到某項系統變更,這意味着組織中某個地方有某人變更了某些內容,於是現在你的應用程序出現了問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我要說的是,大多數故障排除時間都在關注這個領域,就是找出根本原因。系統中發生的事情可能可以解釋你當前遇到的症狀成因。就像我說的那樣,這些症狀可能是完全停機或你的UI中出現的某個錯誤。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"變更到底是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我一直在談變更這個詞,但當我提到這個詞時我真正的意思是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我在這次演講裏會專門針對整個系統範圍的變更。那麼當提到變更和系統變更時,我真正的意思是什麼呢?我說的是代碼部署之類的東西,首先能想到的就是這個。還有基礎設施變更,比如變更AWS上的安全組。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有配置變更,開啓關閉一些flag,暗啓動,在Jenkins、ArgoCD或其他類似的作業平臺中拆分IO作業的變更;另外還有DB、遷移、第三方變更。在這場講座中,我不會討論不同的用量或數據變更。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不管怎樣,有的時候你的應用程序會停機,因爲用戶行爲發生了變化。也許他們發送了其他類型的數據,或給你的系統發來了巨大的負載。但今天我不會討論這些問題。就像我說的那樣,大多數變更都源於系統變更,而不是這些變更類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"希望你理解了我的意思,也許你已經知道這種變更都是很重要的。當你嘗試解決一個問題時,你的角色就是偵探,並且基本上,你會嘗試找出哪些變更可以解釋,試着解釋清楚你面對的問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"爲什麼很難找到變更?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,爲什麼我們很難找到系統中發生了哪些變更呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲今天的現代化技術棧(或者你可以將其稱爲乾草堆)非常複雜。Chinmay就總結得很好。它包括許多第三方服務,例如Xero、你的雲提供商以及數十種不同的Rest API,你的應用程序需要這些API才能正常運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更重要的是,我們把過去那些巨大的單體分解爲Lambda、Kubernetes,分成了十幾個、一百甚至數千個到處運行的小型微服務。更重要的是,變更頻率已經發生了巨大變化,如今的組織(現代化的優秀組織)每天可以部署數百次甚至數千次。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我說的只不過是代碼變更之類,但就像我們在上一張幻燈片中看到的那樣,有很多變更實際上並沒有被視爲部署。可是它們實際上可以破壞你的整個系統,比如配置變更、標誌等等。在過去,負責部署到生產中的人員通常是一些IT或運維人員。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可是在今天的現代化系統中,負責部署到生產環境的可能是開發人員。甚至產品經理現在都可以打開和關閉影響客戶的各種功能標誌。試圖瞭解當今的現代化系統中發生的變更,基本上就像試圖研究一個非常複雜的、不斷變化的難題,還要弄清楚這個難題五分鐘前是什麼樣子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我嘗試過,試着仔細研究故障排除面臨的三大障礙之類的東西。一切事物都是互相連接的,而Epsagon這樣的公司在分佈式跟蹤方面做了很出色的成果來應對這一局面。同時,一個微服務的變更可能會影響很多甚至與它不相關的微服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也許這種影響波及的甚至不是第一級的連接,而是第二、第三甚至第四級的連接。一項變更可能會對整個系統產生連鎖反應。更重要的是,今天的許多變更都是​​在根本沒有任何音頻時鐘的工具中完成的,或者這些音頻時鐘真的很難用得上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ea\/ea3b0ae5743255a4b26e2e4ded9d61dc.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AWS就是一個很好的例子。每次你在AWS控制檯中變更某項內容時,基本上這裏都會有一些雲托盤日誌被審覈。但幾乎沒有人使用它們,因爲它們用起來太複雜了。而且其他許多變更(例如直接進入生產的變更)完全未經任何形式或方式的審覈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後說一下,即使所有變更都經過了審覈,Epsagon也是幫助你理解各個連接的絕佳工具。爲了真正瞭解哪些內容發生了變更,你需要打開數十種不同的工具來跟蹤每個工具中的變更。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你需要專業知識;你需要具備打開所有這些不同工具並進行有效的故障排除的知識。今天的現代化系統中的故障排除大概是這個樣子的。你在Slack上看到了警報,然後轉至Epsagon,它會高聲提醒你,你的系統存在問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你去Kubernetes嘗試弄清楚到底發生了什麼。從Kubernetes出來,你進入CI\/CD管道,想知道是誰部署到了生產環境。爲什麼?什麼時候?然後你轉向Jenkins;從Jenkins出來,你試着追溯源碼。源碼在GitHub上,你轉到GitHub,你試圖瞭解其中是否有任何與故障相關的提交,結果什麼都沒發現,一頭霧水。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你問你的團隊誰變更了什麼內容?爲什麼?誰能幫助我解決現在面臨的問題?到最後,你總算搞明白原來某個不相關的服務是所有這些故障的根本原因,你只是錯過了這個連接,沒注意到這個無關的GitHub部署或變更中的改動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼將來呢,情況會變好嗎?簡單來說,並不會。所有指標都指出,從現在開始情況只會變得更糟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"速度是越來越快了,今天就算是小公司每天也要向生產環境部署幾十次。隨着測試左移運動的興起,開發團隊也可以部署,產品經理也可以變更事物了。連QA現在都可以對你的生產環境做危險的變更,而且這些趨勢不會很快結束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且,由於現代服務棧中的微服務用起來如此容易,系統變得越來越複雜。然後一切都變得越來越小,從微服務縮到了超微服務,諸如此類。而且各種事情只會變得更加複雜和分散。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們現在所看到的趨勢會讓人們更加難以理解系統中到底哪些變更可以解釋系統遇到的種種問題。我知道這一切聽起來都很糟糕,但是爲了緩解這些風險,你可以做很多事情。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我不會一條條解釋那十條要點,總之你需要做的第一件事,重要的是審覈變更。審覈可以自動進行,也可以專門寫一個流程。如果沒有審覈,故障排除只會變得更加複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/04\/04db5e0ef1a4e0e42d100c7a4e64f13c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想了解更多詳情,可觀看視頻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/youtu.be\/J32ZoiRVvPg?fileGuid=YpxJxqQjG99vVr3r"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/komodor.com\/blog\/observability-and-resilience-in-microservices-based-environments\/?fileGuid=YpxJxqQjG99vVr3r"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章