零故障上雲全過程再現,PB級數據遷移如何保障一致性?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、美圖秀秀介紹"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1b\/1b63f8125cb008db917277776bdffed3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們來看一組數據,截止今年上半年,美圖應用月活躍用戶數超過3億,獨立設備數超過20億,每個月美圖秀秀要處理的圖片以及視頻數超過30億。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這些數據背後還有一些看不見的數據,在全國各地我們有6大自建機房,數以百計的服務器,十幾款軟硬件產品,PB級的業務數據。在支撐這些數據的過程中,我們逐漸的發現了一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是運維管理複雜,因爲有這麼多的服務器,隨着一些服務器過保老化,故障率逐步上升。爲了業務的穩定性,需要投入大量的研發人力去監控這些機器的監控狀況。每隔一段時間,我們就會看到公司大羣裏,內購一些硬盤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美圖秀秀的業務流量有極強的週期性,週末比平時峯值流量高,元旦春節峯值會是平時的很多倍。爲了支撐每年元旦春節流量,業務部門首先會預估流量,然後採購部門都會提前按照預估峯值流量採購服務器,然而春節過後流量下來,一部分服務器資源會閒置,變成一種浪費。需要付出的成本較高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美圖內部多個部門服務及組件重複建設的情況也比較普遍。但是由於有的服務組件與業務綁定的過於緊密,使得美圖內部孵化的一些創新產品無法快速使用,導致創新產品的發佈週期較長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們看到了這麼些問題。應該怎麼解決?那換個角度思考,從這些問題,看看我們需要的是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/53\/533c09caf90a35112fcc308cc5513fae.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實我們需要的是運維簡單的彈性計算和存儲資源,開箱即用的服務和組件共同支撐起應用的快速開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過什麼樣的方式能夠實現我們想要的這些能力?經過公司仔細調研,我們給出的答案是雲服務。公司在19年下半年提出了上雲的計劃。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、上雲的挑戰與應對"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那對於這麼龐大體量的產品要上雲,應該怎麼做呢?又會遇到些什麼問題呢?接下來就給大家帶來的是上雲的什麼挑戰,以及我們應該怎麼去應對呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能有人會說上雲有什麼難的?把服務部署上去不就好了。一開始我們接到上雲任務的時候也是這麼想的。然後在討論這個事情的時候,組內的資深技術專家給大家提了這樣的一個問題?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6f\/6fd4c6d3b358c4e14f4a5f1d0640fd2e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把大象裝進冰箱裏,總共分幾步?當然我們都知道答案。3步,第一步把冰箱門打開,第二步把大象裝進去,第三步把冰箱門關上。上雲就相當於把大象裝進冰箱。簡單來說上雲也分3步:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步在雲上創建對應的服務和資源"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步把數據同步上去"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三步把流量切換到雲上。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每一步聽起來都很簡單,對不對?但是具體每一步該怎麼做,細節就是魔鬼?我們有3億的月活,海外用戶的佔比也不小,我們需要爲用戶提供的是7X24小時的不間斷服務,在上雲的過程中也不能中斷。另外是我們有PB級數據,用戶數據是我們最核心的資產,不管在哪個環境下,我們都需要保證用戶數據的完整性和一致性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單來說就是用戶需要完全無感知,這就要求我們整個上雲的過程需要做到平滑和穩定。接下來主要圍繞着如何做到平滑和穩定來跟大家分享上雲的一些方案設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/50\/50c4e1b096095f2945912bfd96dbd382.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"去掉一些細枝末節,我們可以將業務系統的架構簡化爲3層,最上層的負載均衡,中間的web服務,下面是資源層,對應着3個層面的遷移,流量的切換,應用的遷移和資源的遷移。另外還有依賴的外部服務。如果外部依賴不遷移,跨IDC的訪問,專線有1ms的延遲,公網有幾毫秒的時延。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將依賴服務按照跨IDC的時間敏感度分爲強依賴服務和弱依賴服務。對於強依賴服務,需要在我們業務上雲之前部署在雲上,弱依賴服務可以按照服務提供方自身的上雲順序進行,但是我們仍然需要關注的是依賴服務跨IDC訪問的一個帶寬的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每一個遷移操作我們都會按照“準備”—>\"執行”—>\"驗證”—>”觀察”的方式來操作,確保每一個操作都沒有問題才進行下一個操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我會按照大象裝冰箱的一個步驟,以應用、資源和流量的順序跟大家分享各部分內容具體是如何遷移的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、把冰箱門打開:應用遷移"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是應用的遷移。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a7\/a71f2f0550d699efc5a8cca874a3b9de.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在雲下我們有自己的一套容器化平臺,可以支撐業務的容器化需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此應用遷移,我們會將其劃分爲兩種類型,一種是容器化應用,另一種是非容器化的應用。由於容器化平臺的建設和改造提前上雲,可以讓業務在雲上部署應用的時候採用跟IDC相同的方式,所以大大簡化了應用上雲的複雜性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於另一類非容器化應用,因爲我們的應用都是無狀態的服務,所以上雲主要考慮的是在雲上部署的問題。那是部署在虛機上,還是各方面性能都更好的裸金屬服務器?另外在選擇組件的時候,我們是應該自建,還是使用雲上的開源組件,或者雲上自研的組件?其實這裏涉及到一個雲上服務及組件的選型問題。那我們遵循的一個選型原則是:最大限度的複用已有的流程和系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲我們的目的是服務上雲,不是做架構或者應用優化。上雲本身是一個複雜的工程,我們希望在上雲的過程中儘量控制變量的發生。在上雲的過程中不會刻意去改動架構和流程,因爲這樣會進一步的增加上雲的複雜度和時間,導致故障的發生。當然也不是絕對的不變,爲了解決上雲中的一些問題,該變的還是要變。就是不要爲了“順便”做個架構優化而去變。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、把大象裝進去:數據遷移"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說完了應用遷移,相當於打開了冰箱門,接下來是數據遷移,該裝大象了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d4\/d48cb0bac45587cb12322f0206e09266.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分爲兩部分,一個是業務數據,一個是媒體數據。業務數據也包括了很多,比如MySQL的數據,緩存Memcache,Redis,Elasticsearch等等。今天業務數據我們只跟大家介紹MySQL和Memcached。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下面的分享中,大家可以看到MySQL中的持久化數據,Memcache中的緩存和媒體文件作爲3種常見的數據,分別採取了不同的遷移方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/739b61a9c6c91bf9137a4cadcb354103.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要遷移PB級的數據量,僅僅依靠公網帶寬做數據傳輸是遠遠不夠的。那應該怎麼辦?很多人會想到搭專線,沒錯。我們也是使用專線來進行數據的傳輸。但其實公有云還提供了一種方式,就是數據快遞。使用方式是,你預約一個數據快遞服務,然後雲廠商會給你郵寄一個幾十或者幾百T容量的硬件設備,真的是郵寄,你把這個設備插到服務器上開始同步數據,同步完成後寄給雲廠商,雲廠商會把這些數據在雲上用對應的存儲恢復出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們有PB級的數據,好幾百個數據快遞在路上跑,聽起來總有一種不靠譜的感覺。當然我們也沒有采用這種方案。說回專線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在各個機房之間提早搭建了專門用於數據傳輸的專線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務跨IDC訪問流量走業務專線,業務數據傳輸通過數據傳輸專線,然後媒體文件走存儲專線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在右邊公有云上,針對不同的集羣,我們會劃分vpc來做網絡隔離,vpc之間通過虛擬路由來連接。另外還會搭建一套雲上的測試環境。主要用於服務組件的壓測和遷移方案的演練。後面絕大多數方案中涉及到的一些問題都是通過在測試環境壓測和演練發現的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在專線帶寬搭建好之後,我們開始進入mysql的數據遷移。在數據遷移前,我們會在雲上搭建一套跟IDC配置一樣的mysql的主從集羣。因爲本身業務做了讀寫分離,所以主從節點都分配有各自的域名。一開始公有云上的讀寫域名都指向IDC。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們依靠雲上提供的數據複製服務來做數據同步。將IDC的數據同步到雲上。同步分爲兩個階段,剛開始是存量數據同步,之後是增量數據同步。雲上數據複製服務會告訴你什麼時候存量數據同步完了,什麼時候開始增量數據同步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3b47b373b01cf27dceab4bc8b954cd1b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼MySQL數據遷移的完整性如何保證?我們從兩方面來看,首先是數據複製服務的設計,我們可以將數據複製服務理解爲也是MySQL的實例,使用的同步機制跟MySQL的主從同步一樣,理論上來說就是相當於是掛載了一個MySQL的從庫,由MySQL的主從機制來保證數據的完整性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外我們還可以從兩個維度來驗證數據的完整性,一個是數據複製服務提供了按照表行數和表內容的對比功能,同時,從業務層面我們也提供了相應的接口可以抽樣對比兩個數據庫中的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據同步完成以及校驗完成之後,下一步就是讀寫流量的切換?切換順序是先切讀再切寫。流量切換隻需要變更對應域名的DNS配置。當域名的DNS配置變更完成後,並且經過驗證變更已經生效後,mysql的流量應該切換到雲上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是在驗證環境,我們通過流量監測腳本發現IDC的mysql集羣仍然有來自雲上的讀寫流量?通過ip對比,確認這些流量來自一部分服務,通過在這些服務的機器上去查看tcp連接,我們並沒有發現來自雲上服務的ip。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那我們首先想到的是不是DNS的緩存?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在對應的機器上執行nslookup域名解析,域名確實已經解析到雲上了。機器上域名解析正常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再一個懷疑的是不是服務裏面有dns緩存的機制?當我們在服務裏面去做域名解析的操作發現依然是解析到雲上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"都不是,那會是什麼問題?重啓大法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們重啓一部分服務,發現服務起來之後連接全部到雲上。沒有IDC的連接。說明不是域名解析的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後我們在另外的沒有重啓的服務上發現了一個現象,就是新建的連接都來自雲上,老的一些連接來自IDC。從這裏發現我們不同服務使用的數據庫組件,有的有dns感知,有的沒有dns感知功能。當域名變化時,沒有dns感知功能的服務不會主動釋放老的連接創建新連接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲本身服務使用的組件都比較老,變更版本帶來了很多依賴兼容性問題,所以我們放棄在服務本身上變更。轉而在方案上做文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b9\/b99509e665ec6e276162dd6856b254de.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在計算機系統設計領域有這樣一句經典語錄:”計算機系統設計裏面的任何問題都可以通過增加一箇中間層來解決\"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過增加一個VIP來解決域名變更後長連接不斷開的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"切完寫之後,流量全部切換到雲上,此時IDC沒有寫入流量,這時候IDC的MySQL主從集羣已經不能用了。整個切換操作就完成了。如果方案到這裏就結束了,那這只是一個正常的切換操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0d\/0d35b6992976c46d3e0191c013626bcb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那再問一個問題,如果這個時候雲上服務出現故障,需要回滾怎麼辦?已經回滾不回去了,數據已經不一致了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼我們該怎麼解決切寫之後的回滾問題呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bd\/bda2f8195199c57251a792cc976f3487.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想一想這裏回滾不了是因爲什麼?因爲IDC沒有完整的數據了。那現在完整的數據在哪裏在雲上,那如果我把雲上的數據同步到IDC一份不就又是一份完整的數據了嗎?但是同步的時機肯定不是出問題之後,而是應該在出問題之前。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/55\/5513dead9dcbd8964190baf3b2001d17.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順着這個思路,在一開始的時候,我們就在IDC重新準備了一個備庫集羣。將公有云上的數據又同步一份到備庫集羣。這樣可以保證在切寫之後,IDC還是存在一份完整的數據,可以保證流量的回滾。在回滾後,將數據複製服務的流向變更爲從最右邊的集羣到共有云。然後重複前面的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對MySQL的數據遷移,這裏有些小Tips跟大家分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4e\/4e8aa4729bd343c4ac89683df03c00fb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一點,是數據遷移需要做數據對比,對比從兩方面入手,一個是遷移工具的驗證,一個是業務層面的驗證。另外要注意遷移源庫的內存使用率,因爲在做數據對比的時候,源庫的內存使用率會上升,如果源庫本身內存容量不足,有可能造成源庫oom。在實際遷移過程中,我們就遇到了源庫oom的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二點,針對一些按年,按月建表的業務,不要忘記了數據庫表創建的腳本要同步到雲上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三點是數據庫複製服務的限速,一個是複製的過程中不要給源庫造成太大的壓力,另外遷移過程中有多個應用多種數據遷移,需要協調專線的使用率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四點是 在做切換操作時,需要將切換操作腳本話,一鍵切換,只需要執行一條命令。可以減少出錯的機率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"介紹完MySQL的數據遷移,下面介紹Memcached。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5c\/5c622497ce5f0f782313a0b594f0c474.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看下作爲緩存的memcache是如何遷移的。美圖秀秀重度依賴mc作爲緩存。mc的同步採用的是雙寫+就近讀的方案。雲上和IDC會互相寫數據,但是讀流量只會讀本數據中心的mc。另外在流量切換之前需要有一個預熱的過程,預熱既可以在業務層面做,也可以通過灰度的流量自動預熱,利用灰度預熱需要我們實現預估好穿透的量,避免給mysql造成太大的壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在測試環境使用mc的過程中發現了不少的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3d\/3dfc7d6d2dcd1b0b4f648402d39e6613.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在穩定運行期間,mc容量寫滿之後性能下降比較明顯。而且數據寫得越多,性能下降越厲害。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過雲上監控發現,容量滿後,對應虛機swap開始上漲。發現雲上mc實例默認開啓了swap。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用mc的姿勢就是申請一定容量的mc實例,然後一直寫數據,寫滿後依賴mc自身內存管理機制來淘汰數據。有一些業務方跟我們使用姿勢不一樣,會開啓mc的swap。開啓swap,當訪問到磁盤上的數據時性能就下降的比較厲害。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mc平穩運行一段時間後會突然掛掉?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個問題跟前面關閉swap有一定關係,因爲mc創建在虛機上。比如我們申請的mc實例是8G,那虛機內存8G,因爲操作系統會佔用一定的內存,所以mc實際使用的內存可能不會超過6.4G,當我們不斷寫入數據之後,由於關閉了swap沒法做數據交換,在內存分配和銷燬的過程中不斷產生內存碎片。當整個mc的內存+操作系統內存超過8G後,虛機會重啓,導致mc掛掉。這裏需要的是設置mc的max_memory的值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mc的穿透率不平穩,時高時低?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"穿透率與mc的淘汰策略有關係,雖然業務也會設置mc的過期時間,但是在業務設置的mc過期之前,就存在穿透率不平穩的問題。然後發現雲上mc的淘汰策略設置的是隨機淘汰。互聯網應用大多是讀多寫少的模式。因爲我們首頁請求量很大,需要緩存來抗讀的量,越熱的數據越需要有被緩存住,如果是隨機淘汰可能會因爲淘汰掉熱數據帶來一定的穿透量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於我們重度依賴mc作爲緩存,mc的查詢性能直接影響到上雲後系統的穩定性。因此在上雲前會在壓測平臺上拷貝線上真實環境的流量對雲上的mc進行了壓測。但是壓測的結果顯示,mc的讀性能跟idc相比有一定的差異,差異大得有點令人詫異。具體現象是壓測流量到達某個點後,mc會出現大量的數據淘汰,同時伴隨着有大量的新連接創建,CPU利用率升高,ops降低,然後5-10分鐘後又自動恢復的現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跟雲上工程是溝通發現,雲上mc不是原生的mc實現,是使用redis+mc協議實現的。當然雲上有自己的考慮,使用redis主要是爲了數據持久化以及HA。但是我們知道, redis內存管理基於Jemalloc內存分配器,Jemalloc頻繁分配和釋放內存會導致內存碎片。而原生的mc使用slab的內存管理,內存滿後,淘汰同等大小的slab,直接覆蓋數據,不會造成內存碎片。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當碎片率到達一定程度會觸發內存的主動淘汰,由於主動淘汰時間過長,影響了mc的讀寫性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個我們業務上使用的mc客戶端,當發現mc時延超過一定閾值後,會重新創建連接。由於服務數過多,導致單個mc上會看到大量的新建連接過來。每一個連接又會佔用一定的mc內存,從而進一步惡化整個mc的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這個問題,雲上進行了mc的性能優化攻關,最終將mc的性能提升到跟idc的一個量級。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在mc的性能這點,其實前面的參數都不重要,重要的是我們對於雲上服務和組件應該保持的一個態度。引用一句當年美蘇中程核武器談判時,美國總統里根對戈爾巴喬夫說的一句話俄語諺語:“Trust,But Verify”。信任,但要覈實。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有點兒奇怪的一個場景,一箇中國人演講裏面引用的是美國人對俄國人說的俄語諺語。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"媒體文件的遷移。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/15\/15fef0e949e180e8a3b98ef8623c2b84.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特點數據量大,遷移時間長,緩存數據雙寫同步按天記,數據庫同步按周,媒體要按月。數據遷移依然使用雲上的obs遷移工具,支持斷點續傳。同時還需要我們在業務層面進一步做數據對比。全量數據做etag對比,確保數據完整遷移到雲上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"切讀依賴CDN更改域名。域名指向變更後讀流量就逐步遷移到雲上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"切寫流量依賴上傳SDK組件變更上傳域名。我們可以下發需要上傳的域名給到sdk組件。但是客戶端會有一段時間的緩存,因此始終都會存在一部分客戶端將數據上傳到IDC,一部分上傳到公有云。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,對於媒體數據的遷移,我們並沒有做什麼快速切換的方案。過渡態是一定會出現的。因此主要解決過渡狀態下,數據訪問的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e5bf233a30e2d6806ecf5c842116a4aa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏有一個302系統,302系統是幹什麼用的呢?302的含義來自於HTTP 狀態碼302,是頁面跳轉的意思。這裏也一樣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"302系統在這裏可以解決什麼問題呢?當我們在切換的過程中,首先讀肯定是讀本數據中心的obs,比如讀公有云上的obs,當公有云上的obs返回數據不存在時,經過302系統,將請求跳轉到IDC的obs上去查詢,如果有就返回,如果沒有就返回404。在IDC也一樣。有了302系統就不擔心流量在切換的過程中找不到媒體文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6c\/6cb80587222dd66aa4d6ccc8d6ecfaa8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、把冰箱門關上:流量切換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後我們看一下如何關閉冰箱門。流量是如何切換的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/08\/089f8d726af585caa6342a22dc02bda0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖中可以看到,我們對外的API域名有兩種配置,一個是A記錄,一個是CDN。A記錄流量佔比大約30%,CDN70%。我們會將A記錄的流量當做灰度流量來操作。先將A記錄的一部分流量切到雲上,指向雲上的LB。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/95\/950721f68a68b9d6b47b281148380d67.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"穩定灰度一段時間,觀察服務狀態沒問題後,然後開始切CDN,一家一家的切。cdn切完,全部流量都到雲上。此刻檢查,IDC是否還有流量訪問。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a8\/a87d48438119f68e8f5f0207696bcb4e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可能也發現了,這裏存在一個問題,切換DNS和CDN的生效時間比較長,同時回滾的話也沒辦法快速回滾。那應該怎麼解決?這時候想起了一個計算機系統設計的經典語錄:計算機系統裏面的所有問題都可以通過增加一箇中間層來解決。通過增加一層LB來解決切換生效時間比較長的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面我們做了應用遷移,數據遷移,當這裏流量遷移完成後。整個上雲的步驟就結束了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、雲上穩定性建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實不管是在雲上,還是雲下,我們都會因爲下面這3個因素導致系統故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1d\/1daa75562e53b6a5c387f0b95b44c99e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自然災害,物理損壞,邏輯錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上雲的時候,我們的服務都是部署在同一個機房的,上雲完成後,爲了進一步提升服務的穩定性,我們進行了一些雲上的容災建設。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在雲上我們採用的是兩地三中心的方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/88\/880ffa2d91bdbb908ddbd5568d031cf7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對機房級的故障,我們會在雲上的RegionA部署兩個AZ。一個AZ相當於是一個機房。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用雲上同步組件同步資源數據。通過雲上的DNS解析,將流量劃分到不同的AZ下。對於AZ我們也有一個主次的劃分,AZ1作爲主機房承擔70%的流量,AZ2承擔30%的流量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幫助我們壓測的時候可以將流量全部切到一個AZ,對於兩一個AZ進行壓測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外爲了避免Region級的故障,我們還在Region B的一個機房裏部署了另外一套服務,這套服務通過雲專線跟Region A聯通。在RegionB我們只做核心服務的業務容災,非核心服務只負責數據容災。regionB可以保證我們遇到Region級的故障的時候,可以將核心服務切到RegionB,保證用戶的主要功能可用,另外可以通過RegionB的數據恢復非核心服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個想跟大家分享的是全鏈路的監控體系。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/8086f98bf13aa65a48d872fcadc2f4aa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的監控體系是包括客戶端和服務端的一體化監控。客戶端的監控體系在整個上雲過程中發揮了重要的作用。在前面的方案中,多次提到了灰度,回滾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼在灰度操作之後,是應該繼續灰度還是回滾,我們的判斷依據是什麼?依據就是觀察到的客戶端監控體系裏面的網絡質量,錯誤率,錯誤率,慢請求這些指標。監控觸達到離用戶越近的地方,越能體現每個操作的真實影響範圍,影響程度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們內部對於故障的定級有一套定級標準,針對影響的功能,用戶人數,時長分別對應到不同的故障等級。那回滾標準就可以簡單的依據客戶端監控的指標來判斷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那上雲後,服務監控和基礎監控這兩塊更多的是利用雲上的監控數據來構建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控報警就像我們人的一雙眼睛,可以幫助我們更快的發現系統的一些問題。對於監控報警,我們做了如下的一些優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c6\/c6e0c7367aa70794680d51c0dac571b7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是通過grafana統一了所有的監控報表,每個業務的所有相關報表都統一到grafana,當系統出現問題的時候,我們不再需要到處去找監控頁面查看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,使用grafana的flowcharting插件,我們將各業務從客戶端到服務端的整個鏈路的健康狀況展現在一張監控大盤上。用紅黃綠來代表服務的健康程度,綠色代表健康,黃色代表不健康,需要修復,紅色代表異常,需要立即人工介入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後從大盤上我們就可以清晰的看到哪些服務是正常的,哪些服務目前處於非正常狀況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,針對那些不正常的服務,我們利用grafana的另外一個插件Grafana image Renderer將當時的服務狀況發送到企業微信報警羣裏,實現圖文報警。業務開發和運維就能及時介入處理問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、避坑指南"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後跟大家分享的是上雲過程中的避坑指南。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要想跟大家介紹下上雲的整個流程。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0e\/0ec54b93b6abc82065d2cad7c2ff10c6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個上雲流程我們分爲4步,評估調研,規劃設計,實施和驗收。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"評估調研分爲信息收集,業務分析,風險評估和遷移策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"評估調研主要是在設計方案前的準備工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"信息收集分爲兩方面,一個是自身業務的信息收集,一個是公有云上的信息收集。信息收集完成之後,結合業務分析,來制定遷移策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在規劃設計階段,我們在設計遷移方案的時候需要考慮方案如何灰度,各種可能出問題後的回滾操作是什麼樣的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外就是準備應急預案,提前列出關鍵的可能發生問題的事件,針對該事件的處理方式。墨菲定律。然後需要準備非常詳盡的遷移手冊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實施階段主要想強調一下正式遷移前的演練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後驗收,驗收階段一個是需要通過監控看到上雲後的性能情況。另外一個也是需要通過監控對上雲後的一些問題做進一步的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d2\/d232197701ac5dad0ecb593e2dbff2d4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包括在遷移過程中和上雲後出現的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源實例的配額,全方位的限制。木桶效應,流量達到一定程度後,總會遇到遇到最短的那塊板子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是可以通過全鏈路壓測來觸發雲上一些資源的瓶頸點。另外是我們的監控有時候可能過多的關注了業務本身的情況。這就要求我們將監控的範圍進一步的擴大,更完善的監控指標讓我們發現問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雲上組件異常。沒有什麼好的方式來避免,更多的通過監控來發現問題,另外就是針對組件做好容災、降級的服務建設。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對人爲操作的問題,更多的是規範化流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計劃與現實的差異,道路坎坷,那我們是如何做到上雲過程中0故障的了,我們的壓艙石是什麼?前面分享的具體的方案,參數,操作步驟都不重要。給大家總結的就3句話:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/60\/10\/606152e58795b76a6967dcbfd8aaf610.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過信息收集和業務分析設計可灰度可回滾的遷移方案。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"制定詳細的遷移手冊在測試環境提前演練。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在遷移過程中利用貼近用戶的監控體系及時發現問題。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Q&A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q1:數據遷移會不會導致數據錯亂?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A1:"},{"type":"text","text":"數據遷移過程中可能導致數據錯亂的情況主要是切寫的場景。因爲兩個數據中心之間做數據同步有1-2ms的延遲,在做寫流量的切換時,如果存在在1ms以內針對同一條數據的相同字段的更新操作,且兩個更新操作分別在不同的數據中心執行,當自建機房的數據再次同步到雲上的時候理論上可能存在新的更新操作被覆蓋的情況。但是在我們這種應用場景下,這種數據錯亂的情況出現的機率會極低。主要是因爲我們的內容都是用戶主動生產的,用戶針對同一條內容的數據變更不會很快發生。所以在切換過程中不會存在數據錯亂的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q2:obs的地址存放在數據庫中存一份嗎?遷移後是否需要修改?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A2:"},{"type":"text","text":"obs的絕對路徑地址並沒有保存在數據庫中,只會將obs文件的文件名保存在數據庫,另外還有一些obs文件對應的bucket等屬性。針對obs的遷移,遷移過程中obs文件名不會變化,bucket也不會變化。除非雲上已經存在同名的bucket或者bucket名稱不符合雲上的規範,我們需要變更bucket的時候,需要修改對應的數據庫中的bucket的值。其他的不需要修改。變化的只是訪問域名,遷移後只需要修改訪問域名就可以了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q3:遷移如何做數據一致性校驗?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A3:"},{"type":"text","text":"分爲兩個維度來做一致性校驗。一個是利用雲上的數據複製服務,在數據遷移過程中,數據複製服務會提供數據的存量以及增量數據的對比,對比包括行數的對比和內容的對比。另一個是在業務層面,針對遷移的數據進行業務校驗,可以抽樣一批數據進行業務層面的數據校驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q4:你們遷雲之後,制定了怎樣的風險策略?如何規避安全漏洞呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A4:"},{"type":"text","text":"這裏說的風險策略我可以理解爲我們提前做的預案。預案主要是針對遷移過程中可能發生的情況採取的執行操作。其實主要是在每一個重要操作之前問一下如果這個操作失敗了該怎麼處理?針對這樣的問題,逐一設計方案解決。另外就是針對一些不可控的因素做假設。比如專線斷了怎麼處理?專線帶寬滿了怎麼處理等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"安全漏洞這一塊兒因爲不是我負責的,我也不太瞭解。如果同學非常感興趣,我可以找下相關人員幫忙解答一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q5:請問你們的業務遷雲之後,成本是如何控制的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A5:"},{"type":"text","text":"首先遷移前我們制定了架構平移的策略,所以遷移過程中不會刻意壓縮上雲的成本。爲了保證上雲服務的穩定,雲上所使用的的服務會略微高於IDC的配置。即使這樣的策略下,上雲後成本也比IDC要低。另外在上雲完成後,需要成立成本控制中心,從DBA,運維,一直到開發都會需要有成本控制的意識。優化已有架構,業務邏輯等等來降低成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q6:數據庫遷移上雲有哪些注意事項?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A6: "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據庫遷移要提前啓動,因爲數據遷移相比應用和流量遷移的時間要更久,出現問題影響面也更大;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據庫遷移需要針對數據做一致性校驗;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一些月表或者年表的創建腳本不要忘記了遷移到雲上;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘量在低峯期執行,操作要頁面化,腳本化。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"靳洪兵,"},{"type":"text","text":"美圖技術專家。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美圖用戶產品研發部技術專家,主要負責美圖秀秀社區服務研發,主導參與美圖秀秀調度系統、內容中臺等多個項目的研發與設計。在分佈式系統、消息中間件等方面具有豐富經驗;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曾就職於新浪微博,負責平臺研發部消息中間件、配置服務等系統的研發與設計。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:dbaplus社羣(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/b8nZgMU_jXB_KkH1BgZFzg","title":"xxx","type":null},"content":[{"type":"text","text":"零故障上雲全過程再現,PB級數據遷移如何保障一致性?"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章