MySQL海量運維管理如何保障京東大促?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文根據高新剛老師在〖2020 Gdevops全球敏捷運維峯會〗現場演講內容整理而成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/06\/06b1feef8fac30a09b47638a36db180d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們遇到海量這個詞的時候,大家第一時間會想到和數據庫相關的哪些內容?比如海量的數據量、大規模的數據庫的節點數、高併發的業務訪問。海量的數據帶來的是存儲和彈性擴展的問題,大規模的數據庫節點給我們帶來的是批量運維的困擾,高併發訪問帶來的是性能的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我認爲,解決大部分的海量數據的問題,一般有三種通用的方法:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一、我們要有一個數據的全生命週期的管理體系,從數據庫的寫入到數據庫的存儲,到TP的查詢,AP的查詢,到後面的一些冷熱數據分離和大數據實時或異步抽取,我們要有一系列的管控工具幫助我們實現高效的解決方案;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二、我們要有一個非常穩定、平穩高效的架構體系,也就是說不管你怎麼去做彈性的縮擴容,不管你怎麼去做數據的搬遷,也都是在這一個相對固定的TP和AP的架構框架下面去運行;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三、我們還需要有一個自動化運維的管控平臺,如果你有一個非常完善的數據庫生態管控的平臺體系,那麼你就能夠很輕鬆地去駕馭這種海量的數據庫運維工作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以說今天我的議題也是從裏面找了幾個點,來給大家分享。通過海量運維的概述,給大家介紹一下我們是怎麼去做數據生態管控的,以及我們一些數據庫管控組件的功能介紹。其次想給大家介紹一下在面臨大規模的數據庫節點運維的時候,我們如何去搭建高可用的容災體系,或者說如何去做高可用解決方案的選型。接下來會通過介紹資源的管理和告警信息的管理,告訴大家我們的一些自動化運維的思想是什麼。最後就是我們會把海量運維的管理思想在大促備戰中進行應用和實踐。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、海量運維概述"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8c\/8cd8b8c6b8cf0ab0dc058af37df7b7cd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先整體來說這是我們公司的一個數據庫運維拓撲圖,中間是整個數據庫集羣,不管是單庫還是水平拆分的,應用都是通過智能DNS或者vip訪問數據庫的,如果是單庫就直接連到主從架構,如果是水平拆分就是通過Sharding-Sphere或者CDS數據庫中間件產品訪問後面的數據節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏我們可以在中間件這層實現數據加密和脫敏的數據管控目標。如果你要是單庫的話,我們這個數據加密和脫敏是放在應用的代碼裏面,如果你要是用水平拆分的話,分佈式中間件天然就支持加密脫敏的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來數據寫入到數據庫之後,左右兩側其實就是數據的一些流轉功能。DBRep和DTS主要指的是實現數據流轉到AP的查詢平臺,流轉到一些運營後臺的其他業務邏輯的查詢數據庫裏面去。左邊Archiver是數據歸檔平臺,實現數據冷熱分離。還有PillBOX是備份管控的平臺,可以把數據庫的備份按照一定的規則策略傳入到Hadoop,再傳到磁帶庫裏,這樣整體來說數據庫備份的整個生命週期可以得到很好的管控。從online環境傳輸到nearline環境,最後進入到offline的磁帶庫中,再結合備份保留策略以及逆向的恢復功能,備份平臺可以覆蓋數據庫備份恢復的所有需求功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面就是DBA運維的一些管控。DBCM就是一個建模平臺,所有數據庫的建庫建表邏輯要通過這個平臺來創建,如果不通過這個平臺,大家可以想到,在這種大數據後期的一些建設中,你就會發現你的數據的質量會有很嚴重的問題,不管是元數據還是業務數據都或多或少出現數據質量問題,所以說通過建模平臺我們做到了從建模源頭開始規範和引導研發的業務模型設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次是查詢平臺,研發在做開發的時候,包括業務運營人員需要對數據庫進行一些數據查詢、獲取,查詢平臺裏面會包含一些加密、脫敏、查詢記錄條數和導出的一些管控限制的功能,通過這種方式我們能夠很好的做到企業級安全合規的管控和限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CleverDB是性能管控平臺,我們可以給研發\\DBA還有一些其他想關注數據庫的人員一個非常友好的可視化平臺,對於研發用戶來說數據庫就是一個黑盒產品,通過這種可視化的管理,可以講一些專業化的性能指標轉換爲數字化,以可視化的解決展示給我們的需求方,這樣可以驅動他們自主管理各自業務數據庫。在大規模運維體系中,研發自驅管理數據庫和數據庫自助化服務逐漸成爲新生的運維力量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OnLine平臺是一個流程管控平臺,不管做變更也好,做大規模的架構調整也好,都需要一個企業級的合規流程才能完成這個動作。最後就是自動化運維平臺,我們在管理上萬臺服務器的時候,就是通過這些平臺工具提升運維效率的,總之擁有一個完整的數據庫生態的工具平臺,然後基於這個平臺逐漸實現自動化、流程化、智能化,是駕馭海量運維場景必經之路。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、海量運維的高可用體系"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ac\/ace248ed9493bfe6819b17d86d9cdeb4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次我想說的是通過高可用體系的介紹,讓大家去了解如何在這種海量的節點運維過程中,如何做這種容災體系的建設或者去做選型這塊的邏輯。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、優質的容災服務質量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供優質的容災服務質量,RTO<10s。它主要的邏輯流程通過秒級監控可以快速感知故障,然後通過的哨兵探測邏輯,可以驗證監控結果的準確性,第三步調用核心切換模塊,進行切換環境檢查、數據一致性校驗、主從切換、vip映射關係變更、元數據信息變更、切換消息通知等。最終實現容災管控自動化,實現故障自愈。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、良好的兼容性和適配性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二點我們要考慮到容災體系的兼容性和適配性,因爲在一個企業裏面其實你的數據庫的架構並不完全標準的一模一樣。比如說業務會分等級,分等級之後主庫從庫的數量會隨着業務等級的不一樣會有一些變化,還有有些業務會做跨機房的容災,還會做多活、多中心的架構。所以高可用容災體系要具備良好的兼容性和適配性,比如要兼容不同的架構,不同的版本,不同接入方式和不同的資源環境。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、可靠的切換決策模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當故障發生時,切換之前的故障探測、故障場景的分析以及切換決策的邏輯,這些方面要遠遠比切換本身重要。所以這個模型能幫助我們辨識極端場景,避免錯誤切換或者腦裂問題。舉個例子,比如有一個人突然間身體發生了狀況,首先最重要的是要及時發現他有問題,第二及時送到醫院,接下來醫生要及時判別他到底是什麼樣的問題,最後才能做緊急的治療和搶救措施。所以說這個決策模型其實做的就是發現和診斷的過程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、服務自身的可用性保障"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四點就是服務自身的可用性保障。也就是說如果真的出現機房級別的宕機時,其實高可用服務也是掛了的,你如何能夠在另外一個機房把你的高可用服務拉起來,然後在高可用服務啓動之後才能去做機房級的容災,所以這種能力也是需要大家再去考慮的一個點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、平臺自動化的管控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第五點就是平臺自動化的管控。通過配置的統一管理,包括切換進度、歷史查詢的這些功能,可以幫助我們在大規模數據庫運維背景下去提高我們容災的效率,同時豐富的API接口和可視化信息,能夠提供一些非常良好的兼容性和適配性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6、豐富的容災類型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後就是豐富的容災類型。我們一般情況下都會做主庫容災,像MHA是一個主庫容災的能力,其實它不能提供從庫的容災。然後在容災類型上面,也要考慮手動切換、自動切換,因爲我們會做一些切換演練,比如我們做大促準備的時候經常做一些切換演練,主從切換、主備切換、跨機房的批量切換,這種能力都是需要在高可用體系裏面去體現出來的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、海量資源管理"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/98\/98446433f96fc60343a3d3bfce3d89d0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源管理其實就是以前我們在規模非常小的時候用Excel管理數據庫資源,但是現在公司的規模越來越大,數據庫的節點數越來越多,自動化平臺建立起來之後我們會發現有很多維度的元數據需要我們做管控。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、資源自動上報"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先資源節點自動上報,例如公司採購了一百臺機器,這些機器怎麼錄入到這個系統裏面?我們是通過給每臺機器裝一個agent端,它會自動上報它的IP,還有這些機器所在的網段、機房、機櫃等信息是通過跟IT運維繫統去進行對接,通過他們提供的API接口把這些信息抓過來,這樣你的信息才能健全。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、服務器使用狀態的管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二點就是服務器使用狀態的管理,比如這個服務器到底有什麼樣的業務在使用,有沒有報錯,有沒有報修,它的維修進度是什麼,這些都需要在我們的系統裏記錄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在以前就出現過類似的問題,比如說我們要給這臺機器去做維修了,因爲一些元數據的不準確,可能那臺服務器上還有一些其他業務運行,導致那個業務就因爲關機而中斷了服務,所以元數據的準確性也非常重要。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、數據庫與業務研發匹配"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,數據庫跟業務這種兩個視角的同步和串聯的問題。也就是說我的數據庫只記錄了它是什麼樣的架構,它跟業務的對應關係沒有建立起來,是兩個業務的數據庫,這個數據庫又被哪些業務所訪問,包括這些業務是哪些研發負責人去負責的,我們需要把這些相關的信息串聯起來,最後形成一個血緣關係,或者知識拓撲圖,讓大家能夠很清晰地瞭解到我們的元數據的信息是什麼樣的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、元數據變更管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四,元數據的變更,比如你做了主從的切換,服務器的維修,包括一些服務器的名字的變更,所有這樣的事情。以及比如說研發負責人、公司組織架構調整、部門信息的調整,對應到數據庫的管控裏面,元數據都會涉及到一系列的變更。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、資產管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第五就是資產的管理。作爲DBA要能夠給到各個業務部門提供數據庫服務器使用情況,能夠告訴業務部門負責人,這個季度或者這半年服務器資產使用的情況,這種所謂的報表或者視圖的信息,也是需要提供的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6、API服務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後就是API服務,因爲作爲一個數據庫來說,它只是在IT系統裏面其中一個平臺,所以它要跟其他的平臺去進行信息的交互,通過信息的交互,才能夠把所有運維體系的數據串聯起來。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、海量告警管理"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/4040c4c54357fa729748e430857ffdaf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"告警管理也是運維體系中非常關鍵的管理維度,在我們機器非常多的情況下,每天會收到很多告警。許多人私底下聊天的時候經常表達今天收到好幾千條告警,以此來證明他們公司數據庫規模有多龐大。其實我覺得這是不對的,應該告訴大家,我們數據庫規模非常大,但是我們每天只收幾十條告警。以此來證明我們有非常強大的運維管控能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼做到的?其實就是通過告警的管理,通過一系列的方法去實現告警數量的降低,有效告警能夠很直接,很清晰的暴露出來,而不會被海量的告警信息所淹沒。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、觸發告警"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們現在很多公司的監控告警體系都停留在第一階段,就是我們制定各種指標的告警基線,設置告警級別,通過這種方式去給相關的業務負責人,相關的運維人員發告警短信。在這種情況下其實是沒有任何過濾信息的,只要你觸發了這個基線,你就會收到告警信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際情況是什麼樣的?給大家舉個例子,主從延時這件事,當你做一個大事務的時候,重複延時可能是相當長的時間,有可能是半小時或者一個小時,半個小時之內可能每隔兩三分鐘就收一條告警,其實對於運維人員來說,他是不在意的,他知道這是一個大事務,他知道這個事務要繼續做下去,所以就會有第二個方法,就是彙總分析。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、彙總分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"彙總分析是什麼?其實說白了就是一個告警的收斂,我們會把一些連續重複的告警進行一個收斂,然後減少接收人員接收到的告警條數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再一個就是單因多果的分析,比如說這臺機器宕機了,你會收到一條宕機的告警,同時你會收到一堆其他的告警,比如連接數訪問異常,還有主從中斷,你會收到一堆這樣的告警,其實最有用的一條告警就是機器宕機。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我們要通過單因多果的關聯分析,最終找到根因點是什麼,這也是一個我們在運維裏面根因分析的一種方法,通過這個模型我們可以去把最根本的原因找出來,然後發給相關的運維人員,避免有用的信息被海量的告警信息所淹沒這種情況。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、以點推面"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個就是以點推面,這裏有兩個場景:一個就是場景就像我剛纔說的在主庫執行一個大事務,我肯定能夠預判到在之後的未來的半個小時之內,所有的從庫都會有延時,這個時候運維人員可以干預一下告警決策,告訴他半個小時之內這些從庫可以不發延時告警。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再一個比如我們在大促之前會做切換容災演練,主從切換的時候勢必也會發一些告警,這些告警一般都是在夜裏做容災切換演練的,所以如果這個時候去發告警,會影響好多人的休息,比如相關、不相關的業務人員,他都會收到機器切換的告警。所以在計劃內的一些切換的時候,其實我們也是可以預判到的,然後提前將這些告警信息進行一個參數的調整,去避免這樣的事情。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、調節基線"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再一個就是基線的調整。比如像大促的時候,雙十一、雙十二的時候,高峯之前的時候,我們能預判到整個告警基線,或者告警參數需要進行一個很大的調整。比如說併發數,比如說事務的超時時間,我們可能都要做一個調整,來滿足大促峯值的衝擊壓力,不至於大促來臨那一刻,你的手機都有可能接收過多的短信。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、精確告警"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過一系列的調整和模型的建設,最後達到的效果就是減少無用的告警,減少運維人員和研發人員的感官的疲勞,同時他也能夠減少告警短信的成本。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、大促備戰分享"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/8038110169e440c4bd94f24907de6206.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後給大家分享一下我們大促的整個過程,其實這個過程就分爲三個部分:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是備戰部分,備戰之前基本上是提前一個月到一個半月的時候開始準備,接下來主要介紹一下在面對多大規模的服務器的時候,我們不可能把所有的數據庫節點都one by one管理起來,那我們如何去做運維呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一條,要給研發賦能,使相應的業務研發人員具備管理自己數據庫的能力。你可以給他一個引導或者給他一個類似數據庫巡檢報告這樣的東西,讓他意識到他自己的數據庫有哪些問題。讓他先優化一輪,減少對所有數據庫運維的壓力;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二就是要抓核心鏈路的數據庫,比如說像京東,在雙十一或者雙十二的時候,核心鏈路是在支付、白條、運費險,在這些業務上面我們就需要DBA的直接的關注,要抓它的核心鏈路;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,作爲DBA要時刻看一下運維管控的系統是否正常的運行,比如監控告警體系、容災體系、備份恢復的能力、流程變更。保障基礎服務穩定運行,做好監控的監控和服務的服務是非常重要的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四,應急預案。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這些點上如果作爲DBA能夠很好的管控的話,其實整個大促的過程應該還是相對比較輕鬆的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖就是大促過程中比較核心要去關注的幾個點,如果你要去支持一個數據庫的大促運維,或者說去做一些數據庫的巡檢,你應該從哪幾個方面去看或者去檢查你的數據庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d3\/d30f29575459de1ee58f9415f14989de.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要有幾個點可以讓研發去做,通過巡檢報告、可視化平臺,可以推給研發相應的信息。比如像自增信息、表分區信息、慢查詢,其實你可以發給他,由他們自己去做一些相應的調整。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲DBA來說,最主要一點是要關注最後一條“業務梳理”,幫助研發梳理業務跟數據庫之間到底是不是一個強依賴的關係,一個事務到底讀寫操作會有多少條,上下游邏輯是什麼,能不能做熔斷處理。一旦業務系統出現問題是否會影響別人的系統,這些需要DBA做檢查。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再一個就是硬件和機房的巡檢,我們在大促過程中主要出的問題可能硬件比較多,硬件出的問題最多的就是磁盤和Raid卡,經常Raid充放電的一瞬間可能會讓數據庫卡頓幾秒鐘甚至半分鐘,如果這種情況在大促的時候出現那就是災難級的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/84\/84cd039ed5e60ec7cfeecd3da9203268.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後做一個總結,不管你前面做了什麼樣的準備,這六點是你在大促之前必須要關注的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容災切換演練。高可用體系運行是否正常,同機房的切換、跨機房的切換是不是能夠很好的運轉,要通過大促之前的容災切換演練尋求驗證;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"性能優化。通過研發視角可能會做一些相應的優化,DBA應該從一個管控的視角看一下這些優化是否是合理的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"壓測方案。我們每年都會在大促之前做集團級別的軍演,整個集團,整個業務鏈條會做一次這樣的壓測,然後通過壓測去暴露各個業務條線、各個系統的峯值壓力,比如我們最多能扛多大的峯值壓力;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據治理。我們的數據不是說越多越好,我們可以做一些數據的分離,包括我們在讀寫的時候也要加入MQ的思想或者緩存的思想,保證我們讀寫在數據庫這一層都是相對來說是平穩的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源管控。不是所有的公司都像國企那樣那麼有錢,他們在使用服務器資產的時候還是非常謹慎的。我們要梳理每臺機器的使用率,到底需不需要擴容或者縮容,這些需要在大促之前做一個充分的判斷,因爲在大促的時候有很多服務或者很多業務需要擴容,需要機器資源,這個時候就需要合理地把資源分配好;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務優化。要減少每一個事務交互的次數,減少事務的邏輯,要做業務熔斷的降級預案。尤其是連接池,這個也是在大促的時候經常發生的這麼一個問題,大促那一瞬間數據連接滿了,很多情況是因爲業務層面的連接池過大,所以大家需要從這些點上去準備我們的大促。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Q&A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q1:因爲我們做主從數據庫,數據庫我們會有一個平臺擴容,擴容的時候因爲數據量特別大,導致擴容時間特別長,您有什麼建議?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A1:"},{"type":"text","text":"擴容應該在切換之前對業務訪問性能不造成影響。比如你一個庫要去做一拆二的這種水平拆分,這個時候你應該把數據從一個主庫上面遷移到兩個節點上,這個過程中首先你的數據同步相對來說要可控,不能非常暴力的從主庫拉取數據,要注意io和併發壓力。二是當你把全量的數據和增量的數據都能夠追平的時候,要選擇合適的時間做業務的切換,這樣的話其實就能夠把業務的影響降到最低。好多分佈式數據庫是彈性擴容的,這個彈性是怎麼來的?其實就是悄悄的做數據搬移,它是不影響業務的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q2:您對數據庫的容器化怎麼看?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A2:"},{"type":"text","text":"這個問題非常好,因爲京東商城應該是大規模使用數據庫容器的一個案例。首先容器化還是分場景的。舉個例子,在商城是大規模使用容器化,但是在數科,這個容器化的程度非常低。爲什麼?因爲數科是跟支付,跟金融相關的,大部分還是跑在物理機上面,能不能使用容器化還是要看數據庫的規模和體量以及業務場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q3:京東數科這麼大的數據庫,在什麼時候去做分庫分表?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A3:"},{"type":"text","text":"這個問題也非常好,其實這個沒有一個非常標準的答案,在我們公司,我們會有一個基本的界定,單表五千萬。如果超過五千萬,DBA主動會去聯繫研發溝通性能問題和水平拆分事宜。還有一種情況是研發在業務使用中,他已經感知業務響應時長不能夠滿足業務的請求的時候,他會主動來找到DBA。如果我們通過一系列的優化,一系列的調整還是不能滿足它的業務要求的時候,其實我們也會去幫他做水平拆分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q4:如果我們在一些想做數據庫備份,想做備份恢復可能只有前面兩條,您這邊對於這方面有沒有別的方面考慮?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A4:"},{"type":"text","text":"備份恢復確實很頭疼,尤其是數據量大的時候。我們應該怎麼做?備份存儲需要分級,如果真的需要恢復數據的話,肯定是從本地磁盤獲取備份,這個數據恢復起來的快一點,因爲可以減少網絡的傳輸。另外一點其實我們也有一個方案就是我們所有的備份要每個季度內完成一次備份恢復,要去做備份有效性的檢查。很多時候恢復的這份數據庫其實就可以幫助到你快速找回數據,比如你想要通過恢復找數據,其實這個本分可能會在前幾天的有效性檢測的恢復中恢復好了一份數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q5:剛纔您說您這邊是沒有進行容器化的,我理解您是怎麼解決比如說單節點應該不止一個實例,如果多實例會不會互相存在制約?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A5:"},{"type":"text","text":"我們現在確實就用單機多實例的方式去做的,我們做選型的時候,比如說資源隔離的這件事。目前來說不會有特別好的辦法,我們現在最有效的就是把磁盤不會做成一個大的Raid組,進行io隔離,然後用cgroup做cpu mem和網絡的隔離。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q6:那如果A業務導到B業務,出現了這個問題,A資源響應特別高,影響B的數據庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A6:"},{"type":"text","text":"那基本上就會遷移,我們能夠把它放在單機多實例裏面的理由就是它的業務量都非常小,如果出現像您剛纔說的這種情況,要麼通過優化的方案把壓力降下去,要麼就是把業務遷移走,因爲有可能它真的不適合再用單機多實例的能力了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"講師介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高新剛,京東數科數據庫團隊負責人,負責京東數科數據庫平臺的管理維護工作,帶領團隊平穩護航多次6·18、11·11的大促活動;對數據庫多業務場景架構設計,高併發解決方案,數據生態管控有着豐富的實踐經驗;對數據庫庫中間件、分佈式事務數據庫和自動化智能化運維平臺設計開發有着深入的實踐和探索;長期專注於數據庫產品化輸出和國產數據庫的探索研究。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:dbaplus社羣(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/i8JPjo5NWEQKLMOm-uvidg","title":"xxx","type":null},"content":[{"type":"text","text":"MySQL海量運維管理如何保障京東大促?"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章