教科書範本級:銀行容錯容災體系建設與實操性演練設計

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、容錯容災包括什麼?爲什麼是體系?"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、業務連續性需求與容錯容災實施:容錯容災能力是業務連續性保障的基礎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先要討論容災容錯都包括什麼?爲什麼是一個體系?因爲容災容錯有它的背景,這個背景主要是從業務連續性管理的範疇角度,業務連續性管理可能很多朋友也都清楚,不清楚的話網上有比較標準的一個解釋,這裏特殊強調,它主要是針對於企業業務的非計劃性問題產生影響,有針對性的去制定一些業務連續性的保障計劃,控制企業的經營風險。同時像我們理解的也是爲企業的客戶提供優質服務的一個手段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/02949b99ac3d519fd87d0b8f03613424.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體的來講,容災容錯體系的構成主要包括各種可能出現問題的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是小概率大影響的災難性場景,這也是標準的容災容錯涵蓋的內容。假設一個主數據中心,因爲電力的原因,或網絡的原因,整個系統要全體切換到另一個容災中心或者同雙活的一箇中心,這是概率很小,但是影響肯定會非常大的一種情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"概率比較高和影響也比較大的情況,我覺得也應該納入到體系裏的。一些嚴重的故障和錯誤,也會產生這種問題。容災不僅是災備中心的整體切換,所以這裏要特殊強調容錯,這是一類問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"談到對客戶優質服務,還有一部分客戶問題是系統正常但應用程序存在一些邏輯 bug的,或者是基礎數據問題造成一部分的客戶沒法去完成交易,這也是我覺得應該涵蓋在內的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以容災容錯這種實戰能力的建設,我理解上認爲主要包含三個部分的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,是架構和控制這兩個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統整個架構的可靠性設計,這是容災容錯的一個基礎。這個基礎之上,我們還要有對於服務的可用性控制,即有了這個基礎之後,還有一定的發現、定位、處置能力的一個控制能力,結合在一起,才能真正的達到容災容錯實戰的能力要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,是演練和問題的跟蹤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即設定我們的這個場景不一定是能夠生效的,所以要有一套演練、問題跟蹤、分析成因和形成規範、迭代開發的一個過程。而且這個過程不是突擊性的、一次性的任務,是要結合我們日常的運維做的一個任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,是把它形成常態化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"金融行業經常制訂一些工作計劃,一年計劃幾次,在之前大家做了充分的準備,然後演練,但實際真正發生故障,肯定不是跟演練一樣的,所以一定要把容災容錯的能力維護建設在常態工作機制中,才能確保隨時發生、隨時能處理。這是一個簡單的理解。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、容錯容災主要場景:接入、安全、調度、組合、路徑、應用、會話、數據、基礎"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/83\/8330cb1a34383a55bd3965ae7aaccb31.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容錯容災都主要包括哪些場景?上文也提到很多場景的問題,這裏我想簡單的舉一個例子,可能很多方面是大家曾經發生過的問題,一是應用的系統或者體系,客戶接入進來可能發生哪些問題?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"安全性的問題,或是大家熟知的像負載均衡等調度問題,還有交易邏輯的組合問題。尤其是銀行,銀行的交易是比較複雜的,靠多個應用系統組合起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些容災或者是多活數據中心,一定會產生多路徑的問題,包括多路徑怎麼解決,應用系統自身基礎性的條件問題,還有會話的模式和方式的控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最重要的是數據,應用系統在工作過程中,既會產生臨時性數據,也會產生這種永久性數據,臨時性數據會存在一個有狀態、無狀態的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能交易是異步的模式,進出是不同的通道,數據落在某個路徑上,另一個路徑找不到它,這個交易很可能就失敗,這些問題都要解決。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是最基礎的一個問題,上圖中我也列出了很多這種技術環境問題,這是容災容錯要考慮的場景,不一一講,有的我們本身出現過,例如交易邏輯的組合問題,上圖特意標出了一個顏色的信貸流程,它是比較複雜的一個業務流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的應用程序無論怎麼開發,都不可能完備的覆蓋任何的錯誤場景,包括像業務員操作的一些輸入的錯誤之後,不可能把這個流程重新發起重走,只能通過這種錯誤的臨時處置機制來處理。我們的技術人員每天也會大量處理這個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外銀行特殊有的,我們叫聯機批量交易,即代發代扣,像大家熟知代發工資等等,如果量特別大,它會切分成很多文件,這些文件併發處理之後,併發的鎖機制和控制問題,如果控制不好一定會出問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再有一點是強調永久數據,大家可能更加關注多活數據中心的交易,但是一個容災的中心,如果是完備的話,不僅僅是交易能夠承載,它必須同時保證數據是一致的,尤其長期歷史歸檔數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"銀行要存儲很多交易的歷史數據,還有其他的像信貸等等數據,這些數據按照監管或者會計法,至少要保留15年或者更久,這個數據中心一定是對等的,否則它起不到任何容災的作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些是我簡單列出的場景,但是每個場景可能在不同企業中會有不同的問題,要根據實戰來考慮。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、容錯容災體系的構成:容錯容災是能力建設而不是資源配置,是一個建設過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼容災容錯是個體系,我理解上認爲容災容錯是一種能力建設,而不是一種資源配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9a\/9a138412b396c95a5777fa8dfaf4e385.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容災容錯的能力,實際是由日常的事情發生並逐漸演化過來的。我們在日常中不一定能預見到一個災難或者一個重大故障的發生。剛開始可能就是一個小的事件和苗頭,所以在容災容錯我們重點考慮的一個是監控事件,我理解的就是在日常中經常會有報警發生的,需要處理或者是無需處置,或處理起來相對比較簡單的,像應用的隊列清理和清理之後的重啓服務,這些我理解就是監控事件,需要處置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容錯場景比較複雜,而且影響也相對比較嚴重。另外處置過程中,既需要人的介入,也需要系統之間互相關聯處理。有的交易鏈條比較長,由a、b、c幾個系統串聯起來,a處理的時候,後續的b和c連帶的必須要處理,這是容錯的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容災,對服務有嚴重影響,處置前中後過程中,需要全面控制系統內外部、各類業務場景的關係,容災是技術與業務全面共同處置。這也是上文提到BCM爲什麼是以業務場景爲主導的必須要細分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們日常的一些故障之中也遇到過,出現過不是純粹的技術問題,業務也有一套準備。這也是監管部門對於業務連續性保障,從業務角度要求去考慮的一個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們應用系統設計的時候一定要考慮個問題,叫可靠性設計,整個系統如何做可靠性設計。設計了後,一定要做破壞性測試,即假設容錯或者容災的場景發生後,會產生的結果和設計有效性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這步過了之後,就要做適配。上文說到可用性控制能力是很重要的,所以配套的監管控體系一定要配上,最終要形成標準化的操作。這樣一線的工程師在值班的時候出現這些場景,才能夠迅速的發現,並能夠按照要求去處理,真正的形成能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是工具化驗證和培訓,這是一個迭代的過程。在演練或者在實戰中,肯定會發現漏洞,因爲任何一個體系都會有漏洞,一定要回到最開始的可靠性設計來講,要不斷的完善一個技術規範標準,控制整個體系的設計質量,提高整體的可用性,這是容災容錯爲什麼叫一個體系原因。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、容錯容災管理與運維管理工作的關係與管理數據的傳遞"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文談到體系的一些概念,那我們日常工作要怎麼落地?這一塊我們也在總結自己的經驗,並且不斷的按照這個思路去推進自己的工作。它和平時的日常運維工作要怎麼結合,即怎麼能讓它常態化呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/16\/1689961bebe96d1b393d6c271981ae49.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做運維的,會有一個運維的全生命週期管理的過程,我們跟研發共同探討設計這個系統,最後投產,再加上運維方面的一套理論,包括ITIL事件問題,包括變更和配置管理,管理之後一定要形成一個有效的配置管理數據庫,這個是重要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"配置管理數據庫怎麼展開去討論,這是一個比較大的話題。但是我們可以根據自己的需要。我現在以事件的發現和處置爲主,那麼它主要信息就是爲了監控服務的,最後有一部分,主要配置通過了,基於配置信息就要生成監控它的配置庫,也就是監控本身一定要知道都有被監控的系統,知道哪些是被儲備的,能夠有場景去處理的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後要有一套的監控工具和手段,同時要把配置信息拿給我們,這叫容災的管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容災是監控的一個子集,如果監控其中有一部分是跟上文說的容錯容災的場景相關的,一定要拿進來,對它做額外的檢查確認,確認它場景是否發生。如果發生後,對它一定要有的一個處置的過程。這個過程,如果有運維自動化的工具,就用自動化來處理,前提一定是先標準化,不管是手工操作,還是人工管理,先標準化再自動化,再處理事情。這樣的話,又返回到我可能調度一些已經設置好的運維工具的這麼一個循環過程,上圖是個簡要圖示。這個目的就是怎麼與我們正常的運維工作結合在一起,形成常態化,讓它形成一個真正的實戰能力。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、架構可靠性設計是容錯容災的基礎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,架構的可靠性設計,這個是容災容錯的一個基礎和必要條件。沒有這個的話實際談不上真正系統的高可用性,架構的設計肯定是差異化的,不可能一概而論。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、可靠性的差異化設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a8\/a8de05790c0b12fa0f32f5ca40d6c492.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"服務接入渠道類:"},{"type":"text","text":"以典型的銀行爲例,接入類的,就是大家熟知的或使用的網上銀行、手機銀行等等,即對客戶直接服務的,這一類的特點是,多活、彈性、彈性擴展、多路接入等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在容災方面,要考慮會話的一致性,這是一個典型的問題。我們在自身的實踐中也發現了這些問題,例如跨中心雙活時,有的系統不能簡單切換過去,尤其是同步的或者異步交易的來回通路不在一起,中間可能有負載均衡資源等情況,它已經把連接池化了,怎麼能保證一致性,這是一個重點要考慮的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我就不再展開講了,因爲每一塊內容,像負載均衡等等展開講,在我們的論壇,或者我看過的以前一些文章,都非常專業,也討論得非常深入細緻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它既要負載均衡的設置策略,也要研發的應用程序開發,這兩個必須配套。我們也曾出現的問題就是這兩個環節不配套,最後造成了一個多路的會話不一致,也有可能只是少數客戶的交易就出現超時或者失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"業務邏輯平臺類:"},{"type":"text","text":"在銀行來講,典型的有網上銀行這類服務平臺,或者是信貸的業務平臺。這一類,它的業務數邏輯的一致性要求比較高。另外,它數據的處理邏輯也比較複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在可靠性設計來講,是上文提到的落地數據的多路徑一致性,因爲很多交易是必須要做落地文件的,不可能全是走報文。落地文件之後,這個文件進出如果落在不同路徑上應該怎麼處理?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在通用的一種方式,可以用一個NAS存儲資源共享文件系統,但在細化設計上,各個業務通道,進出的時候要寫到不同的文件目錄下,避免混淆,同時變更發佈也會更方便,這是細節了,就不再展開談。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"核心賬務服務類:"},{"type":"text","text":"不管是銀行,還是其他企業,肯定有自己的核心繫統。核心系統的要求,在整個架構裏,越往裏層,對它的交易時間、完成速度等要求則越高,這一層在可靠性設計上,各家企業都不一樣,包括技術人員的數量和素質配比、投入資源都是不一樣的。結合自身,我們採取的是簡單可控的模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如基於小型主機的核心繫統,針對它的HA設計,考慮到我們企業中高級技術專家資源有限,就沒有采用更快速但是更復雜的技術方案,而是選擇了比較傳統的HA加數據庫資源模式,但是Ha裏放的Resource資源一定要很少,只有最基本的存儲和數據庫,這樣切換的耗時和成功率是可預期的,每次切只要是10分鐘之內完成,就是我們可以接受的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據倉庫分析類:"},{"type":"text","text":"數據倉庫這一類的架構設計,它雖然不是一個在線的交易系統,但像金融企業要產生的一些報監管的數據,或後臺的總賬數據等等很重要的,都特殊強調數據來源需要與容災切換必須配合。因爲前面的業務系統切走了,數據抽取的時候,如果沒有控制好,可能抽的就是上一日或者是錯的數據,更可怕的是沒發現,最後你報出去的數據或者出去的總量數據是錯的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個事好像不是太難,但其實這個數據分發路徑在任何一個大的企業,包括有一定規模的銀行去梳理的時候都是很困難的。誰和誰傳了什麼數據,數據供給方如果切走了,IP地址換了,抽取數據能不能準確地跟上,這都是一套機制,這也是上文容錯和容災裏強調的。跟日常監控處理不一樣,它是要跨系統之間要協同,所以可靠性的差異化設計要綜合考慮不同類型的應用系統。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、自上而下的整體可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/46\/4695818d1dd2b1d9511a098e449b603e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從架構設計來講,肯定是自上而下進行整體考慮的,上圖圖示的是一個典型的銀行交業務系統的結構。客戶怎麼進來,怎麼進到業務的平臺,怎麼到後臺的核心繫統,當然都通過這種企業的服務總線,到最後線的數據倉庫類等等,都涉及到上文說到的不同會話模式怎麼考慮的問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)訪問的接入"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對外提供服務,有的是直接對客戶來提供的,另外也有機構之間互聯的,就像銀行跟三方支付。C2S、S2S的訪問目標需要考慮域名,尤其容災中心,我從外聯專線,從主中心切到另一箇中心的話,運營商沒法保證公網地址是一致的,域名、端口要怎麼處理。另外還有現在要面臨的,也是我們正在做的V4到V6的轉換容錯,這些都是訪問接入問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)會話模式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"會話模式涉及到兩個,一個是上文提到的長連接的斷鏈重聯協同控制,如果系統複雜了,互相之間可能有的不需要重連,有的需要重連,這些要靠日常嘗試,並在確認後要形成一套針對這種災難場景的操作規則。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"短連接的多路徑均衡和流量控制,這也是一個要考慮的問題。兩個不同中心或者更多中心,從多路經進來以後,怎麼進行負載流量均衡?因爲有可能你後臺的一個系統,像核心交易從a中心切到b中心,畢竟在一個小的局域網,前面接入層的配比一定得調,把快的部分調大,把交易相對比較慢一點的調小,以保證對外的可用性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)臨時數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"臨時數據上,從上圖可以看到,尤其是圖中左上角的幾種交易類型,大家要熟知一下,一是單向的這種聯機交易,二是雙向的聯機交易。所謂雙向就互爲可難端,這個開端口和可靠性重聯機制都是會不一樣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是單向的批量,也就是傳數據的,傳落地文件的,還有雙向的,都要考慮上文提到的有狀態無狀態的控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有就是文件併發。併發是爲了提高效率,但是它鎖機制控制一定不能出現問題。當然我們自身也出現過各種問題,從這個問題的解決,總結經驗再去改進它。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)流量控制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流量控制上,既然是多活多路徑的,就一定要有一套的流量控制方法,否則流量異常或者是有一路堵了之後,沒有相關控制手段,就會造成惡性循環不可控,這個是可能要考慮的關於整個架構設計的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、應用系統間的調度可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/874ecaa918e896be94ef05fab4d860ae.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用系統間怎麼調度和控制的,這個大家都很熟悉,無非是負載均衡,至於是用開源軟件,還是用一個商業化的軟件做,或是用純粹的硬件設備,這個根據自己的流量性能要求而不同。一般在銀行的關鍵業務系統上肯定是選擇專用的硬件設備。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)均衡調度的設計需求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇均衡調度系統時候,第一是考慮它的可靠性,像冗餘無單點設計,運行狀態可監測等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是性能。最小的或者是單中心的,甚至是單臺設備,就能支持支撐所有的高併發,保證在極端情況下,還能夠承載最高併發度的一個業務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再就是控制,能提供標準的API,讓它的啓停、操作、切換,數據的採集等能夠自動化,這是對於負載均衡的一個設計要求。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)均衡資源使用策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分發調度上針對短連接類請求,實現流量配比調度、後臺資源監控、服務異常調度、防止交易混流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無縫切換則針對長連接,實現後臺狀態監測、異常情況下前端連接的平滑性控制、會話一致性保持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一個是跨中型的,後臺的服務是不同的IP地址段的,關於網絡的二層三層這種互通差異的配比調度。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、應用程序內部可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e3\/e357b2bc03bed9ca8480d20862f4622c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們所叫的應用程序,和應用系統之間是有區別,應用程序我們可以理解成代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用程序內部的可靠性設計,這裏列舉了一些問題:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)接入一致性保持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶和系統的重複訪問控制,我們也曾出現過,尤其是多活多路徑的一個用戶,重複登錄了,有可能會造成業務的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有就是負載均衡的配套設計,上文提到負載均衡如果配了一個參數,後臺的應用系統必須要配套去按照這種模式寫程序或者是設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是二三層訪問之間的實現。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)業務邏輯可靠性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務邏輯,這是應用代碼要考慮的。我們很多開發同事未必會考慮這種邊界條件或者極端情況,業務邏輯異常中斷後,能不能斷點續做?有沒有查詢衝正和報錯輸出這種機制等等。特殊提到的重複代發,我們自身就出現過這個問題,它因爲異常中斷了以後,鎖機制實效,可能造成代發工資,重複發兩份等問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)日誌完整性控制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌完整性的控制很重要。一個應用系統,正常的交易也要寫日誌,這是爲了在排查一些應用或者業務問題時可以來使用。再就異常報錯日誌的信息完整性、日誌輸出的標準化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏特殊強調跨站點的多路模式的採集,因爲很多交易可能是異步模式,進出不同通道,一旦客戶的賬務出現問題,或者是有其他複雜問題,我們去追蹤,去檢查的時候就很困難。如果分散到每一臺機器日誌上,那業務員和技術人員要每一臺機器登錄去檢查日誌,而且要把交易的整個場景拼出來是很困難的。所以跨路徑採集,也是發揮我們日誌的採集、分析的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、應用系統整體可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/df\/df797407126e92ee98bf3f6d893b2940.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼叫應用系統?就是我把這段程序代碼裝到了一套系統中,應用的外部部分,無論是動態的、靜態的部分,我裝到外部server中,應用程序我裝到APP的server中,當然像銀行還有核心繫統,很多它的批量程序我裝到它的DB server上,我的整個網絡環境,負載均衡設備,包括DNS這都是要配置的,跟應用有關的還有網絡的訪問關係控制,包括Ha等等,這些配置完了我們叫它爲一個應用系統,這個時候應用系統作爲一個完整的,有自身體系的,才能夠對外提供服務,才能發揮它的作用。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)應用服務狀態監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在應用系統的可靠性上,一個是應用服務狀態要可監控和採集,否則我們不能確保它是否能夠正常運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說起來簡單,但是做並不容易。最簡單的監控某個進程,監控它的應用服務技能、端口等在不在,即使在也不一定代表是正常工作的,它僵死了,我們可能也不知道,這就需要討論一些方法了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如用日誌給它輸出心跳信息,每10秒、20秒鐘輸一個報正常平安的一個狀態,連續多長時間收不到的,表示它有可能僵死了。這個時候就需要決策,可能要涉及到的,如果是簡單的即監控事件處理,如果複雜的就是容錯容災的處理。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)系統資源調用控制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外應用程序,它對於系統資源的調用,最主要的像數據庫、文件傳輸、加解密設備等都是有調用關係的。有調用關係,有一定的結構,它就一定存在不可靠性。那就要考慮可靠性的提升。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)後線持續運行控制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"切換後的一個倉庫的係數,一旦核心系統切換了以後,卸數後面的數據倉庫等一定要跟上,包括數據備份,上文前面提到過,雙中心或者多中心,它數交易能力對等前提下,數據一定是對等的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我在a中心有的全量的數據備份歸檔,可能有30年、40年的數據,我在b中心必須也有。像我們一個應用系統癱瘓了,客戶可能一段時間做不了交易,這隻影響客戶體驗。但一旦數據備份丟了,這個是永遠補不了,這個是災難性的。所以在應用系統的可靠性設計上,綜合要考慮的都是這些點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6、數據庫按場景的可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/59a1cfe08789d5b0eaa063c037c78a79.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提到應用系統,既然叫系統,即應用程序叫資源。這裏最重要的一個資源,大多數應用系統都避不開的就是數據庫,尤其是關鍵系統數據庫的壓力比較大,像銀行核心系統,24小時運行也沒有任何的時間專維護和處理的。另外它數據量可能非常大,一個要求數據的完全的強一致,這個就要考慮它可能出現什麼場景。簡單列舉了一下,有數據庫僵死、服務器的硬件故障、同城站點故障、數據庫邏輯故障、數據庫軟件BUG、城市災難問題以及其它不可預見問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這些場景的異常處置有本機重啓,即主機故障以後重啓,這是可能的一種操作。其他的不詳細講。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖提到克隆,就是說同構要有個克隆,例如我是Oracle,我克隆的軟件也要是一個Oracle,保證它正好切過來。我們努力的目標是要它做一個異構的克隆,它主要是防數據庫軟件的BUG。這個就是在數據庫的場景上,當然每個展開了以後,可能會有詳細的一些設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家覺得最傳統、最直接的這種數據庫主備機HA切換,如果真去維護一套數據庫的DB的小機的HA體系,這是非常複雜的。因爲任何一個細節考慮不到,可能就會造成HA切換失敗,這個時候還比較麻煩。因爲大家也不知道數據庫是什麼狀態,沒法去處理。這就是數據庫按場景的可靠性設計。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7、網絡基礎環境可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f4\/f44b770e1935d63309c6572b887b8ce9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再往下層是現在銀行叫的兩地三中心,可能也是多中心,網絡基礎環境需要怎麼設計。我們自己具體在做的,可能還是相對比較傳統的,就是兩地三中心這種方式用光纖聯通,同城因爲是在一定的距離範圍內,能夠滿足這種相當於存儲同步,異地的話只能異步了。就是相當於RPO不是0,同乘RPO則可以是0。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外就是上文提到在整個架構設計的時候,要考慮不同系統的特點。小範圍的可能要打通網絡二層,打通的目的是給這些核心類系統,把它放在裏頭。這樣的話,兩個中心,同城容災中心和主中心,它所有設備都在同一個IP地址段,比較方便前面連接和轉的這種切換,同時交易的耗時也會相對比較小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是一定要控制範圍,因爲大家也都知道網絡二次打通是有一定風險的,雖然現在能夠解決生成數問題,但是也是儘量控制納入網絡二次打通範圍的主機的數量,數量少,相對都是小機,變更少,出現問題的可能性就很小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外其他的部分,像我們熟知的網上銀行等,一定要放在三層。無論是它的Web、APP還是DB,都要解決切換訪問的問題,有的不能去支持DNS則要想別的辦法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用系統連數據庫的時候,因爲我們有些數據庫是放在三層的,這樣主備中心數據庫肯定不是一個IP地址段,就要想一些臨時性辦法或者是簡單的辦法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種配置文件應用去聯庫的時候,一個配置文件的點a、點b,兩個分別根據情況給copy到真正的配置文件叫.online,完了再重啓應用,這樣就能夠連接到我要聯接的庫上。當然這個就涉及到容災切換的操作細節控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時二層打通還要專門給存儲開闢一個通道,保證同城中心的備份能夠在一個比較順暢的網絡環境過去,而且不影響其他的交易系統的網絡。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8、存儲資源環境可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c2\/c2b2b9a64dd3b786b0a8f1be4b91ab0c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再往下一層就是存儲資源。存儲本身,包括分佈式存儲,還有軟件定義存儲等等都有,但這個不是一個簡單羅列。上圖的右側,我們要知道應用系統數據在前端這個部分,渠道部分進來是放在什麼位置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們提到過的這種臨時性落地式的文件,它是一種對存儲的需求,這裏又分結構化數據和非結構化數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間的業務層它是一個什麼樣的需求?包括對性能和雙中心。容災有狀態的可能幾乎都是最長。再往前一層,可能是無狀態,那我只要是跟盤,速度足夠快就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再往中間這一層就是它真正的核心類的交易,它也有結構和非結構兩種數據,怎麼考慮?在後線長期歸檔部分怎麼處理?因爲可能這一個環境裏的應用系統不用去雙活,但是數據一定要過去。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,這個出來之後就要根據高低配的存儲和存儲特性的不同,去配置和管理整個存儲資源,包括對容量,就是切換的可靠性、可行性等等,這就是存儲要考慮的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9、數據備份與歸檔的對等設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cd\/cd4b0fd07b663d370b4b1a1594a1191f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到過兩個中心一定是數據本身的備份和歸檔對等的,所以在整個架構的可靠性設計的時候,要特殊考慮歸檔和備份的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家都知道真正做備份的時候,它是很耗帶寬、耗流量的,所以在網絡設計上要單獨留出二層通道給它。本身裸光纖的整體帶寬是有限的,所以要做一定的帶寬控制,還有其他的就不細講。但是兩個中心任務一定要能夠充分的調度,尤其是由一箇中心切到另一箇中心的時候,這個還要能反向回到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假如我的應用是由a中心切到b中心了,備份的任務就會備份到b中心,但是同時必須要異步的提供給a中心。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然是異步的,它一定是有時間差,可能還有最壞的情況,就是在時間差內,中心出現斷電等問題,有一部分時間段的數據歸檔是缺失的。當然無論怎麼斷電,因爲有備份,它是不可能完全丟失的,這就需有一套機制,要知道這些斷電的部分,丟失的歸檔部分,把它補回到a中心,保證兩個中心的數據是必須完整、一致的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個要投入大量的人力,包括系統這個工作,後續我們要選擇一些國產化的備份軟件時候,我們在交流的時候發現大家也提出這些需求了,因爲早期的備份軟件沒有考慮到過跨中心這種多中心的情況下保證數據歸檔一致性等問題,所以相應的管理功能要差。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"10、架構與控制的可靠性設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/98\/986ecd06005eb4967c7647998aaffd8d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把這些從上而下幾個關鍵層的整體可靠性,簡單的羅列了一些以後,下面涉及的是整個架構和控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我覺得架構設計一方面,它是一個靜態的,另外控制它本身也是一種可靠性設計。從上圖可看到主生產中心到同城容災,它有一部分是二層打通的,也看到應用是怎麼寫進來的,同時異步的也有寫到災備中心。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏也涉及到一個問題,就是所謂的控制的可靠性設計,變更發佈的控制是絕對不能漏掉的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要多活的應用系統可能還比較容易發現,因爲我要是漏掉了一個b中心或者c中心的一個應用,它一定會產生邏輯錯誤,能發現。如果是那種非多活的,平時沒用到,變更漏掉的環節的話,等真正去用它的時候切過來,就會出現問題,那是短時間沒法恢復的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既有應用層面的變更,也有像網絡負載均衡,包括安全設備等等一系列的。這個擴展後就是我們所謂的運維自動化,包括標準化控制等。設計的時候一定要考慮它,這是讓整體的架構具備比較完備的一個可靠性的基礎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、可用性控制是容錯容災的核心能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可用性的控制,是容錯容災的核心能力,所以容錯容災這個體系的基本構成要素是很重要的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、容錯容災體系的基本構成要素"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c5354b4a76a6c88afc2208465afe573.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從我們自身實踐來講,有三個最主要的要素。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個就是要準確的掌握容錯容災的配置信息。這個理解很簡單。一個應用系統,它的數據庫和高可用性對我是非常關鍵的。哪怕它丟一些數據,即使是一些渠道類型。像手機銀行、網上銀行等的渠道類型都是交易流水信息,哪怕丟一些數據也無所謂,但一定要儘快回覆,否則客戶就沒法使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種的話,如果給它配了一個克隆數據庫,克隆的硬件、軟件環境一定要納入在一倍的配置信息裏頭,如果沒有納入,根本就談不上實現可用性控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個,我們既然知道配置了,就一定要有一個檢查的場景,知道在什麼條件下一定要去用它。當數據庫不可用了,不一定就一定要用克隆庫,像前文列舉的數據庫場景,主備切換了或者重啓了,也許都能解決它。當然出現那種邏輯,壞塊等錯誤,數據庫無法重啓了,就要用它了。所以這是平時,或者是實時的,或者是定時的,有人工去檢查的方法一定要確認下來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,精確的處置操作控制能力。強調一下實戰跟演練是不同的,它一定不是發生在你預想的時間點上,也可能系統的管理員和技術工程師沒有在現場,所以一定要有實際的操作控制能力,這是容災體系的基本構成要素之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏把我們日常的工作中做一些分享,我們的技術工程師和專家們花費了很大精力和時間去做下面這個事情。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)容錯容災配置管理參考:以應用系統爲對象實體,按場景維護管控與關聯信息"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bf\/bf8441010d089c1584f7653d960df85d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個容災的配置就是以應用系統爲對象實體,按場景維護管控與關聯信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個應用系統,它在主中心有哪些服務器,包括具體的IP地址名稱,它的用途,還有對於服務器的一些健康情況和圖中展示的啓停,以及一些檢查的腳本是怎麼檢查的。這個腳本要標準化,我們內部交流後,認爲它一定要放在標準目錄下,方便運維自動化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外在同城的中心是什麼環境,還有跨中心共享的網絡存儲等設備一定要清楚。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再就是容錯環境,這裏要特殊舉例,像上文提到的克隆庫是一定要掌握的,當然這個是我們在完全手工管理模式下做的,每個工程師掌握一個表。靠人工,比如最近做了一些重大變更之後,我就要同步地維護表,這個效率是比較低的,也比較耗費人的工作精力,下文會單獨談怎麼去做常態化工作。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)容錯容災場景管理工作參考:單系統容錯場景,前提條件(監控)、集成、授權"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/88\/88c003eebfeef4b6629694f95f685e83.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個就上文提到的場景,既然有這套配置了,針對這個場景,它是檢查什麼條件,它前提條件是什麼?它檢查的目的是什麼?如果發生了怎麼處理?這些技術場景的設立,都肯定是要詳細考慮的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏還有跟業務有關的。我們也去真正的操演練過,如果我的系統徹底停掉了,暫時起不來。但是我有些客戶要緊急用款,這時我們考慮到,在覈心繫統不能工作之前有克隆庫,能夠查出客戶最後一個時間點的餘額是多少,緊急用款上,我們也跟業務人員共同梳理一套流程來。當然也會有一套授權機制給客戶提現金,保證客戶自身需求的滿足。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於它實現授權和預演,就不一一細說了,上文已經反覆提到過,這是在它的場景管理。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)容錯容災場景管理工作參考:多系統組合場景,業務與技術的組合,邏輯預演"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/58\/58711a163d61231d4af6862317d5c3b7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再有個場景管理就是上文提到的容災,這個是比較複雜的,也是我們實際切換的一個場景。大家都知道上文說的在應用的架構設計上,系統是互相牽連的,當一個系統走的時候,或者切,或者處理的時候,其他系統必須配合。我們在演練過程中可能有十幾個系統來去切換。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假如核心系統,它切換有幾個大的步驟或者幾個任務。它開始的時候,有的系統是不能動的,有的系統是可以同時動,但是完成了第一步以後,後面其他系統可以做哪幾步,這個是在容災場景管理中必須要反覆的邏輯的預演出來以後給它固化的。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)容錯容災操作過程的標準化:對象、環境、操作、輸出等,是工具化的基礎"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/33\/332989f22470e768a46e756f1698782a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作的標準化,在沒有任何工具的情況下,這個是基礎,有了這個基礎,才具備實施工具化的可能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作標準化的話,在前圖的這個邏輯時序,它只是個大的任務,整體性的說是停止核心的一個服務,但是它具體登到不同的機器上,要去處理不同的操作等。這個操作怎麼做,怎麼讓它標準化,都是要細細考慮的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個,整體持續控制之後,就具體到每一個系統中的每一個操作任務,第一要明確在哪個 IP地址,操作它的用戶環境是root用戶還是其他用戶?再是腳本的具體名稱或者是命令,包括輸出的結果正常是什麼?如果輸出正常了,下一步是串行還是並行都要標識清楚。這個是之後工具化的一個基礎。這些也是在平時的工作中,我們的系統專家和工程師在負責一些系統的時候,會很細化的去處理的這問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(5)容錯容災流程的自動化控制:分層級的流程控制,整體、分類、局部三層流程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ec\/ec07bdf305b523172e5de8c1a9ad05a1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文說過,我們從場景的檢查、場景確定之後的處理等都涉及到流程,但可能不是一個簡單的大流程,在我們實踐中,實際是三層表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一是整體的,主要是針對容錯容災比較大的場景,我們要知道整體場景下都涉及到哪些系統。假設上文提到的最核心的是核心系統,它切換時跟着它走的大概有三十幾個系統,整體有一個調度邏邏輯時序,這個爲了方便指揮,並不是實際到某個機器上進行操作的一個命令。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二就是分類,可能在處理的時候有前臺、中臺、後臺,但是有大類,它內部之間是有持續邏輯的,分類之間它有它的關係,把它分層,每一層簡化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖表格,是操作的一個詳細流程。我們可以用一鍵式切換工具,或者自動化工具,在已經驗證過後,在標準的操作流程基礎上給它自動化,這樣會大量節省時間長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真正發生故障的情況下,我可能不在場,所以一定要工具化,讓一線值班的工程師有能力去處理,這就是流程。從整體到分類到局部,我們也在朝這個方向去努力,現在有些局部已經能做到,自動化處理的速度也是比較快的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、有效的監控體系是容錯容災的基礎"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f9\/f95cec62db1d254e26771487ffb798c8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有效的監控體系是容災容錯的基礎,監控也一樣,它可能涵蓋的範圍更大,因爲一些小故障它也要監控。像直播間失效了,但我們後臺技術上還沒有發現,這個是監控配置的管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是監控的策略,例如針對一個部件,像Oracle數據庫,怎麼證明它是可用的?直接連接,Select一張表,是一種方式。另外像數據庫日誌的報錯以及一些等待事件等,這些都是標準策略。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)有效監控體系的建設策略:以結果爲導向,以準確配置爲基礎,以有效指標爲手段"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c792999bb91930a3e59dcee0c6bc602e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監和控,在目前很多談的監控實際是監,即我查看CPU,查看內存之間是否有空,發現了一個現象,我雖然能處理,但處理是規避,並不是真正找到原因,從根本上解決。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或要做變更,怎麼規避?最簡單的方式是多路多活的時候,如果某一路出現問題,像擁堵,則直接停掉這一路,讓別的路去工作,可以避免某一些客戶進到一個通路以後,全堵死在一個通路里頭。這只是一種規避方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以上文一再強調有效的監控體系,它的一個建設策略就是以結果爲導向,以配置爲基礎,以有效指標爲手段。剛開始建的時候,都不知道有哪些策略,就是由結果去反推有效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控報警有效率有下面這些點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用整體監控標準化;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運維技術人員主動優化;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"能夠發現系統深層次問題及應用系統業務可用性問題的新增技術手段。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外要保證監控的覆蓋度足夠,如果有死角沒有監控,就發現不了故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此外,還有監控策略準確性,我們根據監控報警的效果分析,持續調整優化監控指標、補充缺失監控指標,確保監控策略的準確性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)監控管理與系統建設策略:以對象、指標、策略爲基礎,跟蹤監控發現率、持續優化策略"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fd\/fdc3b5e99eaf0e409555bda202aca34b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控檔案:在現實中存在這樣的問題,雖然有很多技術人員,但有的專注於工具開發,有的專注於系統集成,有的去被動改進,我們沒有整體規劃,都是出事後亡羊補牢,這雖然是一種方式,但跟實際的運維是有偏差的。實際運維需要有一個明確的監控配置,就是檔案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外場景要有針對性的設計,有持續更新計劃。像上圖右側是我在網上看到的一個運維的技術專家面試的時候,面試人員大量提了我熟悉Nagios、Cacti等等,但他忽視了作爲一個監控的管理人員,應該知道Oracle要如何檢查判斷是否正常,以及某一類的應用系統要如何確定是否正常。我們有效監控體系裏,大家更關注於工具和開發,卻往往忽視了它的有效性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)詳細準確的監控配置信息,是有效監控的基礎:實現與系統變更同步的監控配置控制"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5be83d6dc58def334c89d1be4c4fee7f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到監控的基礎是準確的配置信息,配置信息一定是跟研發有關的,肯定要跟研發同步。所以在日常的管理中,在跟研發的應用產品設計的時候,一定是從設計開始,如現在有的應用系統數量以及它應用系統模塊數量,這個信息要跟後臺的配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖右側,大家可以看到監控自身的屬性,有系統架構、邏輯結構,以及底下配的系統資源。上文監控系統部分已經提到,它調的存儲,調的數據庫,調的安全設備等等,都一定要很清晰的控制起來。研發和運維整個管理工作的銜接,涉及到大家熟知的一些領域,運維和開發理解角度可能不同,但統一後大家可以清晰瞭解,所以詳細準確的監控配置信息是監控的一個基礎。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)監與控是容錯容災能力的關鍵構成:發現、定位、處置,通過場景判定嚴重性"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/63\/6387700025a1123f57cbbc18befb75af.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監和控,是容災容錯能力構成的關鍵部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"什麼是控?實際是我們從一些實踐場景中逐漸去分析它,去控制它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"控前提是要怎麼控,所以有三個關鍵詞:發現、定位和處置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖右側是以前很早的時候遇到過的一故障,柱狀圖是交易的量,折線圖是它的響應時間,可以看到響應時間基本接近50送達那就斷了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對這個系統,例如直播間後臺系統出問題了,我們處理的時候,一定會考慮網絡情況、系統的報文、資源等,一層一層去想,如果都登錄系統自己去查會很耗時間,更何況一個大系統的維護環境裏,我們不可能有權限登錄所有的,所以需要把它定位的場景逐項進行梳理,列成一個表格。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果有工具,可以工具化,用自動化來檢查,檢查發生場景,知道如何處理,這是監和控的關係。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(5)容錯容災管理與監控管理的關係:對象、組件、場景、流程、任務、關聯"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/95\/955c7acf3fad2f70035dcdd62492e00e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲監控是容災基礎,他們之間如果要去做工作,或者是做一個工具平臺系統的時候要怎麼集成?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲監控管理一定要有準確的監控配置信息,監控配置之後,每一個部件要布一些策略,它產生的報警信息要給到容災管理的工作環節中,而容災是以應用系統爲基本單元,把它拆分構成的組件,剛纔說到它調用這些組件,它自己服務器數據庫等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每一個組件有自己場景,確認這個場景發生了,就要處置場景,這指的就是一個任務。上文我也提到,流程分成整體的、局部的和細節的,有一二三層,這個任務相當於二層,具體任務的執行可能需要自動化工具或者人工去執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然也要考慮跟其他的應用系統間的邏輯和時序關係,即相互間切換的時候,我先走一步,你再去走你的第一步,那之後我走第二步,我倆234之間都可以並行處理,但是我4,到了你4的時候,一定要等我的第4步,這是有可能會有邏輯關係的,所以要做控制,容災管理和監控之間要形成一個集成關係和工作上的一個互動關係。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、如何將容錯容災演練形成實戰能力?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了上面基礎後,我們要討論的一個話題就是如何將容災容錯的演練形成實戰能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們2020年已經做過演練,6月份計劃9月份做一次演練,在這期間,大家都會檢查自己環境,把應該測試的腳本進行測試,反覆的討論邏輯、過程,等都已經討論清楚後,大家就開始在一個指定日期去做,這就是一個演練過程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、容錯容災演練的設計與意義"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/da\/dafde3bdfe8762e731119a4f3121eccb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"演練目的一定是爲了找問題,因爲如果不實際操作,一定不能把所有問題都發現了。只有找到問題,分析它的原因和改進方法,才能爲形成實戰能力打下一個好的基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"演練的過程,第一要有實操的任務,任務有去執行的人,同時也一定要有觀察記錄的人,記錄耗時跟預計是否相符以及出現的問題等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然還有一個授權,真正演練的時候,因爲是實操性演練,就要授權。關鍵點上一個控制或一些關鍵操作,像存儲的同步調頭等等,一定是靠人來控制,不能完全自動化,否則風險太高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個就是演練它的設計和意義。大家往往演練時忘了去詳細記錄,等演練完了,特別大的問題也許會記住,但一些細節問題可能記不住,這樣演練後,就浪費了機會,沒有完全發現問題。當真正實戰時,可能就會因爲一個小的點造成整個容災切換結果不是太好,所以這就是一個目的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、容災容錯實操性演練"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/1051dce8041679be321f9be0d0856b97.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外做操性演練,我們還要確認一點,在把一個關鍵系統切到了同城的一箇中心,有的系統本身是多活的,涉及到切只是流量配比調一下,但像銀行核心例如賬務系統等,是很難做到或者沒必要去做到跨站點的這種數據庫的雙活,而且這個是很複雜的,它就是一道切的動作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"切過去之後,因爲相當於它到另一箇中心,它跟有些系統相當於跨了中心跑。跨中心跑的,它之間是互聯的裸光纖,每一個具體的網絡的就是一個會話,可能會延遲幾毫秒,累積起來會有幾十毫秒的延遲。如果不實操,我們就不清楚這個系統切過去後,在日常的工作場景下,它性能是否可承載業務,所以這個是我們實操之後得到的一個結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖可看到,整個系統的交易成功率和響應時間是不同的。請求了,搜索端的網卡馬上有反應,圖上左側是切換前,右側是切換後,切換前後基本一致,但是有差異,時間、成功率稍稍有下降。它後臺的賬戶系統是一個原子交易,但對於前臺客戶,這個在他可接受範圍內,在兩秒鐘之內反映交易回來,可以接受,也影響不大,這樣評估下來,可以確認我們容災中心是可用的,這個也是一個演練目的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、容災容錯演練與實戰之間的差異"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/24\/24e4bafd2efb173eae257ab4b18a4d82.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容災容錯演練和實戰之間的差異,有兩個是最大的差異。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文說到了演練都是有計劃性的,是準備好的,實戰是我們不能預計的,不知道什麼時候發生的。所以可以總結這麼幾點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,演練是有計劃性的,肯定要預先梳理檢查。我們可能會花了幾個月時間去準備、報備和協調,並且所有環節都要經過測試。最關鍵的是操作時候,還是由專門系統的技術工程師去操作的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,實戰的隨時性。不可預計發生時間,隨時開始實戰。真正發生故障時,我們會面臨系統的高頻率變更,因爲像一些大的企業數據中心,例如銀行一樣,幾乎每天每週都在變化,它變更如果我們沒跟上,變更環節沒有納入我們視野和控制範圍,切換一定會失敗,所以操作一定要隨着變更去同步,這是實戰中隨時要解決的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外最關鍵的一點是,好多系統是365天7×24小時運行的,我們具體負責的工程師不可能24小時準備着,所以故障發生的時候,操作很可能是由一線的值班工程師或者B角實施的,這個是演練和實戰的不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,演練是有確定性的。因爲我們可以知道今天是針對核心系統做切換,而且可以事先梳理它相關聯的系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四,實戰的隨機性。真正實戰發生的時候,往往是從一個小事情發生了,但大家沒有意識到它是個災難,災難發生,不知道它是什麼情況,這個就是所謂的隨機性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、通過日常事件分析,完備容錯場景與流程,從源頭設計實施"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c8\/c8ded3e72182f44a543fda404f66c27c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實戰和演練是完全不同的,所以一定要有很強的發現、定位、處理能力,才能夠真正的達到實戰性的要求,這就是演練和實戰之間的差異。我們可能經常演練,卻忽視了實戰性的建設和培養,等真正發生災難的時候,完全處理不了。其實對於一個企業來講,巨大的人財物的投入,在真正需要的時候沒發生作用,這是最不可接受的了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就是整個演練和實戰的差異。將容錯容災演練形成實戰能力,就一定要通過日常的實踐分析,不斷的去完備監控體系,也同時形成完備的容錯容災體系。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從源頭上設計。因爲總有達不到100%可用性的情況,達不到了什麼程度也一定要分析原因。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原因分析是針對於特殊標記的客戶問題的容錯處理。例如代發工資,企業有1萬人發工資,但是有一部分人因爲它的卡狀態或者什麼歷史數據問題,或者是因爲簽約關係或應用的bug問題,一部分人發不了工資。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們一定要進行容錯處理,分析情況發生的原因。或者去改應用系統的業務邏輯,或開發一個關於錯誤處理的工具,在特殊情況下我們繞過業務,直接把應發的客戶工資發下去,這個就叫一個整體性設計,整個過程是迭代的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這種方式,包括上文提到的,先把工作標準化,之後再把它工具化,再加上培訓和迭代開發,最終讓我們的容災容錯能力極大的接近於實戰的最高要求,這就是整體的一個建設思路。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、運維工作中,應系網變更->監控管理->監控開發配置->容錯容災"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/37\/9b\/37c853f44cd797eb801ba63da013a69b.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日常工作中可能更細節的一些。最關鍵、最直接的就是前文提到的,我們整個的數據中心的系統,它在不停的變更,監控要跟上,監控跟上以後容錯也要跟上,我們自己內部也在討論,在逐漸朝這個方向努力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用、系統、網絡的管理員一定要遵循變更的管理要求,去把變更分成標準的、常規的、緊急的等等,包括按照風險級別。我們的技術人員,尤其是要求高可靠性、高安全性的人員,一定要遵守合規審計等的要求,這也是一項IT的管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們要嚴格按照着做,只要做了變更,就一定要及時登記,當然要有工具來自動掃描、自動化。現在有些領域是完全可以的,像IT資產是完全可以做到的,要保證它是準確的,它是我們一切工作的基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控的管理員,只要是一個生產的對象,我們就一定要維護到監控的檔案中,並且這個對象,我們一定要給配上標準的和非標準的監控策略,由監控的開發人員部署下去,這樣整個體系纔會完備,監控沒有漏洞和死角。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個之上,我們再把需要容災的部分單獨標識出來,如果有專職的容災的管理員會更好。系統可能有幾百個,但真正的重要系統可能是一小部分,系統的管理員需要識別出自己哪些系統是容錯的,像上文的監控事件,容錯場景它大體的區別就需要標註上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的話,後線有專門容災的管理,對於它的配置,巡檢的場景和處置,要不停的迭代更新和驗證,這樣的話把它形成一個常態化工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以今天主要分享的內容是這些,可能每一個層面只是點到了沒有去展開講,爲大家整體梳理一個思路,關於怎麼建設整體的容災容錯的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":">>>>"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q&A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q1:容災可視化平臺可以集成哪些雙活的指標?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A1 :"},{"type":"text","text":"這個實際是監控的一個展現問題,因爲我們本身哪些系統是跨中心雙活的,我從實時監控上一定能看到,包括配套一些巡檢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q2 : Oracle數據庫的雙活中心?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A2 :"},{"type":"text","text":"是說的跨站點雙活,我沒有做,因爲跨站點的話,雙活一定是有延遲的,無論你裸光纖速度再快、再可靠,都一定有延遲。甚至於同中心,我們也沒有做數據庫Oracle的,因爲選擇某種技術方案,要看實際的環境。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,對於我們來講,單獨的小機,假如每天1000萬的交易量是能夠承載的,我沒必要去做集羣,集羣有集羣的複雜性,包括一個故障的處理,以前我們拿別的系統驗證過。另外,我們自身沒法配那麼多的高級DBA去維護,所以我們我退而求其次。當然有的特別大的數據中心和銀行只能用集羣,單機不夠,容量不夠。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機夠的話,我這個HA配成最簡單的這種勸阻裝置,我們因爲數據庫是多實例的,實例更簡單的可以用克隆方式來更快切換,一兩分鐘之內。越簡單,它是越可預期的,可靠性越強的。我一旦這個環節出現10分鐘我能恢復,那就行,這也滿足監管對於30分鐘恢復的要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q3 : 對於自動化運維監控,運維工具有什麼要求?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A3 : "},{"type":"text","text":"實施方案不具體談了。最關鍵的點,第一要知道整個體系架構中,有哪些部件要納入監控範圍。可能我們像看一部車,它的所有的部件過程部件都要有一定的監測,否則這個車一定是失控的。監測的策略是什麼?有了這些我再去選擇工具,工具無非說開源的等等,這個工具是替你把數據採集回來,你一定要有統一的監控。把它從這種message消息一定處理成帶有結構信息、可利用的監控數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後對於定位和處置,尤其在定位準了之後,才能去處置,這個是真正發揮效果的。也就是說監和最終的控非常重要,不能是隻監不控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q4:銀行自己容災體系要求達到幾級標準?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A4 : "},{"type":"text","text":"我們具體現在是按照監管要求,把重要的系統,同城的情況下,它的 RPO肯定是0,因爲銀行的數據不能丟,丟了出賬務風險了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RTO在監管要求是30分鐘之內,當然你會力求越短越好,因爲我們很多系統多活之後,它涉及不到切的動作了,那可能就是一瞬間把可能堵塞的a中心的某些路給它停掉,這個停我們是利用監控自動化的方式,實際類似於我們紅綠燈的調度方式,只要它標識了之後,負載均衡就不再給它分流量了,馬上就停掉。它速度會更快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是具體到核心系統,它沒有辦法,只能切。這個切,我們演練的方式,也是最後能夠達到驗證,等待30分鐘內能切完,這也滿足監管的要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q5 : 規避容災應急切換的風險點?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A5 :"},{"type":"text","text":"切換中間動作一定會有失敗的。像數據庫,我們也列了,要有退路,我們作爲運維的技術工程師,一定要有b計劃或者c計劃,一定要想到最壞的情況下,即使RTO已經很長了,也要去監管彙報了,必須要把業務和數據恢復過來,不管是後果是什麼,這是要首先保障的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q6 : 進行容災應急切換演練的規劃,系統的高可用性融資應該建設到什麼程度?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A6 : "},{"type":"text","text":"規劃的話,幾百個系統,我們有監管重點盯着這些系統,達到RTO在30分鐘之內,RPO是0,這個在同城可以,異地做不到,可能要有一個時間限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外根據自身要求,因爲要保障對於客戶的優質服務,它本身是有要求,我們把這些系統排查出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們有一套排查方法,從幾個維度,業務影響度、損失、技術的關鍵性等,因爲有些基礎應用系統是沒有業務的這種直接損失,但是它影響大,所以我們會綜合打個分,根據分數來做容災的配置,是同城雙活,還是同城的2+2的方式,或2+1的方式,或者同城做數據集,根據結論來配置,大體根據投入產出和具體情況來看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"姜巖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某城商銀行 數據中心總經理 "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"擁有27年銀行應用系統開發、運維管理、架構管理以及新業務科技實現的設計等相關工作經驗;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷經多次核心系統更新換代,對金融科技與業務的配套發展有着深刻的思考。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:dbaplus社羣(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/y4uUPcRR3UiRBGbXD93atQ","title":"xxx","type":null},"content":[{"type":"text","text":"教科書範本級:銀行容錯容災體系建設與實操性演練設計"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章