微博雲原生成本優化的6個最佳實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"移動互聯網高速發展的早期,企業更多關注業務擴張,用成本投入來換取市場佔用率。但如今移動互聯網經歷了近十年的高速發展,互聯網人口紅利逐漸消退,幾乎沒有公司能夠再忽視成本。無論是蘋果公司的供應鏈、庫存管理,還是特斯拉通過國產化降低成本,又或者是各大雲廠商對企業IT成本的優化,都是這些企業保持高速發展的重要因素。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"企業在早期的資源使用上以滿足業務需求爲第一要務,資源的使用率管理相對粗放。無論是機器規劃的多樣化還是容量規劃問題,或者是在離線資源的整合問題,歸納起來其實都是額外預留了不需要的資源,導致成本浪費。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"近些年隨着雲原生的興起和建設,服務已經具備快速遷移能力。基於此,資源容量就可以按需申請,最小化閒置資源,業務也可以在不同的時間段在相同的機器上按需要部署,不同規格的機器可以基於超高規格的機器(256C\/2T)混合部署,化零爲整,大幅降低資源成本。微博在雲原生改造過程中也嘗試進行了各種能力建設,並梳理了以下6種有效的成本優化手段。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段1:混合編排(提高單機利用率)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"業務在選擇機器規格時,選到一個最合適的規格是"},{"type":"text","marks":[{"type":"italic"},{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"幾乎不可能"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的事情。不同業務使用的機器規格差異巨大,從CPU密集型、內存密集型,到網絡、IO密集型甚至容量密集型,涉及成百上千種規格,而實際上無論是IDC自採的機器還是公有云提供的機器,可選規格都是有限的。這種差異就會導致成本大量浪費,有一些典型的內存密集型場景可能會有1C\/128G的規格需求,CPU存在大量的浪費,如果能夠使用適合的機型必然就能夠優化成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而云原生時代構建以Docker、Kubernetes爲代表的基礎設施,能夠解決多業務混部編排中存在的端口衝突、單網絡空間端口數量限制、業務間資源隔離等問題。通過超賣、大規格機器(256C2T)共享冗餘度等手段還可以進一步節省成本,但是也會引入服務部署過於集中的重大難題。如果一臺大型機宕機的話,則可能會引起30+業務應用和200+資源實例宕機。面對這樣的風險,就必須要有資源\/服務治理措施,通過快速降級、遷移、下線等處理保障業務的穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/b9\/f9\/b9c92d13306423b4a57470d9743f09f9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"神龍裝箱示意圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"微博的業務應用設備大部分配置16C32G,考慮吞吐和時延等因素,堆大小一般設置小於10G,加上堆外內存使用,整體內存利用率小於50%,CPU利用率均值爲44.5%。後端資源如MC、Redis、Pika等,資源特性決定了對內存要求較高,利用率平均64.7%,但對CPU利用率偏低,只有16.8%。考慮業務應用和後端資源兩種服務各自的特性,混合部署可以做到完全互補。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲了滿足不同業務能混合編排在同一規格機器上的需求,我們將各種小機置換爲大規格機器,圍繞Kubernetes搭建的容器編排體系,對Numa節點、綁核、磁盤、網絡進行了精細化的調度,通過服務\/資源治理平臺進行快速治理,在加上快速擴容、快速數據分發的能力,基本可以達到分鐘級恢復,30min內完成整機遷移。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從成本優化成果來說,在高密度容器混部編排後,整機利用率CPU從22.3%提升到了50.6%,內存從44.5%提升到了67.8%,成本優化可達15%以上。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段2:異地部署(物理成本)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"IT成本除了服務器自身的成本外,還包括部署服務器的成本、機房建設的成本、電力成本以及網絡帶寬和專線的成本等。一方面,一線熱門城市由於土地資源有限等原因,機房建設成本相比於其他城市要高出不少;另一方面,由於一些地區電力資源豐富,其電力成本相比於其他的城市會更低一些。綜合考慮多方面成本,選擇在綜合成本更低的地區建設或租用機房,或者選擇公有云單價更低的可用區,可以明顯地降低業務的IT成本。以公有云爲例,不同地域間的成本會有5%-30%的差距。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/90\/ce\/90800e46799a4d4f3c2c6a0c2959b5ce.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"阿里雲ecs("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#666666","name":"user"}}],"text":"ecs.c6.4xlarge"},{"type":"text","text":")官方報價"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在理想情況下,業務服務全部上雲,同時新上的服務也不需要與已有服務互通,那麼直接在單價更低的可用區部署服務就可以立即降低成本。但現實中的大多數情況是,業務已經在自建機房或公有云其他可用區做了部署,並且服務之間存在大量依賴。在這種情況下就需要將專線帶寬的成本與業務耗時的影響考慮進去,使其綜合收益最優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以北京市爲例,很多總部位於北京市內的互聯網公司通常也選擇租用北京市內的機房或是選擇公有云位於北京地區的可用區,但與位於北京市西北方向的張北縣或烏蘭察布市相比,後者機房的成本相對更低,只不過由於相距200~300多公里,部署於兩地的服務器ping耗時也有幾毫秒到10ms左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲了降低時延對業務服務質量的影響,同時減少專線帶寬傳輸的內容、降低專線成本,在部署在線業務服務器的同時,可以將訪問量高的後端資源與在線應用一同部署,減少絕大多數跨專線的請求。此外,我們還使用了多機房消息同步的組件,使得單個機房的請求內循環,從而減少跨機房跨專線的請求,進一步提升了整體服務的可用性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/10\/d9\/10a34029b37a5a36b981910dfe4b77d9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"微博異地部署架構示意圖"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段3:自動擴縮容(降低服務常備冗餘度)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一般而言,業務常規流量都是有規律的,大部分業務全天流量都存在很明顯的波峯波谷現象。正常情況下業務需要預備足夠多的機器數以滿足服務日常峯值,除此之外,爲了安全起見還要預留一定的冗餘度,就是我們通常說的常備冗餘度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是在一天的大部分時間裏,業務流量都要比日常峯值低,如果線上日常維持能夠扛住日常峯值流量並且冗餘30%的機器數量,就會造成很大的資源浪費。如果能夠在保障服務穩定性、足以支撐業務峯值流量的前提下,降低服務的冗餘度,甚至不需要備足常備峯值流量的機器,那服務的成本就能夠大大降低。正常情況下,綜合考慮常備機和按量機的佔用時長及成本,那些波峯波谷清晰的業務能降低大概30%的成本,若是波峯短暫且明顯的話,成本還可以進一步降低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c6\/25\/c68447578d83de332c55285a70fe0e25.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"某業務動態冗餘度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"微博場景下,我們根據線上實時流量的情況來進行動態擴縮容,使線上始終維持一定的冗餘度,常備機器只需要能夠扛日常峯值流量即可,這樣即使動態擴縮容不能正常工作,也能保證線上服務正常運行。通過這種方式既能保證資源的有效利用,又能維持業務的常備冗餘度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而微博熱點事件屬於比較特殊的流量情況,有些業務甚至能在十分鐘之內流量翻10倍之多。爲了保證業務的穩定性,最簡單的手段就是準備充足的機器,流量翻多少倍就準備能應付多少倍的機器。而隨着用戶的不斷增長,流量會呈幾何倍增長,那需要機器數量也會呈幾何倍增長,爲了能夠動態應對隨時可能超越之前歷史峯值的流量,就需要具備極強的彈性擴縮容能力。在我們之前的文章"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MjM5MDE0Mjc4MA==&mid=2651072332&idx=2&sn=8ca5bc34dd3687011a9a343a897bd1cd&chksm=bdb9df1f8ace5609a5b83409cb6c37941673d3460844be7bc494beca228669fae5eb8fcab684&token=1597781820&lang=zh_CN#rd","title":null,"type":null},"content":[{"type":"text","text":"《業界前所未有:10分鐘部署十萬量級服務資源、1小時完成微博後端異地重建》"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"中也提到,目前微博已經具備5分鐘2000臺機器的閃電交付能力,後續基於神龍裸金屬可以提供5分鐘8萬臺機器的交付能力。此外,通過自研的數據恢復中心微博已經實現快速的數據分發能力,能做到10分鐘百T級的數據恢復,1小時完成整個微博後端重建。在降低冗餘度後,只有具備快速擴容、快速數據分發能力才能保障服務穩定性,做到降容不降質。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於快速發展中的業務,對複雜系統進行精確的容量規劃難度非常高。近些年對於無狀態服務,已經可以通過結合容器化基於K8s進行編排實現快速擴容,但對於大規模有狀態的複雜業務場景,在提供服務之前,有海量的數據需要遷移,上百T數據的遷移甚至需要耗時數天,因此典型的方式還是要通過提升冗餘度解決容量規劃問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段4:錯峯應用(共享服務間冗餘度)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在上文中我們提到,業務流量自身的均值和峯值在數值上是有規律的,除此以外,其趨勢往往也伴隨週期性的高峯和低谷,具有很高的自相關性。不同業務的流量高峯低谷不盡相同,按業務的流量高峯做常備冗餘部署資源,在其它業務的流量週期往往存在明顯的資源浪費,例如一些週期性的運營活動(打卡\/簽到\/搶購等)所屬不同的業務服務,就會出現不同程度的資源浪費。如果這些業務的部分機器能夠統一協調起來,讓各個業務間共享冗餘度,就可以很大程度地避免浪費從而降低成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"面對上述場景,我們建設了微博服務數據畫像平臺,除了提供常規的時序特徵,還支持各類指標趨勢的預測服務:通過收集分析歷史指標數據,對未來一段時間的指標趨勢進行預測,從而確定各個服務的流量高峯及低谷對應的週期,並指導後續的資源調度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"根據我們的經驗,當業務流量具有很高的自相關性時,使用簡單的線性歷史平均或指數歷史平均就能取得不錯的預測效果,使用常見的時序模型(Holt-Winters、時序數據分解、ARIMA)時 Mean Percentage Error (MPE) 一般都低於 5%,使用  XGBoost 或其它 Boost 模型擬合更是可以達到 1% 左右的 MPE。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/08\/bf\/08ce5e805c6e0f99ec022ee92b3cf4bf.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"某後端業務流量及自相關數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"結合上述服務畫像平臺提供的時序預測服務,調度側可以針對性地進行資源全局調度:對於不同高峯時期的服務,整體規劃資源互借,減少常備冗餘資源的空閒程度,減少資源浪費,提高利用率;對於低谷期集中的服務,調度層做統一的資源回收,並放入整體的冗餘資源池,提供給離線業務做固定週期的資源調度與整合(機器學習模型訓練、數據分析與挖掘等)。從目前微博的實踐情況看,通過上述手段,整體機器成本節省可超過15%。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段5:在離線整合 (共享服務間冗餘度)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在離線整合是錯峯應用的一個典型場景,簡單點說就是,服務器上白天跑應用,晚上用來跑離線任務,CPU 和磁盤 IO 都會被充分利用起來。在線業務日常CPU利用率的均值都不高,晚上機器會大量空閒,而離線業務則主要進行大規模的數據計算、日誌處理和AI訓練等,晚上常態利用率可能會比較高,隨着業務的增長可能會出現資源不足導致任務跑不完的情況,而白天又剛好相反。面對這些現狀,只有將離線、在線服務統一規劃才能夠充分使用資源,提升資源利用率,從而減少在線與離線服務器的採購成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但在離線整合並不是所有場景都適用,從網絡架構到硬件架構,到操作系統、再到服務架構,以及周邊的調度體系,這是一個複雜的工程。在離線整合的難點在於全自動化調度系統,我們需要同時保證在線不受到影響且離線運行良好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/06\/90\/066e8458bc7b03fe153c26e9d5yy6390.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在離線整合模式切換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在微博場景下,一是離線業務補給在線業務,微博業務每天中高峯及晚高峯都會存在,在這些時段下會將離線業務的部分實例拆分到在線業務;二是在線業務補給離線業務,大部分業務每天夜間基本流量很低,不到晚高峯流量的1成,這樣的場景下會將在線業務實例整合爲離線業務實例。以在線業務的冗餘度作爲在線業務的關鍵指標,通過冗餘度指標進行決策調度(誰補給誰,補給多少),採用離線避讓機制、在線業務兜底擴容等機制,最終保障在線不受到影響,離線運行良好。例如:一個1000臺機器的在線業務,對等1000臺機器的離線業務,典型的波峯波谷業務場景,預計可以節省成本15%。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"手段6:Intel 換成 AMD(物理成本)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在x86服務器市場上,Intel的CPU得益於其長久以來出色的穩定性和性能,佔據了絕大多數的市場份額。但隨着AMD Zen架構的EPYC處理器的推出,在同等性能下每核心的價格相比Intel的CPU更低。同時,AMD EPYC的CPU最高提供單芯片64核128線程的規格,使得單臺雙路服務器能提供的最大規格達到了128核256線程,搭配上2048GB內存和40GE甚至更高速率的網卡,單臺服務器能夠部署更多的服務。若單個機架部署的服務器數量不變,則單個機架能夠支撐的服務也更多,提升了機架的使用率,進一步降低了基礎IT建設成本。以公有云廠商爲例,同等規格的服務器採用AMD大概能降低5%的成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0e\/7b\/0ed7fc331ae66396ab7d03b54eedc57b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"AMD Numa部署結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大規格服務器節省成本的同時也帶來了一些新的問題,在我們之前的"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MjM5MDE0Mjc4MA==&mid=2651073533&idx=2&sn=f747e07a2cc4827fb96c3920fb61b80e&chksm=bdb9d3ae8ace5ab846493cdddefdc47934d1df43658f04f313a7d45e142f6ff14f8cc57c6aa2&token=1597781820&lang=zh_CN#rd","title":null,"type":null},"content":[{"type":"text","text":"《以微博核心業務爲例,解讀如何僅用1臺服務器支持百萬DAU》"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文章中也提到過,單機部署更多實例可能會帶來可用性風險以及AMD雙路服務器擁有8個NUMA節點導致綁核問題等。在微博業務中實際使用最多的場景是256線程的AMD EPYC CPU搭配2048GB內存,每個NUMA節點上擁有32線程及256GB內存,我們在每個NUMA節點上部署2個12線程16GB的Pod用於部署在線核心業務應用,部署1個7核210GB的Pod用於部署Redis、Memcached等緩存類資源,每個Pod都綁核在所在NUMA節點上,每臺服務器共部署24個Pod。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當然,爲了保證可用性,我們對在線業務類Pod和緩存資源類Pod都制定了各種服務治理策略,用來保障業務服務的穩定性,當服務訪問出現問題或是Pod\/機器的核心指標出現異常時,會根據服務治理策略進行快速處理,包括業務降級、資源摘除、服務\/數據遷移等。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如前文所述,我們總結了6種不同的成本優化手段,適用於目前微博普遍存在的場景。不管是降低物理成本、共享冗餘度,還是降低服務\/單機冗餘度等,最終目的都是爲了提高單機利用率,降低單機成本,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"在地域-硬件-業務-機器之間做拆分和整合,進而達到優化成本的目標。但與此同時,這些優化手段也會帶來"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"服務穩定性的巨大挑戰,加大了服務和資源治理的難度,而解決這些問題則需要指標體系、調度體系、混合雲、服務畫像、硬件資源等方方面面的相互協作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在微博的業務場景下,我們進行了不同程度的實踐,並將以上6種手段的優化程度和難度簡單梳理如下,供大家參考。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/73e6dc1d0d1cbd63bdcfb08f80190639.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當然,微博還存在一些"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"獨特的業務場景,可以通過"},{"type":"text","text":"帶寬優化、存儲優化、冷熱分離、應用優化等等手段做進一步優化,受限於文章篇幅,這裏就不展開討論了。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者簡介:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"微博研發中心基礎架構部 胡云鵬、孫雲晨、胡忠想、劉燕和、胡春林"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"胡云鵬"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",微博基礎架構部雲平臺技術負責人,2017年加入微博,主要負責微博基礎服務的架構改造升級工作,目前"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#262525","name":"user"}}],"text":"主要方向爲"},{"type":"text","text":"混合雲、資源雲、服務\/資源治理等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"孫雲晨,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"微博基礎架構部業務改造負責人。2015 年加入微博,參與並負責微博多個業務系統架構升級改造工作。目前主要關注資源服務化及業務研發效率的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#323232","name":"user"}},{"type":"strong"}],"text":"微博基礎架構部"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#323232","name":"user"}}],"text":"前身是微博平臺部門,從2011年起負責微博核心系統的研發和保障工作,構建了微博的高可用架構、緩存、存儲、消息隊列、穩定性、監控、服務化、混合雲、AB測試等後端基礎架構和工具體系,並承擔微博全站熱點、三節和日常的業務運維和資源運維等穩定性保障工作。團隊在混合雲、彈性調度、高可用架構、熱點峯值應對等方向處於業界領先水平,誠摯歡迎有志之士加入,共同打造業界一流的技術團隊。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章