Uber容器化Apache Hadoop基礎設施的實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着Uber的業務持續增長,我們用了5年時間擴展Apache Hadoop(本文中稱爲“Hadoop”),部署到了21000多臺主機上,以支持多種分析和機器學習用例。我們組建了一支擁有多樣化專業知識的團隊來應對在裸金屬服務器上運行Hadoop所面臨的各種挑戰,這些挑戰包括:主機生命週期管理、部署和自動化,Hadoop核心開發以及面向客戶的門戶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着Hadoop基礎設施的複雜性和規模越來越大,團隊越來越難以應對管理如此龐大系統時需要承擔的各種職責。基於腳本和工具鏈的全系統規模運營工作消耗了大量的工程時間。損壞的主機越來越多,修復工作逐漸跟不上節奏。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們繼續爲Hadoop維護自己的裸金屬部署時,公司的其他部門在微服務領域取得了重大進展。容器編排、主機生命週期管理、服務網格和安全性等領域的解決方案逐漸成型,讓微服務管理起來更加高效和簡便。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019年,我們開始了重構Hadoop部署技術棧的旅程。兩年時間過去了,如今有超過60%的Hadoop運行在Docker容器中,爲團隊帶來了顯著的運營優勢。作爲這一計劃的成果之一,團隊將他們的許多職責移交給了其他基礎設施團隊,並且得以更專注於核心Hadoop的開發工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/cd\/9c\/cd02ddab7193630d5a6210ce898fb79c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1:團隊職責轉移"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文總結了這一過程中我們面臨的各種問題,以及我們解決這些問題的方法和途徑。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"回顧過去"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在具體分析架構之前,有必要簡要介紹一下我們之前運維Hadoop的方法及其缺陷。彼時,幾個互相分離的解決方案協同工作,驅動Hadoop的裸金屬部署。這些方案包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓主機設置就地突變的自動化方案"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"手動觸發和監控,基於非冪等動作的腳本"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鬆散耦合的主機生命週期管理解決方案"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在底層,它們是通過幾個Golang服務、大量Python和Bash腳本、Puppet清單和一些Scala代碼實現的。早期我們使用了Cloudera Manager(免費版)並評估了Apache Ambari。然而,由於Uber使用了自定義部署模型,這兩個系統都被證明是不夠用的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們舊的運維方法遇到了一些挑戰,包括但不僅限於以下方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生產主機的手動就地突變導致了許多漂移,後來這讓我們感到很驚訝。工程師經常就部署過程發生爭論,因爲在事件響應期間某些更改沒有經過審查和識別。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全系統範圍的更改需要漫長的時間來做手動計劃和編排。我們上次的操作系統升級被推遲了,最終花了2年多的時間才完成。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幾個月後,管理不善的配置導致了諸多事故。我們錯誤地配置了dfs.blocksize,最終導致我們的一個集羣中的HDFS RPC隊列時間降級。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自動化與人類交互之間缺乏良好的契約,這會導致一些意想不到的嚴重後果。由於一些主機意外退役,我們丟失了一些副本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.hava.io\/blog\/cattle-vs-pets-devops-explained","title":"","type":null},"content":[{"type":"text","text":"“寵物”"}]},{"type":"link","attrs":{"href":"https:\/\/www.hava.io\/blog\/cattle-vs-pets-devops-explained","title":"","type":null},"content":[{"type":"text","text":"主機"}]},{"type":"text","text":"的存在和越來越多的“寵物”所需的人工處理過程導致了一些影響嚴重的事件。我們的一次HDFS NameNode遷移引發了一起影響整個批處理分析棧的事件。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在重構期間仔細考慮了從以前的經驗中吸取的各種教訓。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開始設計新系統時,我們遵循以下原則:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對Hadoop核心的更改應該保持在最低水平,以免同開源代碼分道揚鑣(例如用於安全性的Kerberos)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop守護進程必須容器化,以實現不可變和可重複的部署"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"必須使用聲明性概念(而不是基於動作的命令式模型)來建模集羣運維"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣中的任何主機都必須在發生故障或降級時易於更換"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應儘可能重用和利用Uber內部的基礎設施,以避免重複"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下部分詳細介紹了定義新架構的一些關鍵解決方案。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"集羣管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們之前使用命令式、基於動作的腳本來配置主機和運維集羣的方法已經到了無以爲繼的地步。鑑於負載的有狀態(HDFS)和批處理(YARN)性質,以及部署運維所需的自定義部分,我們決定將Hadoop納入Uber的內部有狀態集羣管理系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理系統通過幾個基於通用框架和庫構建的鬆散耦合組件來運維Hadoop。下圖代表了我們今天所用的架構的簡化版本。黃色組件描述了核心集羣管理系統,而綠色標記的組件代表了專門爲Hadoop構建的自定義組件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/75\/9c\/755f38b77f131267211a59473991959c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2:集羣管理架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理員(Cluster Admin)與集羣管理器界面(Cluster Manager Interface)Web控制檯交互,來觸發對集羣的運維操作。管理員的意圖被傳播到集羣管理器(Cluster Manager)服務,然後觸發突變集羣目標狀態(Goal State)的"},{"type":"link","attrs":{"href":"https:\/\/cadenceworkflow.io\/","title":"","type":null},"content":[{"type":"text","text":"Cadence"}]},{"type":"text","text":"工作流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理(Cluster Management)系統維護預配置的主機,稱爲託管主機(Managed Hosts)。一個節點(Node)代表一組部署在一個託管主機上的Docker容器。目標狀態定義了集羣的整個拓撲結構,包括節點位置信息(主機位置)、集羣到節點的歸屬、節點資源(CPU、內存、磁盤)的定義及其環境變量。一個持久數據存儲負責存儲目標狀態,使集羣管理系統可以從非常嚴重的故障中快速恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們非常依賴Uber開發的開源解決方案Cadence來編排集羣上的狀態變化。Cadence工作流負責所有運維操作,諸如添加或停用節點、升級整個隊列中的容器等等。Hadoop管理器(Hadoop Manager)組件定義了所有工作流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理器(Cluster Manager)不瞭解Hadoop的內部運維操作以及管理Hadoop基礎設施的複雜性。Hadoop管理器實現了自定義邏輯(類似於"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/operator\/","title":"","type":null},"content":[{"type":"text","text":"K8s Custom Operator"}]},{"type":"text","text":"),以在Hadoop的運維範圍內以安全的方式管理Hadoop集羣和模型工作流。例如,我們所有的HDFS集羣都有兩個NameNode;而Hadoop管理器組件內有一些相關的防護措施,例如禁止同時重啓這兩個NameNode。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop Worker是在分配給Hadoop的每個節點上啓動的第一個代理(agent)。系統中的所有節點都在"},{"type":"link","attrs":{"href":"https:\/\/spiffe.io\/docs\/latest\/spire\/understand\/concepts\/","title":"","type":null},"content":[{"type":"text","text":"SPIRE"}]},{"type":"text","text":"註冊,SPIRE是一個開源身份管理和負載證明系統。Hadoop Worker組件在容器啓動時使用SPIRE進行身份驗證,並接收一個"},{"type":"link","attrs":{"href":"https:\/\/spiffe.io\/docs\/latest\/spiffe\/concepts\/#spiffe-verifiable-identity-document-svid","title":"","type":null},"content":[{"type":"text","text":"SVID"}]},{"type":"text","text":"(X.509證書)。Hadoop Worker使用它與其他服務通信,以獲取其他配置和密鑰(例如Kerberos密鑰表)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop容器(Hadoop Container)代表在Docker容器中運行的任何Hadoop組件。在我們的架構中,所有Hadoop組件(HDFS NameNode、HDFS DataNode等)都部署爲Docker容器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop Worker定期從集羣管理器中獲取節點的目標狀態,並在節點本地執行各種動作以實現目標狀態(這是一個控制循環,也是K8s的核心概念)。該狀態定義要啓動、停止或停用的Hadoop容器以及其他設置。在運行HDFS NameNode和YARN ResourceManager的節點上,Hadoop Worker負責更新“主機文件”(例如"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/r3.0.1\/hadoop-project-dist\/hadoop-hdfs\/HdfsDataNodeAdminGuide.html#Hostname-only_configuration","title":"","type":null},"content":[{"type":"text","text":"dfs.hosts和dfs.hosts.exclude"}]},{"type":"text","text":")。這些文件指示需要包含在集羣中或從集羣中排除的DataNodes\/NodeManager主機。Hadoop Worker還負責將節點的實際(Actual)狀態(或當前狀態)回報給集羣管理器。集羣管理器在啓動新的Cadence工作流時,根據實際狀態和目標狀態將集羣收斂到定義的目標狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個與集羣管理器良好集成的系統負責持續檢測主機問題。集羣管理器能做出很多智能決策,例如限制速率以避免同時停用過多的損壞主機。Hadoop管理器在採取任何行動之前會確保集羣在不同的系統變量下都是健康的。Hadoop管理器中包含的檢查可確保集羣中不存在丟失或複製不足的塊,並且在運行關鍵運維操作之前確保數據在DataNode之間保持均衡,並執行其他必要檢查。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用聲明式運維模型(使用目標狀態)後,我們減少了運維集羣時的人工操作。一個很好的例子是系統可以自動檢測到損壞主機並將其安全地從集羣中停用以待修復。每退役一臺損壞主機,系統都會補充一個新主機來保持集羣容量不變(維持目標狀態中所定義的容量)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖顯示了由於各種問題在一週時間段內的各個時間點退役的HDFS DataNode數量。每種顏色描繪了一個HDFS集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/1f\/62\/1f4dfef2a539d10ayy6yy2df2bc1d362.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3:自動檢測和停用損壞的HDFS數據節點"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"容器化Hadoop"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在過去,我們基礎設施的內在可變性曾多次給我們帶來了意想不到的麻煩。有了新架構後,我們得以在不可變Docker容器中運行所有Hadoop組件(NodeManager、DataNode等)和YARN應用程序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們開始重構時,我們在生產環境中爲HDFS運行的是Hadoop v2.8,爲YARN集羣運行的是v2.6。v2.6中不存在對YARN的Docker支持。鑑於依賴YARN的多個系統(Hive、Spark等)對v2.x存在緊密依賴,將YARN升級到v3.x(以獲得更好的Docker支持)是一項艱鉅的任務。我們最終將YARN升級到了支持Docker容器運行時的v2.9,並從v3.1向後移植了幾個補丁("},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/YARN-5366","title":"","type":null},"content":[{"type":"text","text":"YARN-5366"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/YARN-5534","title":"","type":null},"content":[{"type":"text","text":"YARN-5534"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"YARN NodeManager運行在主機上的Docker容器中。主機Docker套接字掛載到NodeManager容器,使用戶的應用程序容器能夠作爲兄弟容器啓動。這繞過了運行"},{"type":"link","attrs":{"href":"https:\/\/www.develves.net\/blogs\/asd\/2016-05-27-alternative-to-docker-in-docker\/","title":"","type":null},"content":[{"type":"text","text":"Docker-in-Docker"}]},{"type":"text","text":"會引入的所有複雜性,並使我們能夠在不影響客戶應用程序的情況下管理YARN NodeManager容器的生命週期(例如"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-yarn\/hadoop-yarn-site\/NodeManager.html#NodeManager_Restart","title":"","type":null},"content":[{"type":"text","text":"重啓"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/32\/d7\/324f033ac2db626eaa6eab3392d337d7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4:YARN NodeManager和應用程序兄弟容器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了讓超過150,000多個應用程序從裸金屬JVM("},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/hadoop\/blob\/branch-2.9.1\/hadoop-yarn-project\/hadoop-yarn\/hadoop-yarn-server\/hadoop-yarn-server-nodemanager\/src\/main\/java\/org\/apache\/hadoop\/yarn\/server\/nodemanager\/containermanager\/linux\/runtime\/DefaultLinuxContainerRuntime.java","title":"","type":null},"content":[{"type":"text","text":"DefaultLinuxContainerRuntime"}]},{"type":"text","text":")無縫遷移到Docker容器("},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/hadoop\/blob\/branch-2.9.1\/hadoop-yarn-project\/hadoop-yarn\/hadoop-yarn-server\/hadoop-yarn-server-nodemanager\/src\/main\/java\/org\/apache\/hadoop\/yarn\/server\/nodemanager\/containermanager\/linux\/runtime\/DockerLinuxContainerRuntime.java","title":"","type":null},"content":[{"type":"text","text":"DockerLinuxContainerRuntime"}]},{"type":"text","text":"),我們添加了一些補丁以在NodeManager啓動應用程序時支持一個默認Docker鏡像。此鏡像包含所有依賴項(python、numpy、scipy等),使環境看起來與裸金屬主機完全一樣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在應用程序容器啓動期間拉取Docker鏡像會產生額外的開銷,這可能會"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/r2.9.1\/hadoop-yarn\/hadoop-yarn-site\/DockerContainers.html#Cluster_Configuration","title":"","type":null},"content":[{"type":"text","text":"導致超時"}]},{"type":"text","text":"。爲了規避這個問題,我們通過"},{"type":"link","attrs":{"href":"https:\/\/github.com\/uber\/kraken","title":"","type":null},"content":[{"type":"text","text":"Kraken"}]},{"type":"text","text":"分發Docker鏡像。Kraken是一個最初在Uber內部開發的開源點對點Docker註冊表。我們在啓動NodeManager容器時預取默認應用程序Docker鏡像,從而進一步優化了設置。這可確保在請求進入之前默認應用程序Docker鏡像是可用的,以啓動應用程序容器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有Hadoop容器(DataNode、NodeManager)都使用卷掛載(volume mount)來存儲數據(YARN應用程序日誌、HDFS塊等)。這些卷在節點放在託管主機上時可用,並在節點從主機退役24小時後刪除。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在遷移過程中,我們逐漸讓應用轉向使用默認Docker鏡像啓動。我們還有一些客戶使用了自定義Docker鏡像,這些鏡像讓他們能夠帶來自己的依賴項。通過容器化Hadoop,我們通過不可變部署減少了可變性和出錯的機率,併爲客戶提供了更好的體驗。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Kerberos集成"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們所有的Hadoop集羣都由Kerberos負責安全性。集羣中的每個節點都需要在Kerberos(dn\/hdfs-dn-host-1.example.com)中"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/r2.8.2\/hadoop-project-dist\/hadoop-common\/SecureMode.html#Kerberos_principals_for_Hadoop_Daemons","title":"","type":null},"content":[{"type":"text","text":"註冊"}]},{"type":"text","text":"主機特定服務主體Principal(身份)。在啓動任何Hadoop守護程序之前,需要生成相應的密鑰表(Keytab)並將其安全地發送到節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber使用SPIRE來做負載證明。SPIRE實現了"},{"type":"link","attrs":{"href":"https:\/\/spiffe.io\/docs\/latest\/spiffe-about\/overview\/","title":"","type":null},"content":[{"type":"text","text":"SPIFFE"}]},{"type":"text","text":"規範。形式爲spiffe:\/\/example.com\/some-service的"},{"type":"link","attrs":{"href":"https:\/\/spiffe.io\/docs\/latest\/spiffe-about\/spiffe-concepts\/#spiffe-id","title":"","type":null},"content":[{"type":"text","text":"SPIFFE ID"}]},{"type":"text","text":"用於表示負載。這通常與部署服務的主機名無關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很明顯,SPIFFE和Kerberos都用的是它們自己獨特的身份驗證協議,其身份和負載證明具有不同的語義。在Hadoop中重新連接整個安全模型以配合SPIRE並不是一個可行的解決方案。我們決定同時利用SPIRE和Kerberos,彼此之間沒有任何交互\/交叉證明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這簡化了我們的技術解決方案,方案中涉及以下自動化步驟序列。我們“信任”集羣管理器和它爲從集羣中添加\/刪除節點而執行的目標狀態運維操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/0c\/db\/0c2d9a19c26f85b26f0c5b5cffaa70db.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5:Kerberos主體註冊和密鑰表分發"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"使用位置信息(目標狀態)從集羣拓撲中獲取所有節點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"將所有節點的對應主體註冊到Kerberos中並生成相應的密鑰表。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"在"},{"type":"link","attrs":{"href":"https:\/\/github.com\/hashicorp\/vault","title":"","type":null},"content":[{"type":"text","text":"Hashicorp"}]},{"type":"link","attrs":{"href":"https:\/\/github.com\/hashicorp\/vault","title":"","type":null},"content":[{"type":"text","text":" "}]},{"type":"link","attrs":{"href":"https:\/\/github.com\/hashicorp\/vault","title":"","type":null},"content":[{"type":"text","text":"Vault"}]},{"type":"text","text":"中保存密鑰表。設置適當的ACL,使其只能由Hadoop Worker讀取。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理器代理獲取節點的目標狀態並啓動Hadoop Worker。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop Worker由SPIRE代理驗證。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop Worker:"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"獲取密鑰表(在步驟2中生成)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"將其寫入Hadoop容器可讀的一個只讀掛載(mount)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"啓動Hadoop容器"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop容器(DataNode、NodeManager等):"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"從掛載讀取密鑰表"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在加入集羣之前使用Kerberos進行身份驗證。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般來說,人工干預會導致密鑰表管理不善,從而破壞系統的安全性。通過上述設置,Hadoop Worker由SPIRE進行身份驗證,Hadoop容器由Kerberos進行身份驗證。上述整個過程是端到端的自動化,無需人工參與,確保了更嚴格的安全性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用戶組管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在YARN中,分佈式應用程序的容器作爲提交應用程序的用戶(或服務帳戶)運行。用戶組(UserGroup)在活動目錄("},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Active_Directory","title":"","type":null},"content":[{"type":"text","text":"Active Directory"}]},{"type":"text","text":",AD)中管理。我們的舊架構需要通過Debian包安裝用戶組定義(從AD生成)的定期快照。這導致了全系統範圍的不一致現象,這種不一致是由包版本差異和安裝失敗引起的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未被發現的不一致現象會持續數小時到數週,直到影響用戶爲止。在過去4年多的時間裏,由於跨主機的用戶組信息不一致引發的權限問題和應用程序啓動失敗,讓我們遇到了不少麻煩。此外,這還導致了大量的手動調試和修復工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Docker容器中YARN的用戶組管理自身存在一系列"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-yarn\/hadoop-yarn-site\/DockerContainers.html#User_Management_in_Docker_Container","title":"","type":null},"content":[{"type":"text","text":"技術挑戰"}]},{"type":"text","text":"。維護另一個守護進程"},{"type":"link","attrs":{"href":"https:\/\/github.com\/SSSD\/sssd","title":"","type":null},"content":[{"type":"text","text":"SSSD"}]},{"type":"text","text":"(如Apache文檔中所建議的)會增加團隊的開銷。由於我們正在重新構建整個部署模型,因此我們花費了額外的精力來設計和構建用於用戶組管理的穩定系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/d0\/6d\/d092cdfc8507d6621e984dcc766eca6d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6:容器內的用戶組"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的設計是利用一個經過內部強化、信譽良好的配置分發系統(Config Distribution System)將用戶組定義中繼到部署YARN NodeManager容器的所有主機上。NodeManager容器運行用戶組進程(UserGroups Process),該進程觀察用戶組定義(在"},{"type":"text","marks":[{"type":"strong"}],"text":"配置分發系統內"},{"type":"text","text":")的更改,並將其寫入一個卷掛載,該掛載與所有應用程序容器(Application Container)以只讀方式共享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用程序容器使用一個自定義"},{"type":"link","attrs":{"href":"http:\/\/www.linuxfromscratch.org\/blfs\/view\/svn\/postlfs\/nss.html","title":"","type":null},"content":[{"type":"text","text":"NSS庫"}]},{"type":"text","text":"(內部開發並安裝在Docker鏡像中)來查找用戶組定義文件。有了這套解決方案,我們能夠在2分鐘內實現用戶組在全系統範圍內的一致性,從而爲客戶顯著提高可靠性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"配置生成"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們運營着40多個服務於不同用例的集羣。在舊系統中,我們在單個Git存儲庫中獨立管理每個集羣的配置(每個集羣一個目錄)。結果複製粘貼配置和管理跨多個集羣的部署變得越來越困難。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過新系統,我們改進了管理集羣配置的方式。新系統利用了以下3個概念:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對.xml和.properties文件的Jinja模板,與集羣無關"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Starlark在部署前爲不同類別\/類型的集羣生成配置"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"節點部署期間的運行時環境變量(磁盤掛載、JVM設置等)注入"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/25\/99\/253eec58c941e6dd59b657db18cf9199.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7:Starlark文件定義不同集羣類型的配置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將模板和Starlark文件中總共66,000多行的200多個.xml配置文件減少到了約4,500行(行數減少了93%以上)。事實證明,這種新設置對團隊來說更具可讀性和可管理性,尤其是因爲它與集羣管理系統集成得更好了。此外,該系統被證明有利於爲批處理分析棧中的其他相關服務(例如Presto)自動生成客戶端配置。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"發現與路由"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以前,將Hadoop控制平面(NameNode和ResourceManager)移動到不同的主機一直是很麻煩的操作。這些遷移通常會導致整個Hadoop集羣滾動重啓,還需要與許多客戶團隊協調以重啓相關服務,因爲客戶端要使用主機名來發現這些節點。更糟糕的是,某些客戶端傾向於緩存主機IP並且不會在出現故障時重新解析它們——我們從一次重大事件中學到了這一點,該事件讓整個區域批處理分析棧降級了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Uber的微服務和在線存儲系統在很大程度上依賴於內部開發的服務網格來執行發現和路由任務。Hadoop對服務網格的支持遠遠落後於其他Apache項目,例如Apache Kafka。Hadoop的用例以及將其與內部服務網格集成所涉及的複雜性無法滿足工程工作的投資回報率目標。取而代之的是,我們選擇利用基於DNS的解決方案,並計劃將這些更改逐步貢獻回開源社區("},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/HDFS-14118","title":"","type":null},"content":[{"type":"text","text":"HDFS-14118"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/HDFS-15785","title":"","type":null},"content":[{"type":"text","text":"HDFS-15785"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們有100多個團隊每天都在與Hadoop交互。他們中的大多數都在使用過時的客戶端和配置。爲了提高開發人員的生產力和用戶體驗,我們正在對整個公司的Hadoop客戶端進行標準化。作爲這項工作的一部分,我們正在遷移到一箇中心化配置管理解決方案,其中客戶無需爲初始化客戶端指定典型的*-site.xml文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用上述配置生成系統,我們能夠爲客戶端生成配置並將配置推送到我們的內部配置分發系統。配置分發系統以可控和安全的方式在整個系統範圍內部署它們。服務\/應用程序使用的Hadoop客戶端將從主機本地配置緩存(Config Cache)中獲取配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/ca\/ef\/ca0608ca3d6c1caffec9562ed14195ef.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8:客戶端配置管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準化客戶端(具有DNS支持)和中心化配置從Hadoop客戶那裏完全抽象出了發現和路由操作。此外,它還提供了一組豐富的可觀察性指標和日誌記錄,讓我們可以更輕鬆地進行調試。這進一步改善了我們的客戶體驗,並使我們能夠在不中斷客戶應用程序的情況下輕鬆管理Hadoop控制平面。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"心態的轉變"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自從Hadoop於2016年首次部署在生產環境中以來,我們已經開發了很多(100多個)鬆散耦合的python和bash腳本來運維集羣。重新構建Hadoop的自動化技術棧意味着我們要重寫所有這些邏輯。這一過程意味着重新實現累積超過4年的邏輯,同時還要保證系統的可擴展性和可維護性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對21,000多臺Hadoop主機大動干戈以遷移到容器化部署,同時放棄正常運維多年的腳本積累,這樣的方案在一開始招致了不少懷疑聲。我們開始將該系統用於沒有SLA的新的開發級集羣,然後再用於集成測試。幾個月後,我們開始向我們的主要集羣(用於數據倉庫和分析)添加DataNodes和NodeManagers,並逐漸建立了信心。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過一系列內部演示和編寫良好的運行手冊幫助其他人學會使用新系統之後,大家開始意識到了轉移到容器化部署的好處。此外,新架構解鎖了舊系統無法支持的某些原語(提升效率和安全性)。團隊開始認可新架構的收益。很快,我們在新舊系統之間架起了幾個組件,以搭建一條從現有系統到新系統的遷移路徑。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"搬運猛獁象"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們新架構採用的原則之一是機羣中的每一臺主機都必須是可更換的。由舊架構管理的可變主機積累了多年的技術債務(陳舊的文件和配置)。作爲遷移的一部分,我們決定重新鏡像我們機羣中的每臺主機。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,自動化流程在編排遷移時需要的人工干預是極少的。宏觀來看,我們的遷移流程是一系列Cadence活動,迭代大量節點。這些活動執行各種檢查以確保集羣穩定,並會智能地選擇和停用節點,爲它們提供新配置,並將它們添加回集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完成遷移任務的最初預期時間是2年以上。我們花了大量時間調整我們的集羣,以找到一個儘量提升遷移速度,同時不會損害我們SLA的甜點。在9個月內,我們成功遷移了約60%(12,500\/21,000臺主機)。我們正走在快車道上,預計在接下來的6個月內完成大部分機羣遷移工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/d4\/c1\/d492cf7yy141f1ca02b61e92b4127cc1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9:在大約7天內遷移200多臺主機"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"要記住的傷疤"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們聲稱我們的遷移非常平滑,那肯定是在撒謊。遷移的初始階段非常順利。然而,當我們開始遷移對更改更敏感的集羣時,發現了很多意想不到的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"丟失塊和添加更多防護措施"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的一個最大的集羣有多個運維工作流同時執行。一個集羣範圍的DataNode升級與集羣另一部分發生的遷移一起觸發了NameNode RPC延遲的降級。後來發生了一系列意外事件,我們最後輸掉了戰鬥,導致集羣中丟失了一些塊,我們不得不從另一個區域恢復它們。這迫使我們爲自動化和運維程序設置了更多的防護措施和安全機制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":2,"normalizeStart":2},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"使用目錄遍歷進行磁盤使用統計"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣管理器代理定期執行磁盤使用情況統計以備使用率分析,並將其提供給公司範圍的效率計劃。不幸的是,該邏輯意味着“"},{"type":"link","attrs":{"href":"https:\/\/golang.org\/pkg\/path\/filepath\/#Walk","title":"","type":null},"content":[{"type":"text","text":"遍歷"}]},{"type":"text","text":"”存儲在DataNode上的24x4TB磁盤上的所有HDFS塊。這導致了大量的磁盤i\/o。它不會影響我們不太忙的集羣,但對我們最繁忙的一個集羣產生了負面影響,增加了HDFS客戶端讀\/寫延遲,這促使我們增強了這一邏輯。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"要點和未來計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在過去2年中,我們對Hadoop的運行方式做出了巨大的改變。我們升級了我們的部署,從一大堆腳本和Puppet清單轉向了在Docker容器中運行大型Hadoop生產集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從腳本和工具過渡到通過成熟的UI運維Hadoop,是團隊的重大文化轉變。我們花在集羣運維上的時間減少了90%以上。我們讓自動化操作控制來整個系統的運維、更換損壞主機。我們不再讓Hadoop來管理我們的時間了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是我們旅途中的主要收穫:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果沒有先進的卓越運維技術,Hadoop可能會變成一個難以馴服的龐然大物。組織應該定期重新評估部署架構並定期償還技術債務,以免亡羊補牢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模基礎設施的重構需要時間。組織應該爲此建立強大的工程團隊,專注於進步而不是追求完美,並隨時準備好應付生產環境中出現的問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"感謝所有參與各個架構團隊的工程師,我們的架構非常穩固。建議與Hadoop領域之外的人們合作以收集不同的觀點。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着我們的遷移任務逐漸步入尾聲,我們正在將重點轉移到更多令人興奮的工作上。利用底層容器編排的優勢,我們爲未來制定了以下計劃:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單擊一次開通多集羣和一個區域內跨多個專區的集羣平衡"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主動檢測和修復服務降級和故障的自動化解決方案"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過橋接雲和本地容量來提供按需彈性存儲和計算資源"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"Mithun Mathew是Uber數據團隊的二級高級軟件工程師。 他之前作爲Apache Ambari的開發人員參與Hadoop部署工作。他目前在Uber專注於數據基礎設施的容器化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"Qifan Shi是Uber數據基礎設施團隊的高級軟件工程師,也是Hadoop容器化的核心貢獻者。他一直在參與能夠高效編排大規模HDFS集羣的多個系統的研發工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"Shuyi Zhang是Uber數據基礎設施團隊的高級軟件工程師。她是Hadoop容器化的核心貢獻者。她目前專注於Uber的計算資源管理系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"Jeffrey Zhong是Uber數據基礎設施團隊的二級工程經理,負責管理數據湖存儲(Hadoop HDFS)和批量計算調度(Hadoop YARN)。在加入 Uber 之前,他曾參與Apache HBase和 Apache Phoenix的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/hadoop-container-blog\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/eng.uber.com\/hadoop-container-blog\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章