目錄
2.2.2 Option(self).foreach(_.send(ReregisterWithMaster))
Worker和Master一樣,在Spark通信架構中都是一個EndPoint,所以他們的啓動過程類似,主要區別在於Worker和Master的實現不同,其onStart方法做的事情不同。
一、Worker的main方法
下面我們來看下worker的main方法:
我們看一下第五步:
第3步創建rpcEnv調用的也是RpcEnv.create(),這個和Master的rpcEnv創建過程一樣,會初始化通信需要的一堆組件。
第5步,註冊Worker,向Dispatcher的receivers中添加一個EndPointData,Inbox的nessages中添加一個Onstart,用於啓動Worker.
二、Worker初始化過程
2.1 參數初始化
同樣的,這裏只列一些相對重要(面試or裝逼時重要)的參數。
// Send a heartbeat every (heartbeat timeout) / 4 milliseconds
// 心跳間隔,每個間隔發送一個心跳。默認是worker超時時間的四分之一(超時時間默認1min,所以默認心跳間隔是15s)。
private val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4
// Model retries to connect to the master, after Hadoop's model.
// The first six attempts to reconnect are in shorter intervals (between 5 and 15 seconds)
// Afterwards, the next 10 attempts are between 30 and 90 seconds.
// A bit of randomness is introduced so that not all of the workers attempt to reconnect at
// the same time.
// 初始嘗試註冊次數(這6次嘗試的間隔在5~15秒之間)。
private val INITIAL_REGISTRATION_RETRIES = 6
// 總共嘗試註冊次數(後10次的間隔在30~90s之間)。引入隨機數是爲了避免所有的worker同時嘗試連接
private val TOTAL_REGISTRATION_RETRIES = INITIAL_REGISTRATION_RETRIES + 10
/**
* The master address to connect in case of failure. When the connection is broken, worker will
* use this address to connect. This is usually just one of `masterRpcAddresses`. However, when
* a master is restarted or takes over leadership, it will be an address sent from master, which
* may not be in `masterRpcAddresses`.
*/
// master重啓後,發來的ip
private var masterAddressToConnect: Option[RpcAddress] = None
// 活動master的地址
private var activeMasterUrl: String = ""
// 活動master webUi
private[worker] var activeMasterWebUiUrl : String = ""
// worker WebUi
private var workerWebUiUrl: String = ""
// worker uri
private val workerUri = RpcEndpointAddress(rpcEnv.address, endpointName).toString
// worker是否已註冊
private var registered = false
// worker是否已連接
private var connected = false
// worker id
private val workerId = generateWorkerId()
// sparkHome
private val sparkHome =
if (testing) {
assert(sys.props.contains("spark.test.home"), "spark.test.home is not set!")
new File(sys.props("spark.test.home"))
} else {
new File(sys.env.get("SPARK_HOME").getOrElse("."))
}
// worker工作目錄
var workDir: File = null
// 已完成Executor
val finishedExecutors = new LinkedHashMap[String, ExecutorRunner]
// 驅動映射
val drivers = new HashMap[String, DriverRunner]
// executor
val executors = new HashMap[String, ExecutorRunner]
// 已完成driver
val finishedDrivers = new LinkedHashMap[String, DriverRunner]
// app地址
val appDirectories = new HashMap[String, Seq[String]]
// 已完成apps
val finishedApps = new HashSet[String]
// 以獲取的executor
val retainedExecutors = conf.getInt("spark.worker.ui.retainedExecutors",
WorkerWebUI.DEFAULT_RETAINED_EXECUTORS)
// 以獲取的driver
val retainedDrivers = conf.getInt("spark.worker.ui.retainedDrivers",
WorkerWebUI.DEFAULT_RETAINED_DRIVERS)
// The shuffle service is not actually started unless configured.
private val shuffleService = if (externalShuffleServiceSupplier != null) {
externalShuffleServiceSupplier.get()
} else {
new ExternalShuffleService(conf, securityMgr)
}
// 公開的地址
private val publicAddress = {
val envVar = conf.getenv("SPARK_PUBLIC_DNS")
if (envVar != null) envVar else host
}
// webUi
private var webUi: WorkerWebUI = null
// 鏈接嘗試次數
private var connectionAttemptCount = 0
// 指標系統
private val metricsSystem = MetricsSystem.createMetricsSystem("worker", conf, securityMgr)
// 雖然知道字面是資源,但是暫時不知道是幹啥用的
private val workerSource = new WorkerSource(this)
// 是否反向代理
val reverseProxy = conf.getBoolean("spark.ui.reverseProxy", false)
private var registerMasterFutures: Array[JFuture[_]] = null
private var registrationRetryTimer: Option[JScheduledFuture[_]] = None
// A thread pool for registering with masters. Because registering with a master is a blocking
// action, this thread pool must be able to create "masterRpcAddresses.size" threads at the same
// time so that we can register with all masters.
// 註冊master的線程池,線程數與master的個數相同
private val registerMasterThreadPool = ThreadUtils.newDaemonCachedThreadPool(
"worker-register-master-threadpool",
masterRpcAddresses.length // Make sure we can register with all masters at the same time
)
var coresUsed = 0
var memoryUsed = 0
2.2 onStart
沒有什麼好說的,我們看下registerWithMaster()
2.2.1 tryRegisterAllMasters()
我們先來看下tryRegisterAllMasters().
這裏以多線程的方式向每一個master發出註冊請求。我們接着往下看:
可見,使用master的引用向master發送了一個RegisterWorker。Master將會用receive()方法接收處理。我們來看下master是怎麼處理這個請求的。
org/apache/spark/deploy/master/Master.scala:266
master的返回結果有四種:
1. 如果Master處於StandBy狀態,返回MasterInStandBy。
2. 如果Master緩存信息中已經有了該worker的ID,返回RegisterWorkerFailed(重複的worker ID)
3. 註冊成功,返回RegisteredWorker。
4. 如果該worker已經註冊,返回RegisterWorkerFailed(重複註冊worker)
我們看一下worker對這四種情況是如何相應的(上述三個類都是RegisterWorkerResponse的子類):
org/apache/spark/deploy/worker/Worker.scala:477
我們接着往下看:
org/apache/spark/deploy/worker/Worker.scala:428
這裏再貼一下1.2的changeMaster
org.apache.spark.deploy.worker.Worker#changeMaster
2.2.2 Option(self).foreach(_.send(ReregisterWithMaster))
我們再來看看註冊重試過程。ReregisterWithMaster是發給自己的,我們看下worker是在哪處理的。
org.apache.spark.deploy.worker.Worker#receive
我們接着往下看:
都在截圖裏,不多說了。
三、 總結
worker啓動過程值得關注的就是向Master的註冊過程。worker初始化後調用onStart方法,在onStart方法內調用registerWithMaster方法開始向Master註冊。
註冊次數最多有17次,1次初始註冊,16次重試,其中前6次嘗試的間隔在5~15秒之間,後10次重試的間隔在30~90秒之間。超出17次仍未註冊成功會報註冊失敗,退出程序。
Master接收到Worker發來的RegisterWorker後根據實際情況有三種返回結果:
MasterInStandBy:Master處於StandBy狀態,註冊失敗,worker忽略這個信息。
RegisterWorkerFailed(有兩種原因:重複worker ID和重複註冊):註冊失敗,Worker會重試註冊。
RegisteredWorker:註冊成功,worker會修改自己的master信息並取消註冊重試、向Master發送心跳(默認15秒一次),Master接收到心跳後會更新這個worker的心跳信息、向Master彙報自身最新的executor和driver信息,master會審覈這些信息,通知worker殺死未知的executor和driver
當然這裏還有一些別的細節,不覺得重要這裏就不談了。