文章目录

Spark源码剖析——RpcEndpoint、RpcEnv

当前环境与版本

环境	版本
JDK	java version “1.8.0_231” (HotSpot)
Scala	Scala-2.11.12
Spark	spark-2.4.4

1. 前言

RpcEndpoint、RpcEnv可以说是Spark整个体系最核心的部分：
- RpcEndpoint代表了一个终端
- RpcEnv代表了通信环境
一个终端可以通过通信环境与其他终端进行交互，由此构建出了Spark集群的通信互联，包括心跳、发送计算任务、数据块管理、状态响应、监控等。
在Spark早期版本中，整个通信部分由Akka实现（底层是Netty），为了获得更好的性能，在Spark2后完全自主使用Netty实现了整个通信机制。而该通信机制仍然是模仿Akka的Actor模型，各个RpcEndpoint之间的通信依旧基于事件驱动，想要理解该部分的朋友可以先看看Akka的事件驱动模型示例。
先从RpcEndpoint、RpcEnv这部分看起，可以深刻的理解Spark整个计算框架的分布式基础，再去看其他部分将会事半功倍。
我画了一副RpcEndpoint与RpcEnv的总览图，如下：
想直接看总体结构的朋友，请直接跳到最后总结处！😃

2. RpcEndpoint

2.1 核心UML图

UML图
描述
- Master：在standalone模式下的主节点，负责管理集群、分配应用资源
- Worker：在standalone模式下的从节点，负责启动Executor、运行具体的应用
- ClientEndpoint：即我们常说的客户端，用于在cluster模式下，向集群申请Driver、提交配置、提交Jar包等（client模式下是在本地直接启动Driver，不需要ClientEndpoint）
- DriverEnpoint：我们常说的Driver中包含了它，由用户代码中new SparkContext()创建，负责资源申请、与Executor交互、启动/关闭任务
- CoarseGrainedExecutorBackend：是粗粒度的Executor后台进程，由它与DriverEnpoint交互，负责Executor的注册、启动、关闭、任务启动等
- HeartbeatReceiver：负责心跳交互，确保RpcEndpoint相互知道对方是否存活
- BlockManagerEndpoint：分Master(Driver处)和Slave(Executor处)，由SparkEnv实例化，主要负责应用运行时的数据块管理

2.2 RpcEndpoint源码分析

org.apache.spark.rpc.RpcEndpoint

/**
 * 可以在此处看到RpcEndpoint的生命周期的说明
 * The life-cycle of an endpoint is:
 *
 * {@code constructor -> onStart -> receive* -> onStop}
 * 
 */
private[spark] trait RpcEndpoint {

  val rpcEnv: RpcEnv

  /**
   * 此处定义了对自身的引用，用于自己向自己发送消息
   */
  final def self: RpcEndpointRef = {
    require(rpcEnv != null, "rpcEnv has not been initialized")
    rpcEnv.endpointRef(this)
  }

  /**
   * scala偏函数
   * 用于处理RpcEndpoint调用`RpcEndpointRef.send`或 `RpcCallContext.reply`发送的消息，不需要回应发送方
   * 子类实现时，利用匹配模式进行处理
   */
  def receive: PartialFunction[Any, Unit] = {
    case _ => throw new SparkException(self + " does not implement 'receive'")
  }

  /**
   * scala偏函数
   * 用于处理RpcEndpoint调用`RpcEndpointRef.ask`发送的消息，需要回应发送方
   * 子类实现时，利用匹配模式进行处理
   */
  def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    case _ => context.sendFailure(new SparkException(self + " won't reply anything"))
  }

 // 省略部分代码

  /**
   * 子类实现初始化工作
   */
  def onStart(): Unit = {
    // By default, do nothing.
  }

  def onStop(): Unit = {
    // By default, do nothing.
  }
  
  // 省略部分代码
}

org.apache.spark.rpc.ThreadSafeRpcEndpoint

// RpcEndpoint的线程安全的实现
private[spark] trait ThreadSafeRpcEndpoint extends RpcEndpoint

首先，RpcEndpoint是一个trait，定义了一个抽象模板，需要由子类去实现各个功能。我们看到的大部分Endpoint都是其线程安全的实现ThreadSafeRpcEndpoint。
其生命周期如最前面的文档注释：先是调用constructor，再调用onStart，接着随时都有可能收到消息，收到消息则会调用receive或receiveAndReply，当其结束时会被调用onStop。（后面RpcEnv处我们将解释为什么生命周期是这个顺序）
对于每个RpcEndpoint
- 其中onStart是用于其初始化，例如Master中onStart会初始化WebUI、MetricsSystem、心跳检测定时器等。
- 其中receive、receiveAndReply是其与其他RpcEndpoint交互的核心方法，用于不同RpcEndpoint之间的消息解析、交互。其原理和Akka的Actor中的receive方法一样，只需定义用作协议的样例类（例如RegisterWorker、Heartbeat），进行匹配即可完成解析，再进行后续处理。（源码查看的小技巧：使用Ctrl+鼠标左键点击receive中case匹配的类，即可找到具体的协议Message，再次点击，即可找到发送该Message的RpcEndpoint、接收该消息的RpcEndpoint，这样可以快速的找到RpcEndpoint之间的消息传递逻辑）

3. RpcEndpointRef

3.1 RpcEndpointRef

org.apache.spark.rpc.RpcEndpointRef
代表了本节点对于另一个RpcEndpoint的引用封装，利用它既可以实现对于另一个RpcEndpoint的访问。它主要定义了以下几个方法：
- send - 用于发送单向的异步消息
- ask - 用于发送双向的消息，并返回Future，由使用者决定什么时候阻塞等待响应的消息
- askSync - 内部调用ask，并将Future阻塞，等待响应的消息

3.2 NettyRpcEndpointRef

org.apache.spark.rpc.netty.NettyRpcEndpointRef
RpcEndpointRef其唯一实现是NettyRpcEndpointRef，由Netty实现其通信机制。
其几个核心方法send、ask通通都调用了NettyRpcEnv的send、ask来实现。而NettyRpcEnv发送消息时需要封装一个RequestMessage，主要由发送方、接收方、内容组成。
源码较简单，此处就不做展示了。需要注意的是NettyRpcEndpointRef源代码中的this，它指的是NettyRpcEndpointRef，对应某一个RpcEndpoint，不是调用它的RpcEndpoint（当然也有自己调自己的情况）。

4. RpcEnv

4.1 核心UML图

UML图
描述
- 此部分UML图主要展示了RpcEnv创建、功能的核心部分，我们查看源码时应该首先关注该部分代码
- Spark会利用RpcEnvFactory创建RpcEnv，目前其唯一实现是NettyRpcEnvFactory。封装一个RpcEnvConfig，即可利用create方法创建NettyRpcEnv。
- RpcEnv目前其唯一实现是NettyRpcEnv，利用它可与其他RpcEndpoint进行交互，上图展示了NettyRpcEnv的几个核心属性与功能方法：
  - dispatcher - 负责注册本节点的RpcEndpoint、处理RpcEndpoint的收件箱(Inbox)收到的消息
  - outboxes - 维护的是对于远程RpcEndpoint的发件箱(Outbox)，利用它可以向对应的RpcEndpoint发送消息
  - transportContext - 实际创建TransportServer的对象，实例化时，还会传入一个实例化的NettyRpcHandler，用于处理Netty的消息（收到消息后调用该handler的receive，利用dispatcher将消息发送到Inbox）
  - startServer(…) - 由NettyRpcEnvFactory调用create(…)时一起调用，它再调用transportContext，创建并启动TransportServer（内部是Netty）
  - postToOutbox(…) - 该类内部私有方法，调用该方法，即可利用对应的Outbox或message发送消息，最终由TransportClient向连接的channel发送消息
  - send(…) - 提供给外部调用的方法，用于发送消息，最终调用dispatcher或postToOutbox(…)
  - ask(…) - 同send(…)，不同之处在于它会返回一个Future，由调用者来控制如何处理

4.2 NettyRpcEnv源码分析

org.apache.spark.rpc.netty.NettyRpcEnv

private[netty] class NettyRpcEnv(
    val conf: SparkConf,
    javaSerializerInstance: JavaSerializerInstance,
    host: String,
    securityManager: SecurityManager,
    numUsableCores: Int) extends RpcEnv(conf) with Logging {

   // 省略部分代码

  private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)

  private val transportContext = new TransportContext(transportConf, 
    new NettyRpcHandler(dispatcher, this, streamManager))

   // 省略部分代码

  /**
   * Map包含了远端RpcAddress与Outbox的映射关系
   */
  private val outboxes = new ConcurrentHashMap[RpcAddress, Outbox]()

   // 省略部分代码

  /**
   * 创建TransportServer，由NettyRpcEnvFactory.create(...)时调用该方法
   */
  def startServer(bindAddress: String, port: Int): Unit = {
    // 是否开启Spark认证，默认不会开启
    val bootstraps: java.util.List[TransportServerBootstrap] =
      if (securityManager.isAuthenticationEnabled()) {
        java.util.Arrays.asList(new AuthServerBootstrap(transportConf, securityManager))
      } else {
        java.util.Collections.emptyList()
      }
    // 创建TransportServer，内部最终会利用Netty的ServerBootstrap进行创建
    server = transportContext.createServer(bindAddress, port, bootstraps)
    // 注册一个RpcEndpoint，用于方便远程RpcEnv来查询是否存在RpcEndpoint
    dispatcher.registerRpcEndpoint(
      RpcEndpointVerifier.NAME, new RpcEndpointVerifier(this, dispatcher))
  }

   // 省略部分代码
  /**
   * 注册一个RpcEndpoint，RpcEndpoint被创建时，一般都会调用该方法进行注册
   * 例如Master、Worker、CoarseGrainedExecutorBackend等
   */
  override def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef = {
    dispatcher.registerRpcEndpoint(name, endpoint)
  }

   // 省略部分代码

  /**
   * 该类内部私有方法，由send、ask调用，利用Outbox向对应的节点发送消息
   */
  private def postToOutbox(receiver: NettyRpcEndpointRef, message: OutboxMessage): Unit = {
    if (receiver.client != null) {
      // client不为null时，会直接利用OutboxMessage发送消息
      // 实际调用的是receiver.client的send*方法发送消息
      // client内部拥有与对应节点连接的channel，利用它才实现了最终的消息发送
      message.sendWith(receiver.client)
    } else {
      require(receiver.address != null,
        "Cannot send message to client endpoint with no listen address.")
      // 获取发件箱
      val targetOutbox = {
        // 先从outboxes中找
        val outbox = outboxes.get(receiver.address)
        if (outbox == null) {
          // outboxes中找不到，那就新建一个
          val newOutbox = new Outbox(this, receiver.address)
          val oldOutbox = outboxes.putIfAbsent(receiver.address, newOutbox)
          if (oldOutbox == null) {
            newOutbox
          } else {
            oldOutbox
          }
        } else {
          outbox
        }
      }
      // 是否已经被停用
      if (stopped.get) {
        outboxes.remove(receiver.address)
        targetOutbox.stop()
      } else {
        // 如果没停，那就将message存入Outbox
        // 并调用drainOutbox()将Outbox中所有的message发出
        // 最终还是同上面client不为null一样，调用的message.sendWith(...)
        targetOutbox.send(message)
      }
    }
  }

  /**
   * 向对应节点发送单向消息
   */
  private[netty] def send(message: RequestMessage): Unit = {
    val remoteAddr = message.receiver.address
    if (remoteAddr == address) {
      // 如果接收方是本地，那么直接利用dispatcher向本节点的Inbox发送消息
      try {
        dispatcher.postOneWayMessage(message)
      } catch {
        case e: RpcEnvStoppedException => logDebug(e.getMessage)
      }
    } else {
      // 否则调用postToOutbox，向远程节点发送消息
      postToOutbox(message.receiver, OneWayOutboxMessage(message.serialize(this)))
    }
  }

   // 省略部分代码

  /**
   * 向对应节点发送消息，并返回一个Future
   */
  private[netty] def ask[T: ClassTag](message: RequestMessage, timeout: RpcTimeout): Future[T] = {
    val promise = Promise[Any]()
    val remoteAddr = message.receiver.address

    // 回调，失败时调用
    def onFailure(e: Throwable): Unit = {
      if (!promise.tryFailure(e)) {
        e match {
          case e : RpcEnvStoppedException => logDebug (s"Ignored failure: $e")
          case _ => logWarning(s"Ignored failure: $e")
        }
      }
    }
	 // 回调，成功时调用
    def onSuccess(reply: Any): Unit = reply match {
      case RpcFailure(e) => onFailure(e)
      case rpcReply =>
        if (!promise.trySuccess(rpcReply)) {
          logWarning(s"Ignored message: $reply")
        }
    }

    try {
      if (remoteAddr == address) {
        // 如果接收方是本地，那么直接利用dispatcher向本节点的Inbox发送消息
        val p = Promise[Any]()
        p.future.onComplete {
          case Success(response) => onSuccess(response)
          case Failure(e) => onFailure(e)
        }(ThreadUtils.sameThread)
        dispatcher.postLocalMessage(message, p)
      } else {
        // 将回调onFailure、onSuccess封装入消息内
        val rpcMessage = RpcOutboxMessage(message.serialize(this),
          onFailure,
          (client, response) => onSuccess(deserialize[Any](client, response)))
        // 否则调用postToOutbox，向远程节点发送消息
        postToOutbox(message.receiver, rpcMessage)
        promise.future.failed.foreach {
          case _: TimeoutException => rpcMessage.onTimeout()
          case _ =>
        }(ThreadUtils.sameThread)
      }
      // 启用定时任务，检测超时
      val timeoutCancelable = timeoutScheduler.schedule(new Runnable {
        override def run(): Unit = {
          onFailure(new TimeoutException(s"Cannot receive any reply from ${remoteAddr} " +
            s"in ${timeout.duration}"))
        }
      }, timeout.duration.toNanos, TimeUnit.NANOSECONDS)
      promise.future.onComplete { v =>
        timeoutCancelable.cancel(true)
      }(ThreadUtils.sameThread)
    } catch {
      case NonFatal(e) =>
        onFailure(e)
    }
    promise.future.mapTo[T].recover(timeout.addMessageIfTimeout)(ThreadUtils.sameThread)
  }

   // 省略部分代码

  /**
   * 获取RpcEndpoint对应的RpcEndpointRef
   */
  override def endpointRef(endpoint: RpcEndpoint): RpcEndpointRef = {
    dispatcher.getRpcEndpointRef(endpoint)
  }

   // 省略部分代码

  /**
   * 一个节点创建完RpcEnv、RpcEndpoint后，会调用该方法进行阻塞
   * 例如Master、Worker、CoarseGrainedExecutorBackend等
   */
  override def awaitTermination(): Unit = {
    // dispatcher内部将进一步调用其内的线程池的awaitTermination，进行阻塞
    dispatcher.awaitTermination()
  }

  // 省略部分代码

}

我们可以看到NettyRpcEnv主要由几个部分组成：
- 维护RpcEndpoint，并处理其收件箱(Inbox)的Dispatcher（后面会解释它是如何处理的）
- 存储待发送的消息的发件箱(Outbox)
- 用于发送消息的postToOutbox、send、ask

4.3 Outbox源码分析

org.apache.spark.rpc.netty.Outbox

private[netty] class Outbox(nettyEnv: NettyRpcEnv, val address: RpcAddress) {

  outbox => // 给this一个别名叫outbox，方便使用

  // 省略部分代码

  /**
   * 发消息，由外部调用
   */
  def send(message: OutboxMessage): Unit = {
    val dropped = synchronized {
      if (stopped) {
        true
      } else {
        // 将消息添加入队列中
        messages.add(message)
        false
      }
    }
    if (dropped) {
      message.onFailure(new SparkException("Message is dropped because Outbox is stopped"))
    } else {
      // 发送并清空Outbox内的消息
      drainOutbox()
    }
  }

  /**
   * 发送并清空Outbox内的消息
   */
  private def drainOutbox(): Unit = {
    var message: OutboxMessage = null
    synchronized {
      if (stopped) {
        return
      }
      if (connectFuture != null) {
        // We are connecting to the remote address, so just exit
        return
      }
      if (client == null) {
        // 此处会利用nettyEnv创建一个client，方便后面发消息
        launchConnectTask()
        return
      }
      if (draining) {
        // There is some thread draining, so just exit
        return
      }
      // 取出消息
      message = messages.poll()
      if (message == null) {
        return
      }
      draining = true
    }
    while (true) {
      try {
       // 获取到client
        val _client = synchronized { client }
        if (_client != null) {
          // 根据不同的message类型
          // 利用client的send*方法通过channel向对应节点发送消息
          message.sendWith(_client)
        } else {
          assert(stopped == true)
        }
      } catch {
        case NonFatal(e) =>
          handleNetworkFailure(e)
          return
      }
      synchronized {
        if (stopped) {
          return
        }
        message = messages.poll()
        if (message == null) {
          draining = false
          return
        }
      }
    }
  }

  // 省略部分代码
  
}

Oubox中核心的就是send(…)与drainOutbox()，利用这两个方法来发送消息。
需要注意的是OutboxMessage有两种实现：
- OneWayOutboxMessage - 单向消息，不需要回应
- RpcOutboxMessage - Rpc请求，需要回应

4.4 Dispatcher、Inbox源码分析

org.apache.spark.rpc.netty.Dispatcher

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {

  private class EndpointData(
      val name: String,
      val endpoint: RpcEndpoint,
      val ref: NettyRpcEndpointRef) {
    // 实例化EndpointData，同时会实例化该RpcEndpoint对应的Inbox
    val inbox = new Inbox(ref, endpoint)
  }

  // Endpoint于EndpointData的映射关系，EndpointData内拥有该Endpoint的Inbox
  // 向Endpoint发送消息都要调用它
  private val endpoints: ConcurrentMap[String, EndpointData] =
    new ConcurrentHashMap[String, EndpointData]
    
  // RpcEndpoint与自身Ref的映射，方便后续快速获取自己的引用，不用重复创建RpcEndpointRef
  // 由RpcEndpoint中的self方法从中获取RpcEndpointRef
  private val endpointRefs: ConcurrentMap[RpcEndpoint, RpcEndpointRef] =
    new ConcurrentHashMap[RpcEndpoint, RpcEndpointRef]

  // 维护了EndpointData，后面的MessageLoop会从该队列取数据，进行处理（也就是处理Inbox中的消息）
  // 标识了哪些EndpointData的Inbox中可能有消息
  private val receivers = new LinkedBlockingQueue[EndpointData]

  /**
   * 注册RpcEndpoint
   */
  def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
    val addr = RpcEndpointAddress(nettyEnv.address, name)
    // 构建了一个对应地址的NettyRpcEndpointRef
    val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
    synchronized {
      if (stopped) {
        throw new IllegalStateException("RpcEnv has been stopped")
      }
      // 如果没有该名称的EndpointData，就新建一个EndpointData，并放进去
      // 需要注意此处new EndpointData，内部同时还会实例化一个Inbox
      if (endpoints.putIfAbsent(name, new EndpointData(name, endpoint, endpointRef)) != null) {
        throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
      }
      val data = endpoints.get(name)
      endpointRefs.put(data.endpoint, data.ref)
      // 注意此行代码，将EndpointData放入了receivers队列中
      // 后面循环取出EndpointData，进行处理时，将第一个处理该EndpointData（因此会先调用RpcEndpoint的onStart）
      receivers.offer(data)
    }
    endpointRef
  }

  // 省略部分代码

  /**
   * 遍历所有的RpcEndpoint，并发送消息
   */
  def postToAll(message: InboxMessage): Unit = {
    val iter = endpoints.keySet().iterator()
    while (iter.hasNext) {
      val name = iter.next
        postMessage(name, message, (e) => { e match {
          case e: RpcEnvStoppedException => logDebug (s"Message $message dropped. ${e.getMessage}")
          case e: Throwable => logWarning(s"Message $message dropped. ${e.getMessage}")
        }}
      )}
  }

  /** 
   * 远程endpoint发送消息
   * 由Netty的Server收到消息，调用NettyRpcHandler的receive，再调用至此
   */
  def postRemoteMessage(message: RequestMessage, callback: RpcResponseCallback): Unit = {
    val rpcCallContext =
      new RemoteNettyRpcCallContext(nettyEnv, callback, message.senderAddress)
    val rpcMessage = RpcMessage(message.senderAddress, message.content, rpcCallContext)
    postMessage(message.receiver.name, rpcMessage, (e) => callback.onFailure(e))
  }

  /** 本地endpoint发送消息 */
  def postLocalMessage(message: RequestMessage, p: Promise[Any]): Unit = {
    val rpcCallContext =
      new LocalNettyRpcCallContext(message.senderAddress, p)
    val rpcMessage = RpcMessage(message.senderAddress, message.content, rpcCallContext)
    postMessage(message.receiver.name, rpcMessage, (e) => p.tryFailure(e))
  }

  /** 发送单向消息，远程或本地调用*/
  def postOneWayMessage(message: RequestMessage): Unit = {
    postMessage(message.receiver.name, OneWayMessage(message.senderAddress, message.content),
      (e) => throw e)
  }

  /**
   * 向指定的endpoint发送消息
   * 实际只是将消息放入Inbox，后续会由内部MessageLoop轮询处理Inbox中的消息
   */
  private def postMessage(
      endpointName: String,
      message: InboxMessage,
      callbackIfStopped: (Exception) => Unit): Unit = {
    val error = synchronized {
      // 获取到该endpoint对应的EndpointData
      val data = endpoints.get(endpointName)
      if (stopped) {
        Some(new RpcEnvStoppedException())
      } else if (data == null) {
        Some(new SparkException(s"Could not find $endpointName."))
      } else {
        // 将消息存入inbox
        data.inbox.post(message)
        receivers.offer(data)
        None
      }
    }
    // We don't need to call `onStop` in the `synchronized` block
    error.foreach(callbackIfStopped)
  }

  // 省略部分代码

  def awaitTermination(): Unit = {
    threadpool.awaitTermination(Long.MaxValue, TimeUnit.MILLISECONDS)
  }

  // 省略部分代码

  /** 伴随着Dispatcher实例化被调用*/
  private val threadpool: ThreadPoolExecutor = {
    val availableCores =
      if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()
    val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
      math.max(2, availableCores))
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
    for (i <- 0 until numThreads) {
      // 线程池开始处理MessageLoop，调用其run方法
      pool.execute(new MessageLoop)
    }
    pool
  }

  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        // 该方法，将会持续循环
        while (true) {
          try {
            // 先要从已标识为可能有消息的EndpointData的队列中取出EndpointData
            val data = receivers.take()
            // PoisonPill是毒药片的意思，用来指明当前Thread是否应该退出循环（即毒死它 ^_^）
            if (data == PoisonPill) {
              // 不仅自己要死，还要拖其他Thread陪葬，哈哈
              receivers.offer(PoisonPill)
              return
            }
            // 调用inbox的process来处理消息
            data.inbox.process(Dispatcher.this)
          } catch {
            case NonFatal(e) => logError(e.getMessage, e)
          }
        }
      } catch {
        // 省略部分代码
      }
    }
  }

}

org.apache.spark.rpc.netty.Inbox

private[netty] class Inbox(
    val endpointRef: NettyRpcEndpointRef,
    val endpoint: RpcEndpoint)
  extends Logging {

  inbox =>  // 给this一个别名叫inbox，方便使用

  // 实例化Inbox时，此处会被调用
  // 将OnStart放入了messages队列的第一位，因此RpcEndpoint生命周期中在调用了构造器后，会调用onStart
  inbox.synchronized {
    messages.add(OnStart)
  }

  /**
   * 处理收件箱中的消息，由dispatcher中的MessageLoop调用
   */
  def process(dispatcher: Dispatcher): Unit = {
    var message: InboxMessage = null
    inbox.synchronized {
      if (!enableConcurrent && numActiveThreads != 0) {
        return
      }
      // 取出消息
      message = messages.poll()
      if (message != null) {
        numActiveThreads += 1
      } else {
        return
      }
    }
    while (true) {
      // 注意safelyCall此处是scala中的柯里化
      safelyCall(endpoint) {
        // 匹配消息类型，根据不同类型的消息，进行不同的处理
        message match {
          case RpcMessage(_sender, content, context) =>
            try {
              // 收到远程endpoint的RpcMessage（意味着发送消息并等待对方回应）
              // 因此，需要调用本endpoint的receiveAndReply，处理并进行回应
              endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
                throw new SparkException(s"Unsupported message $message from ${_sender}")
              })
            } catch {
              case e: Throwable =>
                context.sendFailure(e)
                // Throw the exception -- this exception will be caught by the safelyCall function.
                // The endpoint's onError function will be called.
                throw e
            }

          case OneWayMessage(_sender, content) =>
            // 收到的是单向消息，可能是本地或远程
            // 因此，只需要调用本endpoint的receive方法进行处理
            endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
              throw new SparkException(s"Unsupported message $message from ${_sender}")
            })

          case OnStart =>
            // OnStart由实例化Inbox时放入消息队列
            // 用于调用endpoint生命周期的onStart
            endpoint.onStart()
            if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
              inbox.synchronized {
                if (!stopped) {
                  enableConcurrent = true
                }
              }
            }

          // 省略部分代码
        }
      }

      // 省略部分代码
    }
  }

  /**
   * 用于外部将消息投递入Inbox的消息队列
   */
  def post(message: InboxMessage): Unit = inbox.synchronized {
    if (stopped) {
      // We already put "OnStop" into "messages", so we should drop further messages
      onDrop(message)
    } else {
      // 将消息加入消息队列
      messages.add(message)
      false
    }
  }

  // 省略部分代码

}

由于Dispatcher与Inbox的业务是合在一起处理的，因此我们在此处一同进行分析。
可以看到Dispatcher提供了registerRpcEndpoint，用于将RpcEndpoint注册进来，这样才能利用其Inbox向其发送消息。而发送消息分别提供了postToAll、postRemoteMessage、postLocalMessage、postOneWayMessage几个公共方法，其核心都是调用了postMessage，将消息放入了收件箱(Inbox)的消息队列中。最后由伴随着Dispatcher实例化时就被启动的MessageLoop循环调用inbox.process(…)，处理了收件箱(Inbox)中的消息。
而Inbox的核心则是：
- post - 将消息放入消息队列中
- process - 根据不同类型的消息，进行不同的处理，最终调用对应Endpoint的receiveAndReply或receive
另外，Inbox在实例化时顺带着将OnStart消息放入了消息队列之首，由MessageLoop调用inbox.process(…)，进而调用了Endpoint的onStart方法。
由此，我们可以理解到最开始对于RpcEndpoint生命周期所作出的描述：
- constructor -> onStart -> receive* -> onStop
喜欢刨根问底的朋友，可能觉得此处的顺序还是不够清晰，因为没看到实际调用的代码，这部分我们将在后续对于RpcEndpoint具体的实现类的调用中进行分析。

4.5 NettyRpcEnv是如何接收外部消息的？

此部分的关键在于NettyRpcEnv中的transportContext，它启动了Netty的Server、并提供了处理消息的NettyRpcHandler，是整个NettyRpcEnv通信的关键。

首先，看到NettyRpcEnv中的transportContext

private val transportContext = new TransportContext(transportConf,
new NettyRpcHandler(dispatcher, this, streamManager))

此处，伴随着NettyRpcEnv的实例化，创建了TransportContext，并实例化了NettyRpcHandler

再看NettyRpcEnv中的startServer(…)方法

def startServer(bindAddress: String, port: Int): Unit = {
  val bootstraps: java.util.List[TransportServerBootstrap] =
      if (securityManager.isAuthenticationEnabled()) {
      java.util.Arrays.asList(new AuthServerBootstrap(transportConf, securityManager))
      } else {
      java.util.Collections.emptyList()
      }
  server = transportContext.createServer(bindAddress, port, bootstraps)
  dispatcher.registerRpcEndpoint(
      RpcEndpointVerifier.NAME, new RpcEndpointVerifier(this, dispatcher))
}

此处，调用了transportContext的createServer方法，创建了TransportServer

接着再看TransportContext的createServer(…)做了什么

public TransportServer createServer(
    String host, int port, List<TransportServerBootstrap> bootstraps) {
  return new TransportServer(this, host, port, rpcHandler, bootstraps);
}

此处，实例化了一个TransportServer，并将rpcHandler传了进去（也就是前面的NettyRpcHandler）

我们再看TransportServer中做了什么（在这里，附上Netty使用示例）

public TransportServer(
    TransportContext context,
    String hostToBind,
    int portToBind,
    RpcHandler appRpcHandler,
    List<TransportServerBootstrap> bootstraps) {
  this.context = context;
  this.conf = context.getConf();
  this.appRpcHandler = appRpcHandler;
  this.bootstraps = Lists.newArrayList(Preconditions.checkNotNull(bootstraps));

  boolean shouldClose = true;
  try {
    // 传入host、port，进行初始化
    init(hostToBind, portToBind);
    shouldClose = false;
  } finally {
    if (shouldClose) {
      JavaUtils.closeQuietly(this);
    }
  }
}

private void init(String hostToBind, int portToBind) {

  IOMode ioMode = IOMode.valueOf(conf.ioMode());
  // bossGroup处理其他节点的channel连接
  EventLoopGroup bossGroup =
    NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server");
  // workerGroup处理channel接收、发出的数据
  EventLoopGroup workerGroup = bossGroup;

  PooledByteBufAllocator allocator = NettyUtils.createPooledByteBufAllocator(
    conf.preferDirectBufs(), true /* allowCache */, conf.serverThreads());
  
  // 构建ServerBootstrap，用于启动Netty的Server
  bootstrap = new ServerBootstrap()
    .group(bossGroup, workerGroup)
    .channel(NettyUtils.getServerChannelClass(ioMode))
    .option(ChannelOption.ALLOCATOR, allocator)
    .option(ChannelOption.SO_REUSEADDR, !SystemUtils.IS_OS_WINDOWS)
    .childOption(ChannelOption.ALLOCATOR, allocator);

  this.metrics = new NettyMemoryMetrics(
    allocator, conf.getModuleName() + "-server", conf);

  if (conf.backLog() > 0) {
    bootstrap.option(ChannelOption.SO_BACKLOG, conf.backLog());
  }

  if (conf.receiveBuf() > 0) {
    bootstrap.childOption(ChannelOption.SO_RCVBUF, conf.receiveBuf());
  }

  if (conf.sendBuf() > 0) {
    bootstrap.childOption(ChannelOption.SO_SNDBUF, conf.sendBuf());
  }

  // 初始化负责处理channel数据的Pipeline
  bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
    @Override
    protected void initChannel(SocketChannel ch) {
      logger.debug("New connection accepted for remote address {}.", ch.remoteAddress());
	
      RpcHandler rpcHandler = appRpcHandler;
      for (TransportServerBootstrap bootstrap : bootstraps) {
        rpcHandler = bootstrap.doBootstrap(ch, rpcHandler);
      }
      // 传入rpcHandler，并将其进行封装入TransportRequestHandler
      // 再将TransportRequestHandler封装入TransportChannelHandler
      // TransportChannelHandler则是继承了Netty的ChannelInboundHandlerAdapter
      // （请查看TransportChannelHandler中的channelRead方法，不懂的朋友建议先学学Netty）
      // 最后将TransportChannelHandler添加至Pipeline的最后
      context.initializePipeline(ch, rpcHandler);
    }
  });

  // 利用bootstrap启动Server
  InetSocketAddress address = hostToBind == null ?
      new InetSocketAddress(portToBind): new InetSocketAddress(hostToBind, portToBind);
  channelFuture = bootstrap.bind(address);
  channelFuture.syncUninterruptibly();

  port = ((InetSocketAddress) channelFuture.channel().localAddress()).getPort();
  logger.debug("Shuffle server started on port: {}", port);
}

我们可以看到构造器中调用了init(…)方法，而init(…)则真正的利用Netty构建了ServerBootstrap，并启动。
此部分的关键之处在于bootstrap.childHandler(…)中调用context.initializePipeline(…)，将我们前面传进来的RpcHandler封装入TransportChannelHandler，并添加到了Pipeline的最后。这样，当远程发过来消息时，每条消息最终会由RpcHandler处理。
接着由Netty构建的Server收到消息后，传至Pipeline，最后RpcHandler的方法（receive最关键）会被调用。

org.apache.spark.rpc.netty.NettyRpcHandler代码如下

private[netty] class NettyRpcHandler(
    dispatcher: Dispatcher,
    nettyEnv: NettyRpcEnv,
    streamManager: StreamManager) extends RpcHandler with Logging {

  private val remoteAddresses = new ConcurrentHashMap[RpcAddress, RpcAddress]()

  override def receive(
      client: TransportClient,
      message: ByteBuffer,
      callback: RpcResponseCallback): Unit = {
    val messageToDispatch = internalReceive(client, message)
    // 利用dispatcher发送消息
    dispatcher.postRemoteMessage(messageToDispatch, callback)
  }

  override def receive(
      client: TransportClient,
      message: ByteBuffer): Unit = {
    val messageToDispatch = internalReceive(client, message)
    // 利用dispatcher发送消息
    dispatcher.postOneWayMessage(messageToDispatch)
  }

  // 省略部分代码，其他代码也会用到dispatcher发送消息
}

我们可以看到，收到的消息都由dispatcher进行了处理，最终消息到达了Endpoint的收件箱(Inbox)，这样就完成了我们的消息接收！！！^_^

5. 总结

最后，我们来对前面所说的做个总结。为了方便，再把最前面的图挪出来一道看 ^_^
这是单个节点（进程）内的结构图，其他节点同理。
首先，单个节点内可能存在多个RpcEndpoint，例如Driver节点中包含DriverEndpoint、HeartbeatReceiver等RpcEndpoint。
每一个RpcEndpoint在RpcEnv中注册后，会在其中拥有自己的Inbox，用于接收消息，统一由Dispatcher维护。
在Dispatcher处理之前，远程scoket的连接由使用Netty封装的TransportServer处理，最后通过Pipeline调用至NettyRpcHandler，利用Dispatcher将消息发送给对应的RpcEndpoint，完成了消息接收。
当RpcEndpoint需要发送消息时，先要获取到接收方的RpcEndpointRef，再调用RpcEnv中的ask/send方法，将消息发入本地的Inbox、或是利用Outbox发送、或是直接发送：
- 如果接收方是本地节点，那么利用Dispatcher发入本地的Inbox
- 如果接收方是远程节点，那么调用postToOutbox，将消息直接发出或是利用Outbox发出
最终对外消息的发出是由TranportClient中维护的channel完成（TranportClient由TransportClientFactory中的createClient创建，依旧是Netty封装）。

Spark源码剖析——RpcEndpoint、RpcEnv

文章目录

Spark源码剖析——RpcEndpoint、RpcEnv

当前环境与版本

1. 前言

2. RpcEndpoint

2.1 核心UML图

2.2 RpcEndpoint源码分析

3. RpcEndpointRef

3.1 RpcEndpointRef

3.2 NettyRpcEndpointRef

4. RpcEnv

4.1 核心UML图

4.2 NettyRpcEnv源码分析

4.3 Outbox源码分析

4.4 Dispatcher、Inbox源码分析

4.5 NettyRpcEnv是如何接收外部消息的？

5. 总结

opencv + face_recognition —— 人臉識別案例

OpenCV學習——圖像基礎與幾何變換

OpenCV學習——圖像特效

基於阿里雲的數據倉庫架構設計

網絡通信框架——Netty示例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結