[源碼解析] TensorFlow 分佈式環境(2)---Master 靜態邏輯

在具體介紹 TensorFlow 分佈式的各種 Strategy 之前，我們首先需要看看分佈式的基礎：分佈式環境。只有把基礎打紮實了，才能在以後的分析工作之中最大程度的掃清障礙，事半功倍。本文梳理下 Master 的靜態邏輯。

本系列其他文章是：

[翻譯] TensorFlow 分佈式之論文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

[翻譯] TensorFlow 分佈式之論文篇 "Implementation of Control Flow in TensorFlow"

1. 總述

Server 上運行了兩個 RPC 服務，分別是MasterService 和 WorkerService。如果 Client 接入到Server，那麼Server 就是 Master 角色，Client 訪問的就是 MasterService 服務（MasterService 同時負責協調和控制多個 WorkerService 的執行過程）。

Master 這個角色的具體實現是 Master Service。Master Service是一個GRPC service，用於與一系列遠端的分佈式設備進行交互來協調多個worker service。

Master Service 對應了 "//tensorflow/core/protobuf/master_service.proto"，其內部有 CreateSession，RunStep 等接口，所有的 TensorFlow Server 都實現了 Master Service。
客戶端可以與 Master Service 交互以執行分佈式 TensorFlow 計算。
一個 Master Service 會跟蹤多個 "主會話（master sessions）"。每個 master sessions 封裝了一個計算圖及其相關狀態。
Master session 運行在 Master 之上，在會話建立後，master 返回一個句柄給客戶端，該句柄可用於關聯客戶端和主會話。
每個 Master session 通常對應一個 "客戶會話（client session）"。客戶端可以通過調用 CreateSession 向 master 發送一個初始圖，通過調用 ExtendSession 向圖添加節點。
這裏需要說明下，Master 即是一個概念角色，比如 Master 節點，也有一個具體 Master 類。

2. 接口

2.1 接口規範

Client 通過 GrpcSession 調用 Master Service，既然是 RPC 服務，那麼 Client 和 MasterService 之間就需要有一個接口規範。這個規範定義在 master_service.proto 文件中，其定義了各個接口的消息體。

service MasterService {
  // Creates a session.
  rpc CreateSession(CreateSessionRequest) returns (CreateSessionResponse);

  // Extends a session.
  rpc ExtendSession(ExtendSessionRequest) returns (ExtendSessionResponse);

  // Prepares future partial run calls.
  rpc PartialRunSetup(PartialRunSetupRequest) returns (PartialRunSetupResponse);

  // Drives the graph computation.
  rpc RunStep(RunStepRequest) returns (RunStepResponse);

  // Closes a session.
  rpc CloseSession(CloseSessionRequest) returns (CloseSessionResponse);

  // List the devices usable by the master.
  rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);

  // Close and abandon all existing sessions.  Ongoing computations
  // will no longer affect fresh ones via the resources in containers listed in
  // the ResetRequest.  See ResetRequest for more details.
  rpc Reset(ResetRequest) returns (ResetResponse);

  // Registers a callable for execution with RunCallable.
  rpc MakeCallable(MakeCallableRequest) returns (MakeCallableResponse);

  // Executes a callable registered with MakeCallable.
  rpc RunCallable(RunCallableRequest) returns (RunCallableResponse);

  // Frees resources associated with a callable registered with MakeCallable.
  rpc ReleaseCallable(ReleaseCallableRequest) returns (ReleaseCallableResponse);
}

2.2 MasterInterface

Client 使用接口 MasterInterface 獲取遠端 MasterService 的服務。MasterInterface 是接口類，是 Client 與 TensorFlow Master service 進行通信的抽象接口。這個接口既支持基於 RPC 的 master 實現，也支持不需要 RPC 往返的進程內部的 master 實現。MasterInterface 所有接口都是同步接口，這樣 Client 就像調用本地函數一樣調用遠端 MasterService 提供的服務。

MasterInterface有兩種實現，都是用來和 Master service 進行通信，

LocalMaster 用於進程間的直接通信，此時 Client 和 Master 在同一個進程。
GrpcRemoteMaster 則使用 Grpc 來和 Master service 進行通信，此時 Client 和 Master 分別部署在兩個不同進程。
- 可以調用工廠方法 NewGrpcMaster 生成 GrpcRemoteMaster 實例。
- GrpcRemoteMaster 其實就實現了 gRPC 客戶端，它通過 Stub 訪問遠端 Master 上的 MasterService 服務，具體服務是 GrpcMasterService。
- 因爲 MasterInterface 都是同步接口，所以 Client 就好像訪問本地函數一樣訪問 MasterService。

class MasterInterface {
 public:
  virtual ~MasterInterface() {}
  virtual Status CreateSession(CallOptions* call_options,
                               const CreateSessionRequest* request,
                               CreateSessionResponse* response) = 0;

  virtual Status ExtendSession(CallOptions* call_options,
                               const ExtendSessionRequest* request,
                               ExtendSessionResponse* response) = 0;

  virtual Status PartialRunSetup(CallOptions* call_options,
                                 const PartialRunSetupRequest* request,
                                 PartialRunSetupResponse* response) {
    return errors::Unimplemented("Partial run not implemented for this master");
  }

  virtual Status RunStep(CallOptions* call_options,
                         RunStepRequestWrapper* request,
                         MutableRunStepResponseWrapper* response) = 0;

  virtual Status RunStep(CallOptions* call_options,
                         const RunStepRequest* request,
                         RunStepResponse* response) {
    std::unique_ptr<RunStepRequestWrapper> wrapped_request(
        new ProtoRunStepRequest(request));
    std::unique_ptr<MutableRunStepResponseWrapper> wrapped_response(
        new NonOwnedProtoRunStepResponse(response));
    return RunStep(call_options, wrapped_request.get(), wrapped_response.get());
  }

  virtual MutableRunStepRequestWrapper* CreateRunStepRequest() {
    MutableProtoRunStepRequest* ret = new MutableProtoRunStepRequest;
    ret->request_.set_request_id(GetUniqueRequestId());
    return ret;
  }

  virtual MutableRunStepResponseWrapper* CreateRunStepResponse() {
    return new OwnedProtoRunStepResponse;
  }

  virtual Status CloseSession(CallOptions* call_options,
                              const CloseSessionRequest* request,
                              CloseSessionResponse* response) = 0;

  virtual Status ListDevices(CallOptions* call_options,
                             const ListDevicesRequest* request,
                             ListDevicesResponse* response) = 0;

  virtual Status Reset(CallOptions* call_options, const ResetRequest* request,
                       ResetResponse* response) = 0;

  virtual Status MakeCallable(CallOptions* call_options,
                              const MakeCallableRequest* request,
                              MakeCallableResponse* response) = 0;
  virtual Status RunCallable(CallOptions* call_options,
                             const RunCallableRequest* request,
                             RunCallableResponse* response) = 0;
  virtual Status ReleaseCallable(CallOptions* call_options,
                                 const ReleaseCallableRequest* request,
                                 ReleaseCallableResponse* response) = 0;

 protected:
  // NOTE: This should only be called by implementations of this
  // interface whose CreateRunStepResponse() method returns a
  // proto-based wrappers for the RunStepResponse message.
  RunStepResponse* get_proto_from_wrapper(
      MutableRunStepResponseWrapper* wrapper) {
    return wrapper->get_proto();
  }
};

具體使用如下，如果 Client 和 Master 在同一個進程，則直接使用 LocalMaster，否則使用 GrpcRemoteMaster 來利用 gRPC 訪問遠程 GrpcMasterService。圖上兩個矩形封裝的 Master 代表實際的 Master 類，此類實現了具體 Master 功能。

圖 1 Master 邏輯結構

2.3 調用

下面的僞代碼說明了客戶端如何與 master 交互，這其實就是分佈式模式之中，使用 GrpcRemoteMaster 來通過 gRPC 與遠端 MasterSerivce 服務交互的過程。

stub = NewStub("/job:mnist/replica:0/task:0")
{handle} = stub->CreateSession({graph_def})
  
do {
   stub->RunStep({handle, {feeds}, {fetches}})
   // The client can evaluate a predicate locally, based on the
   // result of fetches, to determine whether to terminate. For
   // example, it might fetch the loss and evaluate whether it is less
   // than some threshold.
} while (!should_stop({fetches}));

stub->CloseSession({handle})

3. LocalMaster

當 Client 調用時候，GrpcSession 使用 LocalMaster 獲取本地master，如果沒有得到，則才使用 GrpcRemoteMaster。此時 Client 和 master 沒有跨節點，LocalMaster 使客戶端和master之間能夠直接進行進程內通信，這樣就可以給同進程內部的Client提供更高效的Master服務。

3.1 定義

LocalMaster 定義如下，主要成員變量就是 master_impl_。LocalMaster 其實就是一個殼而已，直接轉發給master_impl_。master_impl_ 是當 Client 和 master 沒有跨節點時候，本地直接調用的類。

class LocalMaster : public MasterInterface {
 private:
  Master* master_impl_;  // Not owned.
  const int64 default_timeout_in_ms_;

  // See LocalMaster::Lookup for the factory function that creates
  // objects of this type.
  LocalMaster(Master* master_impl, const int64 default_timeout_in_ms);

  TF_DISALLOW_COPY_AND_ASSIGN(LocalMaster);
};

3.2 註冊

LocalMaster 有一個靜態變量 local_master_registry_ 用來註冊。

typedef std::unordered_map<string, MasterInfo> LocalMasterRegistry;

LocalMasterRegistry* local_master_registry() {
  static LocalMasterRegistry* local_master_registry_ = new LocalMasterRegistry;
  return local_master_registry_;
}

在 GrpcServer 初始化時候，調用如下代碼把 target="grpc://" 生成的 Master 註冊到本地 LocalMaster。

LocalMaster::Register(target(), master_impl_.get(), config.operation_timeout_in_ms());

就是把 master 註冊到這個static變量 local_master_registry_ 之中。

/* static */
void LocalMaster::Register(const string& target, Master* master,
                           int64 default_timeout_in_ms) {
  mutex_lock l(*get_local_master_registry_lock());
  local_master_registry()->insert(
      {target, MasterInfo(master, default_timeout_in_ms)});
}

3.3 查找

當調用 GrpcSession::Create 方法時候，如果 Client 和 Master 在同一個進程，Lookup 在本地能夠找到註冊的 Master，則會生成一個 LocalMaster 返回，同時 LocalMaster 的 master_impl_ 就配置成找到的 Master。如果找不到，就返回空，則 GrpcSession::Create 方法會創建一個 GrpcRemoterMaster，這樣就同遠端 Master 進行交互。

/* static */
std::unique_ptr<LocalMaster> LocalMaster::Lookup(const string& target) {
  std::unique_ptr<LocalMaster> ret;
  mutex_lock l(*get_local_master_registry_lock());
  auto iter = local_master_registry()->find(target);
  if (iter != local_master_registry()->end()) {
    ret.reset(new LocalMaster(iter->second.master,
                              iter->second.default_timeout_in_ms));
  }
  return ret;
}

以下是同一個進程，Lookup 可以找到的情況，生成 LocalMaster 進行本地操作。

圖 2 同進程 master 操作

我們看看不同進程的情況。此時進程 1 之中的 LocalMaster 沒有指向任何 Master，因爲本地沒有啓動 Server，所以 GrpcSession::Create 方法第一步 Lookup 調用失敗，返回 Null，GrpcSession::Create 方法執行第二步驟，創建 GrpcRemoteMaster，進行遠程交互。進程 2 之中，LocalMaster 因爲沒有客戶端調用 GrpcSession::Create 方法，所以也沒有指向任何 Master。

圖 3 跨進程 master 操作

3.4 功能

LocalMaster 調用到其內部成員變量 master_impl_ 來完成業務功能。

Status LocalMaster::CreateSession(CallOptions* call_options,
                                  const CreateSessionRequest* request,
                                  CreateSessionResponse* response) {
  Notification n;
  Status ret;
  master_impl_->CreateSession(request, response, [&n, &ret](const Status& s) {
    ret.Update(s);
    n.Notify();
  });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

Status LocalMaster::ExtendSession(CallOptions* call_options,
                                  const ExtendSessionRequest* request,
                                  ExtendSessionResponse* response) {
  Notification n;
  Status ret;
  master_impl_->ExtendSession(request, response, [&n, &ret](const Status& s) {
    ret.Update(s);
    n.Notify();
  });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

Status LocalMaster::RunStep(CallOptions* call_options,
                            RunStepRequestWrapper* request,
                            MutableRunStepResponseWrapper* response) {
  Notification n;
  Status ret;
  master_impl_->RunStep(call_options, request, response,
                        [&n, &ret](const Status& s) {
                          ret.Update(s);
                          n.Notify();
                        });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

4. GrpcRemoteMaster

GrpcRemoteMaster 是 gRPC 客戶端的一種實現，其終通過 Stub 調用遠端 Master 上的 GrpcMasterService 服務，這樣調用行爲就猶如本地函數調用一樣。遠端 GrpcMasterService 實現了 MasterService 服務定義的所有接口，是 MasterService 服務的真正實體。當創建 GrpcRemoteMaster 實例時候，需要通過 target 來指定 Master 服務的地址和端口，並且創建對應的 RPC 通道。GrpcSession 和 GrpcRemoteMaster 從嚴格意義上講都是 Client 實現的一部分。

4.1 定義

GrpcRemoteMaster 具體定義如下，主要是使用了MasterServiceStub。

// GrpcRemoteMaster is an implementation of the MasterInterface
// that uses gRPC to talk to the Master service.
class GrpcRemoteMaster : public MasterInterface {
  using MasterServiceStub = grpc::MasterService::Stub;

 public:
  explicit GrpcRemoteMaster(const SharedGrpcChannelPtr& client_channel)
      : stub_(grpc::MasterService::NewStub(client_channel)) {}

  ~GrpcRemoteMaster() override {}

  std::unique_ptr<MasterServiceStub> stub_;
};

4.2 功能

GrpcRemoteMaster 的功能很簡單，就是通過 gRPC 的一個 stub 調用遠端 Master 服務的相應接口。

4.2.1 CreateSession

我們使用 CreateSession 爲例看看，是使用 CallWithRetry 完成功能。

Status CreateSession(CallOptions* call_options,
                     const CreateSessionRequest* request,
                     CreateSessionResponse* response) override {
  return CallWithRetry(call_options, request, response,
                       &MasterServiceStub::CreateSession);
}

CallWithRetry 代碼如下，其又是調用 s = FromGrpcStatus((stub_.get()->*pfunc)(&ctx, *request, response)) 獲取 Stub 來完成功能。

template <typename Request, typename Response>
Status CallWithRetry(CallOptions* call_options, const Request* request,
                     Response* response,
                     ::grpc::Status (MasterServiceStub::*pfunc)(
                         ::grpc::ClientContext*, const Request&, Response*),
                     string trace_string = {}) {
  absl::Duration timeout = absl::Milliseconds(call_options->GetTimeout());
  absl::Time expired_time = absl::FromUnixMicros(Env::Default()->NowMicros());
  if (timeout > absl::ZeroDuration()) {
    expired_time += timeout;
  }
  Status s;
  for (int num_retries = 0;; ++num_retries) {
    ::grpc::ClientContext ctx;
    std::unique_ptr<profiler::TraceMe> trace;
    if (!trace_string.empty()) {
      trace.reset(NewTraceRpc(trace_string, &ctx));
    }
    ctx.set_fail_fast(false);
    if (timeout > absl::ZeroDuration()) {
      // We do not modify the timeout here to match legacy behavior. However,
      // this could violate the contract of tensorflow::Session. If we retry
      // an RPC just before the deadline is exceeded, we will still set the
      // timeout to the original value. This leads to the overall timeout
      // being double what was expected.
      ctx.set_deadline(absl::ToChronoTime(absl::Now() + timeout));
    }
    s = FromGrpcStatus((stub_.get()->*pfunc)(&ctx, *request, response));
    if (!errors::IsUnavailable(s)) {
      return s;
    }
    // TODO(b/117162170): we may want to make this configurable.
    constexpr int kMaxRetries = 10;
    if (num_retries >= kMaxRetries) {
      return s;
    }
    absl::Time now = absl::FromUnixMicros(Env::Default()->NowMicros());
    const absl::Time deadline_with_backoff =
        now + absl::Microseconds(ComputeBackoffMicroseconds(num_retries));
    // Wait for a short period of time before retrying the RPC.  If our
    // backoff would put us past the RPC deadline, we truncate it to ensure
    // our RPC starts before the deadline.
    const auto backoff_until = (timeout <= absl::ZeroDuration() ||
                                expired_time > deadline_with_backoff)
                                   ? deadline_with_backoff
                                   : expired_time;
    Env::Default()->SleepForMicroseconds(
        absl::ToInt64Microseconds(backoff_until - now));
    now = absl::FromUnixMicros(Env::Default()->NowMicros());
    if (now > expired_time && timeout > absl::ZeroDuration()) {
      // If timeout_in_ms is set, exit the retry loop on timeout.
      return errors::DeadlineExceeded(ctx.debug_error_string());
    }
  }
}

4.2.2 Master Service Stub

接下來我們看看 Stub，這是依據 "//tensorflow/core/protobuf/master_service.proto" 來使用 grpc 實現的。

class Stub final : public StubInterface {
 public:
  Stub(const std::shared_ptr< ::grpc::ChannelInterface>& channel);
  ::grpc::Status CreateSession(::grpc::ClientContext* context,
                               const CreateSessionRequest& request,
                               CreateSessionResponse* response) override;
  ::grpc::Status ExtendSession(::grpc::ClientContext* context,
                               const ExtendSessionRequest& request,
                               ExtendSessionResponse* response) override;
  ::grpc::Status PartialRunSetup(::grpc::ClientContext* context,
                                 const PartialRunSetupRequest& request,
                                 PartialRunSetupResponse* response) override;
  ::grpc::Status RunStep(::grpc::ClientContext* context,
                         const RunStepRequest& request,
                         RunStepResponse* response) override;
  ::grpc::Status CloseSession(::grpc::ClientContext* context,
                              const CloseSessionRequest& request,
                              CloseSessionResponse* response) override;
  ::grpc::Status ListDevices(::grpc::ClientContext* context,
                             const ListDevicesRequest& request,
                             ListDevicesResponse* response) override;
  ::grpc::Status Reset(::grpc::ClientContext* context,
                       const ResetRequest& request,
                       ResetResponse* response) override;
  ::grpc::Status MakeCallable(::grpc::ClientContext* context,
                              const MakeCallableRequest& request,
                              MakeCallableResponse* response) override;
  ::grpc::Status RunCallable(::grpc::ClientContext* context,
                             const RunCallableRequest& request,
                             RunCallableResponse* response) override;
  ::grpc::Status ReleaseCallable(::grpc::ClientContext* context,
                                 const ReleaseCallableRequest& request,
                                 ReleaseCallableResponse* response) override;

 private:
  std::shared_ptr< ::grpc::ChannelInterface> channel_;
  const ::grpc::internal::RpcMethod rpcmethod_CreateSession_;
  const ::grpc::internal::RpcMethod rpcmethod_ExtendSession_;
  const ::grpc::internal::RpcMethod rpcmethod_PartialRunSetup_;
  const ::grpc::internal::RpcMethod rpcmethod_RunStep_;
  const ::grpc::internal::RpcMethod rpcmethod_CloseSession_;
  const ::grpc::internal::RpcMethod rpcmethod_ListDevices_;
  const ::grpc::internal::RpcMethod rpcmethod_Reset_;
  const ::grpc::internal::RpcMethod rpcmethod_MakeCallable_;
  const ::grpc::internal::RpcMethod rpcmethod_RunCallable_;
  const ::grpc::internal::RpcMethod rpcmethod_ReleaseCallable_;
};

具體遠端的對應方法是：

static const char* grpcMasterService_method_names[] = {
    "/tensorflow.MasterService/CreateSession",
    "/tensorflow.MasterService/ExtendSession",
    "/tensorflow.MasterService/PartialRunSetup",
    "/tensorflow.MasterService/RunStep",
    "/tensorflow.MasterService/CloseSession",
    "/tensorflow.MasterService/ListDevices",
    "/tensorflow.MasterService/Reset",
    "/tensorflow.MasterService/MakeCallable",
    "/tensorflow.MasterService/RunCallable",
    "/tensorflow.MasterService/ReleaseCallable",
};

std::unique_ptr<MasterService::Stub> MasterService::NewStub(
    const std::shared_ptr< ::grpc::ChannelInterface>& channel,
    const ::grpc::StubOptions& options) {
  std::unique_ptr<MasterService::Stub> stub(new MasterService::Stub(channel));
  return stub;
}

Stub 內部調用 grpc 完成發送功能。

::grpc::Status MasterService::Stub::CreateSession(
    ::grpc::ClientContext* context, const CreateSessionRequest& request,
    CreateSessionResponse* response) {
  return ::grpc::internal::BlockingUnaryCall(
      channel_.get(), rpcmethod_CreateSession_, context, request, response);
}

所以，如果是 GrpcRemoteMaster，則調用流程應該是：GrpcRemoteMaster 接收到 grpc session 的請求，轉交給 grpc master service，這期間經歷了 GrpcSession -> GrpcRemoteMaster -> GrpcMasterService -> Master -> MasterSession 一系列流程。

4.3 創建

當建立 GrpcSession 時候，create 方法之中會先查找有沒有 Master。如果找到了就直接返回 LocalMaster，這部分我們前面介紹過。如果 Lookup 找不到。所以會調用 NewGrpcMaster 生成一個 GrpcRemoteMaster。

/* static */
Status GrpcSession::Create(const SessionOptions& options,
                           std::unique_ptr<GrpcSession>* out_session) {
  std::unique_ptr<GrpcSession> session(new GrpcSession(options));
  std::unique_ptr<MasterInterface> master;
  // For testing, we enable the client to disable the use of the local
  // master registry, so that the RPC stack is exercised.
  if (!options.config.rpc_options().use_rpc_for_inprocess_master()) {
    master = LocalMaster::Lookup(options.target); 
  }
  if (!master) {
    SharedGrpcChannelPtr master_channel;
    TF_RETURN_IF_ERROR(
        NewHostPortGrpcChannel(options.target.substr(kSchemePrefixLength),
                               &options.config.rpc_options(), &master_channel));
    // 建立 GrpcRemoteMaster，與遠端 Master 交互
    master.reset(NewGrpcMaster(master_channel));
  } else {
    session->is_local_ = true;
  }
  session->SetRemoteMaster(std::move(master));
  *out_session = std::move(session);
  return Status::OK();
}

NewGrpcMaster 方法具體如下：

MasterInterface* NewGrpcMaster(const SharedGrpcChannelPtr& channel) {
  return new GrpcRemoteMaster(channel);
}

5. GrpcMasterService

GrpcMasterService 實現了 RPC 對應的 MasterService。GrpcMasterService 會：

預先了解有哪些本地設備可以給客戶使用，也會發現遠端設備並且跟蹤其統計數據。
維護/管理實時計算圖會話（MasterSession），這些會話將調用本地或者遠端設備來對收到的計算圖進行計算。
會話功能是：對收到的計算圖進行分析，剪枝，把節點放到可用設備上，通過調用 RunGraph 在工作者上進行圖計算。

5.1 創建

GrpcServer 之中，master_service_ 是 GrpcMasterService 類型的變量。

  // 創建 Master 以及對應的 GrpcMasterService
  master_impl_ = CreateMaster(&master_env_);
  master_service_ = NewGrpcMasterService(master_impl_.get(), config, &builder);

GrpcServer 使用 master_thread_ 線程來執行 GrpcMasterService 的 HandleRPCsLoop方法。

master_thread_.reset(
    env_->StartThread(ThreadOptions(), "TF_master_service",
                      [this] { master_service_->HandleRPCsLoop(); }));

5.2 定義

GrpcMasterService 定義如下，master_impl_ 是 Server 傳入的 master 指針，是一個 Master 類的實例：

class GrpcMasterService : public AsyncServiceInterface {
  Master* master_impl_ = nullptr;  // Not owned.
  std::unique_ptr<::grpc::ServerCompletionQueue> cq_;
  grpc::MasterService::AsyncService master_service_;

  mutex mu_;
  bool is_shutdown_ TF_GUARDED_BY(mu_);
  const ConfigProto default_session_config_;
  ::grpc::Alarm* shutdown_alarm_ = nullptr;

  template <class RequestMessage, class ResponseMessage>
  using MasterCall = Call<GrpcMasterService, grpc::MasterService::AsyncService,
                          RequestMessage, ResponseMessage>;
}

GrpcMasterService 初始化時候，會得到 grpc 的消息隊列 cq_。

GrpcMasterService(Master* master, const ConfigProto& default_session_config,
                  ::grpc::ServerBuilder* builder)
    : master_impl_(master),
      is_shutdown_(false),
      default_session_config_(default_session_config) {
  builder->RegisterService(&master_service_);
  cq_ = builder->AddCompletionQueue();
}

5.3 主循環

前面提到了，master_thread_ 線程來執行 GrpcMasterService 的 HandleRPCsLoop 方法。HandleRPCsLoop 會調用 GrpcMasterService 內部函數來進行處理RPC消息。主循環 HandleRPCsLoop 代碼如下：

void HandleRPCsLoop() override {
  ENQUEUE_REQUEST(CreateSession, true);
  ENQUEUE_REQUEST(ExtendSession, false);
  for (int i = 0; i < 100; ++i) {
    ENQUEUE_REQUEST(PartialRunSetup, false);
    ENQUEUE_REQUEST(RunStep, true);
  }
  ENQUEUE_REQUEST(CloseSession, false);
  ENQUEUE_REQUEST(ListDevices, false);
  ENQUEUE_REQUEST(Reset, false);
  ENQUEUE_REQUEST(MakeCallable, false);
  for (int i = 0; i < 100; ++i) {
    ENQUEUE_REQUEST(RunCallable, true);
  }
  ENQUEUE_REQUEST(ReleaseCallable, false);

  void* tag;
  bool ok;
  while (cq_->Next(&tag, &ok)) {
    UntypedCall<GrpcMasterService>::Tag* callback_tag =
        static_cast<UntypedCall<GrpcMasterService>::Tag*>(tag);
    if (callback_tag) {
      callback_tag->OnCompleted(this, ok);
    } else {
      // NOTE(mrry): A null callback_tag indicates that this is
      // the shutdown alarm.
      cq_->Shutdown();
    }
  }
}

上面代碼之中有一些最佳實踐，具體就是圍繞 ENQUEUE_REQUEST 做了一些處理：

this->cq_ 是 grpc 隊列。
ENQUEUE_REQUEST 宏會爲給定的 RPC 方法名稱創建一個新請求（比如 ENQUEUE_REQUEST(GetStatus, false) 就會生成一個 GetStatus 請求），這些請求將在 this->cq_ 之上進行排隊。
預先把一定數量的要處理的任務放入 cq_，如果任務被任務響應 handler 調用，則 handler 會調用ENQUEUE_REQUEST() 往隊列之中補充一個同樣的調用，這樣可以確保完成隊列 cq_ 有足夠的任務來處理傳入的請求，這樣處理將不會阻塞，整體處理速度會提高。
代碼最後的 while 循環將讀取 gRPC 隊列中的內容，就是 gRPC 調用之後的收尾工作。

#define ENQUEUE_REQUEST(method, supports_cancel)                              \
  do {                                                                        \
    mutex_lock l(mu_);                                                        \
    if (!is_shutdown_) {                                                      \
      Call<GrpcMasterService, grpc::MasterService::AsyncService,              \
           method##Request, method##Response>::                               \
          EnqueueRequest(&master_service_, cq_.get(),                         \
                         &grpc::MasterService::AsyncService::Request##method, \
                         &GrpcMasterService::method##Handler,                 \
                         (supports_cancel));                                  \
    }                                                                         \
  } while (0)

5.4 消息處理

在具體消息響應之中，會調用 master_impl_ 進行處理，當 Master 處理完成之後，處理函數將回調一個 lambda 表達式，向 Client 返回的響應消息。可以看到，代碼在最後會使用 ENQUEUE_REQUEST 再插入一個同樣類型的請求，比如下面最後會返回給 Client 一個 CreateSessionResponse。

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

5.5 功能

GrpcMasterService 提供的 API 如下：

static const char* grpcMasterService_method_names[] = {
    "/tensorflow.MasterService/CreateSession",
    "/tensorflow.MasterService/ExtendSession",
    "/tensorflow.MasterService/PartialRunSetup",
    "/tensorflow.MasterService/RunStep",
    "/tensorflow.MasterService/CloseSession",
    "/tensorflow.MasterService/ListDevices",
    "/tensorflow.MasterService/Reset",
    "/tensorflow.MasterService/MakeCallable",
    "/tensorflow.MasterService/RunCallable",
    "/tensorflow.MasterService/ReleaseCallable",
};

我們舉出三個具體功能分析一下：

5.5.1 CreateSession

CreateSessionRequest 消息之中會帶有 Client 設定的計算圖和配置信息。Master 接收到請求之後，爲這個 Client 建立一個 MasterSession 實例，並建立一個唯一地標識該 MasterSession 實例的 session_handle。這是通過 Master 類成員變量 std::unordered_map<string, MasterSession*> sessions_ 來完成的，session_handle 就是 string 類型。

Master 返回消息 CreateSessionResponse 給 Client。CreateSessionResponse 消息中攜帶：

session_handle。Client 的 GrpcSession 據此和 Master 端的 MasterSession 建立關聯，後續交互之中，Client 在消息內均會攜帶此 session_handle，隨後，Client 與 Master 的所有交互中，在請求消息中通過攜帶 session_handle，Master 通過它在 std::unordered_map<string, MasterSession*> sessions_ 會找到相對應的 MasterSession 實例。
初始 graph_version。用於後續發起 ExtendSession 操作，往原始的計算圖中追加新的節點。

圖 4 CreateSession

具體響應代碼如下：

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

5.5.2 ExtendSession

當建立 Session 之後，Client 可以通過 ExtendSession 告訴 Master 我需要拓展原有計算圖的規模 (只能追加子圖，不能修改或刪除)。

在請求消息 ExtendSessionRequest 中有：

session_handle ：用來查找哪一個 MasterSession 實例；
graph_def ：需要加到計算圖上的節點；
current_graph_version ：需要拓展的計算圖版本號；

在在響應消息 ExtendSessionResponse 中返回 new_graph_version，其用於下一此 ExtendSession 操作。

圖 5 ExtendSession

具體代碼如下：

// RPC handler for extending a session.
void ExtendSessionHandler(
    MasterCall<ExtendSessionRequest, ExtendSessionResponse>* call) {
  master_impl_->ExtendSession(&call->request, &call->response,
                              [call](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                              });
  ENQUEUE_REQUEST(ExtendSession, false);
}

5.5.3 RunStep

客戶端會迭代執行 RunStep，請求消息 RunStepRequest 的變量較多，比如：

session_handle ：用來查找哪一個 MasterSession 實例；
feed ：輸入的 NamedTensor 列表；
fetch ：待輸出 Tensor 的名稱列表；
target ：執行節點列表；

響應消息 RunStepResponse 主要攜帶：

tensor ：輸出的 Tensor 列表；

圖 6 RunStep

消息定義具體如下：

message RunStepRequest {
  // REQUIRED: session_handle must be returned by a CreateSession call
  // to the same master service.
  string session_handle = 1;

  // Tensors to be fed in the step. Each feed is a named tensor.
  repeated NamedTensorProto feed = 2;

  // Fetches. A list of tensor names. The caller expects a tensor to
  // be returned for each fetch[i] (see RunStepResponse.tensor). The
  // order of specified fetches does not change the execution order.
  repeated string fetch = 3;

  // Target Nodes. A list of node names. The named nodes will be run
  // to but their outputs will not be fetched.
  repeated string target = 4;

  // Options for the run call.
  RunOptions options = 5;

  // Partial run handle (optional). If specified, this will be a partial run
  // execution, run up to the specified fetches.
  string partial_run_handle = 6;

  // If true then some errors, e.g., execution errors that have long
  // error messages, may return an OK RunStepResponse with the actual
  // error saved in the status_code/status_error_message fields of the
  // response body. This is a workaround since the RPC subsystem may
  // truncate long metadata messages.
  bool store_errors_in_response_body = 7;

  // Unique identifier for this request. Every RunStepRequest must
  // have a unique request_id, and retried RunStepRequest must have
  // the same request_id. If request_id is zero, retry detection is disabled.
  int64 request_id = 8;
}

message RunStepResponse {
  // NOTE: The order of the returned tensors may or may not match
  // the fetch order specified in RunStepRequest.
  repeated NamedTensorProto tensor = 1;

  // Returned metadata if requested in the options.
  RunMetadata metadata = 2;

  // If store_errors_in_response_body is true in the request, then
  // optionally the server may return an OK status for the RPC and
  // fill the true status into the fields below, to allow for messages
  // that are too long to fit in metadata.
  error.Code status_code = 3;
  string status_error_message = 4;
}

具體代碼如下：

// RPC handler for running one step in a session.
void RunStepHandler(MasterCall<RunStepRequest, RunStepResponse>* call) {
  auto* trace = TraceRpc("RunStep/Server", call->client_metadata());
  CallOptions* call_opts = new CallOptions;
  if (call->request.options().timeout_in_ms() > 0) {
    call_opts->SetTimeout(call->request.options().timeout_in_ms());
  } else {
    call_opts->SetTimeout(default_session_config_.operation_timeout_in_ms());
  }
  RunStepRequestWrapper* wrapped_request =
      new ProtoRunStepRequest(&call->request);
  MutableRunStepResponseWrapper* wrapped_response =
      new NonOwnedProtoRunStepResponse(&call->response);
  call->SetCancelCallback([call_opts]() { call_opts->StartCancel(); });
  master_impl_->RunStep(
      call_opts, wrapped_request, wrapped_response,
      [call, call_opts, wrapped_request, trace](const Status& status) {
        call->ClearCancelCallback();
        delete call_opts;
        delete wrapped_request;
        delete trace;
        if (call->request.store_errors_in_response_body() && !status.ok()) {
          call->response.set_status_code(status.code());
          call->response.set_status_error_message(status.error_message());
          call->SendResponse(ToGrpcStatus(Status::OK()));
        } else {
          call->SendResponse(ToGrpcStatus(status));
        }
      });
  ENQUEUE_REQUEST(RunStep, true);
}

6. 業務實現 Master 類

6.1 創建

前面提到了，GrpcServer 之中建立的是 Master 類的實例。

std::unique_ptr<Master> GrpcServer::CreateMaster(MasterEnv* master_env) {
  return std::unique_ptr<Master>(new Master(master_env, 0.0));
}

這樣，在收到 Client 的消息後，在具體消息響應之中，GrpcMasterService 的線程會調用 master_impl_ 進行處理，就是把業務邏輯委託給 Master 類來實現。所以我們接下來就看看 Master 如何處理。

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

6.2 定義

Master 其實不是 MasterInterface 的派生類，其定義在tensorflow/core/distributed_runtime/master.cc。可以從成員變量 sessions_ 上看出來，主要就是管理 MasterSession。

class Master {

 private:
  typedef Master ME;

  // Not owned.
  MasterEnv* env_ = nullptr;

  // Owned.
  mutex mu_;

  // shutdown_ is set to true by the dtor.
  condition_variable shutdown_cv_;
  bool shutdown_ TF_GUARDED_BY(mu_) = false;
  Thread* gc_thread_;

  // Maps session handles to sessions.
  std::unordered_map<string, MasterSession*> sessions_ TF_GUARDED_BY(mu_);

  // Moving average of step times.
  MovingAverage last_1000_steps_ TF_GUARDED_BY(mu_);

  // Cumulative number of steps executed.
  int64 step_count_ TF_GUARDED_BY(mu_);

  // If a session is not active for this many seconds, it will be
  // closed automatically.
  const double session_gc_seconds_;

  // Used to track ids for incoming requests so we can detect duplicates.
  RecentRequestIds recent_request_ids_;
};

6.3 功能

我們回憶一下之前提到的。

分佈式運行的核心是如何操作計算圖，但是計算功能被拆分爲 Client，Master 和 Worker 三個角色。

Client 負責構造計算圖，Worker 負責執行具體計算，但是 Worker 怎麼知道應該計算什麼？TensorFlow 在兩者之間插入了一個 Master 角色來負責協調，調度。

雖然 Master 不是 MasterInterface 的派生類，但時其實現了 MasterService 的具體業務。Master 具體負責：

Master 預先知道本地有哪些設備可以作爲客戶使用的設備，也會發現遠程設備，並跟蹤這些遠程設備的統計數據。
一個 Master 包含多個 "主會話（master sessions）"。每個 master sessions 封裝了一個計算圖及其相關狀態。
主會話將:
- 精簡優化計算圖，比如剪枝/分割/插入發送和接受算子。
- 協調/調度資源。比如哪個計算應該在哪個設備運行，具體就是按照 graph -> Partition -> Device 這個策略把子圖劃分到硬件設備之上。
- 把分割之後的各個子圖發送給各個 worker，具體每一個子圖對應一個 MasterSession。並最終通過在工作者上啓動 RunGraph 來驅動圖的計算。
Master 維護實時圖計算會話的狀態。

至此，Master 的靜態結構我們已經介紹完畢，具體 Master 功能我們將在後文 Session 部分進行具體介紹。

最後，強烈推薦兩個大神：

[TensorFlow Internals] (https://github.com/horance-liu/tensorflow-internals)，雖然其分析的不是最新代碼，但是建議對 TF 內部實現機制有興趣的朋友都去閱讀一下，絕對大有收穫。
https://home.cnblogs.com/u/deep-learning-stacks/ 西門宇少，不僅僅是 TensorFlow，其公共號還有更多其他領域，業界前沿。