[源碼解析] TensorFlow 分佈式之 ParameterServerStrategy V2

對於 ParameterServerStrategy V2，我們將從幾個方面來研究：如何與集羣建立連接，如何生成變量，如何獲取數據，如何運行。其中，變量和作用域我們在前文已經研究過，運行在 MirroredStrategy 裏面也介紹，所以本文主要看看如何使用，如何初始化。在下一篇之中會重點看看如何分發計算。

安利兩個github，都是非常好的學習資料，推薦。

https://github.com/yuhuiaws/ML-study

https://github.com/Jack47/hack-SysML

另外推薦西門宇少的最新大作讓Pipeline在Transformer LM上沿着Token level並行起來——TeraPipe。

本系列其他文章是：

[翻譯] TensorFlow 分佈式之論文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

[翻譯] TensorFlow 分佈式之論文篇 "Implementation of Control Flow in TensorFlow"

[源碼解析] TensorFlow 分佈式環境(1) --- 總體架構

[源碼解析] TensorFlow 分佈式環境(2)---Master 靜態邏輯

[源碼解析] TensorFlow 分佈式環境(3)--- Worker 靜態邏輯

[源碼解析] TensorFlow 分佈式環境(4) --- WorkerCache

[源碼解析] TensorFlow 分佈式環境(5) --- Session

[源碼解析] TensorFlow 分佈式環境(7) --- Worker 動態邏輯

[源碼解析] TensorFlow 分佈式環境(8) --- 通信機制

[翻譯] 使用 TensorFlow 進行分佈式訓練

[源碼解析] TensorFlow 分佈式 DistributedStrategy 之基礎篇

[源碼解析] TensorFlow 之分佈式變量

[源碼解析] TensorFlow 分佈式之 MirroredStrategy

[源碼解析] TensorFlow 分佈式之 MirroredStrategy 分發計算

[源碼解析] TensorFlow 分佈式之 ParameterServerStrategy V1

1. 如何使用

在 TensorFlow 2 中，參數服務器訓練由 tf.distribution.experimental.ParameterServerStrategy 類提供支持，該類將訓練步驟分佈到一個可擴展到數千個工作者（伴隨着參數服務器）的集羣。

1.1 訓練方法

支持訓練有兩種主要方法：

Keras Model.fit API。如果用戶喜歡用高層次抽象來訓練，則建議使用這種方式。
自定義訓練循環（custom training loop）。如果用戶需要自己實現或者定義訓練細節，則可以考慮這種方式。

1.2 集羣

無論選擇何種API（ Model.fit 或自定義訓練循環），TensorFlow 2中的分佈式訓練都會涉及如下概念：一個"集羣" 有若干個"作業（job）"，每個作業可能包括一個或多個"任務"。而當使用參數服務器訓練時，建議使用如下配置：

一個協調者（coordinator ） job（job名稱爲 chief）。
多個工作者 jobs（job名稱爲 worker）。
多個參數服務器 jobs（job名稱爲 ps）。

協調者負責創建資源、分配訓練任務、寫檢查點和處理任務失敗，工作者和參數服務器則運行 tf.distribution.Server 來聽取協調者的請求。

1.3 使用 Model.fit API 進行訓練

如果使用 "Model.fit" API，則參數服務器訓練需要協調者使用 tf.distribution.experimental.ParameterServerStrategy 對象和 tf.keras.utils.experimental.DatasetCreator 作爲輸入。與其他策略類似，其工作流程包括：創建和編譯模型，準備回調，調用 Model.fit。

1.4 使用自定義循環進行訓練

TensorFlow 2 推薦使用一種基於中央協調的架構來進行參數服務器訓練。每個工作者和參數服務器都運行一個 tf.distribution.Server，在此基礎上，一個協調者任務負責在工作者和參數服務器上創建資源，調度功能，並協調訓練。協調器使用 tf.distribution.experimental.coordinator.ClusterCoordinator 來協調集羣，使用 tf.distribution.experimental.ParameterServerStrategy 來定義參數服務器上的變量和工作者的計算。在自定義訓練循環中， tf.distribution.experimental.coordinator.ClusterCoordinator 類是用於協調器的關鍵組件。

ClusterCoordinator 類需要與 tf.distribution.Strategy 對象一起工作。
對於參數服務器訓練， ClusterCoordinator 需要與 tf.distribution.experimental.ParameterServerStrategy 一起工作。
這個 tf.distribution.Strategy 對象需要使用者提供集羣的信息，並使用這些信息來定義訓練步驟。然後， ClusterCoordinator 對象將這些訓練步驟的執行分派給遠程工作者。

ClusterCoordinator 提供的最重要的 API 是 schedule 。

Schedule API 把一個 tf.function 插入隊列，並立即返回一個類似 future 的 RemoteValue 。
在隊列之中排隊的函數被派發給後臺線程中的遠程工作者，他們的 RemoteValue 將被異步賦值。
由於 schedule 不需要執行分配任務，因此傳遞進來的 tf.function 可以在任何可用的工作者上執行。
如果被執行的工作者在結束之前變得不可用，該 tf.function 將在另一個可用的工作者上重試。
由於函數的執行不是原子性的，所以一個函數可能被執行多次。

除了調度遠程函數這個功能之外，ClusterCoordinator 還幫助在所有工作者上創建數據集，以及當一個工作者從失敗中恢復時重建這些數據集。

1.5 建立集羣

如上所述，一個參數服務器訓練集羣需要一個協調者任務來運行你的訓練程序，程序包括一個或幾個運行TensorFlow 服務器（ tf.distribution.Server ）的工作者和參數服務器，可能還有一個運行 side-car 評估的評估任務。設置它們的要求是。

協調者（coordinator）任務需要知道所有其他 TensorFlow 服務器（評估者除外）的地址和端口。
工作者和參數服務器需要知道他們應該監聽哪個端口。爲了簡單起見，用戶通常可以在這些任務上創建 TensorFlow 服務器時傳入完整的集羣信息。
評估器（evaluator）任務不需要知道訓練集羣的設置，它也不應該試圖連接到訓練集羣。
工作者和參數服務器的任務類型應該分爲 "worker" 和 "ps" 兩種。出於歷史原因，協調器應使用 "chief" 作爲任務類型。

2. 初始化

2.1 用例

以下是如何初始化 ParameterServerStrategy 的樣例，無論是使用 Model.fit 還是自定義循環，都需要這步工作。爲了使用 GPU 進行訓練，需要爲每個工作者分配可見的 GPU。 ParameterServerStrategy 將使用每個工作者上所有可用的 GPU，但有個限制是：所有工作者都應該有相同數量的 GPU 可用。

variable_partitioner = (
    tf.distribute.experimental.partitioners.MinSizePartitioner(
        min_shard_bytes=(256 << 10),
        max_shards=NUM_PS))

strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver,
    variable_partitioner=variable_partitioner)

對於 variable_partitioner，這是一個 distribute.experimental.partitioners.Partitioner，其指定如何對變量進行分區。如果是 None，變量將不被分割，其特點如下：

此參數取值是 tf.distribute.experimental.partitioners 中預定義的分區器。一個常用的分區器是 MinSizePartitioner(min_shard_bytes = 256 << 10, max_shards = num_ps)，它爲每個分片分配至少 256K，每個 ps 最多得到一個分片。
在策略 scope 下創建的每個變量都會調用 variable_partitioner，以指示該變量應如何分區。沿着分區軸只有一個分區的變量（即不需要分區）將被創建爲一個普通的 tf.Variable 。
只支持第一個/最外層軸的分區。
Div 分區策略被用來對變量進行分區。假設我們沿着變量的第一軸分配連續的整數 id，那麼 id 會以連續的方式分配給分片，同時試圖保持每個分片的大小相同。如果 id 不能平均分配給分片的數量，那麼前幾個分片中的每一個將被多分配一個 id。例如，一個變量的第一個維度是 13，它有 13 個 id，它們被分成 5 個分片。 [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]] .
在 strategy.extended.colocate_vars_with 下創建的變量將不會被分割。

2.2 集羣設置

在真實的生產環境中，用戶需要在不同機器上的所有不同進程中運行訓練任務。在每個任務上配置集羣信息的最簡單方法是設置"TF_CONFIG" 環境變量，並使用 tf.distribution.cluster_resolver.TFConfigClusterResolver 來解析"TF_CONFIG" 。如果用戶使用 Kubernetes 或其他配置模板開始訓練任務，很可能這些模板已經設置了"TF_CONFIG"

2.2.1 設置 "TF_CONFIG" 環境變量

假定你有 3 個工作者，3 個參數服務器，那麼 worker 1 的 "TF_CONFIG" 可以如下：

os.environ["TF_CONFIG"] = json.dumps({
   "cluster": {
       "worker": ["host1:port","host2:port","host3:port"],
       "ps": ["host4:port","host5:port"],
       "chief": ["host6:port"]
    },
   "task": {"type":"worker","index": 1}
})

2.2.2 使用二進制文件

如果你喜歡用一個二進制文件來運行所有這些任務，你將需要在程序開始就指明不同分支負責處理不同的角色。

cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
if cluster_resolver.task_type in ("worker","ps"):
  # Start a TensorFlow server and wait.
elif cluster_resolver.task_type =="evaluator":
  # Run side-car evaluation
else:
  # Run the coordinator.

如下代碼啓動一個 TensorFlow server 然後等待完成。

# Set the environment variable to allow reporting worker and ps failure to the
# coordinator. This is a workaround and won't be necessary in the future.
os.environ["GRPC_FAIL_FAST"] ="use_caller"

server = tf.distribute.Server(
    cluster_resolver.cluster_spec(),
    job_name=cluster_resolver.task_type,
    task_index=cluster_resolver.task_id,
    protocol=cluster_resolver.rpc_layer or"grpc",
    start=True)
server.join()

2.3 初始化方法

初始化方法如下，主要工作是連接到集羣，然後調用 _extended 進行繼續初始化。

  def __init__(self, cluster_resolver, variable_partitioner=None):
   """Initializes the TF2 parameter server strategy.

    This initializes the  tf.distribute.experimental.ParameterServerStrategy 
    object to be ready for use with
     tf.distribute.experimental.coordinator.ClusterCoordinator .
   """
    # pyformat: enable
    self._cluster_resolver = cluster_resolver

    self._verify_args_and_config(cluster_resolver)
    self._cluster_coordinator = None

    self._connect_to_cluster(coordinator_name="chief") # 連接到集羣
    self._extended = ParameterServerStrategyV2Extended(self, cluster_resolver,
                                                       variable_partitioner)
    super(ParameterServerStrategyV2, self).__init__(self._extended)
    distribute_lib.distribution_strategy_gauge.get_cell("V2").set(
       "ParameterServerStrategy")
    self._should_use_with_coordinator = True
    # Used while constructing distributed iterators.
    self._canonicalize_devices = False

2.4 連接到集羣

_connect_to_cluster 起到了連接到集羣的作用，其主要邏輯是設置了 filter，然後調用 remote.connect_to_cluster 去連接集羣。

  def _connect_to_cluster(self, coordinator_name):
    if coordinator_name in ["worker","ps"]:
      raise ValueError("coordinator name should not be 'worker' or 'ps'.")
    cluster_spec = self._cluster_resolver.cluster_spec()
    self._num_workers = len(cluster_spec.as_dict().get("worker", ()))
    self._num_ps = len(cluster_spec.as_dict().get("ps", ()))

    device_filters = server_lib.ClusterDeviceFilters()
    # For any worker, only the devices on ps and coordinator nodes are visible
    for i in range(self._num_workers):
      device_filters.set_device_filters(
         "worker", i, ["/job:ps","/job:%s" % coordinator_name])
    # Similarly for any ps, only the devices on workers and coordinator are
    # visible
    for i in range(self._num_ps):
      device_filters.set_device_filters(
         "ps", i, ["/job:worker","/job:%s" % coordinator_name])

    # Allow at most one outstanding RPC for each worker at a certain time. This
    # is to simplify worker failure handling in the runtime
    os.environ["TF_ENABLE_EAGER_CLIENT_STREAMING_ENQUEUE"] ="False"

    remote.connect_to_cluster(
        cluster_spec,
        job_name=coordinator_name,
        protocol=self._cluster_resolver.rpc_layer,
        cluster_device_filters=device_filters)

    distribute_lib.distribution_strategy_replica_gauge.get_cell(
       "ps_strategy_num_workers").set(self._num_workers)
    distribute_lib.distribution_strategy_replica_gauge.get_cell(
       "ps_strategy_num_ps").set(self._num_ps)

connect_to_cluster 方法會連接到給定的集羣，使集羣上的設備可用。如果給定的本地 job 名稱沒有出現在集羣規範中，它將被自動添加，並且使用本地主機上一個未使用的端口。

工作者如果在被過濾的遠程設備上訪問資源或啓動程序/功能，將導致一個未知設備錯誤。對於任何遠程任務，如果沒有設備過濾器，所有的集羣設備都是可見的；如果指定了設備過濾器，任務則只能看到與至少一個過濾器匹配的設備。任務本身的設備始終是可見的。

以下是使用樣例。

cdf = tf.config.experimental.ClusterDeviceFilters()
# For any worker, only the devices on PS nodes and itself are visible
for i in range(num_workers):
  cdf.set_device_filters('worker', i, ['/job:ps'])
# Similarly for any ps, only the devices on workers and itself are visible
for i in range(num_ps):
  cdf.set_device_filters('ps', i, ['/job:worker'])

tf.config.experimental_connect_to_cluster(cluster_def,
                                          cluster_device_filters=cdf)

具體 connect_to_cluster 的代碼如下。

@tf_export("config.experimental_connect_to_cluster")
def connect_to_cluster(cluster_spec_or_resolver,
                       job_name="localhost",
                       task_index=0,
                       protocol=None,
                       make_master_device_default=True,
                       cluster_device_filters=None):
 """Connects to the given cluster.

  Will make devices on the cluster available to use. Note that calling this more
  than once will work, but will invalidate any tensor handles on the old remote
  devices.

  If the given local job name is not present in the cluster specification, it
  will be automatically added, using an unused port on the localhost.

  Device filters can be specified to isolate groups of remote tasks to avoid
  undesired accesses between workers. Workers accessing resources or launching
  ops / functions on filtered remote devices will result in errors (unknown
  devices). For any remote task, if no device filter is present, all cluster
  devices will be visible; if any device filter is specified, it can only
  see devices matching at least one filter. Devices on the task itself are
  always visible. Device filters can be particially specified.

  Args:
    cluster_spec_or_resolver: A  ClusterSpec  or  ClusterResolver  describing
      the cluster.
    job_name: The name of the local job.
    task_index: The local task index.
    protocol: The communication protocol, such as "grpc" . If unspecified, will
      use the default from  python/platform/remote_utils.py .
    make_master_device_default: If True and a cluster resolver is passed, will
      automatically enter the master task device scope, which indicates the
      master becomes the default device to run ops. It won't do anything if
      a cluster spec is passed. Will throw an error if the caller is currently
      already in some device scope.
    cluster_device_filters: an instance of
       tf.train.experimental/ClusterDeviceFilters  that specify device filters
      to the remote tasks in cluster.
 """
  if not context.executing_eagerly():
    raise ValueError(
       " tf.config.experimental_connect_to_cluster  can only be called in"
       "eager mode."
    )
  protocol = protocol or remote_utils.get_default_communication_protocol()
  if isinstance(cluster_spec_or_resolver, server_lib.ClusterSpec):
    cluster_spec = cluster_spec_or_resolver
  elif isinstance(cluster_spec_or_resolver, cluster_resolver.ClusterResolver):
    if cluster_spec_or_resolver.master() in _LOCAL_MASTERS:
      # Do nothing if the master is local.
      return
    cluster_spec = cluster_spec_or_resolver.cluster_spec()
  else:
    raise ValueError(
       " cluster_spec_or_resolver  must be a  ClusterSpec  or a"
       " ClusterResolver .")

  cluster_def = copy.deepcopy(cluster_spec.as_cluster_def())
  if cluster_device_filters:
    if isinstance(cluster_device_filters, server_lib.ClusterDeviceFilters):
      cluster_device_filters = copy.deepcopy(
          cluster_device_filters._as_cluster_device_filters())  
    else:
      raise ValueError(" cluster_device_filters  must be an instance of"
                      " tf.train.experimental.ClusterDeviceFilters .")

  # Automatically add local job, if not part of the cluster spec.
  if job_name not in cluster_spec.jobs:
    local_port = pywrap_tfe.TF_PickUnusedPortOrDie()
    job_def = cluster_def.job.add()
    job_def.name = job_name
    job_def.tasks[0] ="localhost:{}".format(local_port)

  server_def = ServerDef(
      cluster=cluster_def,
      job_name=job_name,
      task_index=task_index,
      protocol=protocol,
      default_session_config=context.context().config,
      cluster_device_filters=cluster_device_filters)

  if context.get_server_def() is None:
    context.set_server_def(server_def) # 這裏會做處理設備
  else:
    context.update_server_def(server_def)

  # 配置 master Device  
  if make_master_device_default and isinstance(
      cluster_spec_or_resolver,
      cluster_resolver.ClusterResolver) and cluster_spec_or_resolver.master():
    master = cluster_spec_or_resolver.master()
    master_job_name = None
    master_task_id = None
    for job_name in cluster_spec.jobs:
      for task_id in cluster_spec.task_indices(job_name):
        task_address = cluster_spec.task_address(job_name, task_id)
        if master in task_address or task_address in master:
          master_job_name = job_name
          master_task_id = task_id
          break

    if not master_job_name:
      raise ValueError(
         " make_master_device_default  is set to True but cannot find"
         "master %s in the cluster" % master)

    master_device ="/job:{}/replica:0/task:{}".format(master_job_name,
                                                       master_task_id)
    master_device = device_util.canonicalize(master_device)
    current_device = device_util.current()
    if current_device:
      current_device = device_util.canonicalize(current_device)
    if current_device and current_device != master_device:
      raise ValueError(" connect_to_cluster  is called inside existing device"
                      "scope %s, which is different from the master device"
                      "scope %s to enter. This is not allowed." %
                       (current_device, master_device))

    if not current_device:
      logging.info("Entering into master device scope: %s", master_device)
      ops.device(master_device).__enter__()

2.5 初始化設備

set_server_def 會調用 _initialize_logical_devices 來初始化邏輯設備。

  def set_server_def(self, server_def, keep_alive_secs=_KEEP_ALIVE_SECS):
   """Allow setting a server_def on the context.

    When a server def is replaced, it effectively clears a bunch of caches
    within the context. If you attempt to use a tensor object that was pointing
    to a tensor on the remote device, it will raise an error.

    Args:
      server_def: A tensorflow::ServerDef proto. Enables execution on remote
        devices.
      keep_alive_secs: Num. seconds after which the remote end will hang up. As
        long as the client is still alive, the server state for the context will
        be kept alive. If the client is killed (or there is some failure), the
        server will clean up its context keep_alive_secs after the final RPC it
        receives.

    Raises:
      ValueError: if server_def is None.
   """
    if not server_def:
      raise ValueError("server_def is None.")

    self._server_def = server_def

    if self._context_handle:
      server_def_str = server_def.SerializeToString()
      pywrap_tfe.TFE_ContextSetServerDef(self._context_handle, keep_alive_secs,
                                         server_def_str)
      self._initialize_logical_devices()

    # Clear all the caches in case there are remote tensors in them.
    self._clear_caches()

_initialize_logical_devices 則會調用上下文對象的方法和一些其他方法來實現功能。

  def _initialize_logical_devices(self):
   """Helper to initialize devices."""
    # Store list of devices
    logical_devices = []
    context_devices = []
    device_list = pywrap_tfe.TFE_ContextListDevices(self._context_handle)
    try:
      self._num_gpus = 0
      for i in range(pywrap_tfe.TF_DeviceListCount(device_list)):
        dev_name = pywrap_tfe.TF_DeviceListName(device_list, i)
        context_devices.append(pydev.canonical_name(dev_name))
        spec = pydev.DeviceSpec.from_string(dev_name)
        # If the job is localhost, we assume that the cluster has not yet been
        # configured and thus clear the job, replica & task.
        if spec.job =="localhost":
          spec = spec.replace(job=None, replica=None, task=None)
        logical_devices.append(
            LogicalDevice(name=spec.to_string(), device_type=spec.device_type))
        dev_type = pywrap_tfe.TF_DeviceListType(device_list, i)
        if dev_type =="GPU":
          self._num_gpus += 1

    finally:
      self._logical_devices = logical_devices
      self._context_devices = context_devices
      pywrap_tfe.TF_DeleteDeviceList(device_list)

我們以 TFE_ContextListDevices 爲例來看，其調用到了 Context 的 ListDevices 方法。

TF_DeviceList* TFE_ContextListDevices(TFE_Context* ctx, TF_Status* status) {
  TF_DeviceList* l = new TF_DeviceList;
  tensorflow::unwrap(ctx)->ListDevices(&l->response);
  return l;
}

上下文如何實現，就需要具體情況具體分析了，比如下面的生成上下文的代碼。

TFE_Context* TFE_NewContext(const TFE_ContextOptions* opts, TF_Status* status) {
  if (opts->use_tfrt) {
#if defined(PLATFORM_GOOGLE) && !defined(LIBTPU_ON_GCE)
    tfrt::tf::ContextInterface* tfrt_context = new tfrt::tf::ContextInterface(
        opts->session_options.options,
        static_cast<tensorflow::ContextDevicePlacementPolicy>(
            opts->device_placement_policy),
        opts->async, opts->use_tfrt_distributed_runtime);
#if !defined(IS_MOBILE_PLATFORM)
    tfrt_context->SetDistributedManager(
        tfrt::tf::CreateDistributedManagerContext(
            tfrt_context->GetCoreRuntime()->GetHostContext()));
#endif  // !IS_MOBILE_PLATFORM
    return tensorflow::wrap(tfrt_context);
#else
    status->status = tensorflow::errors::Unimplemented("TFRT is not supported");
    return nullptr;
#endif  // PLATFORM_GOOGLE && !LIBTPU_ON_GCE
  }
  std::vector<std::unique_ptr<tensorflow::Device>> devices;
  status->status = tensorflow::DeviceFactory::AddDevices(
      opts->session_options.options,"/job:localhost/replica:0/task:0",
      &devices);
  if (!status->status.ok()) return nullptr;
  std::unique_ptr<tensorflow::DeviceMgr> device_mgr(
      new tensorflow::DynamicDeviceMgr(std::move(devices)));

  tensorflow::Rendezvous* r =
      new tensorflow::IntraProcessRendezvous(device_mgr.get());
  tensorflow::EagerContext* eager_context = new tensorflow::EagerContext(
      opts->session_options.options,
      static_cast<tensorflow::ContextDevicePlacementPolicy>(
          opts->device_placement_policy),
      opts->async, device_mgr.release(),
      /*device_mgr_owned*/ true, r,
      /*cluster_flr=*/nullptr,
      /*collective_executor_mgr=*/nullptr,
      /*run_eager_op_as_function=*/opts->run_eager_op_as_function);
#if !defined(IS_MOBILE_PLATFORM)
  eager_context->SetDistributedManager(
      std::make_unique<tensorflow::EagerContextDistributedManager>(
          eager_context));
#endif  // !IS_MOBILE_PLATFORM
  return tensorflow::wrap(eager_context);
}

2.6 Master 設備

在 connect_to_cluster 之中，會調用 ops.device(master_device).enter() 來設置 master Device。代碼位於 tensorflow/python/framework/ops.py。 device_name_or_function 參數可以是一個設備名稱字符串，一個設備函數，或者是None：

如果它是一個設備名稱字符串，在這個上下文中構建的所有操作將被分配給具有該名稱的設備，除非被嵌套的 device() 上下文覆蓋。
如果它是一個函數，它將被視爲一個從操作對象到設備名稱字符串的函數，並且在每次創建一個新操作時被調用。該操作將被分配給具有返回名稱的設備。
如果它是 None，所有來自包圍上下文（enclosing context）的 device() 調用將被忽略。

@tf_export(v1=["device"])
def device(device_name_or_function):
 """Wrapper for  Graph.device()  using the default graph.

  See  tf.Graph.device  for more details.

  Args:
    device_name_or_function: The device name or function to use in the context.

  Returns:
    A context manager that specifies the default device to use for newly
    created ops.

  Raises:
    RuntimeError: If eager execution is enabled and a function is passed in.
 """
  if context.executing_eagerly():
    if callable(device_name_or_function):
      raise RuntimeError(
         "tf.device does not support functions when eager execution"
         "is enabled.")
    return context.device(device_name_or_function)
  elif executing_eagerly_outside_functions():
    @tf_contextlib.contextmanager
    def combined(device_name_or_function):
      with get_default_graph().device(device_name_or_function):
        if not callable(device_name_or_function):
          with context.device(device_name_or_function):
            yield
        else:
          yield
    return combined(device_name_or_function)
  else:
    return get_default_graph().device(device_name_or_function)

3. 使用 Model.fit 訓練

Keras 通過 Model.fit 提供了一個易於使用的訓練 API，它在幕後處理訓練循環，並且通過可重寫的 train_step 和回調方法提供了靈活性，也提供了檢查點保存或 TensorBoard 摘要保存等功能。通過 Model.fit，同樣的訓練代碼只需通過簡單地交換策略對象即可被用於其他策略。

3.1 輸入數據

使用參數服務器訓練的 Model.fit 需要在一個 callable 中提供輸入數據，該 callable 接收一個 tf.distribution.InputContext 類型的參數，並返回一個 tf.data.Dataset 。然後，系統將創建一個 tf.keras.utils.experimental.DatasetCreator 對象，它接受上述的 callable，並通過 input_options 參數創建一個可選的 tf.distribution.InputOptions 對象。

注意，建議用參數服務器訓練來 shuffle 和 repeat 數據，並在 fit 調用中指定 steps_per_epoch，這樣庫就會知道 epoch 的界限。

關於 InputContext 參數的更多信息，請參見官方 Distributed input 教程。

def dataset_fn(input_context):
  global_batch_size = 64
  batch_size = input_context.get_per_replica_batch_size(global_batch_size)

  x = tf.random.uniform((10, 10))
  y = tf.random.uniform((10,))

  dataset = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(10).repeat()
  dataset = dataset.shard(
      input_context.num_input_pipelines,
      input_context.input_pipeline_id)
  dataset = dataset.batch(batch_size)
  dataset = dataset.prefetch(2)

  return dataset

dc = tf.keras.utils.experimental.DatasetCreator(dataset_fn)

dataset_fn 中的代碼將在每個工作者的輸入設備上被調用，這個設備通常是CPU。

3.2 模型構建和編譯

處理好數據之後，用戶需要創建一個 tf.keras.Model，然後是一個 Model.compile 調用，以納入組件，如優化器、度量或參數（如 steps_per_execution）。

with strategy.scope():
  model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
  model.compile(tf.keras.optimizers.SGD(), loss='mse', steps_per_execution=10)

3.3 回調和訓練

在你調用 model.fit 進行實際訓練之前，還需要爲常見的工作準備所需的回調，例如。

ModelCheckpoint ：保存模型的權重。
BackupAndRestore ：確保訓練進度被自動備份，並在集羣出現不可用情況（如中止或搶佔）時恢復；
TensorBoard ：將進度報告保存爲摘要文件，在 TensorBoard 工具中進行可視化。

注意：由於性能方面的考慮，自定義回調在與 ParameterServerStrategy 一起使用時不能覆蓋批級（batch level）回調。請修改你的自定義回調成爲 epoch 級別的調用，並將 steps_per_epoch 調整到一個合適的值。此外，當與 ParameterServerStrategy 一起使用時， steps_per_epoch 是 Model.fit 的一個必要參數。

working_dir = '/tmp/my_working_dir'
log_dir = os.path.join(working_dir, 'log')
ckpt_filepath = os.path.join(working_dir, 'ckpt')
backup_dir = os.path.join(working_dir, 'backup')

callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir=log_dir),
    tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_filepath),
    tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir),
]

model.fit(dc, epochs=5, steps_per_epoch=20, callbacks=callbacks)

3.4 直接使用 ClusterCoordinator (optional)

即使你選擇了 Model.fit 訓練路徑，你也可以選擇實例化一個 tf.distribution.experimental.coordinator.ClusterCoordinator 對象來安排你希望在工作者上執行的其他功能。

0x04 自定義訓練

使用 tf.distribution.Strategy 的自定義訓練循環爲定義訓練循環提供了極大的靈活性。通過上面定義的 ParameterServerStrategy （作爲 strategy ），用戶可以使用 tf.distribution.experimental.coordinator.ClusterCoordinator 將訓練步驟調度給遠程工作者來執行。

和其他 tf.distribution.Strategy 的訓練循環一樣，用戶需要創建一個模型，定義一個數據集和一個步進函數（step function）。爲了確保高效的數據集預取，建議使用下面會提到的分佈式數據集創建 API。此外，確保在 worker_fn 內調用 Strategy.run，這樣可以充分利用分配給工作者的 GPU。

我們接下來看看如何創建這些組件。

4.1 配置數據

首先，編寫一個函數來創建一個數據集，其中包括由 Keras preprocessing layers 所實現的預處理邏輯。我們在 dataset_fn 之外創建這些層，但在 dataset_fn 內應用轉換，因爲我們將把 dataset_fn 包裹到 tf.function 中，它不允許在其內部創建變量。

feature_vocab = [
   "avenger","ironman","batman","hulk","spiderman","kingkong","wonder_woman"
]
label_vocab = ["yes","no"]

with strategy.scope():
  feature_lookup_layer = tf.keras.layers.StringLookup(
      vocabulary=feature_vocab,
      mask_token=None)
  label_lookup_layer = tf.keras.layers.StringLookup(
      vocabulary=label_vocab,
      num_oov_indices=0,
      mask_token=None)

  raw_feature_input = tf.keras.layers.Input(
      shape=(3,),
      dtype=tf.string,
      name="feature")
  feature_id_input = feature_lookup_layer(raw_feature_input)
  feature_preprocess_stage = tf.keras.Model(
      {"features": raw_feature_input},
      feature_id_input)

  raw_label_input = tf.keras.layers.Input(
      shape=(1,),
      dtype=tf.string,
      name="label")
  label_id_input = label_lookup_layer(raw_label_input)

  label_preprocess_stage = tf.keras.Model(
      {"label": raw_label_input},
      label_id_input)

以下是構建數據的代碼。

def feature_and_label_gen(num_examples=200):
  examples = {"features": [],"label": []}
  for _ in range(num_examples):
    features = random.sample(feature_vocab, 3)
    label = ["yes"] if"avenger" in features else ["no"]
    examples["features"].append(features)
    examples["label"].append(label)
  return examples

examples = feature_and_label_gen()

然後，使用 dataset_fn 把訓練數據集包裝起來。

def dataset_fn(_):
  raw_dataset = tf.data.Dataset.from_tensor_slices(examples)

  train_dataset = raw_dataset.map(
      lambda x: (
          {"features": feature_preprocess_stage(x["features"])},
          label_preprocess_stage(x["label"])
      )).shuffle(200).batch(32).repeat()
  return train_dataset

4.2 建立模型

接下來，我們來建立模型和其他對象，要確保在 strategy.scope 之下創建這些變量。

# These variables created under the  strategy.scope  will be placed on parameter
# servers in a round-robin fashion.
with strategy.scope():
  # Create the model. The input needs to be compatible with Keras processing layers.
  model_input = tf.keras.layers.Input(
      shape=(3,), dtype=tf.int64, name="model_input")

  emb_layer = tf.keras.layers.Embedding(
      input_dim=len(feature_lookup_layer.get_vocabulary()), output_dim=16384)
  emb_output = tf.reduce_mean(emb_layer(model_input), axis=1)
  dense_output = tf.keras.layers.Dense(units=1, activation="sigmoid")(emb_output)
  model = tf.keras.Model({"features": model_input}, dense_output)

  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.1)
  accuracy = tf.keras.metrics.Accuracy()

然後需要確保使用 FixedShardsPartitioner 將所有變量分成兩個分片，每個分片被分配給不同的參數服務器。

assert len(emb_layer.weights) == 2
assert emb_layer.weights[0].shape == (4, 16384)
assert emb_layer.weights[1].shape == (4, 16384)
assert emb_layer.weights[0].device =="/job:ps/replica:0/task:0/device:CPU:0"
assert emb_layer.weights[1].device =="/job:ps/replica:0/task:1/device:CPU:0"

4.3 定義訓練步驟

第三步則是使用 tf.function 來創建訓練 step。

@tf.function
def step_fn(iterator):

  def replica_fn(batch_data, labels):
    with tf.GradientTape() as tape:
      pred = model(batch_data, training=True)
      per_example_loss = tf.keras.losses.BinaryCrossentropy(
              reduction=tf.keras.losses.Reduction.NONE)(labels, pred)
      loss = tf.nn.compute_average_loss(per_example_loss)
      gradients = tape.gradient(loss, model.trainable_variables)

    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    actual_pred = tf.cast(tf.greater(pred, 0.5), tf.int64)
    accuracy.update_state(labels, actual_pred)
    return loss

  batch_data, labels = next(iterator)
  losses = strategy.run(replica_fn, args=(batch_data, labels))
  return strategy.reduce(tf.distribute.ReduceOp.SUM, losses, axis=None)

在上面的訓練步進函數中，在 step_fn 中調用 Strategy.run 和 Strategy.reduce 就可以支持每個工作者的多個GPU。工作者被分配 GPU 之後， Strategy.run 將在多個模型副本上分配數據集。

4.4 分配計算到遠端

在使用 ParameterServerStrategy 定義所有的計算後，你將使用 tf.distribution.experimental.coordinator.ClusterCoordinator 類來創建資源並將訓練步驟分配給遠程工作者。

coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(strategy)

然後，爲每個工作者（per-worker）創建一個數據集和一個迭代器。在下面的 per_worker_dataset_fn 中，建議將 dataset_fn 包裹到 strategy.distribution_datasets_from_function 中，以允許無縫高效的把數據預取到 GPU。

@tf.function
def per_worker_dataset_fn():
  return strategy.distribute_datasets_from_function(dataset_fn)

per_worker_dataset = coordinator.create_per_worker_dataset(per_worker_dataset_fn)
per_worker_iterator = iter(per_worker_dataset)

最後一步是使用 ClusterCoordinator.schedule 將計算分配給遠程工作者。

schedule 方法把一個 tf.function 插入隊列，並立即返回一個 future-like 的 RemoteValue 。隊列之中的函數將被派發給後臺線程中的遠程工作者，RemoteValue 將被異步填充。
可以使用 join 方法（ ClusterCoordinator.join ）來等待所有被規劃（scheduled）的函數執行完畢。

num_epoches = 4
steps_per_epoch = 5
for i in range(num_epoches):
  accuracy.reset_states()
  for _ in range(steps_per_epoch):
    coordinator.schedule(step_fn, args=(per_worker_iterator,))
  # Wait at epoch boundaries.
  coordinator.join()
  print ("Finished epoch %d, accuracy is %f." % (i, accuracy.result().numpy()))

下面是如何得到 RemoteValue 的結果。

loss = coordinator.schedule(step_fn, args=(per_worker_iterator,))
print ("Final loss is %f" % loss.fetch())

或者，你可以啓動所有的步驟，並在等待完成時做一些事情。

for _ in range(total_steps):
  coordinator.schedule(step_fn, args=(per_worker_iterator,))
while not coordinator.done():
  time.sleep(10)
  # Do something like logging metrics or writing checkpoints.

4.5 建立數據集

上述代碼中的數據集是使用 ClusterCoordinator.create_per_worker_dataset API 創建的。它爲每個工作者創建一個數據集，並返回一個容器對象。你可以調用 iter 方法來創建一個屬於每個工作者（per-worker）的迭代器。在工作者執行函數之前， ClusterCoordinator.schedule 方法的輸入參數將被設置成工作者的相應切片（slice）。

目前， ClusterCoordinator.schedule 方法假定worker都是相同的，因此假定不同worker上的數據集是相同的，如果數據集包含 Dataset.shuffle 操作，則數據集可能會被shuffle。正因爲如此，建議用戶安排運行有限的步驟，而不是依賴數據集的 OutOfRangeError 。

另一個重要的注意事項是， tf.data 數據集不支持跨任務邊界的隱式序列化和反序列化。所以在傳遞給 ClusterCoordinator.create_per_worker_dataset 的函數內創建整個數據集是很重要的。

5. 運行

5.1 直接運行

如果直接調用 run 來運行，則 ParameterServerStrategy 和其他策略套路類似，比如在 parameter_server_strategy_v2 之中調用了 mirrored_run，所以我們不在贅述。

  def _call_for_each_replica(self, fn, args, kwargs):
    self._assert_being_scheduled_by_cluster_coordinator()

    return mirrored_run.call_for_each_replica(self._container_strategy(), fn,
                                              args, kwargs)

5.2 ClusterCoordinator

另一種方式是使用 ClusterCoordinator 來運行，我們將在下一章節結合自定義訓練循環來進行分析。

6. 性能改進

如果你在使用 ParameterServerStrategy 和 ClusterResolver 訓練時發現性能問題，可能有幾個原因。

一個常見的原因是參數服務器的負載不平衡，一些重載的參數服務器已經達到容量。也可能有多種根本原因。緩解這個問題的一些簡單方法是：

在構建 ParameterServerStrategy 時，通過指定一個 variable_partitioner 來分割你的大型模型變量。
如果可能的話，避免創建一個所有參數服務器都需要的熱點（hotspot）變量。例如，在優化器中使用一個恆定的學習率或子類 tf.keras.optimizers.schedules.LearningRateSchedule，因爲默認行爲是：學習率將成爲一個放在特定參數服務器上的變量，但是此變量在每一步中被所有其他參數服務器使用。
在將你的大詞彙表傳遞給 Keras 預處理層之前，對它們進行 shuffle。

性能問題的另一個可能原因是協調器。你的第一個 schedule / join 的實現是基於Python的，因此可能有線程開銷。另外，協調器和工作者之間的延遲也可能很大。如果是這種情況，那麼建議：

對於 Model.fit，你可以將 Model.compile 提供的 steps_per_execution 參數設置爲大於1的值。
對於一個自定義的訓練循環，你可以將多個步驟打包到一個 tf.function 中。

steps_per_invocation = 10

@tf.function
def step_fn(iterator):
  for _ in range(steps_per_invocation):
    features, labels = next(iterator)
    def replica_fn(features, labels):
      ...

    strategy.run(replica_fn, args=(features, labels))

隨着庫的進一步優化，希望可以讓大多數用戶在未來不必手動打包步驟。此外，提高性能的一個小竅門是安排沒有返回值的函數。

7. 已知限制

在上述章節中已經涉及了大部分已知的限制。本節提供一個總結。

7.1 ParameterServerStrategy

os.environment["grpc_fail_fast"]="use_caller" 在包括協調器在內的每個任務上都需要，以使容錯正常工作。
不支持同步的參數服務器訓練。
通常需要將多個步驟打包到一個函數中，以實現最佳性能。
不支持通過 tf.saved_model.load 加載含有分片變量的保存模型。注意使用 TensorFlow Serving 加載這樣的 saved_model 是可以的。
不支持將包含分片優化器插槽（slot）變量的檢查點加載到不同數量的分片中。
不支持在不重啓協調者任務的情況下從參數服務器故障中恢復。
使用 tf.lookup.StaticHashTable（它通常被一些 Keras 預處理層採用，如 tf.keras.layer.IntegerLookup 、 tf.keras.layer.StringLookup 和 tf.keras.layer.TextVectorization ）將導致在這一步之中參數服務器訓練所使用的資源被放在協調器上。這會影響從工作者到協調器的查找RPC的性能。這是目前需要解決的一個高度優先事項。

7.2 Model.fit

steps_per_epoch 參數在 Model.fit 中是必需的。你可以選擇一個值來確保epoch之內被分割恰當。
由於性能原因， ParameterServerStrategy 不支持批量級自定義回調。你應該將這些調用轉換爲epoch級的調用，並適當選擇 steps_per_epoch，以便每隔 steps_per_epoch 步數調用這些回調。內置回調不受影響：它們的批處理級調用已經被修改爲可執行的。官方正在計劃爲"ParameterServerStrategy"支持批量調用。
出於同樣的原因，與其他策略不同，進度條和指標只在epoch邊界被記錄。
不支持 run_eagerly 。

7.3 自定義循環

ClusterCoordinator.schedule 不支持數據集的訪問量保證（visitation guarantees）。

0xFF 參考

https://www.youtube.com/watch?v=B2Tpv_N7wkg&ab_channel=TensorFlow

[中字] TFRT: 新的 TensorFlow 運行庫 - TF Dev Summit '20

深入理解 TFRT

Inside TensorFlow: Eager execution runtime

【深度學習框架tensorflow： Inside TensorFlow 】Inside TensorFlow（合輯）

https://github.com/tensorflow/docs-l10n/blob/07e15a23c7fa397bc44acbf20f997f7cb268ab1c/site/en-snapshot/tutorials/distribute/parameter_server_training.ipynb