白花錢警告：使用tensorflow分佈式必須注意ps server空耗資源

原創

2020-02-20 14:15

爲武漢祈禱。

問題一

ps server 不會主動停止，無論在什麼情況下。這個問題從2016年提出，到現在，也沒有一個簡潔乾淨的解決方式，而這個問題會很嚴重，如果你使用的是租用資源，會白白花費很多錢錢。

我注意到，ps server不論是使用gpu還是cpu資源都不會主動停止，即使worker已經訓練完停止了，甚至是遇到錯誤，ps server仍舊會運行。這就會導致這個進程對節點資源的持續佔有，即使沒有使用GPU資源。這種情況是按照全部使用計費的！！！我的客服工程師在初期錯誤程序出現這一情況後，沒有告訴我ps不停止，並且他是知道會計費的，導致我的第一個成功的分佈式程序空跑數小時，心疼我們租用的核時。但是所有的教程都沒有警告過，所以我特別發了這篇博客。

根本原因是


  if FLAGS.job_name == "ps":
      server.join()
                           cluster=cluster)):

這回導致ps一直等待worker，一直等...

解決方法，參考：

https://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow

其實作者已經寫的很詳細了，我是參考I'll eat my hat這個作者的思路，下面貼上我完整的代碼，作爲一個應用實例，供參考：

def main(unused_argv):
  tf.logging.set_verbosity(tf.logging.INFO)

  tf.gfile.MakeDirs(FLAGS.train_logdir)
  tf.logging.info('Training on %s set', FLAGS.train_split)
  #distribute the training
  ps_hosts=FLAGS.ps_hosts.split(",")
  worker_hosts=FLAGS.worker_hosts.split(",")
  cluster=tf.train.ClusterSpec({"ps":ps_hosts,"worker":worker_hosts})
  server=tf.train.Server(cluster,job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index)
  if FLAGS.job_name == "ps":
      with tf.device('/job:ps/task:%d' % FLAGS.task_index):
          queue = tf.FIFOQueue(cluster.num_tasks('worker'), tf.int32, shared_name='done_queue%d' % FLAGS.task_index)

      # wait for the queue to be filled
      with tf.Session(server.target) as sess:
          for i in range(cluster.num_tasks('worker')):
              sess.run(queue.dequeue())
              print('ps:%d received "done" from worker:%d' % (FLAGS.task_index, i))
          print('ps:%d quitting' % FLAGS.task_index)
  elif FLAGS.job_name =="worker":
      graph = tf.Graph()
      with graph.as_default():
        with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % (FLAGS.task_index),
                                                      cluster=cluster)):#, ps_tasks=FLAGS.num_ps_tasks

          done_ops = []
            # create a shared queue on the worker which is visible on /job:ps/task:%d
          for i in range(cluster.num_tasks('ps')):
                with tf.device('/job:ps/task:%d' % i):
                    done_queue = tf.FIFOQueue(cluster.num_tasks('worker'), tf.int32, shared_name='done_queue' + str(i))
                    done_ops.append(done_queue.enqueue(FLAGS.task_index))
          assert FLAGS.train_batch_size % FLAGS.num_clones == 0, (
              'Training batch size not divisble by number of clones (GPUs).')
          clone_batch_size = FLAGS.train_batch_size // FLAGS.num_clones

          dataset = data_generator.Dataset(
              dataset_name=FLAGS.dataset,
              split_name=FLAGS.train_split,
              dataset_dir=FLAGS.dataset_dir,
              batch_size=clone_batch_size,
              crop_size=[int(sz) for sz in FLAGS.train_crop_size],
              min_resize_value=FLAGS.min_resize_value,
              max_resize_value=FLAGS.max_resize_value,
              resize_factor=FLAGS.resize_factor,
              min_scale_factor=FLAGS.min_scale_factor,
              max_scale_factor=FLAGS.max_scale_factor,
              scale_factor_step_size=FLAGS.scale_factor_step_size,
              model_variant=FLAGS.model_variant,
              num_readers=2,
              is_training=True,
              should_shuffle=True,
              should_repeat=True)

          train_tensor, summary_op = _train_deeplab_model(
              dataset.get_one_shot_iterator(), dataset.num_of_classes,
              dataset.ignore_label)

          # Soft placement allows placing on CPU ops without GPU implementation.
          session_config = tf.ConfigProto(
              allow_soft_placement=True, log_device_placement=False)
          #liutian add on cloud
          session_config.gpu_options.allow_growth = True

          last_layers = model.get_extra_layer_scopes(
              FLAGS.last_layers_contain_logits_only)
          init_fn = None
          #FLAGS.tf_initial_checkpoint = '/home/DATA/liutian/tmp/tfdeeplab/deeplab/datasets/pascal_voc_seg/init_models/deeplabv3_pascal_train_aug/model.ckpt'
          if FLAGS.tf_initial_checkpoint:

            init_fn = train_utils.get_model_init_fn(
                FLAGS.train_logdir,
                FLAGS.tf_initial_checkpoint,
                FLAGS.initialize_last_layer,
                last_layers,
                ignore_missing_vars=True)

          scaffold = tf.train.Scaffold(
              init_fn=init_fn,
              summary_op=summary_op,
          )

          stop_hook = tf.train.StopAtStepHook(
              last_step=FLAGS.training_number_of_steps
          )
          hooks = [stop_hook,tf.train.FinalOpsHook([done_ops])]

          profile_dir = FLAGS.profile_logdir
          if profile_dir is not None:
            tf.gfile.MakeDirs(profile_dir)

          with tf.contrib.tfprof.ProfileContext(
              enabled=profile_dir is not None, profile_dir=profile_dir):
            with tf.train.MonitoredTrainingSession(
                master=server.target,
                is_chief=(FLAGS.task_index == 0),
                config=session_config,
                scaffold=scaffold,
                checkpoint_dir=FLAGS.train_logdir,
                summary_dir=FLAGS.train_logdir,
                log_step_count_steps=FLAGS.log_steps,
                save_summaries_steps=FLAGS.save_summaries_secs,
                save_checkpoint_secs=FLAGS.save_interval_secs,
                hooks=hooks) as sess:
              while not sess.should_stop():
                sess.run([train_tensor])

這樣的話還有一個問題就是，如果代碼有一定問題，那麼不會主動退出。這個只能再想想辦法了。

同樣的問題在知乎大家也可以試試，但我沒有采用。

https://www.zhihu.com/question/51181456?from=profile_question_card

問題二

這裏要說一個比較偶然的錯誤，會導致worker都不停止。ps會輸出unknownError:Could not start gRPC server.

這是由於端口被佔用，也就是類似於：

節點名:2223 (比如192.18.49.1:2223,或者1:2223)

其中2223就是端口。如果2223被什麼佔用了，那麼worker跑完就不會停止。

節點不釋放，就會空耗資源，就會費錢。

解決方法是開始跑程序就要注意ps的輸出，如果提示了unknownError:Could not start gRPC server.就要換個節點，比如

節點名:2333333

TinaO-O

發佈了287 篇原創文章 · 獲贊 143 · 訪問量 29萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

白花錢警告：使用tensorflow分佈式必須注意ps server空耗資源

問題一

問題二

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

英語作文人工智能免費在線批改打分無需註冊微軟小英作文打分託福雅思高中作文 GRE

並行雲架構深度框架 sbatch slurm 深度學習 tensorflow環境從搭建到使用 conda

分佈式運算白花錢警告：使用tensorflow分佈式必須注意ps server空耗資源

【簡單】超像素分割代碼 saliency maps on image hierarchies OWT-UCM分割使用

VOC2012服務器好像是停了？好消息，5。5號又開了，儘快測試，不知道啥時候又會關了

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結