Running tf.slim evaluation loop on CPU

原創

2020-06-22 03:03

I'm trying to fine-tune a network using the train_image_classifier.py Tensorflow slim image classification library and I'd like to run in parallel an evaluation script I made changing the eval_image_classifier.py script so that it uses evaluation_loop() instead of evaluate_once().

It all seems to work fine, until the moment I run both the training and the evaluation processes at the same time. As soon as the evaluation process tries to restore a checkpoint, the GPU memory is depleted and the evaluation process crashes (while the training process seems to freeze.)

Looking through the docomentationI found in this page about using GPUs that:

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

I'm aware of this and the fact that I could turn it off to hopefully reduce memory footprint, but since the evaluation doesn't need to be as fast as the training (and given the huge amount of CPU cores and RAM available on my machine) I would like to run the evaluation on CPU, leaving the GPU entirely for training.

My naive approach was to edit the main(_) function of eval_image_classifier.py as follows:

def main(_):
    # flag validity checks
    with tf.device('/cpu:0'): # Run evaluation on the CPUs
        # all my code that was previously in main()

Alas, naive was not good enough and the eval process still hogs on GPU and crashes. How can I force the computation to run exclusively on the CPU?

Update:

Looking for altrnative solutions, I found this inception-v3 model which in the examples in the docstring has:

# Force all Variables to reside on the CPU.
    with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
        ...

I thought of trying this out as well in my code, so I modified main(_) as:

def main(_):
    # flag validity checks
    with slim.arg_scope([slim.variable], device='/cpu:0'): # Run evaluation on the CPUs
        # all my code that was previously in main()

Sadly, when the script restores the model from a checkpoint it still loads it on the GPU.

After some more research, I solved the problem by passing to evaluation_loop() a session_configuration instance with device_count={'GPU':0}:

  config = tf.ConfigProto(device_count={'GPU':0}) # mask GPUs visible to the session so it falls back on CPU
  slim.evaluation.evaluation_loop(
      master=FLAGS.master,
      checkpoint_dir=FLAGS.checkpoint_dir,
      logdir=FLAGS.eval_dir,
      num_evals=num_batches,
      eval_op=list(names_to_updates.values()),
      variables_to_restore=variables_to_restore,
      summary_op=tf.summary.merge_all(),
      eval_interval_secs=FLAGS.eval_interval,
      session_config=config) # <---- the actual fix

share improve this answer

answered May 5 '17 at 13:20

GPhilo

9,11033 gold badges3333 silver badges5757 bronze badges

add a comment

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Running tf.slim evaluation loop on CPU

Update:

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

HTTP協議和WebSocket協議（二）(轉載)

實現簡單的Redis分佈式鎖

netty4+protobuf3最佳實踐

MySQL 定時備份

使用nmcli命令配置網絡

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結