Running tf.slim evaluation loop on CPU

I'm trying to fine-tune a network using the train_image_classifier.py Tensorflow slim image classification library and I'd like to run in parallel an evaluation script I made changing the eval_image_classifier.py script so that it uses evaluation_loop() instead of evaluate_once().

It all seems to work fine, until the moment I run both the training and the evaluation processes at the same time. As soon as the evaluation process tries to restore a checkpoint, the GPU memory is depleted and the evaluation process crashes (while the training process seems to freeze.)

Looking through the docomentationI found in this page about using GPUs that:

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

I'm aware of this and the fact that I could turn it off to hopefully reduce memory footprint, but since the evaluation doesn't need to be as fast as the training (and given the huge amount of CPU cores and RAM available on my machine) I would like to run the evaluation on CPU, leaving the GPU entirely for training.

My naive approach was to edit the main(_) function of eval_image_classifier.py as follows:

def main(_):
    # flag validity checks
    with tf.device('/cpu:0'): # Run evaluation on the CPUs
        # all my code that was previously in main()

Alas, naive was not good enough and the eval process still hogs on GPU and crashes. How can I force the computation to run exclusively on the CPU?

Update:

Looking for altrnative solutions, I found this inception-v3 model which in the examples in the docstring has:

# Force all Variables to reside on the CPU.
    with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
        ...

I thought of trying this out as well in my code, so I modified main(_) as:

def main(_):
    # flag validity checks
    with slim.arg_scope([slim.variable], device='/cpu:0'): # Run evaluation on the CPUs
        # all my code that was previously in main()

Sadly, when the script restores the model from a checkpoint it still loads it on the GPU.

After some more research, I solved the problem by passing to evaluation_loop() a session_configuration instance with device_count={'GPU':0}:

  config = tf.ConfigProto(device_count={'GPU':0}) # mask GPUs visible to the session so it falls back on CPU
  slim.evaluation.evaluation_loop(
      master=FLAGS.master,
      checkpoint_dir=FLAGS.checkpoint_dir,
      logdir=FLAGS.eval_dir,
      num_evals=num_batches,
      eval_op=list(names_to_updates.values()),
      variables_to_restore=variables_to_restore,
      summary_op=tf.summary.merge_all(),
      eval_interval_secs=FLAGS.eval_interval,
      session_config=config) # <---- the actual fix

shareimprove this answer

answered May 5 '17 at 13:20

 

GPhilo

9,11033 gold badges3333 silver badges5757 bronze badges

add a comment

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章