tensorflow 混合精度训练

混合精度是指在训练期间在模型中同时使用16位和32位浮点类型,以使其运行更快并使用更少的内存。通过将模型的某些部分保持在32位类型中以保持数值稳定性,模型将具有更短的步长时间,并且在评估指标(如准确性)方面同样可以训练。可以在现代GPU和TPU上将性能提高。

tensorflow自带了混合精度训练,但是要求版本在2.1以上。

而混合精度训练需要一定的硬件配置。虽然混合精度可以在大多数硬件上运行,但只能加速最近的NVIDIA GPU和Cloud TPU上的模型。NVIDIA GPU支持混合使用float16和float32,而TPU支持混合使用bfloat16和float32。

在NVIDIA GPU中,具有7.0或更高计算能力的GPU将受益于混合精度,因为它们具有称为Tensor Core的特殊硬件单元,可以加速float16矩阵乘法和卷积。较旧的GPU不能提供混合精度带来的数学性能优势,但是节省内存和带宽可以提高速度。您可以在NVIDIA的CUDA GPU网页上查找GPU的计算能力。从混合精度中受益最多的GPU实例包括RTX GPU,Titan V和V100。

具体算力可以查看https://blog.csdn.net/iefenghao/article/details/97956440。(穷人gpu没有资格加速,哭)

#仅支持2.1版本以上的tensorflow
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.mixed_precision import experimental as mixed_precision

#混合精度训练设置,gpu使用 “mixed_float16” tpu使用“mixed_bfloat16”
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

model = tf.keras.applications.MobileNetV2(include_top=False, weights=None)
inputs = tf.keras.layers.Input(shape=(224, 224, 3))
print(inputs.dtype.name)
x = model(inputs)  # 此处x为MobileNetV2模型去处顶层时输出的特征相应图。
x = tf.keras.layers.GlobalAveragePooling2D()(x)

#Dense层在混合精度训练中会变成float16,而输出输入在需要float32位,否则可能会出现数值不稳定的错误。
#因此将dense和激活层分开写,并设置最终输出为float32
outputs = layers.Dense(2, name='Logits')(x)
print(outputs.dtype.name)
outputs = layers.Activation('softmax', dtype='float32', name='predictions')(outputs)
print(outputs.dtype.name)

print('Outputs dtype: %s' % outputs.dtype.name)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.summary()

损失计算中,方向传播会导致float16的梯度发生下溢现象,这是由于float16表示的数值范围下雨float32位,因此需要对损失进行Loss scaling。通过上乘1024,再除以1024来防止下溢。而在tensorflow中,使用tf.keras.Model.fit训练会自动的进行Loss scaling来动态防止下溢。

而自定义训练则需要进行如下操作来防止下溢。

Loss scaling

Loss scaling is a technique which tf.keras.Model.fit automatically performs with the mixed_float16 policy to avoid numeric underflow. This section describes loss scaling and how to customize its behavior.

Underflow and Overflow

The float16 data type has a narrow dynamic range compared to float32. This means values above 65504 will overflow to infinity and values below 6.0×10−8 will underflow to zero. float32 and bfloat16 have a much higher dynamic range so that overflow and underflow are not a problem.

Loss scaling background

The basic concept of loss scaling is simple: simply multiply the loss by some large number, say 1024. We call this number the loss scale. This will cause the gradients to scale by 1024 as well, greatly reducing the chance of underflow. Once the final gradients are computed, divide them by 1024 to bring them back to their correct values.

The pseudocode for this process is:

loss_scale = 1024
loss = model(inputs)
loss *= loss_scale
# We assume `grads` are float32. We do not want to divide float16 gradients
grads = compute_gradient(loss, model.trainable_variables)
grads /= loss_scale

具体查看:https://tensorflow.google.cn/guide/keras/mixed_precision?hl=en

 

在模型设计中,使用8的单位神经元时计算效率最高。

  • tf.keras.layers.Dense(units=64)
  • tf.keras.layers.Conv2d(filters=48, kernel_size=7, stride=3)
    • And similarly for other convolutional layers, such as tf.keras.layers.Conv3d
  • tf.keras.layers.LSTM(units=64)
    • And similar for other RNNs, such as tf.keras.layers.GRU
  • tf.keras.Model.fit(epochs=2, batch_size=128)

Summary

  • You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.
  • You can use mixed precision with the following lines:
    # On TPUs, use 'mixed_bfloat16' instead
    policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
    mixed_precision.set_policy(policy)
    
  • If your model ends in softmax, make sure it is float32. And regardless of what your model ends in, make sure the output is float32.

  • If you use a custom training loop with mixed_float16, in addition to the above lines, you need to wrap your optimizer with a tf.keras.mixed_precision.experimental.LossScaleOptimizer. Then call optimizer.get_scaled_loss to scale the loss, and optimizer.get_unscaled_gradients to unscale the gradients.
  • Double the training batch size if it does not reduce evaluation accuracy
  • On GPUs, ensure most tensor dimensions are a multiple of 8 to maximize performance

 

参考文献:

官方文档:https://tensorflow.google.cn/guide/keras/mixed_precision?hl=en

英伟达算力:https://blog.csdn.net/iefenghao/article/details/97956440

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章