tensorflow 混合精度訓練

混合精度是指在訓練期間在模型中同時使用16位和32位浮點類型,以使其運行更快並使用更少的內存。通過將模型的某些部分保持在32位類型中以保持數值穩定性,模型將具有更短的步長時間,並且在評估指標(如準確性)方面同樣可以訓練。可以在現代GPU和TPU上將性能提高。

tensorflow自帶了混合精度訓練,但是要求版本在2.1以上。

而混合精度訓練需要一定的硬件配置。雖然混合精度可以在大多數硬件上運行,但只能加速最近的NVIDIA GPU和Cloud TPU上的模型。NVIDIA GPU支持混合使用float16和float32,而TPU支持混合使用bfloat16和float32。

在NVIDIA GPU中,具有7.0或更高計算能力的GPU將受益於混合精度,因爲它們具有稱爲Tensor Core的特殊硬件單元,可以加速float16矩陣乘法和卷積。較舊的GPU不能提供混合精度帶來的數學性能優勢,但是節省內存和帶寬可以提高速度。您可以在NVIDIA的CUDA GPU網頁上查找GPU的計算能力。從混合精度中受益最多的GPU實例包括RTX GPU,Titan V和V100。

具體算力可以查看https://blog.csdn.net/iefenghao/article/details/97956440。(窮人gpu沒有資格加速,哭)

#僅支持2.1版本以上的tensorflow
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.mixed_precision import experimental as mixed_precision

#混合精度訓練設置,gpu使用 “mixed_float16” tpu使用“mixed_bfloat16”
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

model = tf.keras.applications.MobileNetV2(include_top=False, weights=None)
inputs = tf.keras.layers.Input(shape=(224, 224, 3))
print(inputs.dtype.name)
x = model(inputs)  # 此處x爲MobileNetV2模型去處頂層時輸出的特徵相應圖。
x = tf.keras.layers.GlobalAveragePooling2D()(x)

#Dense層在混合精度訓練中會變成float16,而輸出輸入在需要float32位,否則可能會出現數值不穩定的錯誤。
#因此將dense和激活層分開寫,並設置最終輸出爲float32
outputs = layers.Dense(2, name='Logits')(x)
print(outputs.dtype.name)
outputs = layers.Activation('softmax', dtype='float32', name='predictions')(outputs)
print(outputs.dtype.name)

print('Outputs dtype: %s' % outputs.dtype.name)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.summary()

損失計算中,方向傳播會導致float16的梯度發生下溢現象,這是由於float16表示的數值範圍下雨float32位,因此需要對損失進行Loss scaling。通過上乘1024,再除以1024來防止下溢。而在tensorflow中,使用tf.keras.Model.fit訓練會自動的進行Loss scaling來動態防止下溢。

而自定義訓練則需要進行如下操作來防止下溢。

Loss scaling

Loss scaling is a technique which tf.keras.Model.fit automatically performs with the mixed_float16 policy to avoid numeric underflow. This section describes loss scaling and how to customize its behavior.

Underflow and Overflow

The float16 data type has a narrow dynamic range compared to float32. This means values above 65504 will overflow to infinity and values below 6.0×10−8 will underflow to zero. float32 and bfloat16 have a much higher dynamic range so that overflow and underflow are not a problem.

Loss scaling background

The basic concept of loss scaling is simple: simply multiply the loss by some large number, say 1024. We call this number the loss scale. This will cause the gradients to scale by 1024 as well, greatly reducing the chance of underflow. Once the final gradients are computed, divide them by 1024 to bring them back to their correct values.

The pseudocode for this process is:

loss_scale = 1024
loss = model(inputs)
loss *= loss_scale
# We assume `grads` are float32. We do not want to divide float16 gradients
grads = compute_gradient(loss, model.trainable_variables)
grads /= loss_scale

具體查看:https://tensorflow.google.cn/guide/keras/mixed_precision?hl=en

 

在模型設計中,使用8的單位神經元時計算效率最高。

  • tf.keras.layers.Dense(units=64)
  • tf.keras.layers.Conv2d(filters=48, kernel_size=7, stride=3)
    • And similarly for other convolutional layers, such as tf.keras.layers.Conv3d
  • tf.keras.layers.LSTM(units=64)
    • And similar for other RNNs, such as tf.keras.layers.GRU
  • tf.keras.Model.fit(epochs=2, batch_size=128)

Summary

  • You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.
  • You can use mixed precision with the following lines:
    # On TPUs, use 'mixed_bfloat16' instead
    policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
    mixed_precision.set_policy(policy)
    
  • If your model ends in softmax, make sure it is float32. And regardless of what your model ends in, make sure the output is float32.

  • If you use a custom training loop with mixed_float16, in addition to the above lines, you need to wrap your optimizer with a tf.keras.mixed_precision.experimental.LossScaleOptimizer. Then call optimizer.get_scaled_loss to scale the loss, and optimizer.get_unscaled_gradients to unscale the gradients.
  • Double the training batch size if it does not reduce evaluation accuracy
  • On GPUs, ensure most tensor dimensions are a multiple of 8 to maximize performance

 

參考文獻:

官方文檔:https://tensorflow.google.cn/guide/keras/mixed_precision?hl=en

英偉達算力:https://blog.csdn.net/iefenghao/article/details/97956440

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章