利用ImageDataGenerator構建數據集

ImageDataGenerator屬於Keras的圖片預處理模塊,在Tensorflow 2.0中已集成了Keras的API。

本文利用ImageDataGenerator來完成一個基本的機器學習流程:

  1. 檢查並瞭解數據
  2. 建立輸入管道
  3. 建立模型
  4. 訓練模型
  5. 測試模型
  6. 改進模型並重復該過程

 1. 檢查並瞭解數據:

  • 導入必要的package

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt
  • 下載圖片數據

本文以貓狗分類數據集爲例子。

_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'

path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)

PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')

可以打印PATH變量查看圖片保存路徑:print(PATH)

圖片文件結構如下:

cats_and_dogs_filtered
|__ train
    |______ cats: [cat.0.jpg, cat.1.jpg, cat.2.jpg ....]
    |______ dogs: [dog.0.jpg, dog.1.jpg, dog.2.jpg ...]
|__ validation
    |______ cats: [cat.2000.jpg, cat.2001.jpg, cat.2002.jpg ....]
    |______ dogs: [dog.2000.jpg, dog.2001.jpg, dog.2002.jpg ...]
  • 劃分訓練/驗證數據集

由於這個數據集本身就已經按文件夾劃分好的訓練/驗證數據集,因此可以直接基於這些文件夾來生成不同的數據集。

後面部分將使用.flow_from_directory(directory)方法來生成數據集,因此先構建訓練/驗證數據集的文件路徑名:

train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

train_cats_dir = os.path.join(train_dir, 'cats')  # directory with our training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')  # directory with our training dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')  # directory with our validation cat pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')  # directory with our validation dog pictures

查看訓練/驗證數據集的大小:

num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))

num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val

print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)

2. 建立輸入管道

定義一些參數,方便後續使用:

batch_size = 128
epochs = 15
IMG_HEIGHT = 150
IMG_WIDTH = 150
  • 構造ImageDataGenerator

ImageDataGenerator類包含了許多圖片預處理參數,例如 rescale 可以實現圖片像素歸一化。同時,爲了防止模型過擬合可以採取一些數據增強(Data augmentation)操作:水平翻轉、隨機旋轉等。完整的ImageDataGenerator初始化參數如下:

Class ImageDataGenerator

Generate batches of tensor image data with real-time data augmentation.

Arguments:

  • featurewise_center: Boolean. Set input mean to 0 over the dataset, feature-wise.
  • samplewise_center: Boolean. Set each sample mean to 0.
  • featurewise_std_normalization: Boolean. Divide inputs by std of the dataset, feature-wise.
  • samplewise_std_normalization: Boolean. Divide each input by its std.
  • zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
  • zca_whitening: Boolean. Apply ZCA whitening.
  • rotation_range: Int. Degree range for random rotations.
  • width_shift_range: Float, 1-D array-like or int
    • float: fraction of total width, if < 1, or pixels if >= 1.
    • 1-D array-like: random elements from the array.
    • int: integer number of pixels from interval (-width_shift_range, +width_shift_range)
    • With width_shift_range=2 possible values are integers [-1, 0, +1], same as with width_shift_range=[-1, 0, +1], while with width_shift_range=1.0 possible values are floats in the interval [-1.0, +1.0).
  • height_shift_range: Float, 1-D array-like or int
    • float: fraction of total height, if < 1, or pixels if >= 1.
    • 1-D array-like: random elements from the array.
    • int: integer number of pixels from interval (-height_shift_range, +height_shift_range)
    • With height_shift_range=2 possible values are integers [-1, 0, +1], same as with height_shift_range=[-1, 0, +1], while with height_shift_range=1.0 possible values are floats in the interval [-1.0, +1.0).
  • brightness_range: Tuple or list of two floats. Range for picking a brightness shift value from.
  • shear_range: Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees)
  • zoom_range: Float or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range].
  • channel_shift_range: Float. Range for random channel shifts.
  • fill_mode: One of {"constant", "nearest", "reflect" or "wrap"}. Default is 'nearest'. Points outside the boundaries of the input are filled according to the given mode:
    • 'constant': kkkkkkkk|abcd|kkkkkkkk (cval=k)
    • 'nearest': aaaaaaaa|abcd|dddddddd
    • 'reflect': abcddcba|abcd|dcbaabcd
    • 'wrap': abcdabcd|abcd|abcdabcd
  • cval: Float or Int. Value used for points outside the boundaries when fill_mode = "constant".
  • horizontal_flip: Boolean. Randomly flip inputs horizontally.
  • vertical_flip: Boolean. Randomly flip inputs vertically.
  • rescale: rescaling factor. Defaults to None. If None or 0, no rescaling is applied, otherwise we multiply the data by the value provided (after applying all other transformations).
  • preprocessing_function: function that will be implied on each input. The function will run after the image is resized and augmented. The function should take one argument: one image (Numpy tensor with rank 3), and should output a Numpy tensor with the same shape.
  • data_format: Image data format, either "channels_first" or "channels_last". "channels_last" mode means that the images should have shape (samples, height, width, channels), "channels_first" mode means that the images should have shape (samples, channels, height, width). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last".
  • validation_split: Float. Fraction of images reserved for validation (strictly between 0 and 1).
  • dtype: Dtype to use for the generated arrays.

 

  • 構建訓練數據集: 

train_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our training data
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')
  • 構建驗證數據集:

validation_image_generator = ImageDataGenerator(rescale=1./255)
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

可視化圖片,檢查下預處理操作:

# The next function returns a batch from the dataset.
# The return value of next function is in form of (x_train, y_train)
# where x_train is training features and y_train, its labels.
# Discard the labels to only visualize the training images.
sample_training_images, _ = next(train_data_gen)

def plot_images(images_arr):
    fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 20))
    axes = axes.flatten()

    for img, ax in zip(images_arr, axes):
        ax.imshow(img)
        ax.axis('off')

    plt.tight_layout()
    plt.show()

plot_images(sample_training_images[:5])

3. 建立模型

model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])
  • 編譯模型

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    model.summary()

 4. 訓練模型

在訓練的過程中,每5個epoch保存一下模型參數。

checkpoint_path = 'trainging/cp-{epoch:04d}.ckpt'

""" Create the model save method """
# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                    verbose=1,
                                                    save_weights_only=True,
                                                    save_freq=5)
model.save_weights(checkpoint_path.format(epoch=0))

""" Train the model """
history = model.fit_generator(
    generator=train_data_gen,
    steps_per_epoch=total_train//batch_size,
    epochs=epochs,
    callbacks=[cp_callback],
    validation_data=val_data_gen,
    validation_steps=total_val//batch_size
)

訓練完成後可以可視化查看下訓練的效果:

""" Visualize training results """
print(history.history)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))

plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

從圖中可以看出,訓練精度和驗證精度相差很大,模型僅在驗證集上獲得了約70%的精度。

5. 測試模型

載入訓練中保存的checkpoints,饋入驗證數據進行測試:

checkpoint_dir = os.path.dirname(checkpoint_path)
latest = tf.train.latest_checkpoint(checkpoint_dir)
model.load_weights(latest)
loss, acc = model.evaluate_generator(generator=val_data_gen, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

其實如果在model.fit_generator()中饋入了驗證樣本集,在模型訓練完成後會自動進行validation的操作。

6. 改進模型並重復該過程

可以發現在訓練時,模型的準確率達到了90%,而在驗證樣本中準確率卻爲70%,模型過擬合了。

如果用於訓練的樣本數很少,而模型中用於訓練的參數又很多,很容易產生過擬合。

防止模型過擬合的方法有很多,可以增大訓練樣本集,也可以在模型中增加Dropout層,加入正則化項等。

  • 這裏利用rotation_range,width_shift_range,height_shift_range,horizontal_flip,zoom_range來實現Data augmentation。
train_image_generator = ImageDataGenerator(rescale=1./255,
                                           rotation_range=45,
                                           width_shift_range=.15,
                                           height_shift_range=.15,
                                           horizontal_flip=True,
                                           zoom_range=0.5)
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
  • 同時在模型中增加Dropout層:
model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Dropout(rate=0.2),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(rate=0.2),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])

再次訓練模型後發現,過擬合被抑制了:

 

完整代碼如下:

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt

have_data = False
taining_mode = True

batch_size = 128
epochs = 15
IMG_HEIGHT = 150
IMG_WIDTH = 150

checkpoint_path = 'trainging_adv/cp-{epoch:04d}.ckpt'

""" Load date """
if have_data:
    PATH = '/home/<user-id>/.keras/datasets/cats_and_dogs_filtered'
else:
    _URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
    path_to_zip = tf.keras.utils.get_file(fname='cats_and_dogs.zip', origin=_URL, extract=True)
    PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')

train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

train_cats_dir = os.path.join(train_dir, 'cats')  # directory with our training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')  # directory with our training dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')  # directory with our validation cat pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')  # directory with our validation dog pictures

""" Understand the data counts """
num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))
num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val

print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)

""" Data preparation """
train_image_generator = ImageDataGenerator(rescale=1./255,
                                           rotation_range=45,
                                           width_shift_range=.15,
                                           height_shift_range=.15,
                                           horizontal_flip=True,
                                           zoom_range=0.5)

validation_image_generator = ImageDataGenerator(rescale=1./255)

train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

""" Visualize training images """
# The next function returns a batch from the dataset.
# The return value of next function is in form of (x_train, y_train)
# where x_train is training features and y_train, its labels.
# Discard the labels to only visualize the training images.
sample_training_images, _ = next(train_data_gen)

def plot_images(images_arr):
    fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 20))
    axes = axes.flatten()

    for img, ax in zip(images_arr, axes):
        ax.imshow(img)
        ax.axis('off')

    plt.tight_layout()
    plt.show()

plot_images(sample_training_images[:5])

""" Create the model """
model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Dropout(rate=0.2),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(rate=0.2),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])

""" Compile the model """
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

if taining_mode:
    """ Create the model save method """
    # Create a callback that saves the model's weights every 5 epochs
    cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                     verbose=1,
                                                     save_weights_only=True,
                                                     save_freq=5)
    model.save_weights(checkpoint_path.format(epoch=0))

    """ Train the model """
    history = model.fit_generator(
        generator=train_data_gen,
        steps_per_epoch=total_train//batch_size,
        epochs=epochs,
        callbacks=[cp_callback],
        validation_data=val_data_gen,
        validation_steps=total_val//batch_size
    )

    """ Visualize training results """
    print(history.history)
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs_range = range(epochs)

    plt.figure(figsize=(8, 8))

    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc, label='Training Accuracy')
    plt.plot(epochs_range, val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.title('Training and Validation Accuracy')

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss, label='Training Loss')
    plt.plot(epochs_range, val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.title('Training and Validation Loss')
    plt.show()

else:
    checkpoint_dir = os.path.dirname(checkpoint_path)
    latest = tf.train.latest_checkpoint(checkpoint_dir)
    model.load_weights(latest)
    loss, acc = model.evaluate_generator(generator=val_data_gen, verbose=2)
    print("Restored model, accuracy: {:5.2f}%".format(100*acc))

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章