利用ImageDataGenerator构建数据集

ImageDataGenerator属于Keras的图片预处理模块,在Tensorflow 2.0中已集成了Keras的API。

本文利用ImageDataGenerator来完成一个基本的机器学习流程:

  1. 检查并了解数据
  2. 建立输入管道
  3. 建立模型
  4. 训练模型
  5. 测试模型
  6. 改进模型并重复该过程

 1. 检查并了解数据:

  • 导入必要的package

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt
  • 下载图片数据

本文以猫狗分类数据集为例子。

_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'

path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)

PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')

可以打印PATH变量查看图片保存路径:print(PATH)

图片文件结构如下:

cats_and_dogs_filtered
|__ train
    |______ cats: [cat.0.jpg, cat.1.jpg, cat.2.jpg ....]
    |______ dogs: [dog.0.jpg, dog.1.jpg, dog.2.jpg ...]
|__ validation
    |______ cats: [cat.2000.jpg, cat.2001.jpg, cat.2002.jpg ....]
    |______ dogs: [dog.2000.jpg, dog.2001.jpg, dog.2002.jpg ...]
  • 划分训练/验证数据集

由于这个数据集本身就已经按文件夹划分好的训练/验证数据集,因此可以直接基于这些文件夹来生成不同的数据集。

后面部分将使用.flow_from_directory(directory)方法来生成数据集,因此先构建训练/验证数据集的文件路径名:

train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

train_cats_dir = os.path.join(train_dir, 'cats')  # directory with our training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')  # directory with our training dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')  # directory with our validation cat pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')  # directory with our validation dog pictures

查看训练/验证数据集的大小:

num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))

num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val

print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)

2. 建立输入管道

定义一些参数,方便后续使用:

batch_size = 128
epochs = 15
IMG_HEIGHT = 150
IMG_WIDTH = 150
  • 构造ImageDataGenerator

ImageDataGenerator类包含了许多图片预处理参数,例如 rescale 可以实现图片像素归一化。同时,为了防止模型过拟合可以采取一些数据增强(Data augmentation)操作:水平翻转、随机旋转等。完整的ImageDataGenerator初始化参数如下:

Class ImageDataGenerator

Generate batches of tensor image data with real-time data augmentation.

Arguments:

  • featurewise_center: Boolean. Set input mean to 0 over the dataset, feature-wise.
  • samplewise_center: Boolean. Set each sample mean to 0.
  • featurewise_std_normalization: Boolean. Divide inputs by std of the dataset, feature-wise.
  • samplewise_std_normalization: Boolean. Divide each input by its std.
  • zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
  • zca_whitening: Boolean. Apply ZCA whitening.
  • rotation_range: Int. Degree range for random rotations.
  • width_shift_range: Float, 1-D array-like or int
    • float: fraction of total width, if < 1, or pixels if >= 1.
    • 1-D array-like: random elements from the array.
    • int: integer number of pixels from interval (-width_shift_range, +width_shift_range)
    • With width_shift_range=2 possible values are integers [-1, 0, +1], same as with width_shift_range=[-1, 0, +1], while with width_shift_range=1.0 possible values are floats in the interval [-1.0, +1.0).
  • height_shift_range: Float, 1-D array-like or int
    • float: fraction of total height, if < 1, or pixels if >= 1.
    • 1-D array-like: random elements from the array.
    • int: integer number of pixels from interval (-height_shift_range, +height_shift_range)
    • With height_shift_range=2 possible values are integers [-1, 0, +1], same as with height_shift_range=[-1, 0, +1], while with height_shift_range=1.0 possible values are floats in the interval [-1.0, +1.0).
  • brightness_range: Tuple or list of two floats. Range for picking a brightness shift value from.
  • shear_range: Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees)
  • zoom_range: Float or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range].
  • channel_shift_range: Float. Range for random channel shifts.
  • fill_mode: One of {"constant", "nearest", "reflect" or "wrap"}. Default is 'nearest'. Points outside the boundaries of the input are filled according to the given mode:
    • 'constant': kkkkkkkk|abcd|kkkkkkkk (cval=k)
    • 'nearest': aaaaaaaa|abcd|dddddddd
    • 'reflect': abcddcba|abcd|dcbaabcd
    • 'wrap': abcdabcd|abcd|abcdabcd
  • cval: Float or Int. Value used for points outside the boundaries when fill_mode = "constant".
  • horizontal_flip: Boolean. Randomly flip inputs horizontally.
  • vertical_flip: Boolean. Randomly flip inputs vertically.
  • rescale: rescaling factor. Defaults to None. If None or 0, no rescaling is applied, otherwise we multiply the data by the value provided (after applying all other transformations).
  • preprocessing_function: function that will be implied on each input. The function will run after the image is resized and augmented. The function should take one argument: one image (Numpy tensor with rank 3), and should output a Numpy tensor with the same shape.
  • data_format: Image data format, either "channels_first" or "channels_last". "channels_last" mode means that the images should have shape (samples, height, width, channels), "channels_first" mode means that the images should have shape (samples, channels, height, width). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last".
  • validation_split: Float. Fraction of images reserved for validation (strictly between 0 and 1).
  • dtype: Dtype to use for the generated arrays.

 

  • 构建训练数据集: 

train_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our training data
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')
  • 构建验证数据集:

validation_image_generator = ImageDataGenerator(rescale=1./255)
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

可视化图片,检查下预处理操作:

# The next function returns a batch from the dataset.
# The return value of next function is in form of (x_train, y_train)
# where x_train is training features and y_train, its labels.
# Discard the labels to only visualize the training images.
sample_training_images, _ = next(train_data_gen)

def plot_images(images_arr):
    fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 20))
    axes = axes.flatten()

    for img, ax in zip(images_arr, axes):
        ax.imshow(img)
        ax.axis('off')

    plt.tight_layout()
    plt.show()

plot_images(sample_training_images[:5])

3. 建立模型

model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])
  • 编译模型

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    model.summary()

 4. 训练模型

在训练的过程中,每5个epoch保存一下模型参数。

checkpoint_path = 'trainging/cp-{epoch:04d}.ckpt'

""" Create the model save method """
# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                    verbose=1,
                                                    save_weights_only=True,
                                                    save_freq=5)
model.save_weights(checkpoint_path.format(epoch=0))

""" Train the model """
history = model.fit_generator(
    generator=train_data_gen,
    steps_per_epoch=total_train//batch_size,
    epochs=epochs,
    callbacks=[cp_callback],
    validation_data=val_data_gen,
    validation_steps=total_val//batch_size
)

训练完成后可以可视化查看下训练的效果:

""" Visualize training results """
print(history.history)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))

plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

从图中可以看出,训练精度和验证精度相差很大,模型仅在验证集上获得了约70%的精度。

5. 测试模型

载入训练中保存的checkpoints,馈入验证数据进行测试:

checkpoint_dir = os.path.dirname(checkpoint_path)
latest = tf.train.latest_checkpoint(checkpoint_dir)
model.load_weights(latest)
loss, acc = model.evaluate_generator(generator=val_data_gen, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

其实如果在model.fit_generator()中馈入了验证样本集,在模型训练完成后会自动进行validation的操作。

6. 改进模型并重复该过程

可以发现在训练时,模型的准确率达到了90%,而在验证样本中准确率却为70%,模型过拟合了。

如果用于训练的样本数很少,而模型中用于训练的参数又很多,很容易产生过拟合。

防止模型过拟合的方法有很多,可以增大训练样本集,也可以在模型中增加Dropout层,加入正则化项等。

  • 这里利用rotation_range,width_shift_range,height_shift_range,horizontal_flip,zoom_range来实现Data augmentation。
train_image_generator = ImageDataGenerator(rescale=1./255,
                                           rotation_range=45,
                                           width_shift_range=.15,
                                           height_shift_range=.15,
                                           horizontal_flip=True,
                                           zoom_range=0.5)
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
  • 同时在模型中增加Dropout层:
model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Dropout(rate=0.2),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(rate=0.2),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])

再次训练模型后发现,过拟合被抑制了:

 

完整代码如下:

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt

have_data = False
taining_mode = True

batch_size = 128
epochs = 15
IMG_HEIGHT = 150
IMG_WIDTH = 150

checkpoint_path = 'trainging_adv/cp-{epoch:04d}.ckpt'

""" Load date """
if have_data:
    PATH = '/home/<user-id>/.keras/datasets/cats_and_dogs_filtered'
else:
    _URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
    path_to_zip = tf.keras.utils.get_file(fname='cats_and_dogs.zip', origin=_URL, extract=True)
    PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')

train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

train_cats_dir = os.path.join(train_dir, 'cats')  # directory with our training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')  # directory with our training dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')  # directory with our validation cat pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')  # directory with our validation dog pictures

""" Understand the data counts """
num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))
num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val

print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)

""" Data preparation """
train_image_generator = ImageDataGenerator(rescale=1./255,
                                           rotation_range=45,
                                           width_shift_range=.15,
                                           height_shift_range=.15,
                                           horizontal_flip=True,
                                           zoom_range=0.5)

validation_image_generator = ImageDataGenerator(rescale=1./255)

train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

""" Visualize training images """
# The next function returns a batch from the dataset.
# The return value of next function is in form of (x_train, y_train)
# where x_train is training features and y_train, its labels.
# Discard the labels to only visualize the training images.
sample_training_images, _ = next(train_data_gen)

def plot_images(images_arr):
    fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 20))
    axes = axes.flatten()

    for img, ax in zip(images_arr, axes):
        ax.imshow(img)
        ax.axis('off')

    plt.tight_layout()
    plt.show()

plot_images(sample_training_images[:5])

""" Create the model """
model = Sequential([
    Conv2D(filters=16, kernel_size=3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D(),
    Dropout(rate=0.2),
    Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(rate=0.2),  
    Flatten(),
    Dense(units=512, activation='relu'),
    Dense(units=1, activation='sigmoid')
])

""" Compile the model """
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

if taining_mode:
    """ Create the model save method """
    # Create a callback that saves the model's weights every 5 epochs
    cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                     verbose=1,
                                                     save_weights_only=True,
                                                     save_freq=5)
    model.save_weights(checkpoint_path.format(epoch=0))

    """ Train the model """
    history = model.fit_generator(
        generator=train_data_gen,
        steps_per_epoch=total_train//batch_size,
        epochs=epochs,
        callbacks=[cp_callback],
        validation_data=val_data_gen,
        validation_steps=total_val//batch_size
    )

    """ Visualize training results """
    print(history.history)
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs_range = range(epochs)

    plt.figure(figsize=(8, 8))

    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc, label='Training Accuracy')
    plt.plot(epochs_range, val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.title('Training and Validation Accuracy')

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss, label='Training Loss')
    plt.plot(epochs_range, val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.title('Training and Validation Loss')
    plt.show()

else:
    checkpoint_dir = os.path.dirname(checkpoint_path)
    latest = tf.train.latest_checkpoint(checkpoint_dir)
    model.load_weights(latest)
    loss, acc = model.evaluate_generator(generator=val_data_gen, verbose=2)
    print("Restored model, accuracy: {:5.2f}%".format(100*acc))

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章