【深度学习走进tensorflow2.0】TensorFlow 2.0 常用模块TFRecord

原創

2020-02-22 09:52

主要介绍TensorFlow 另一个数据处理的利器——TFRecord。

一、什么是TFRecord ？
TFRecord 是 TensorFlow 中的数据集存储格式。当我们将数据集整理成 TFRecord 格式后，TensorFlow 就可以高效地读取和处理这些数据集，从而帮助我们更高效地进行大规模的模型训练。

TFRecord 可以理解为一系列序列化的 tf.train.Example 元素所组成的列表文件，而每一个 tf.train.Example 又由若干个 tf.train.Feature 的字典组成。形式如下：

# dataset.tfrecords
 [
     {   # example 1 (tf.train.Example)
         'feature_1': tf.train.Feature,
         ...
         'feature_k': tf.train.Feature
     },
     ...
     {   # example N (tf.train.Example)
        'feature_1': tf.train.Feature,
        ...
        'feature_k': tf.train.Feature
    }
]


为了将形式各样的数据集整理为 TFRecord 格式可按以下步骤：
1、读取该数据元素到内存；
2、将该元素转换为 tf.train.Example 对象（每一个 tf.train.Example 由若干个 tf.train.Feature 的字典组成，因此需要先建立 Feature 的字典）；
3、将该 tf.train.Example 对象序列化为字符串，并通过一个预先定义的 tf.io.TFRecordWriter 写入 TFRecord 文件。

而读取 TFRecord 数据则可按照以下步骤：
1、通过 tf.data.TFRecordDataset 读入原始的 TFRecord 文件（此时文件中的 tf.train.Example 对象尚未被反序列化），获得一个 tf.data.Dataset 数据集对象；
2、通过 Dataset.map 方法，对该数据集对象中的每一个序列化的 tf.train.Example 字符串执行 tf.io.parse_single_example 函数，从而实现反序列化。

二、实战例子理解，猫狗数据集转换为tfrecord文件，并读取
猫狗数据集下载地址


# 将数据集存储为 TFRecord 文件

import tensorflow as tf
import os

data_dir = 'D:/深度学习/tensorflow2.0/catsAndDogs/datasets/'
train_cats_dir = data_dir + '/train/cats/'
train_dogs_dir = data_dir + '/train/dogs/'


tfrecord_file = data_dir + '/train/train.tfrecords'

train_cat_filenames = [train_cats_dir + filename for filename in os.listdir(train_cats_dir)]
train_dog_filenames = [train_dogs_dir + filename for filename in os.listdir(train_dogs_dir)]


train_filenames = train_cat_filenames + train_dog_filenames
train_labels = [0] * len(train_cat_filenames) + [1] * len(train_dog_filenames)  # 将 cat 类的标签设为0，dog 类的标签设为1


# 迭代读取每张图片，建立 tf.train.Feature 字典和 tf.train.Example 对象，序列化并写入 TFRecord 文件。
with tf.io.TFRecordWriter(tfrecord_file) as writer:
    for filename, label in zip(train_filenames, train_labels):
        image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
        feature = {                             # 建立 tf.train.Feature 字典
            'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
            'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
        }
        example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
        writer.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件

# 运行以上代码，不出片刻，我们即可在 tfrecord_file 所指向的文件地址获得一个 500MB 左右的 train.tfrecords 文件。

# 读取 TFRecord 文件
raw_dataset = tf.data.TFRecordDataset(tfrecord_file)    # 读取 TFRecord 文件
feature_description = { # 定义Feature结构，告诉解码器每个Feature的类型是什么
    'image': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64),
}

def _parse_example(example_string): # 将 TFRecord 文件中的每一个序列化的 tf.train.Example 解码
    feature_dict = tf.io.parse_single_example(example_string, feature_description)
    feature_dict['image'] = tf.io.decode_jpeg(feature_dict['image'])    # 解码JPEG图片
    return feature_dict['image'], feature_dict['label']


dataset = raw_dataset.map(_parse_example)
# 运行以上代码后，我们获得一个数据集对象 dataset ，这已经是一个可以用于训练的 tf.data.Dataset 对象了！

import matplotlib.pyplot as plt
for image, label in dataset:
   plt.title('cat' if label == 0 else 'dog')
   plt.imshow(image.numpy())
   plt.show()

开心果汁

发布了653 篇原创文章 · 获赞 795 · 访问量 188万+

他的留言板关注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【深度学习走进tensorflow2.0】TensorFlow 2.0 常用模块TFRecord

Python 潮流周刊#50：我最喜欢的 Python 3.13 新特性！

【深度學習走進tensorflow2.0】TensorFlow binary was not compiled to use: AVX2

【matlab 圓周率計算】matlab 求圓周率的兩種算法實現比較

【深度學習走進tensorflow2.0】TensorFlow 2.0 常用模塊@tf.function

【機器學習非線性迴歸模型】10分鐘瞭解下8種常見的非線性迴歸模型

【深度學習走開tensorflow2.0】TensorFlow 2.0 常用模塊tf.TensorArray

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【深度学习 走进tensorflow2.0】TensorFlow 2.0 常用模块TFRecord

【深度学习走进tensorflow2.0】TensorFlow 2.0 常用模块TFRecord