TensorFlow數據處理（輸入文件隊列）

原創

2020-06-21 03:46

生成樣例數據

先生成 TFRecord 格式的樣例數據，Example 的結構如下，表示第1個文件中的第1個數據

{
    'i':0,
    'j':0
}

生成數據的代碼如下（以下代碼都實現自《TensorFlow：實戰Google深度學習框架》）

import tensorflow as tf


# 創建TFRecord文件的幫助函數
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


# 模擬海量數據情況下將數據寫入不同的文件
num_shards = 2  # 總共寫入多少個文件
instances_per_shard = 2  # 每個文件有多少數據

for i in range(num_shards):
    # 按0000n-of-0000m的後綴區分文件。n代表當前文件編號，m代表文件總數
    filename = ('data/data.tfrecords-%.5d-of-%.5d' % (i, num_shards))
    writer = tf.python_io.TFRecordWriter(filename)

    # 將數據封裝成Example結構並寫入TFRecord文件
    for j in range(instances_per_shard):
        example = tf.train.Example(
            features=tf.train.Features(feature={
                'i': _int64_feature(i),
                'j': _int64_feature(j)
            }))
        writer.write(example.SerializeToString())
    writer.close()

運行後會在 data 文件夾下生成兩個文件，文件的命名後綴爲 0000n-of-0000m，n代表當前文件編號，m代表文件總數

data/
    data.tfrecords-00000-of-00002
    data.tfrecords-00001-of-00002

讀取文件數據

文件隊列的生成主要使用兩個函數

tf.train.match_filenames_once()：獲取符合正則表達式的文件列表
tf.train.string_input_producer()：用文件列表創建一個輸入隊列

通過設置 shuffle 參數爲 True，string_input_producer 會將文件的入隊順序打亂，所以出隊順序是隨機的。隨機打亂文件順序和入隊操作會跑在一個單獨的線程上，不會影響出隊的速度

當輸入隊列中的所有文件都處理完後，它會將文件列表中的文件重新加入隊列。可以通過設置 num_epochs 參數來限制加載初始文件列表的最大輪數

讀取文件隊列數據的代碼如下

import tensorflow as tf

# 獲取文件列表
files = tf.train.match_filenames_once('data/data.tfrecords-*')

# 創建文件輸入隊列
filename_queue = tf.train.string_input_producer(files, shuffle=False)

# 讀取並解析Example
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
    serialized_example,
    features={
        'i': tf.FixedLenFeature([], tf.int64),
        'j': tf.FixedLenFeature([], tf.int64)
    })

with tf.Session() as sess:
    # 使用match_filenames_once需要用local_variables_initializer初始化一些變量
    sess.run(
        [tf.global_variables_initializer(),
         tf.local_variables_initializer()])

    # 打印文件名
    print(sess.run(files))

    # 用Coordinator協同線程，並啓動線程
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    # 獲取數據
    for i in range(6):
        print(sess.run([features['i'], features['j']]))
    coord.request_stop()
    coord.join(threads)

這裏需要使用 tf.local_variables_initializer() 初始化 tf.train.match_filenames_once() 中的變量，否則會報錯

tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value matching_filenames

運行結果如下

$ python read.py

[b'data/data.tfrecords-00000-of-00002'
 b'data/data.tfrecords-00001-of-00002']

[0, 0]
[0, 1]
[1, 0]
[1, 1]
[0, 0]
[0, 1]

最後兩個輸出結果是第一個文件的第二遍輸出，可知 string_input_producer 函數將初始文件列表重新加入了隊列中

組合樣例數據

可以使用兩種函數組合樣例數據，它們出隊時得到的是一個 batch 的樣例，它們的區別在於 shuffle_batch 函數會將數據順序打亂

tf.train.batch()
tf.train.shuffle_batch()

使用 tf.train.batch() 的方法如下

import tensorflow as tf

# 獲取文件列表
files = tf.train.match_filenames_once('data/data.tfrecords-*')

# 創建文件輸入隊列
filename_queue = tf.train.string_input_producer(files, shuffle=False)

# 讀取並解析Example
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
    serialized_example,
    features={
        'i': tf.FixedLenFeature([], tf.int64),
        'j': tf.FixedLenFeature([], tf.int64)
    })

# i代表特徵向量，j代表標籤
example, label = features['i'], features['j']

# 一個batch中的樣例數
batch_size = 3

# 文件隊列中最多可以存儲的樣例個數
capacity = 1000 + 3 * batch_size

# 組合樣例
example_batch, label_batch = tf.train.batch(
    [example, label], batch_size=batch_size, capacity=capacity)

with tf.Session() as sess:
    # 使用match_filenames_once需要用local_variables_initializer初始化一些變量
    sess.run(
        [tf.global_variables_initializer(),
         tf.local_variables_initializer()])

    # 用Coordinator協同線程，並啓動線程
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    # 獲取並打印組合之後的樣例。真實問題中一般作爲神經網路的輸入
    for i in range(2):
        cur_example_batch, cur_label_batch = sess.run(
            [example_batch, label_batch])
        print(cur_example_batch, cur_label_batch)

    coord.request_stop()
    coord.join(threads)

運行結果如下

$ python batching.py

[0 0 1] [0 1 0]
[1 0 0] [1 0 1]

可以看到單個的數據被組織成 3 個一組的 batch

以下是使用 tf.train.shuffle_batch() 的方法，min_after_dequeue 參數限制了出隊時隊列中元素的最少個數，當隊列元素個數太少時，隨機的意義就不大了

轉載： https://blog.csdn.net/white_idiot/article/details/78847091

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

TensorFlow數據處理（輸入文件隊列）

生成樣例數據

讀取文件數據

組合樣例數據

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

騰訊面試：什麼鎖比讀寫鎖性能更高？

類初始化列表簡介

TensorFlow數據處理（輸入文件隊列）

python中接口實現

c++初始化列表的詳解2

c++多態詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結