Tensorflow Dataset API 入門

原創

coderhzx

2018-08-31 19:55

有時候文件做成的tfrecord可能會有幾十G，想先直接從衆多的圖像文件構建Dataset ?
下面的內容應該能幫到你

簡介看下面參考文獻第一個，他講的很清楚了

本篇筆記脈絡如下
- 1 創建
- 2 變換
- 3 獲取樣本Tensors
- 4 使用
- 5 Performance Considerations
- reference

1 創建

因爲圖像很多，我們不能將圖像全部讀到Dataset裏，所以這一步我們先在Dataset裏存放所有文件路徑

filelist = os.listdir(img_dir)
# lable_list = ... # 標籤列表根據自己的情況獲取
# 兩個tensor
t_flist = tf.constant(filelist)
t_labellist = tf.constant(lable_list)
# 構造 Dataset
dataset = tf.data.Dataset().from_tensor_slices(t_flist, t_labellist)

至此構造完成，set裏面就是一個個樣本 [file, label]

2 變換

常用的有這麼幾個，逐一介紹
- cache
- map
- shuffle
- repeat
- batch
- prefetch

最後給出常用組合

cache

將dataset緩存在內存或者本地硬盤

cache(filename='')

就一個參數 tf.string類型的tf.Tensor，表示文件系統裏的一個文件路徑，默認是內存

map

map(
    map_func,
    num_parallel_calls=None
)

兩個參數
- 第一個map函數，用來變換樣本
- 併發數，一般填cpu線程數

針對上面的圖像，我們需要將 [file, label] 變換到 [img_data, label]，所以函數形式如下

def _mapfunc(file, label):
  with tf.device('/cpu:0'):
    img_raw = tf.read_file(file)
    decoded = tf.image.decode_bmp(img_raw)
    resized = tf.image.resize_images(decoded, [h, w])
  return resized, label
  # 上面return還可以加上文件名，這裏的個數隨意
  # 下面 Iter.get_next().run 返回的每個成員和這裏對應

GPU和CPU異步處理整個計算圖，GPU集中於各種大數據量的計算，所以圖像解碼的任務交給CPU，這是google的建議

shuffle

打亂set裏的樣本，各種樣本混在一起後必須要的操作

shuffle(
    buffer_size,
    seed=None,
    reshuffle_each_iteration=None
)

reshuffle_each_iteration 默認 True
buffer_size 比樣本數+1

常用組合

dset = dataset.map(_mapfunc)

3 獲取樣本Tensors

先是迭代器
- make_one_shot_iterator()
- make_initializable_iterator(shared_name=None)

還有倆沒列出來，用到的時候再加進來吧

然後是樣本Tensors

get_next(name=None)

Returns a nested structure of tf.Tensors containing the next element.

樣例代碼

_iter = dset.make_one_shot_iterator()
next_one = _iter.get_next() # type: tuple

4 使用

把next_one 當成 tensor 開始構建圖
比如算個全局平均池化

img, label = next_one
# tf 1.4 這個tensor沒有dim的信息，各種op都報dim的錯誤
# 一般會報 channel 最後這個維度爲 None
# 必須加這個reshape
out.set_shape([-1, -1, -1, 3])
out = tf.reshape(img, [-1, h, w, c])
out = tf.reduce_mean(out, axis=(1, 2)) # channels last
# out shape: (n, c)

ValueError: The channel dimension of the inputs should be defined. Found `None`.

5 Performance Considerations

上面那些變換很好用，但是順序對性能有些影響

Map and Batch

map用到了自定義函數，如果很短就會導致調用開銷(overhead)大，所以推薦batch之後map
不過map函數要處理batch，就要加循環了，針對我上面的map，我只能想到用tf.while_loop

Map and Cache

官方言：如果map函數開銷大，只要你內存或者硬盤放得下，就map -> cache。
因爲不用cache時，這個map每次都在GPU需要前計算，如果mapfunc開銷大，
確實會拖慢GPU的腳步，提前算好並緩存能減少突發需求的等待時間
不過上面prefetch裏也提到了，n達到一定值就夠了，不需要全部。
這個Cache的作用可能會淪爲減少頻繁的硬盤讀

Map and Interleave / Prefetch / Shuffle

後面仨會一直佔用一塊緩存(緩存了dataset裏的元素)，如果map改變了樣本的大小，
這個順序就會影響內存使用

Repeat and Shuffle

官方推薦用 tf.contrib.data.shuffle_and_repeat，不行就
shuffle -> repeat

reference

TensorFlow全新的數據讀取方式：Dataset API入門教程
https://zhuanlan.zhihu.com/p/30751039
Importing Data
https://www.tensorflow.org/programmers_guide/datasets
Module: tf.data
https://www.tensorflow.org/versions/master/api_docs/python/tf/local_variables
Introduction to TensorFlow Datasets and Estimators
https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html
Input Pipeline Performance Guide
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/performance/datasets_performance.md

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Tensorflow Dataset API 入門

1 創建

2 變換

cache

map

shuffle

常用組合

3 獲取樣本Tensors

4 使用

5 Performance Considerations

Map and Batch

Map and Cache

Map and Interleave / Prefetch / Shuffle

Repeat and Shuffle

reference

今天！通義靈碼在北京、成都、杭州三城開講啦

【BI 可視化插件】怎麼做？手把手教你實現

電話激活windows

C++實現支持浮點（x.x）的四則運算（帶括號）

C/C++ 調用avx/sse函數(Intrinsics函數)

printf記錄程序日誌，徹底告別vsnprintf

服務器被黑給我上了一課

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Tensorflow Dataset API 入門

1 創建

2 變換

cache

map

shuffle

常用組合

3 獲取 樣本Tensors

4 使用

5 Performance Considerations

Map and Batch

Map and Cache

Map and Interleave / Prefetch / Shuffle

Repeat and Shuffle

reference

3 獲取樣本Tensors