tf.data官方教程 - - 基於TF-v2

這是本人關於tf.data的第二篇博文,第一篇基於TF-v1詳細介紹了tf.data,但是v1和v2很多地方不兼容,所以替大家瞧瞧v2的tf.data模塊有什麼新奇之處。

TensorFlow版本:2.1.0

首先貼上TF v1版本的tf.data博文地址:《TensorFlow tf.data 導入數據(tf.data官方教程)


使用 tf.data 構建數據輸入通道

tf.data API編寫的數據輸入通道簡單、並且可重用度高。tf.data能夠實現非常複雜的數據輸入通道。例如:圖像模型的數據輸入管道可能會聚集來自分佈式文件系統中文件的數據,對每個圖像應用隨機擾動,然後將隨機選擇的圖像合併爲一批進行訓練。文本模型的數據輸入管道可能涉及從原始文本數據中提取符號,將其轉換爲帶有查找表的嵌入標識符,以及將不同長度的序列分批處理。tf.dataAPI使得處理大量數據,從不同數據格式讀取數據以及執行復雜的轉換成爲可能。

tf.data API引入了tf.data.Dataset 這個抽象概念。它是一個元素組成的序列,每個元素可以由一個或多個部分組成。例如,圖像的數據輸入通道中,一個元素可以是由數據和標籤組成的一個訓練樣本。

創建dataset的方法有兩種:

  • 基於內存中的數據 或 硬盤中的一個或多個文件 建立Dataset
  • 通過對Dataset進行 transform 得到一個新的Dataset
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import pathlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)

1. 基礎知識

建立一個數據輸入通道,一般需要從數據源開始。如果你的數據儲存在內存中,你可以使用tf.data.Dataset.from_tensor()tf.data.Dataset.from_tensor_slices()創建Dataset。如果你的數據是TFRecord格式,你可以使用tf.data.TFRecordDataset()創建Dataset

一旦你有了一個Dataset對象,你可以通過調用它的方法對其進行變換產生一個新的 Dataset對象。

Dataset是一個Python可迭代對象。所以可以使用 for 循環來消耗它的元素:

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

for elem in dataset:
  print(elem.numpy())

8
3
0
8
2
1

或者顯式使用iter創建一個Python迭代器,並使用next來消耗其的元素:

it = iter(dataset)

print(next(it).numpy())

8

另外,也可以使用reduce()變換來消耗數據集的元素,根據所有元素產生單個結果。下面的示例說明如何使用reduce變換來計算整數數據集的總和。

print(dataset.reduce(0, lambda state, value: state + value).numpy())

22

1.1 Dataset 結構介紹

一個Dataset由多個相同結構的(嵌套)元素組成,每個元素又由多個可由tf.TypeSpec表示的部分組成(常見的有Tensor, SparseTensor, RaggedTensor, TensorArray, Dataset)。

利用Dataset.element_spec屬性可以檢查每個元素的組成部分的類型。該屬性返回一個由tf.TypeSpec對象組成的嵌套結構,這個結構與Dataset中元素的結構是對應的。例如:

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None),
  \;TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
  \;(TensorSpec(shape=(), dtype=tf.float32, name=None),
  \;  \;TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type

tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset 的變換支持任何結構的數據集。在使用 Dataset.map()Dataset.flat_map()Dataset.filter() 函數時(這些轉換會對每個元素應用一個函數),元素結構決定了函數的參數:

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1

<TensorSliceDataset shapes: (10,), types: tf.int32>

for z in dataset1:
  print(z.numpy())

[6 7 1 1 5 6 7 8 7 6]
[8 3 3 7 9 3 8 4 8 4]
[2 3 6 9 4 2 1 8 1 6]
[6 7 1 9 6 2 4 7 9 1]

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2

<TensorSliceDataset shapes: ((), (100,)), types: (tf.float32, tf.int32)>

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3

<ZipDataset shapes: ((10,), ((), (100,))), types: (tf.int32, (tf.float32, tf.int32))>

for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

:爲 Dataset 中的元素的各個組件命名通常會帶來便利性(例如,元素的各個組件表示不同特徵時)。除了元組之外,還可以使用 命名元組(collections.namedtuple) 或 字典 來表示 Dataset 的單個元素。

dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random.uniform([4]),
    "b": tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)})

dataset..element_spec

{‘a’: TensorSpec(shape=(), dtype=tf.float32, name=None), ‘b’: TensorSpec(shape=(100,), dtype=tf.int32, name=None)}

2. 讀取輸入數據

2.1 讀取Numpy數組

See Loading NumPy arrays for more examples.

如果您的數據存儲在內存中,則創建 Dataset 的最簡單方法是使用Dataset.from_tensor_slices()創建dataset。

train, test = tf.keras.datasets.fashion_mnist.load_data() # out is np array

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels)) # auto convert np array to constant tensor
dataset

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>

注意:上面的代碼段會將 features 和 labels 數組作爲 tf.constant() 嵌入 TensorFlow 圖中。這非常適合小型數據集,但會浪費內存,因爲這會多次複製數組的內容,並可能會達到 tf.GraphDef 協議緩衝區的 2GB 上限。

2.2 讀取Python生成器中的數據

另一個常見的數據源是Python生成器。

注意:雖然使用Python生成器很簡單,但這種方法的移植性、可擴展性較差。它必須與生成器運行在同一個Python進程中,並且它仍然受Python GIL的制約。

def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1
for n in count(5):
  print(n)

0
1
2
3
4

Dataset.from_generator可以將生成器轉化爲tf.data.Dataset.from_generator函數將可調用對象作爲輸入,從而在到達生成器末尾時可重新啓動生成器。它帶有一個可選args參數,利用該參數可向可調用對象傳遞傳遞參數。

output_types參數是必需的,因爲tf.data會在後臺構建一個tf.Graph(圖的邊界需要tf.type)。

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )
for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())

[0\, \, 1 \, \, 2 \,\, 3 \,\, 4 \,\, 5 \, \, 6 \,\, 7 \,\, 8 \, \, 9 \,]
[10\, 11\, 12\, 13\, 14\, 15\, 16\, 17\, 18\, 19]
[20 \, 21 \, 22 \, 23 \, 24 \, 0 \, 1 \, 2 \, 3 \, 4]
[ 5 \, 6 \, 7 \, 8 \, 9 \, 10 \, 11 \, 12 \, 13 \, 14]
[15\, 16\, 17\, 18\, 19\, 20\, 21\, 22\, 23\, 24]
[0\, \, 1 \, \, 2 \,\, 3 \,\, 4 \,\, 5 \, \, 6 \,\, 7 \,\, 8 \, \, 9 \,]
[10\, 11\, 12\, 13\, 14\, 15\, 16\, 17\, 18\, 19]
[20\, 21\, 22\, 23\, 24\, 0 \, 1 \, 2 \, 3 \, 4]
[ 5 6 7 8 9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

output_shapes參數不是必須的,但是極力推薦指定該參數。因爲很多TensorFlow operations不支持unknown rank的Tensor。如果某一個axis的長度是未知或者可變的,可以在output_shapes參數中將其置爲None。

值得注意的是,dataset的其他方法也有output_shapes、output_types類似的規則。

下面是一個實例,它返回一個array元組,第二個array是一個長度不確定的向量:

def gen_series(): # 生成器
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,)) # array形狀爲(-1,)
    i += 1
for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break

0 : [ 1.9201 0.2124 -0.3383 -0.1141 0.7749 -0.1499]
1 : []
2 : [ 0.5885 -1.1092 0.4577 2.2978 -1.1854]
3 : [-1.7452 1.0516]
4 : []
5 : []
6 : [-0.8563 -1.2055 -0.291 1.0448 0.1486 1.0402 1.8017]

第一個array是int32型,shape爲**();第二個array是一個float32型,shape爲(None,)**。

ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32),  # 必選參數
    output_shapes=((), (None,))) # 可選參數,但最好選上,原因前面已經提過

ds_series

<FlatMapDataset shapes: ((), (None,)), types: (tf.int32, tf.float32)>

現在,tf.data.Dataset建好了。但請注意:將形狀可變的數據集進行 batching 時,您需要使用Dataset.padded_batch

ds_series_batch = ds_series.shuffle(20).padded_batch(10, padded_shapes=([], [None]))

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())

[ 6 1 10 0 3 17 12 9 5 23]


[[ 0.5812 -0.825 0.6075 -1.3856 -0.8151 -1.1908 0. 0. ]
[-0.7208 0.0611 0.0084 0.6592 0.8364 0.8327 -0.7164 0.8826]
[ 0.0391 -2.0019 0.4077 0.9304 0. 0. 0. 0. ]
[ 0.4397 -0.0901 -0.4993 0.3485 0.2481 0. 0. 0. ]
[ 0.0346 0. 0. 0. 0. 0. 0. 0. ]
[-1.0478 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0.3163 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]]

注意:TensorFlow 2.2版本中,padded_shapes參數已經不需要了,The default behavior is to pad all axes to the longest in the batch.

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

對於更實際的示例,可以嘗試用preprocessing.image.ImageDataGenerator將其包裝爲tf.data.Dataset

首先下載數據:

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 5s 0us/step

創建 image.ImageDataGenerator

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)
images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)

ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds

<FlatMapDataset shapes: ((32, 256, 256, 3), (32, 5)), types: (tf.float32, tf.float32)>

2.3 讀取TFRecord數據

See Loading TFRecords for an end-to-end example.

tf.data API支持多種文件格式,因此您可以處理超出內存大小的大型數據集。例如,TFRecord文件格式是一種簡單的面向記錄的二進制格式,許多TensorFlow應用程序都支持該格式的訓練數據。通過 tf.data.TFRecordDataset 類,您可以將一個或多個 TFRecord 文件的內容作爲數據管道的輸入。

下面以French Street Name Signs(FSNS)爲例:

# Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001
7905280/7904079 [==============================] - 0s 0us/step

TFRecordDataset的filenames 參數可以是字符串、字符串列表,也可以是字符串 tf.Tensor。因此,如果您有兩組分別用於訓練和驗證的文件,你可以創建一個工廠方法來產生dataset(以filenames作爲輸入參數)。

dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

很多TensorFlow項目在它們的TFRecords文件中,使用了序列化的tf.train.Example記錄。查看這種數據需要解碼:

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']

bytes_list {
value: “Rue Perreyon”
}

2.4 讀取text數據

See Loading Text for an end to end example.

很多數據集都是作爲一個或多個文本文件存儲的。tf.data.TextLineDataset 可以從一個或多個文本文件中提取行。給定一個或多個文件名,TextLineDataset 會爲這些文件的每行生成一個字符串值元素。

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step

dataset = tf.data.TextLineDataset(file_paths)

查看第一個文件的前幾行:

for line in dataset.take(5):
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus’ son;"
b’His wrath pernicious, who ten thousand woes’
b"Caused to Achaia’s host, sent many a soul"
b’Illustrious into Ades premature,’
b’And Heroes gave (so stood the will of Jove)’

使用Dataset.interleave可以交替讀取各個文件。這樣可以更輕鬆地將文件混在一起。

files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus’ son;"
b"\xef\xbb\xbfOf Peleus’ son, Achilles, sing, O Muse,"
b’\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought’


b’His wrath pernicious, who ten thousand woes’
b’The vengeance, deep and deadly; whence to Greece’
b’countless ills upon the Achaeans. Many a brave soul did it send’


b"Caused to Achaia’s host, sent many a soul"
b’Unnumbered ills arose; which many a soul’
b’hurrying down to Hades, and many a hero did it yield a prey to dogs and’

默認情況下,TextLineDataset 會讀取每個文件的每一行,這可能是不是我們想要的。例如,如果文件以標題行開頭或包含評論。可以使用 Dataset.skip()Dataset.filter() 方法來移除這些行。

這裏以Titanic數據集爲例,演示去除標題行,過濾以查找倖存者:

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step

for line in titanic_lines.take(10):
  print(line.numpy())

b’survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone’
b’0,male,22.0,1,0,7.25,Third,unknown,Southampton,n’
b’1,female,38.0,1,0,71.2833,First,C,Cherbourg,n’
b’1,female,26.0,0,0,7.925,Third,unknown,Southampton,y’
b’1,female,35.0,1,0,53.1,First,C,Southampton,n’
b’0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y’
b’0,male,2.0,3,1,21.075,Third,unknown,Southampton,n’
b’1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n’
b’1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n’
b’1,female,4.0,1,1,16.7,Third,G,Southampton,n’

def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)
for line in survivors.take(10):
  print(line.numpy())

b’1,female,38.0,1,0,71.2833,First,C,Cherbourg,n’
b’1,female,26.0,0,0,7.925,Third,unknown,Southampton,y’
b’1,female,35.0,1,0,53.1,First,C,Southampton,n’
b’1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n’
b’1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n’
b’1,female,4.0,1,1,16.7,Third,G,Southampton,n’
b’1,male,28.0,0,0,13.0,Second,unknown,Southampton,y’
b’1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y’
b’1,male,28.0,0,0,35.5,First,A,Southampton,y’
b’1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n’

2.5 讀取CSV數據

See Loading CSV Files, and Loading Pandas DataFrames for more examples.

CSV是一種常見的文件格式,它以純文本方式儲存表格數據。

例如:

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")  # 下載數據
df = pd.read_csv(titanic_file, index_col=None)
df.head()
survived sex age n_siblings_spouses parch fare class deck embark_town alone
0 0 male 22.0 1 0 7.2500 Third unknown Southampton n
1 1 female 38.0 1 0 71.2833 First C Cherbourg n
2 1 female 26.0 0 0 7.9250 Third unknown Southampton y
3 1 female 35.0 1 0 53.1000 First C Southampton n
4 0 male 28.0 0 0 8.4583 Third unknown Queenstown y

如果你的數據規模不大,能直接讀入內存,Dataset.from_tensor_slices方法可以以字典爲輸入,從而大大方便數據的導入:

titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’ : 0
‘sex’ : b’male’
‘age’ : 22.0
‘n_siblings_spouses’: 1
‘parch’ : 0
‘fare’ : 7.25
‘class’ : b’Third’
‘deck’ : b’unknown’
‘embark_town’ : b’Southampton’
‘alone’ : b’n’

相比之下,直接從硬盤中讀取數據是一個更靈活的方案。

tf.data模塊提供了從一個或多個符合RFC 4180的 CSV文件中提取記錄的方法。

experimental.make_csv_dataset函數是一個讀取csv文件的高階API。它支持列類型推斷和許多其他功能(例如batching、shuffling等),以簡化用法。

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")
for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’: [1 1 0 0]
features:
  \; ‘sex’ : [b’female’ b’female’ b’male’ b’male’]
  \; ‘age’ : [28. 24. 29. 28.]
  \; ‘n_siblings_spouses’: [0 1 0 0]
  \; ‘parch’ : [0 0 0 0]
  \; ‘fare’ : [ 7.2292 26. 30. 7.725 ]
  \; ‘class’ : [b’Third’ b’Second’ b’First’ b’Third’]
  \; ‘deck’ : [b’unknown’ b’unknown’ b’D’ b’unknown’]
  \; ‘embark_town’ : [b’Cherbourg’ b’Southampton’ b’Southampton’ b’Queenstown’]
  \; ‘alone’ : [b’y’ b’n’ b’y’ b’y’]

如果只需要列的子集,則可以使用select_columns參數

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])
for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’: [1 0 1 0]
  \; ‘fare’ : [ 10.5 7.25 23. 106.425]
  \; ‘class’ : [b’Second’ b’Third’ b’Second’ b’First’]

還有一個低階experimental.CsvDataset類API,它可提供更精細的控制,但不支持列類型推斷。相反,您必須指定每列的類型。

titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])

[0,b’male’,22.0,1,0,7.25,b’Third’,b’unknown’,b’Southampton’,b’n’]
[1,b’female’,38.0,1,0, 71.2833,b’First’,b’C’,b’Cherbourg’,b’n’]
[1,b’female’,26.0,0,0,7.925,b’Third’,b’unknown’,b’南安普敦’,b’y’]
[1,b’女性’,35.0,1,0,53.1,b’First’,b’C’,b’Southampton’,b’n’]
[0,b’male ‘,28.0,0,0,8.4583,b’Third’,b’unknown’,b’Queenstown’,b’y’]
[0,b’male’,2.0,3,1,21.075,b’Third’ ,b’unknown’,b’Southampton’,b’n’]
[1,b’female’,27.0、0、2、11.1333,b’Third’,b’unknown’,b’南安普敦’,b’n’]
[1,b’female’,14.0,1,0,30.0708,b’Second’,b’unknown’,b’Cherbourg’,b’n’]
[1,b’female ‘,4.0,1,1,16.7,b’Third’,b’G’,b’Southampton’,b’n’]
[0,b’male’,20.0,0,0,8.05,b’Third’,b’unknown’,b’Southampton’,b’y’]

如果某些列爲空,則此低級界面允許您提供默認值而不是列類型。

%%writefile missing.csv  # Ipython魔法命令,只在Ipython中又用
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,
,,,

Writing missing.csv

# Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

<MapDataset shapes: (4,), types: tf.int32>

for line in dataset:
  print(line.numpy())

[1 2 3 4 ]
[999 2 3 4 ]
[1 999 3 4 ]
[1 2 999 4 ]
[1 2 3 999]
[999 999 999 999]

默認情況下,CsvDataset會生成(yield)文件所有列的每一行,這可能是不希望的。例如,如果要忽略文件開頭的標題行,或者希望去除掉某些列,可以使用headerselect_cols這兩個參數。

# Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

<MapDataset shapes: (2,), types: tf.int32>

for line in dataset:
  print(line.numpy())

[2 4]
[2 4]
[999 4]
[2 4]
[2 999]
[999 999]

2.5 從文件讀取數據

很多數據集是由很多的文件構成,每個文件存儲單個example。

flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

根目錄包含每個類的文件夾:

for item in flowers_root.glob("*"):
  print(item.name)

sunflowers
daisy
LICENSE.txt
roses
tulips
dandelion

每個類的文件夾中存儲的是該類樣本:

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())

b’/home/kbuilder/.keras/datasets/flower_photos/roses/2980099495_cf272e90ca_m.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/sunflowers/14678298676_6db8831ee6_m.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/tulips/485266837_671def8627.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/daisy/7377004908_5bc0cde347_n.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/dandelion/9726260379_4e8ee66875_m.jpg’

利用tf.io.read_file函數讀取數據並從路徑中提取標籤,並返回(image, label)對:

def process_path(file_path):
  label = tf.strings.split(file_path, '/')[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)
for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())

b’\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00\x1ccmp3.10.3.2Lq3 0xad6b4f35\x00\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\x0c\x0c\x0b\n\x0b\x0b\r\x0e\x12\x10\r\x0e\x11\x0e\x0b\x0b\x10’


b’roses’

3. 數據集元素 batching

3.1 最簡單的 batching(直接 stack)

最簡單的 batching 方法是將數據集中的 n 個連續元素堆疊爲單個元素。Dataset.batch() 轉換正是這麼做的,它與 tf.stack() 運算符具有相同的限制(被應用於元素的每個組成部分):即對於每個組成部分 i,所有元素的shape必須完全相同。

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])

[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8, 9, 10, 11]), array([ -8, -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

Dataset.batch容易導致數量未知錯誤,因爲最後一個batch可能未滿。注意shape中的None:

batched_dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.int64)>

使用drop_remainder參數忽略最後一批,並獲得完整的形狀傳播:

batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset

<BatchDataset shapes: ((7,), (7,)), types: (tf.int64, tf.int64)>

4.2 將 Tensor 填充成統一大小,然後 batching

上述方法適用於具有相同大小的張量。不過,很多模型(例如序列模型)處理的輸入數據可能具有變化的size(例如序列的長度不同)。爲了解決這個問題,可以通過 Dataset.padded_batch() 來指定一個或多個可能被填充的維度,從而批處理不同形狀的張量。

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

[[0 0 0]
  \; [1 0 0]
  \; [2 2 0]
  \; [3 3 3]]


[[4 4 4 4 0 0 0]
  \; [5 5 5 5 5 0 0]
  \; [6 6 6 6 6 6 0]
  \; [7 7 7 7 7 7 7]]

Dataset.padded_batch() 允許你爲各部分的各維度設置不同的填充,並且可以採用可變長度(用 None 表示)或恆定長度。也可以更改用於填充值,默認爲 0。

4. 訓練工作流程

4.1. 數據repeat多個epoch

tf.dataAPI提供了兩種主要的方式來實現數據的epoch repeat。

  • 最簡單的方式是使用Dataset.repeat()

下面實例演示:

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)
def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')

如果不給Dataset.repeat()傳遞參數,數據集將無限重複輸入。

Dataset.repeat無參數輸入時會自動開始無縫地切換到下一次迭代。因此,先Dataset.repeatDataset.batch將產生跨越時期邊界的批次:

titanic_batches = titanic_lines.repeat(3).batch(128)

plot_batch_sizes(titanic_batches)

在這裏插入圖片描述
如果需要清晰的epoch邊界,請先Dataset.batchDataset.repeat

titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)

在這裏插入圖片描述
如果您想在每個epoch結束時執行自定義計算(例如收集統計信息),那麼最簡單的方法是在每個epoch重新開始數據集迭代:

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)

(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 2

4.2. 隨機shuffle輸入數據

Dataset.shuffle()有一個固定大小的buffer,每次按均勻分佈從buffer中取出下一個元素。

注意:越大的buffer_sizes,shuffle的越均勻,但這會佔用很多內存,並且需要大量的時間來填充滿該buffer(填滿後,纔會輸出元素)。如果這導致了一些問題,可以考慮使用Dataset.interleave代替。

向數據集添加索引,以便可以看到效果:

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.string)>

由於的buffer_size值爲100,並且批的大小爲20,因此第一批不包含索引大於120的元素。

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 92 84 52 3 27 100 44 26 2 63 54 93 69 97 10 101 32 65
109 40]

batch 與 shuffle 先後順序的問題:

Dataset.shuffle在緩衝區爲空之前,不會向epoch發出結束信號。

先shuffle後repeat

因此先shuffle後repeat,會把每一個epoch的數據完全用光之後,纔會開始下一個epoch(將下一個epoch的數據放入shuffle buffer):

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())

Here are the item ID’s near the epoch boundary:


[523 318 510 467 627 433 514 594 454 560]
[596 566 205 613 493 570 615 411 556 496]
[598 528 623 559 299 473 391 536]
[41 14 51 3 97 70 34 99 63 52]
[ 49 69 104 0 112 90 38 88 11 83]

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

在這裏插入圖片描述
先repeat後shuffle

先repeat後shuffle,在當前epoch結束時,會把下一個epoch開始的數據加入shuffle buffer,與上一個epoch末尾的數據放在一起shuffle。

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

Here are the item ID’s near the epoch boundary:


[545 576 610 588 0 595 582 10 597 495]
[353 540 7 490 440 563 559 27 600 504]
[624 476 25 519 608 525 477 30 560 363]
[468 34 3 32 47 22 609 449 627 20 ]
[611 599 577 541 62 13 601 606 15 18 ]
[26 43 607 434 73 616 55 552 57 6 ]
[587 544 584 1 16 51 596 614 21 50 ]
[39 46 76 40 78 71 37 28 2 69 ]
[574 24 88 12 543 100 89 68 445 83 ]
[441 619 557 97 113 96 38 79 613 92 ]
[29 414 65 462 537 232 126 118 75 11 ]
[87 121 80 585 114 72 99 112 102 589]
[77 61 542 369 8 133 129 567 136 344]
[81 91 139 128 49 66 565 64 152 90 ]
[538 494 154 547 131 147 166 158 111 165]

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

在這裏插入圖片描述

5. 數據預處理

Dataset.map(f)函數的作用是將函數f應用到數據集的每一個元素,並返回處理後的數據集。這個函數爲我們數據預處理提供了極大便利。

注意:
          \;\;\;\;\; f函數的參數和返回值都必須是tf.Tensor

5.1 使用Dataset.map()進行數據預處理

使用真實數據訓練神經網絡時,常常需要將圖像的尺寸改爲一致,從而可以將多個圖像組成一個batch。因此這裏將演示如何使用Dataset.map()進行圖像的解碼、改變尺寸。

同樣以花分類數據集爲例:

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

編寫一個函數來解析list_ds中的每個元素

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, '/')
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

測試下上面函數的效果:

file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

在這裏插入圖片描述
parse_image函數應用到整個數據集list_ds上:

images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2): # 查看2個example以驗證正確性
  show(image, label)

在這裏插入圖片描述

5.2 使用非TF函數進行數據預處理

使用非TF內置函數進行數據預處理的性能不如內置TF函數(Python當然沒有C++跑的快,另外語言間的通訊也是個瓶頸),所以儘可能多地使用TF內置函數進行數據預處理。但是有的時候,使用Python庫函數進行數據處理也是很方便的。你可以在Dataset.map()函數內部使用tf.py_function()來調用Python函數。

例如,你想對圖像進行一個任意旋轉,但是tf.image裏只有tf.image.tot90,這對於數據增強來說不是很有效。

注意: tensorflow_addons 的 tensorflow_addons.image.rotate 中有一個TF兼容的 rotate函數。

爲了實現我們上面提到的隨機旋轉,我們可以使用scipy.ndimage.rotate函數:

import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image
image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).

在這裏插入圖片描述
爲了在Dataset.map中使用上面寫好的random_rotate_image,我們需要描述返回的shape及type:

def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label
rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).

在這裏插入圖片描述

5.3 解析tf.Exampleprotocol buffer messages

許多輸入管道都是從TFRecord文件中提取tf.train.Example協議緩衝區消息。每條tf.train.Example記錄包含一個或多個“特徵”,並且輸入管道通常會將這些特徵轉換爲張量。

fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

你可以在td.data.Dataset外,使用tf.train.Example protos來了解數據:

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])

在這裏插入圖片描述

raw_example = next(iter(dataset))
def tf_parse(eg):
  example = tf.io.parse_example(
      eg[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]
img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

b’Rue Perreyon’b’
\ x89PNG \ r \ n \ x1a \ n \ x00 \ x00 \ x00 \ rIHDR \ x00 \ x00 \ x02X’…

decoded = dataset.map(tf_parse)
decoded

<MapDataset shapes: ((), ()), types: (tf.string, tf.string)>

image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape

TensorShape([10])

5.4 時間序列windowing

For an end to end time series example see: Time series forecasting.

時間序列數據通常以完整的時間軸進行組織。

下面用Dataset.range模擬一個時間序列:

range_ds = tf.data.Dataset.range(100000)

通常,基於此類數據的模型需要連續的時間切片。

最簡單的方法是數據進行batch:

5.4.1 使用batch

batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

[0 1 2 3 4 5 6 7 8 9 ]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]

如果要對未來進行one step密集預測,您可以相對於彼此移動特徵和標籤 one step:

def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 ] => [1 2 3 4 5 6 7 8 9 ]
[10 11 12 13 14 15 16 17 18] => [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28] => [21 22 23 24 25 26 27 28 29]

如果要預測整個窗口而不是固定的偏移量,可以將batches分爲兩部分:

batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9 ] => [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24] => [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39] => [40 41 42 43 44]

如果要使一個batch的特徵 與 另一個batch的標籤有重合,請使用Dataset.zip

feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9 ] => [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19] => [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29] => [30 31 32 33 34]

5.4.2 使用window

在使用Dataset.batch時,有些情況下可能需要你精細化的控制。該Dataset.window方法讓你進行完全的控制,但需格外小心:它返回Dataset的Datasets。有關詳細信息,參見1.1節。

window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

Dataset.flat_map方法可以獲取數據集的數據集並將其展平爲單個數據集:can take a dataset of datasets and flatten it into a single dataset:

 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

WARNING:tensorflow:AutoGraph could not transform <function at 0x7f973007e6a8> and will run it as-is.
Cause: could not parse the source code:


for x in windows.flat_map(lambda x: x).take(30):


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function at 0x7f973007e6a8> and will run it as-is.
Cause: could not parse the source code:


for x in windows.flat_map(lambda x: x).take(30):


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

幾乎所有情況下,你需要先對數據集進行batch:

def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

現在,您可以看到shift參數控制着每個窗口的移動量。

將所有的代碼放在一起,構建下面的函數:

def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows
ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

[0 3 6 9 12 15 18 21 24 27]
[5 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]

然後,像以前一樣很容易提取標籤:

dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

[0 3 6 9 12 15 18 21 24] => [3 6 9 12 15 18 21 24 27]
[5 8 11 14 17 20 23 26 29] => [8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]

5.5 重採樣

當使用類別非常不平衡的數據集時,您可能需要對數據集重新採樣。tf.data提供了兩種方法來執行此操作。信用卡欺詐數據集就是此類問題的一個很好的例子。

注意:有關完整教程,請參見不平衡數據

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip
69156864/69155632 [==============================] - 2s 0us/step

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

檢查類別分佈情況,類別是高度不均衡的:

def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts
counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

[0.995 0.005]

訓練不平衡數據集的一種常見方法是平衡它。tf.data包括了一些進行數據平衡的方法:

5.5.1 Datasets.sampling

一種重採樣數據集的方法是使用sample_from_datasets。如果每個類別都有一個獨立的data.Dataset,這種方法很適用。

在這裏,只需使用過濾器從信用卡欺詐數據中生成各個類別的dataset:

negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

WARNING:tensorflow:AutoGraph could not transform <function at 0x7f9730114598> and will run it as-is.
Cause: could not parse the source code:


.filter(lambda features, label: label==0)


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function at 0x7f9730114598> and will run it as-is.
Cause: could not parse the source code:


.filter(lambda features, label: label==0)


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function at 0x7f97301149d8> and will run it as-is.
Cause: could not parse the source code:


.filter(lambda features, label: label==1)


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function at 0x7f97301149d8> and will run it as-is.
Cause: could not parse the source code:


.filter(lambda features, label: label==1)


This error may be avoided by creating the lambda in a standalone statement.


To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

[1 1 1 1 1 1 1 1 1 1 1]

使用tf.data.experimental.sample_from_datasets進行數據集均衡,請執行以下操作:

balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

現在,數據集以50/50的概率生成每個類的示例:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 0 1 1 0 1 0 0 0 0]
[1 1 0 0 0 0 0 1 0 1]
[1 1 0 0 0 1 0 0 1 1]
[1 0 0 1 0 1 1 0 0 0]
[0 0 1 0 0 0 0 1 1 1]
[0 1 1 1 1 0 0 1 0 1]
[0 0 0 0 1 0 1 1 1 1]
[0 0 0 1 1 1 0 0 0 1]
[1 1 0 1 1 1 1 1 1 0]
[1 1 1 1 0 1 0 0 1 1]

5.5.2 experimental.rejection_resample

使用experimental.sample_from_datasets的一個問題是:它需要每一類有一個獨立的tf.data.Dataset。這可以使用Dataset.filter實現,但是會導致數據被加載兩次。

data.experimental.rejection_resample函數可以被用於數據集的平衡,並且數據只加載一次。元素將從數據集中刪除以實現平衡。

data.experimental.rejection_resample有一個class_func參數。該class_func被用於數據集的每個元素,並用於確定示例出於平衡目的所屬的類。

creditcard_ds的元素已經是(features, label)對。因此,class_func只需要返回這些標籤:

def class_func(features, label):
  return label

重採樣器還需要目標分佈,以及可選的初始分佈估計:

resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

重採樣器處理單個示例,因此您必須在應用重採樣器之前先unbatch數據集:

resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/data/experimental/ops/resampling.py:156: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

重採樣器返回(class, example)從的輸出創建對class_func。在這種情況下,example已經是(feature, label)一對,因此可map用於刪除標籤的多餘副本:

balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

現在,數據集以50/50的概率生成每個類的示例:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 1 1 1 0 0 1 0 0 0]
[1 1 1 1 1 1 1 1 0 1]
[1 1 0 1 0 0 0 1 0 0]
[0 0 0 1 1 0 0 1 1 0]
[1 0 1 0 0 1 1 0 1 0]
[1 1 0 1 0 1 0 0 1 0]
[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[1 1 0 1 1 0 0 1 1 1]
[1 1 0 0 0 1 0 1 1 0]

6. 在高階API中使用tf.data

6.1 在 tf.keras 中使用 tf.data

tf.keras API 極大地降低了創建、使用機器學習模型的難度。它的.fit().evaluate().predict() API支持tf.data作爲輸入。下面是一個簡單的示例:

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)
fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

Model.fitModel.evaluate都需要 數據+標籤:

model.fit(fmnist_train_ds, epochs=2)

Epoch 1/2
WARNING:tensorflow:Layer flatten is casting an input tensor from dtype float64 to the layer’s dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx.


If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.


To change all layers to have dtype float64 by default, call tf.keras.backend.set_floatx('float64'). To change just this layer, pass dtype=‘float64’ to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.


1875/1875 [==============================] - 3s 2ms/step - loss: 0.6013 - accuracy: 0.7970
Epoch 2/2
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4617 - accuracy: 0.8418


<tensorflow.python.keras.callbacks.History at 0x7f97801f1588>

從上面可以看出,tf.kerastf.data的支持還是很好的。

如果你傳給.fit()方法的數據輸入管道在構建過程中調用了Dataset.repeat()方法,你需要給.fit()額外傳遞steps_per_epoch這個參數。

model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

Epoch 1/2
20/20 [==============================] - 0s 2ms/step - loss: 0.4650 - accuracy: 0.8422
Epoch 2/2
20/20 [==============================] - 0s 2ms/step - loss: 0.3897 - accuracy: 0.8797


<tensorflow.python.keras.callbacks.History at 0x7f97801f1908>

評估時,你可以指定評估step數:

loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

1875/1875 [==============================] - 3s 2ms/step - loss: 0.4423 - accuracy: 0.8473
Loss : 0.44227170944213867
Accuracy : 0.847266674041748

對於大數據集,可以設置評估step數:

loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

10/10 [==============================] - 0s 2ms/step - loss: 0.4557 - accuracy: 0.8188
Loss : 0.45573288202285767
Accuracy : 0.8187500238418579

調用Model.predict時,不需要標籤。

predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

(320, 10)

如果你的dataset包含標籤,predict會自動忽略標籤。

result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)

(320, 10)

6.2 在 tf.estimator 中使用 tf.data

要在tf.estimator.Estimatorinput_fn中使用Dataset,只需要保證input_fn返回的是Dataset即可

官方教程對於這塊的介紹有點不足,推薦大家閱讀《TensorFlow Estimator 官方文檔之----Dataset for Estimator》,裏面比較詳細地介紹了怎麼在tf.estimator中使用tf.data

import tensorflow_datasets as tfds

def train_input_fn():
  titanic = tf.data.experimental.make_csv_dataset(
      titanic_file, batch_size=32,
      label_name="survived")
  titanic_batches = (
      titanic.cache().repeat().shuffle(500)
      .prefetch(tf.data.experimental.AUTOTUNE))
  return titanic_batches
embark = tf.feature_column.categorical_column_with_hash_bucket('embark_town', 32)
cls = tf.feature_column.categorical_column_with_vocabulary_list('class', ['First', 'Second', 'Third']) 
age = tf.feature_column.numeric_column('age')
import tempfile
model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
    model_dir=model_dir,
    feature_columns=[embark, cls, age],
    n_classes=2
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {’_model_dir’: ‘/tmp/tmp7xfmvz5w’, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: 600, ‘_session_config’: allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, ‘_keep_checkpoint_max’: 5, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_log_step_count_steps’: 100, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: ClusterSpec({}), ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}

model = model.train(input_fn=train_input_fn, steps=100)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column_v2.py:560: Layer.add_variable (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.add_weight method instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/ftrl.py:143: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0…
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp7xfmvz5w/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0…
INFO:tensorflow:loss = 0.6931472, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100…
INFO:tensorflow:Saving checkpoints for 100 into /tmp/tmp7xfmvz5w/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100…
INFO:tensorflow:Loss for final step: 0.5968354.

result = model.evaluate(train_input_fn, steps=10)

for key, value in result.items():
  print(key, ":", value)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-28T01:27:11Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp7xfmvz5w/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Inference Time : 0.65018s
INFO:tensorflow:Finished evaluation at 2020-03-28-01:27:11
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.684375, accuracy_baseline = 0.603125, auc = 0.73216105, auc_precision_recall = 0.6447562, average_loss = 0.60841894, global_step = 100, label/mean = 0.396875, loss = 0.60841894, precision = 0.76, prediction/mean = 0.31196585, recall = 0.2992126
INFO:tensorflow:Saving ‘checkpoint_path’ summary for global step 100: /tmp/tmp7xfmvz5w/model.ckpt-100
accuracy : 0.684375
accuracy_baseline : 0.603125
auc : 0.73216105
auc_precision_recall : 0.6447562
average_loss : 0.60841894
label/mean : 0.396875
loss : 0.60841894
precision : 0.76
prediction/mean : 0.31196585
recall : 0.2992126
global_step : 100

for pred in model.predict(train_input_fn):
  for key, value in pred.items():
    print(key, ":", value)
  break

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp7xfmvz5w/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
logits : [-0.1131]
logistic : [0.4717]
probabilities : [0.5283 0.4717]
class_ids : [0]
classes : [b’0’]
all_class_ids : [0 1]
all_classes : [b’0’ b’1’]


注:本文來自於TenosrFlow官方使用tf.data導入數據的 Learn > Guide > tf.data

2020年3月29號更新

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章