前言

最近新建了一個conda環境，搞上了tensorflow 2.0 (Beat)，，，TF2.0改變確實很多，比如刪除了Session……這對於我等習慣了先建圖——再Session執行的人來說，我現在方的雅痞……2.0如何以圖形式運行我還沒有一點頭緒(剛發現了tf.compat裏面有歷史版本233)……所以還在瑟瑟發抖的使用新版TF強烈推薦的keras。

今天正準備用TF2.0小跑一個圖像任務，首先就是數據的讀入，然而這邊數據集11G，所以打算整合進TFrecord，方便之後；

介紹

TFrecord是Tensorflow提供並推薦使用的一種統一一種二進制文件格式，用於存儲數據，理論上它可以保存任何格式的信息。

type	value
uint64	length
uint32	masked_crc32_of_length
byte	data[length]
uint32	masked_crc32_of_data

如上：整個文件由文件長度信息、長度校驗碼、數據、數據校驗碼組成。

TFRecord 的核心內容在於內部有一系列的 Example ，Example 是 protocolbuf 協議下的消息體。
比如我這邊使用的Example是這樣的：

exam = tf.train.Example (
            features=tf.train.Features(
                feature={
                    'name' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[splits[-1].encode('utf-8')])),
                    'shape': tf.train.Feature(int64_list=tf.train.Int64List (value=[img.shape[0], img.shape[1], img.shape[2]])),
                    'data' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[bytes(img.numpy())]))
                }
            )
        )

可以看出，一個 Example 消息體包含了一個Features，而Features由諸多feature組成，其中每個feature 是一個 map，也就是 key-value 的鍵值對。其中，key 取值是 String 類型；而 value 是 Feature 類型的消息體，它的取值有 3 種：

BytesList
FloatList
Int64List

需要注意的是，他們都是列表的形式。

如何創建TFrecord文件

從上面我們知道，TFRecord 內由一系列Example組成，每個Example可以代表一組數據。

Tensorflow 2.0 Beat 中，輸出TFrecord的API爲tf.io.TFRecordWriter (filename, options=None), 其中第二個參數是用來控制文件的輸出配置，一般不用管。第一個參數就是你要保存的文件名，調用該函數後，會返回一個Writer實例。

有了Writer，我們就可以不停的調用Writer.write (example)來把我們的Examples輸出到文件中，需要注意的是，該函數接受的是一個string，所以我們應該先把example序列化爲string類型，即Writer.write(example.SerializeToString())

當把所有的example輸出到文件後，需要調用Writer.close()關閉文件。

例子：

writer = tf.io.TFRecordWriter (file_name)
for item in file_list:
    # item = .\\data\\xx(label)\\xxx.jpg
    splits = item.split ('\\')
    label = splits[2]
    img = tf.io.read_file (item)
    img = tf.image.decode_jpeg (img)
    exam = tf.train.Example (
    	features=tf.train.Features(
        	feature={
            	'name' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[splits[-1].encode('utf-8')])),
            	'label': tf.train.Feature(int64_list=tf.train.Int64List (value=[int(label)])),
            	'shape': tf.train.Feature(int64_list=tf.train.Int64List (value=[img.shape[0], img.shape[1], img.shape[2]])),
            	'data' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[bytes(img.numpy())]))
        	}
    	)
	)
    writer.write (exam.SerializeToString())
writer.close()

這裏因爲Tensorflow 2.0 默認使用的是Eager模式，所以img是一個 Eager Tensor，需要轉爲numpy。

如何讀取TFrecord

老版本中，我們可以使用tf.TFrecordReader()，不過這個在2.0裏我沒找到，所以我們使用tf.data.TFRecordDataset(filename)，調用後我們會得到一個Dataset(tf.data.Dataset)，字面理解，這裏面就存放着我們之前寫入的所有Example。

還記得寫入時，我們把每個example都進行了序列化麼，所以我們要得到之前的example，還需要解析以下之前寫入的序列化string。tf.io.parse_single_example(example_proto, feature_description)函數可以解析單條example.

解釋一下這個函數：
第一個參數就是要解析的string，重點在於第二個參數，他要我們指定解析出來的example的格式。爲了能正確解析，這個要和我們寫入時的example對應起來：
比如我們寫入時example爲：

exam = tf.train.Example (
    features=tf.train.Features(
        feature={
            'name' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[splits[-1].encode('utf-8')])),
            'label': tf.train.Feature(int64_list=tf.train.Int64List (value=[int(label)])),
            'shape': tf.train.Feature(int64_list=tf.train.Int64List (value=[img.shape[0], img.shape[1], img.shape[2]])),
            'data' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[bytes(img.numpy())]))
        }
    )
)

則我們需要指定的參數爲：

feature_description = {
    'name' : tf.io.FixedLenFeature([], tf.string, default_value='Nan'),
    'label': tf.io.FixedLenFeature([] , tf.int64, default_value=-1), # 默認值自己定義
    'shape': tf.io.FixedLenFeature([3], tf.int64),
    'data' : tf.io.FixedLenFeature([], tf.string)
}

可以看到其中每一條都和之前的example中的feature對應(feature_description 中 map的key可以不對應，比如name改成id還是沒問題的)。

OK，我們目前解決了解析一條example，但是一個Dataset中的example那麼多。沒關係tensorflow的dataset提供了Dataset.map(func)，可以給定一個映射規則，將dataset中的所有條目按照該規則進行映射，其實和python的map函數差不多。

所以我們可以把我們的映射一條的函數呈遞給Dataset.map(func)，以解析所有的example。

reader = tf.data.TFRecordDataset(file_name) # 打開一個TFrecord

feature_description = {
    'name' : tf.io.FixedLenFeature([], tf.string, default_value='Nan'),
    'label': tf.io.FixedLenFeature([] , tf.int64, default_value=-1),
    'shape': tf.io.FixedLenFeature([3], tf.int64),
    'data' : tf.io.FixedLenFeature([], tf.string)
}
def _parse_function (exam_proto): # 映射函數，用於解析一條example
    return tf.io.parse_single_example (exam_proto, feature_description)
   
reader = reader.map (_parse_function)

讀取的話，我們可以用for循環：

for row in reader.take(10): # 只取前10條
# for row in reader: # 枚舉所有example
    print (row['name'])
    print (np.frombuffer(row['data'].numpy(), dtype=np.uint8)) # 如果要恢復成3d數組，可reshape

不過我們還可以完出花樣：
dataset中還提供了很多方法，比如batch，shuffle，repeat。。。更多的可以自行去官網摸索(不知何時，訪問TF官網突然就啥都不用了)

我們就可以這樣：

reader = tf.data.TFRecordDataset(file_name)

feature_description = {
    'name' : tf.io.FixedLenFeature([], tf.string, default_value='Nan'),
    'label': tf.io.FixedLenFeature([] , tf.int64, default_value=-1),
    'shape': tf.io.FixedLenFeature([3], tf.int64),
    'data' : tf.io.FixedLenFeature([], tf.string)
}
def _parse_function (exam_proto):
    return tf.io.parse_single_example (exam_proto, feature_description)

reader = reader.repeat (1) # 讀取數據的重複次數爲：1次，這個相當於epoch
reader = reader.shuffle (buffer_size = 2000) # 在緩衝區中隨機打亂數據
reader = reader.map (_parse_function) # 解析數據
batch  = reader.batch (batch_size = 10) # 每10條數據爲一個batch，生成一個新的Dataset

shape = []
batch_data_x, batch_data_y = np.array([]), np.array([])
for item in batch.take(1): # 測試，只取1個batch
    shape = item['shape'][0].numpy()
    for data in item['data']: # 一個item就是一個batch
        img_data = np.frombuffer(data.numpy(), dtype=np.uint8)
        batch_data_x = np.append (batch_data_x, img_data)
    for label in item ['label']:
        batch_data_y = np.append (batch_data_y, label.numpy())

batch_data_x = batch_data_x.reshape ([-1, shape[0], shape[1], shape[2]])
print (batch_data_x.shape, batch_data_y.shape) # = (10, 480, 640, 3) (10,)
# 我的圖片數據時480*640*3的

可以很方便的讀取出數據的各批次，還能隨即等等。

Tensorflow 2.0 TFrecord的輸出與讀入

前言

介紹

如何創建TFrecord文件

如何讀取TFrecord

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

二叉堆詳解及 C代碼實現

Tensorflow 2.0 TFrecord的輸出與讀入

[位壓] 高精度加法

【C++學習】 typedef一些初學誤區記錄

並查集啓發式合併詳解 + C代碼實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結