Python_ML-Day05: TensorFlow的線程隊列與IO操作、TFRecords文件的存取

Python_ML-Day5: TensorFlow的線程隊列與IO操作、TFRecords文件的存取

1.TensorFlow 隊列
    - 在訓練樣本的時候，希望讀入的訓練樣本是有序的
    - 考慮使用隊列機制： 先進先出
    - tf.FIFOQueue(capacity, dtypes, name='fifo_queue')
        1.先進先出隊列
        2.參數
            - capacity：整數。可能存儲在此隊列中的元素數量的上限
            - dtypes：DType對象列表。長度dtypes必須等於每個隊列元
            素中的張量數,dtype的類型形狀，決定了後面進隊列元素形狀
        4.方法
            - dequeue(name=None)
            - enqueue(vals, name=None):
            - enqueue_many(vals, name=None):vals列表或者元組
            返回一個進隊列操作
            - size(name=None)： 查看隊列的元素個數
    - tf.RandomShuffleQueue 隨機出隊列

2.隊列管理器
    - 創建一個QueueRunner：tf.train.QueueRunner(queue, enqueue_ops=None)
    - 參數說明：
        1. queue：A Queue
        2. enqueue_ops：添加線程的隊列操作列表，[queue1, queue2...]*2,指定兩個線程
    - 創建好隊列管理器之後，在sess中開啓
        1. create_threads(sess, coord=None,start=False)
            - start：布爾值，如果True啓動線程；如果爲False調用者
                必須調用start()啓動線程
            - coord:線程協調器，後面線程管理需要用到

3.線程協調器
    - 線程協調員,實現一個簡單的機制來協調一組線程的終止
    - tf.train.Coordinator()
    - 方法：
        1. request_stop()： 強制停止
        2. should_stop() 檢查是否要求停止
        3. join(threads=None, stop_grace_period_secs=120) 等待線程終止
        4.return:線程協調員實例

4.文件讀取
    - 流程
        1. 構造一個文件隊列
            - 將輸出字符串（例如文件名）輸入到管道隊列
            - tf.train.string_input_producer(string_tensor,,shuffle=True)
            - 參數說明：
                1.string_tensor    含有文件名的1階張量
                2.num_epochs:過幾遍數據，默認無限過數據
                3.return:具有輸出字符串(文件名)的隊列
        2. 讀取隊列的中的文件的內容，解碼[默認只讀取一個樣本]
            - 讀取read：
                1. csv文件： 讀取一行
                2. 二進制文件： 讀取指定一個樣本的bytes的字節
                3. 圖片文件： 按一張一張的讀取
            - 根據文件格式選擇合適的文件閱讀器，並返回一個閱讀器Reader：
                1. tf.TextLineReader
                    - 閱讀文本文件。逗號分割 csv文件。默認按行讀取
                2. tf.FixedLengthRecordReader(record_bytes)
                    - 讀取二進制文件，每次讀取固定字節數量
                    - record_bytes: 整型，指定每次讀取的字節數
                3. tf.TFRecordReader
                    - 讀取TfRecords文件
            - Reader.read(file_queue)
                1. 讀取文件隊列中的文件內容
                2. 返回一個Tensors元組，key文件名字，value爲文件內容
            - 解碼
                1. 由於從文件中讀取的是字符串，需要函數去解析這些字符串到張量
                2. 將CSV轉換爲張量，與tf.TextLineReader搭配使用
                    - tf.decode_csv(records,record_defaults=None,field_delim = None，name = None)
                    - records:tensor型字符串，每個字符串是csv中的記錄行，Reader中的Value
                    - field_delim:默認分割符”,”
                    - record_defaults:參數決定了所得張量的類型. 逗號分割的每一列的類型可能都不同，通過此參數指定. 當值爲空時，也可以通過此參數指定默認值
                3.將字節轉換爲一個數字向量表示，字節爲一字符串類型的張量,與函數tf.FixedLengthRecordReader搭配使用，二進制讀取爲uint8格式
                    - tf.decode_raw(bytes,out_type,little_endian = None，name = None)
        3. 批處理
            - 每次讀取一個樣本，然後加入到一個批次
            - tf.train.batch(tensors,batch_size,num_threads = 1,capacity = 32,name=None)
                1. 讀取指定大小（個數）的張量
                2. tensors：可以是包含張量的列表
                3. batch_size:從隊列中讀取的批處理大小 - 決定每批次取的數據條數
                4. num_threads：進入隊列的線程數
                5. capacity：整數，隊列中元素的最大數量
                6. return:tensors
            - tf.train.shuffle_batch(tensors,batch_size,capacity,min_after_dequeue,    num_threads=1,) 
                1. 亂序讀取指定大小（個數）的張量
                2. min_after_dequeue:留下隊列裏的張量個數，能夠保持隨機打亂
        4. 子線程讀取文件形成批次，主線程從批次中取數據去訓練模型

    - 開啓線程的另外一種操作
        1. 收集所有圖中的隊列線程，並啓動線程
        2. tf.train.start_queue_runners(sess=None,coord=None)
            - sess:所在的會話中
            - coord：線程協調器
            - return：返回所有線程隊列

5.圖片的讀取
    - 圖片三要素： 長度，寬度，通道數
        1. 長度，寬度 - 像素個數
        2. 單通道 - 灰度值 / 三通道 - RGB值
    - 三要素與張量的關係
        1. [height,width,channels]
    - 圖片的基本操作
        1. 每一個樣本必須保持特徵值數量一致
        2. 所有的圖片，要統一特徵的數量。像素一樣。改變長寬
        3. 縮小圖片 - tf.image.resize_image(images, size)
            - images：4-D形狀[batch, height, width, channels]或3-D形狀的張量[height, width, channels]的圖片數據
            - size：1-D int32張量：new_height, new_width，圖像的新尺寸
            - 返回4-D格式或者3-D格式圖片
    - 圖片讀取器 tf.WholeFileReader
        1. 將文件的全部內容作爲值輸出的讀取器
        2. rerurn 讀取器實例
        3. 方法：
            - read(file_queue)
            - 輸出 key value
    - 圖片解碼器
        1. tf.image.decode_jpeg(contents)
            - 將JPEG編碼的圖像解碼爲uint8張量
            - return:uint8張量，3-D形狀[height, width, channels]
        2. tf.image.decode_png(contents)
            - 將PNG編碼的圖像解碼爲uint8或uint16張量
            - return:張量類型，3-D形狀[height, width, channels]
    - 圖片批處理案例流程
        1. 構造圖片文件隊列
        2. 構造圖片閱讀器
        3. 讀取圖片數據,解碼
        4. 處理圖片數據，大小

6. 二進制文件讀取
    - 二進制數據下載地址：http://www.cs.toronto.edu/~kriz/cifar.html
    - 二進制文件讀取器
        1. reader = tf.FixedLengthRecordReader(record_bytes)
            - record_bytes: 每條記錄的字節數
    - 二進制文件解碼
        1. label_image = tf.decode_raw(value, tf.uint8)
    - 案例流程
        1. 構造文件隊列
        2. 新建二進制文件閱讀器，讀取文件
        3. 新建二進制文件解碼，解碼成utf-8文件[每個utf-8的值，對應一個字節]
        4. 圖片切分
            - 前1個字節，表示label
            - 後32*32*3個字節，表示image
            - tf.slice()
        5.批處理
        6.多線程處理

7.TFRecords文件
    - TFRecords是Tensorflow設計的一種內置文件格式，是一種二進制文件，它能更好的利用內存，更方便複製和移動
    - 爲了將二進制數據和標籤(訓練的類別標籤)數據存儲在同一個文件中
    - 文件格式：*.tfrecords
    - 寫入文件內容：Example協議塊，類字典格式
    - 存儲tfrecords文件
        1. 建立存儲器,寫入器
            - tf.python_io.TFRecordWriter(path)
            - path： tf文件存儲路徑
            - return 寫入器
            - 方法：
                1. write(record): 向文件中寫入一個字符串記錄record
                    - 此字符串record爲一個序列化的example
                    - 使用 Example.SerializeToString()
                2. close(): 關閉寫入器
        2. 爲每個樣本構造協議塊Example
            - 構建單個特徵
                1.tf.train.Feature(**options)
                    - **options：一個指定格式的列表
                    - 指定格式有Int64,Bytes,Float：
                        1.tf.train. Int64List(value=[int])
                        2.tf.train. BytesList(value=[string])
                        3.tf.train. FloatList(value=[float])
                    - value: 爲具體的數值類型的值
                    - 注：
                        1. 比如圖片信息：value = [img_tensor.eval().tostring()]
                        2. 比如label int值：value= [int(label_tensor.eval())]

            - 爲每個樣本構建信息鍵值對
                1. tf.train.Features(feature=None)
                2. 參數說明：
                    - feature：字典數據,key爲要保存的名字，value爲tf.train.Feature實例
                3. 返回值：
                    - return:Features類型
            - 構造每個樣本的Example協議塊，提供給寫入器寫入record
                1. tf.train.Example(features=None)
                2. 參數說明：
                    - features:tf.train.Features類型的特徵實例
                3. 返回值：
                    - return：example格式協議塊
        3. 序列化Example
            - example.SerializeToString()
            - 返回可存儲的序列化的Example
        4. 寫入
            writer.write(example.SerializeToString())

    - 讀取tfrecords文件
        1. 構建文件隊列
        2. 構造文件閱讀器，讀取example
            tf.TFRecordReader()
        3. 解析TFRecords的example協議內存塊
            - 解析一個單一的Example原型
                1. tf.parse_single_example(serialized,features=None,name=None)
                2. 參數說明：
                    - serialized：標量字符串Tensor，一個序列化的Example
                    - features模板：dict字典數據，鍵爲讀取的名字，值爲FixedLenFeature
                    - return:一個鍵值對組成的字典，鍵爲讀取的名字
            - tf.FixedLenFeature(shape,dtype)
                - shape：輸入數據的形狀，一般不指定,爲空列表
                - dtype：輸入數據類型，與存儲進文件的類型要一致，類型只能是float32,int64,string
        4. 解碼： 如果讀取的string,需要解碼。其他不需要解碼
            tf.decode_raw(value,tf.unit8)

8.源碼
    import os
    import tensorflow as tf

    def fifo():
        """
        模擬隊列操作
        :return:
        """
        # 1. 定義隊列
        Q = tf.FIFOQueue(3, tf.float32)
        en_Q = Q.enqueue_many([[0.1, 0.2, 0.3]])
        # 2. 定義讀取數據過程
        out_dt = Q.dequeue()
        de_Q = Q.enqueue(out_dt + 1)
        # 3. 定義數據處理過程
        # 4. 處理數據再次入隊列

        with tf.Session() as sess:
            # 執行隊列初始化
            sess.run(en_Q)
            # 執行隊列進出操作
            for i in range(100):
                sess.run(de_Q)
            # 訓練數據
            for i in range(Q.size().eval()):
                print(sess.run(Q.dequeue()))


    def queueRunner():
        """
        隊列管理器, 異步處理數據
        :return:
        """
        # 定義一個隊列 1000
        Q = tf.FIFOQueue(1000, tf.float32)
        # 定義一個變量
        var = tf.Variable(0.0, tf.float32)
        # 變量自增 + 1
        data = var.assign_add(1)
        # 入隊列
        en_q = Q.enqueue(data)
        # 定義一個隊列管理器，去運行上述隊列過程
        qr = tf.train.QueueRunner(Q, enqueue_ops=[en_q] * 2)
        # 初始化init_op
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            # 初始化op
            sess.run(init_op)
            # 創建並開啓子線程，並制定協調員 -- 開啓隊列管理器
            coord = tf.train.Coordinator()
            threads = qr.create_threads(sess, coord=coord, start=True)
            # 主線程去讀取數據，然後進行訓練等操作
            for i in range(300):
                print(sess.run(Q.dequeue()))
            # 協調器向你發送停止請求
            coord.request_stop()
            # 協調器回收資源
            coord.join(threads)


    def readCSVFile():
        """
        讀取csv文件
        :return:
        """
        # 1.找到文件目錄，構造一個文件列表
        file_name = os.listdir("./data")
        file_path = [os.path.join('./data', file) for file in file_name]
        file = tf.constant(file_path)
        # 2.構造文件隊列
        queue = tf.train.string_input_producer(file, shuffle=True)
        # 3.構造閱讀器
        reader = tf.TextLineReader()
        key, value = reader.read(queue)
        # 4.解碼
        # 每行記錄分爲兩列，都是字符串格式，並且默認值分別爲None和default
        records = [["None"], ["default"]]
        line1, line2 = tf.decode_csv(value, record_defaults=records)
        # 5.批處理
        line1_batch, line2_batch = tf.train.batch(tensors=[line1, line2], batch_size=9, num_threads=1, capacity=9)
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess, coord=coord)
            print(sess.run([line1_batch]))
            coord.request_stop()
            coord.join(threads)


    def readImags():
        """
        讀取圖片文件
        :return:
        """
        # 1.找到文件目錄，構造一個文件列表
        file_name = os.listdir("./image")
        file_path = [os.path.join('./image', file) for file in file_name]
        file = tf.constant(file_path)
        # 2.構造文件隊列
        queue = tf.train.string_input_producer(file, shuffle=True)
        # 3.構造閱讀器
        reader = tf.WholeFileReader()
        key, value = reader.read(queue)
        # 4.解碼
        img = tf.image.decode_jpeg(value)
        # 處理圖片的大小
        img_resize = tf.image.resize(img, [200, 200])
        # 設置形狀固定[200,200,3]
        img_resize.set_shape([200, 200, 3])
        # 5.批處理
        img_batch = tf.train.batch(tensors=[img_resize], batch_size=10, num_threads=1, capacity=10)
        print(img_batch)
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess, coord=coord)
            print(sess.run([img_batch]))
            coord.request_stop()
            coord.join(threads)


    # 定義命令行參數
    FLAGS = tf.app.flags.FLAGS
    tf.app.flags.DEFINE_string("cifar_dir", "./bin", "文件的目錄")
    tf.app.flags.DEFINE_string("tfrecords_dir", "./tfrecords/a.tfrecords", "tfrecords文件的目錄")


    class CifarRead(object):
        """
        完成讀取二級制文件，寫進tfRecords
        """

        def __init__(self, pathlist):
            self.path_list = pathlist
            self.bytes = 32 * 32 * 3 + 1

        def getRecord(self):
            # 創建隊列
            queue = tf.train.string_input_producer(self.path_list)
            # 讀取數據
            reader = tf.FixedLengthRecordReader(record_bytes=self.bytes)
            key, value = reader.read(queue)
            # 解碼器
            label_image = tf.decode_raw(value, tf.uint8)
            print(label_image)
            # 切分圖片和目標值
            label = tf.cast(tf.slice(label_image, [0], [1]), tf.int32)
            image = tf.slice(label_image, [1], [self.bytes - 1])
            print(label)
            print(image)
            reshape = tf.reshape(image, [32, 32, 3])
            print(reshape)
            # 批處理數據
            img_batch, label_batch = tf.train.batch([reshape, label], batch_size=10, num_threads=1, capacity=10)
            print(img_batch)
            print(label_batch)
            return img_batch, label_batch

        def write_to_tfrecords(self, img_batch, label_batch, batch_size=10):
            """
            將批量獲取的數據，存儲爲tfrecords文件
            :param img_batch: 批量的圖片信息
            :param label_batch: 批量的目標值信息
            :return: 返回要存儲的tftrcord信息
            """
            # 1. 創建一個寫入器
            writer = tf.python_io.TFRecordWriter(FLAGS.tfrecords_dir)
            # 2. 創建要寫入的example實例
            # 2.1 因爲是批次數據，所以要分割[因爲有eval所以要在sess中運行]
            for i in range(batch_size):
                image = img_batch[i].eval().tostring()
                label = int(label_batch[i].eval()[0])
                example = tf.train.Example(features=tf.train.Features(
                    feature={
                        "image": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                        "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
                    }
                ))
                # 3. 序列化,寫入
                writer.write(example.SerializeToString())
            # 寫完關閉
            writer.close()

        def readTfrecords(self):
            # 構造文件隊列
            queue = tf.train.string_input_producer([FLAGS.tfrecords_dir])
            # 創建閱讀器
            reader = tf.TFRecordReader()
            # 讀取文件
            key, value = reader.read(queue)
            # 讀取的value 是string, example.SerializeToString()，所以需要反序列化parse_single_example
            features = tf.parse_single_example(value, features={
                "image": tf.FixedLenFeature(shape=[], dtype=tf.string),
                "label": tf.FixedLenFeature(shape=[], dtype=tf.int64)
            })
            # 讀取的image是string, 所以，需要解碼成utf-8
            image = tf.decode_raw(features["image"], tf.uint8)
            label = features["label"]
            print(image, label)
            # 固定圖片的形狀，方便後續的處理
            image_reshape = tf.reshape(image, [32, 32, 3])
            # 進行批處理
            img_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=5, num_threads=1, capacity=10)
            return img_batch, label_batch


    if __name__ == '__main__':
        # 1. 找到文件，放入列表
        file_list = os.listdir(FLAGS.cifar_dir)
        path_list = [os.path.join(FLAGS.cifar_dir, file) for file in file_list if file[:5] == "data_"]
        print(path_list)
        cr = CifarRead(path_list)
        img_batch, label_batch = cr.readTfrecords()
        with tf.Session() as sess:
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess, coord=coord)
            #
            print(sess.run([label_batch]))
            print(sess.run([label_batch]))
            print(sess.run([label_batch]))
            coord.request_stop()
            coord.join(threads)
Python_ML-Day05: TensorFlow的線程隊列與IO操作、TFRecords文件的存取

Flink系列（二）-- Flink的數據源詳解

ElasticSearch從入門到放棄（五） -- Java API【基於官方文檔7.5】

JAVA 定時調取器的使用

Python_ML-Day05: TensorFlow的線程隊列與IO操作、TFRecords文件的存取

從零開始搭建CDH大數據平臺（二）-- CDH 5.3.6集羣搭建篇

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結