attention is all you need實現(TF2詳細註釋)(一)數據處理

在網上找到一份教程,無奈還要一行一行的理解代碼,不懂的實在太多啦,把查到的東西的東西記錄在此吧。

Github項目地址:https://github.com/princewen/tensorflow_practice/tree/master/basic/Basic-Transformer-Demo

一、超參數設置

定義了一個Hyperparams類,參數是模型的超參數值

  • min_cnt:出現頻率小於min_cnt的單詞不加入到vocab單詞集中
  • hidden_units:也就是模型中的num_units,指一個單詞的編碼長度

attention/hyperparams.py

class Hyperparams:
    '''Hyperparameters'''
    # data
    source_train = 'data/train.tags.de-en.de'
    target_train = 'data/train.tags.de-en.en'
    source_test = 'data/IWSLT16.TED.tst2014.de-en.de.xml'
    target_test = 'data/IWSLT16.TED.tst2014.de-en.en.xml'

    # training
    batch_size = 32  # alias = N
    lr = 0.0001  # learning rate. In paper, learning rate is adjusted to the global step.
    logdir = 'logdir'  # log directory

    # model
    maxlen = 10  # Maximum number of words in a sentence. alias = T.
    # Feel free to increase this if you are ambitious.
    min_cnt = 20  # words whose occurred less than min_cnt are encoded as <UNK>.
    hidden_units = 512  # alias = C
    num_blocks = 6  # number of encoder/decoder blocks
    num_epochs = 20
    num_heads = 8
    dropout_rate = 0.1
    sinusoid = False  # If True, use sinusoid. If false, positional embedding.

二、數據讀取   

1.數據集介紹

IWSLT:國際口語翻譯系統評測,以下是readme文件中的介紹:

對於每個語言對x-y,域內並行訓練數據通過以下文件提供: 

train.tags.x-y.x

train.tags.x-y.y

其中包括2016年4月1日在TED網站上每一對x-y的演講文本和手動翻譯。文本以純文本(UTF8編碼)的形式給出,每行一個或多個句子,並且是對齊的(在語言對級別上,而不是跨對)。單語培訓數據包含在文件train.y中。除了德英對之外,一些基於TEDX演講的開發集也被額外發布了。UTF8編碼的文本在句子中分段,在<seg id="N">和</seg> (N=1,2,…)之間給出,可以包含多個句子。文件*.x .y.x .xml和*.x .y. y.xml的段是對齊的。

  • attention/data/de.vocab.tsv:(第二列爲單詞出現頻率)

  • attention/data/en.vocab.tsv:

     
  •  編碼後的數據(Sourses)

2.名詞解釋

  • OOV:out of vocabulary集外詞,在訓練數據或測試數據中,該單詞沒有出現在單詞集vocab中。默認值爲1
  • tsv:tab separated values製表符分割符
  • u + 字符串:後面字符串以 Unicode 格式 進行編碼
  • np.lib.pad:對pad進行填充
  • regex:RegularExpression正則表達式
  • strip()方法:用於移除字符串頭尾指定的字符(默認爲空格或換行符)或字符序列

3.數據加載

  • tf.train.slice_input_producer

一個tensor生成器,作用是按照設定,每次從一個tensor列表中按順序或者隨機抽取出一個tensor放入文件名隊列。但在TensorFlow2中這個方法(下同)已經被移除了,取代tf.train的類是tf.data,對應方法爲tf.data.Dataset.from_tensor_slices

 需要注意的是兩個tensor一起輸入時:

input_queues = tf.data.Dataset.from_tensor_slices([X,Y])
x, y = tf.data.Dataset.shuffle(input_queues, buffer_size=hp.batch_size * 32).batch(batch_size=hp.batch_size)

會報錯:ValueError: not enough values to unpack (expected 2, got 1) 

把input_queues打印出來可以看出X和Y還是分別屬於兩個array,我的理解是input_queues中只有一個張量,shuffle方法卻要求返回兩個(x和y),查閱官方文檔關於輸入兩個tensor的方法:

# Two tensors can be combined into one Dataset object.
features = tf.constant([[1, 3], [2, 1], [3, 3]]) # ==> 3x2 tensor
labels = tf.constant(['A', 'B', 'A']) # ==> 3x1 tensor
dataset = Dataset.from_tensor_slices((features, labels))
# Both the features and the labels tensors can be converted
# to a Dataset object separately and combined after.
features_dataset = Dataset.from_tensor_slices(features)
labels_dataset = Dataset.from_tensor_slices(labels)
dataset = Dataset.zip((features_dataset, labels_dataset))
# A batched feature and label set can be converted to a Dataset
# in similar fashion.
batched_features = tf.constant([[[1, 3], [2, 3]],
                                [[2, 1], [1, 2]],
                                [[3, 3], [3, 2]]], shape=(3, 2, 2))
batched_labels = tf.constant([['A', 'A'],
                              ['B', 'B'],
                              ['A', 'B']], shape=(3, 2, 1))
dataset = Dataset.from_tensor_slices((batched_features, batched_labels))
for element in dataset.as_numpy_iterator():
  print(element)

可以看出應改爲:

input_queues = tf.train.slice_input_producer([X,Y]) #TF1
input_queues = tf.data.Dataset.from_tensor_slices([X,Y]) #TF2只返回1個tensor
input_queues = tf.data.Dataset.from_tensor_slices((X,Y)) #返回兩個tensor

此時input_queues中X與Y對應:

 

  • tf.train.shuffle_batch([example, label], batch_size=batch_size, capacity=capacity, min_after_dequeue)

這也是TensorFlow1中的方法。在TF2中以下兩種方法是可行的:

dataset = tf.data.Dataset.shuffle(input_queues, buffer_size=hp.batch_size * 64).batch(batch_size=hp.batch_size)

# dataset = input_queues.shuffle(hp.batch_size * 64).batch(hp.batch_size)

參數min_after_dequeue,一定要保證這參數小於capacity參數的值,否則會出錯。這個代表隊列中的元素大於它的時候就輸出亂的順序的batch。也就是說這個函數的輸出結果是一個亂序的樣本排列的batch,不是按照順序排列的。

 奇怪的是文檔這裏寫的min_after_dequeue參數似乎不管用,寫了會報TypeError: shuffle() got an unexpected keyword argument 'min_after_dequeue。shuffle只有這三個參數,我猜測可能是爲了用起來更方便所以移除了一些參數?

  • Iterator:按照源代碼中的x,y = tf.train.shuffle_batch執行會報錯:ValueError: too many values to unpack (expected 2) ,查了半天資料總算看懂iterator的用法了。示例如下:
def get_batch_data():
    X, Y = load_train_data()
    num_batch = len(X) // hp.batch_size
    X = tf.convert_to_tensor(X,tf.int32)
    Y = tf.convert_to_tensor(Y,tf.int32)
    
    dataset = tf.data.Dataset.shuffle(input_queues, buffer_size=hp.batch_size * 64).batch(batch_size=hp.batch_size)
    
    iterator = tf.compat.v1.data.make_one_shot_iterator(input_queues)
    x, y = iterator.get_next()
    return x, y, num_batch

def main():
    x, y, batch_num = get_batch_data()
    with tf.compat.v1.Session() as sess:
        sess.run(tf.compat.v1.global_variables_initializer())
        for i in range(10):
            print(i, sess.run(x))

if __name__ == '__main__':
    main()

將X輸出:

0 [ 129 1622    6  358    7 6349    3    0    0    0]
1 [  59 2320 2736    7  249 1486    3    0    0    0]
2 [  59  265  572  276   10   22 5922    3    0    0]
3 [34  7 16  1  3  0  0  0  0  0]
4 [ 37  63 136   9 935 396   3   0   0   0]
5 [ 672   14  165    4 1550  746    3    0    0    0]
6 [209  40 624  11   3   0   0   0   0   0]
7 [37 51  1  4 36  1  3  0  0  0]
8 [  37    7   20 4103   17 1286    3    0    0    0]
9 [159   7   8   1   3   0   0   0   0   0]
  • strip():移除字符串頭尾指定的字符(默認爲空格或換行符)或字符序列。 
  • attention/data_load.py
from hyperparams import Hyperparams as hp
import tensorflow as tf
import numpy as np
import codecs # 編碼
import regex # 正則表達式

def load_de_vocab():
    # 發生大於min_cnt的單詞加入vocab
    vocab = [line.split()[0] for line in codecs.open('attention/data/de.vocab.tsv','r','utf-8').read().splitlines()
             if int(line.split()[1])>=hp.min_cnt]
    # 對單詞編號,順序排列
    word2idx = {word:idx for idx,word in enumerate(vocab)}
    idx2word = {idx:word for idx,word in enumerate(vocab)}

    return word2idx,idx2word

def load_en_vocab():
    # 英文
    vocab = [line.split()[0] for line in codecs.open('atteneion/data/en.vocab.tsv','r','utf-8').read().splitlines()
             if int(line.split()[1])>=hp.min_cnt]

    word2idx = {word:idx for idx,word in enumerate(vocab)}
    idx2word = {idx:word for idx,word in enumerate(vocab)}
    return word2idx,idx2word



def create_data(source_sents,target_sents):
    de2idx,idx2de = load_de_vocab() #讀取德文
    en2idx,idx2en = load_en_vocab() #讀取英文

    x_list ,y_list,Sources,Targets = [],[],[],[]
    for source_sent,target_sent in zip(source_sents,target_sents):
        # 對句子進行編碼
        x = [de2idx.get(word,1) for word in (source_sent+u" </S>").split()] # 1: OOV, </S>: End of Text
        y = [en2idx.get(word,1) for word in (target_sent+u" </S>").split()]

        if max(len(x),len(y)) <= hp.maxlen: #句子中單詞的最大數限制
            x_list.append(np.array(x))
            y_list.append(np.array(y))
            Sources.append(source_sent)
            Targets.append(target_sent)

    #Pad
    X = np.zeros([len(x_list),hp.maxlen],np.int32) # 第一維:數據量;第二維:句子長度
    Y = np.zeros([len(y_list),hp.maxlen],np.int32)

    for i,(x,y) in enumerate(zip(x_list,y_list)): # (x, y) = (源句, 目標句)
        X[i] = np.lib.pad(x,[0,hp.maxlen-len(x)],'constant',constant_values=(0,0)) # 填充
        Y[i] = np.lib.pad(y,[0,hp.maxlen-len(y)],'constant',constant_values=(0,0))
    return X,Y,Sources,Targets



def load_train_data():
    def _refine(line):
        line = regex.sub("[^\s\p{Latin}']", "", line) # 替換
        return line.strip()

    de_sents = [_refine(line) for line in codecs.open(hp.source_train, 'r', 'utf-8').read().split('\n') if
                line and line[0] != "<"]
    en_sents = [_refine(line) for line in codecs.open(hp.target_train, 'r', 'utf-8').read().split('\n') if
                line and line[0] != '<']

    X, Y, Sources, Targets = create_data(de_sents, en_sents)
    # X, Y爲編碼後的句子
    return X, Y


def load_test_data():
    def _refine(line):
        line = regex.sub("<[^>]+>", "", line)
        line = regex.sub("[^\s\p{Latin}']", "", line)
        return line.strip() #去除頭尾指定字符

    de_sents = [_refine(line) for line in codecs.open(hp.source_test,'r','utf-8').read().split('\n') if line and line[:4] == "<seg"]
    # 測試數據最後一列爲seg
    en_sents = [_refine(line) for line in codecs.open(hp.target_test,'r','utf-8').read().split('\n') if line and line[:4] == '<seg']

    X,Y,Sources,Targets = create_data(de_sents,en_sents)
    return X,Sources,Targets



def get_batch_data():
    X, Y = load_train_data()

    num_batch = len(X) // hp.batch_size

    #print("train_X:\n", X[:10])
    #print("train_Y:\n", Y[:10])
    X = tf.convert_to_tensor(X,tf.int32)
    Y = tf.convert_to_tensor(Y,tf.int32)
    # print(X)
    # print(Y)
    '''
    input_queues = tf.train.slice_input_producer([X,Y])
    '''
    input_queues = tf.data.Dataset.from_tensor_slices((X,Y))

    # print(list(input_queues.as_numpy_iterator()))
    # print("shape of input_queues:", input_queues)
    '''
    i = 0
    for element in input_queues.as_numpy_iterator():
        print(i, element)
        i += 1
        if i == 10:
            break
    '''
    '''
    x,y = tf.train.shuffle_batch(input_queues,
                                 num_threads=8,
                                 batch_size=hp.batch_size,
                                 capacity = hp.batch_size*64,
                                 min_after_dequeue=hp.batch_size * 32,
                                 allow_smaller_final_batch=False)
    '''
    dataset = tf.data.Dataset.shuffle(input_queues, buffer_size=hp.batch_size * 64).batch(batch_size=hp.batch_size)
    # dataset = input_queues.shuffle(hp.batch_size * 64).batch(hp.batch_size)

    # print(tf.compat.v1.data.get_output_shapes(input_queues))
    # print(tf.compat.v1.data.get_output_types(input_queues))

    iterator = tf.compat.v1.data.make_one_shot_iterator(input_queues)
    x, y = iterator.get_next()
    print(x, y)
    return x, y, num_batch

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章