詳解transformer代碼

文章目錄

1.代碼下載：

在github下載了比較熱門的transformer代碼的實現，其gith地址爲:https://github.com/Kyubyong/transformer

2.prepro.py

主要負責生成對應的預處理語料文件，並利用sentencepiece包來處理原始語料。

2.1 首先進行語料預處理階段

    # train
    _prepro = lambda x:  [line.strip() for line in open(x, 'r').read().split("\n") \
                      if not line.startswith("<")]
    prepro_train1, prepro_train2 = _prepro(train1), _prepro(train2)
    assert len(prepro_train1)==len(prepro_train2), "Check if train source and target files match."

    # eval
    _prepro = lambda x: [re.sub("<[^>]+>", "", line).strip() \
                     for line in open(x, 'r').read().split("\n") \
                     if line.startswith("<seg id")]
    prepro_eval1, prepro_eval2 = _prepro(eval1), _prepro(eval2)
    assert len(prepro_eval1) == len(prepro_eval2), "Check if eval source and target files match."

    # test
    prepro_test1, prepro_test2 = _prepro(test1), _prepro(test2)
    assert len(prepro_test1) == len(prepro_test2), "Check if test source and target files match."

代碼中可以看到，針對train，eval和test數據集進行了預處理，把其中的一些標點符號去掉

2.2 生成預處理過後的對應數據集

    def _write(sents, fname):
    with open(fname, 'w') as fout:
        fout.write("\n".join(sents))
    _write(prepro_train1, "iwslt2016/prepro/train.de")
    _write(prepro_train2, "iwslt2016/prepro/train.en")
    _write(prepro_train1+prepro_train2, "iwslt2016/prepro/train")
    _write(prepro_eval1, "iwslt2016/prepro/eval.de")
    _write(prepro_eval2, "iwslt2016/prepro/eval.en")
    _write(prepro_test1, "iwslt2016/prepro/test.de")
    _write(prepro_test2, "iwslt2016/prepro/test.en")

其中生成的“train”文件，是結合了“train.de”和“train.en”，以“de”結尾的是德語，以“en”結尾的是翻譯成的英語。

2.3 sentencepiece處理

在代碼中，利用了sentencepiece中的bpe算法來進行分詞。
BPE（Byte Pair Encoding，雙字節編碼）。2016年應用於機器翻譯，解決集外詞（OOV）和罕見詞（Rare word）問題。
BPE算法屬於的是預處理中中的subword算法

    import sentencepiece as spm
    train = '--input=iwslt2016/prepro/train --pad_id=0 --unk_id=1 \ #輸入訓練文件
             --bos_id=2 --eos_id=3\
             --model_prefix=iwslt2016/segmented/bpe --vocab_size={} \ #輸出文件的前綴名陳
             --model_type=bpe'.format(hp.vocab_size)
    spm.SentencePieceTrainer.Train(train)

    logging.info("# Load trained bpe model")
    sp = spm.SentencePieceProcessor()
    sp.Load("iwslt2016/segmented/bpe.model")    

    logging.info("# Segment")
    def _segment_and_write(sents, fname):
        with open(fname, "w") as fout:
            for sent in sents:
                pieces = sp.EncodeAsPieces(sent)    #對文件進行編碼
                fout.write(" ".join(pieces) + "\n")

（1） BPE algorithm

BPE(字節對)編碼或二元編碼是一種簡單的數據壓縮形式，其中最常見的一對連續字節數據被替換爲該數據中不存在的字節。後期使用時需要一個替換表來重建原始數據。OpenAI GPT-2 與Facebook RoBERTa均採用此方法構建subword vector.

優點：可以有效地平衡詞彙表大小和步數(編碼句子所需的token數量)。
缺點：基於貪婪和確定的符號替換，不能提供帶概率的多個分片結果。

（2）BPE算法過程

1）準備足夠大的訓練語料

2）確定期望的subword詞表大小

3）將單詞拆分爲字符序列並在末尾添加後綴“ </ w>”，統計單詞頻率。 本階段的subword的粒度是字符。 例如，“ low”的頻率爲5，那麼我們將其改寫爲“ l o w </ w>”：5

4）統計每一個連續字節對的出現頻率，選擇最高頻者合併成新的subword

5）重複第4步直到達到第2步設定的subword詞表大小或下一個最高頻的字節對出現頻率爲1

停止符"“的意義在於表示subword是詞後綴。舉例來說：“st"字詞不加”“可以出現在詞首如"st ar”，加了”“表明改字詞位於詞尾，如"wide st”，二者意義截然不同。

每次合併後詞表可能出現3種變化：

+1，表明加入合併後的新字詞，同時原來在2個子詞還保留（2個字詞不是完全同時連續出現）
+0，表明加入合併後的新字詞，同時原來2個子詞中一個保留，一個被消解（一個字詞完全隨着另一個字詞的出現而緊跟着出現）
-1，表明加入合併後的新字詞，同時原來2個子詞都被消解（2個字詞同時連續出現）

實際上，隨着合併的次數增加，詞表大小通常先增加後減小。

（3）BPE算法例子

輸入：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

Iter 1, 最高頻連續字節對"e"和"s"出現了6+3=9次，合併成"es"。輸出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

Iter 2, 最高頻連續字節對"es"和"t"出現了6+3=9次, 合併成"est"。輸出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Iter 3, 以此類推，最高頻連續字節對爲"est"和"</w>" 輸出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}

Iter n, 繼續迭代直到達到預設的subword詞表大小或下一個最高頻的字節對出現頻率爲1。

3.data_load.py

主要負責生成batch數據

3.1 主方法：get_batch

用來生成多個batch數據。
主要使用了“tf.data.Dataset.from_generator”加載數據的方法，把句子中的詞語生成在詞典中的id。

def get_batch(fpath1, fpath2, maxlen1, maxlen2, vocab_fpath, batch_size, shuffle=False):
    '''Gets training / evaluation mini-batches
    fpath1: source file path. string.
    fpath2: target file path. string.
    maxlen1: source sent maximum length. scalar.
    maxlen2: target sent maximum length. scalar.
    vocab_fpath: string. vocabulary file path.
    batch_size: scalar
    shuffle: boolean

    Returns
    batches
    num_batches: number of mini-batches
    num_samples
    '''
    sents1, sents2 = load_data(fpath1, fpath2, maxlen1, maxlen2)
    batches = input_fn(sents1, sents2, vocab_fpath, batch_size, shuffle=shuffle)
    num_batches = calc_num_batches(len(sents1), batch_size)
    return batches, num_batches, len(sents1)

參數返回	描述
batches	tf.data.Dataset的一種形式，包含了元組xs（）和元組ys
num_batches	總共有多個個batches進行迭代，也就是有多少輪epoch
len(sents1)	數據集的大小

3.2 load_data

加載數據

def load_data(fpath1, fpath2, maxlen1, maxlen2):
    '''Loads source and target data and filters out too lengthy samples.
    fpath1: source file path. string.
    fpath2: target file path. string.
    maxlen1: source sent maximum length. scalar.
    maxlen2: target sent maximum length. scalar.

    Returns
    sents1: list of source sents
    sents2: list of target sents
    '''
    sents1, sents2 = [], []
    with open(fpath1, 'r') as f1, open(fpath2, 'r') as f2:
        for sent1, sent2 in zip(f1, f2):
            if len(sent1.split()) + 1 > maxlen1: continue # 1: </s>
            if len(sent2.split()) + 1 > maxlen2: continue  # 1: </s>
            sents1.append(sent1.strip())
            sents2.append(sent2.strip())
    return sents1, sents2

當句子超過長度maxlen1或者maxlen2的時候，則拋棄
保存句子，返回輸入和輸入句子的列表

3.3 input_fn

def encode(inp, type, dict):
    '''Converts string to number. Used for `generator_fn`.
    inp: 1d byte array.
    type: "x" (source side) or "y" (target side)
    dict: token2idx dictionary

    Returns
    list of numbers
    '''
    inp_str = inp.decode("utf-8")
    if type=="x": tokens = inp_str.split() + ["</s>"]
    else: tokens = ["<s>"] + inp_str.split() + ["</s>"]

    x = [dict.get(t, dict["<unk>"]) for t in tokens]
    return x

def generator_fn(sents1, sents2, vocab_fpath):
    '''Generates training / evaluation data
    sents1: list of source sents
    sents2: list of target sents
    vocab_fpath: string. vocabulary file path.

    yields
    xs: tuple of
        x: list of source token ids in a sent
        x_seqlen: int. sequence length of x
        sent1: str. raw source (=input) sentence
    labels: tuple of
        decoder_input: decoder_input: list of encoded decoder inputs
        y: list of target token ids in a sent
        y_seqlen: int. sequence length of y
        sent2: str. target sentence
    '''
    token2idx, _ = load_vocab(vocab_fpath)
    for sent1, sent2 in zip(sents1, sents2):
        x = encode(sent1, "x", token2idx)
        y = encode(sent2, "y", token2idx)
        decoder_input, y = y[:-1], y[1:]

        x_seqlen, y_seqlen = len(x), len(y)
        yield (x, x_seqlen, sent1), (decoder_input, y, y_seqlen, sent2)

def input_fn(sents1, sents2, vocab_fpath, batch_size, shuffle=False):
    '''Batchify data
    sents1: list of source sents
    sents2: list of target sents
    vocab_fpath: string. vocabulary file path.
    batch_size: scalar
    shuffle: boolean

    Returns
    xs: tuple of
        x: int32 tensor. (N, T1) # 句子中每個詞語轉換爲id
        x_seqlens: int32 tensor. (N,) # 句子原有長度
        sents1: str tensor. (N,) # 單個句子
    ys: tuple of
        decoder_input: int32 tensor. (N, T2) # 句子中每個詞語轉換爲id，decoder輸入
        y: int32 tensor. (N, T2) # 句子中每個詞語轉換爲id，decoder輸出
        y_seqlen: int32 tensor. (N, ) # 句子原有長度
        sents2: str tensor. (N,) # 單個句子
    '''
    shapes = (([None], (), ()),
              ([None], [None], (), ()))
    types = ((tf.int32, tf.int32, tf.string),
             (tf.int32, tf.int32, tf.int32, tf.string))
    paddings = ((0, 0, ''),
                (0, 0, 0, ''))

    dataset = tf.data.Dataset.from_generator(
        generator_fn,
        output_shapes=shapes,
        output_types=types,
        args=(sents1, sents2, vocab_fpath))  # <- arguments for generator_fn. converted to np string arrays

    if shuffle: # for training
        dataset = dataset.shuffle(128*batch_size)

    dataset = dataset.repeat()  # iterate forever
    dataset = dataset.padded_batch(batch_size, shapes, paddings).prefetch(1)

    return dataset

用了“tf.data.Dataset.from_generator”加載數據，生成器方法爲：generator_fn
返回的dataset中，有兩個變量xs和ys。其中xs中包含:x，x_seqlens，sents1；ys中包含：decoder_input，y，y_seqlen，sents2
在encoder輸入句子x中，添加了結尾符號“</s>”；在輸出句子y中，添加了開頭符號“<s>”和結尾符號“</s>”
decoder模型中，decoder_input：用了y[:-1]
decoder模型中，輸出：用了y[1:]
最後再用“0” padding 句子，把所有的詞語轉換成對應的詞典id
xs和ys的描述：

xs: tuple of
    x: int32 tensor. (N, T1) # 句子中每個詞語轉換爲id
    x_seqlens: int32 tensor. (N,) # 句子原有長度
    sents1: str tensor. (N,) # 單個句子
ys: tuple of
    decoder_input: int32 tensor. (N, T2) # 句子中每個詞語轉換爲id，decoder輸入
    y: int32 tensor. (N, T2) # 句子中每個詞語轉換爲id，decoder輸出
    y_seqlen: int32 tensor. (N, ) # 句子原有長度
    sents2: str tensor. (N,) # 單個句子

4.model.py

實現transformer模型的主要代碼

4.1 初始化

    def __init__(self, hp):
        self.hp = hp
        self.token2idx, self.idx2token = load_vocab(hp.vocab)
        self.embeddings = get_token_embeddings(self.hp.vocab_size, self.hp.d_model, zero_pad=True)

get_token_embeddings：生成詞向量矩陣，這個矩陣是隨機初始化的，且self.embeddings設置成了tf.get_variable共享參數
self.embeddings：其維度爲（vocab_size，d_model），作者默認爲維度是（32001，512）

4.2 encode模型

這部分代碼是實現encode模型的

    def encode(self, xs, training=True):
        '''
        Returns
        memory: encoder outputs. (N, T1, d_model)
        '''
        with tf.variable_scope("encoder", reuse=tf.AUTO_REUSE):
            x, seqlens, sents1 = xs

            # src_masks
            src_masks = tf.math.equal(x, 0) # (N, T1)

            # embedding
            enc = tf.nn.embedding_lookup(self.embeddings, x) # (N, T1, d_model)
            enc *= self.hp.d_model**0.5 # scale

            enc += positional_encoding(enc, self.hp.maxlen1)
            enc = tf.layers.dropout(enc, self.hp.dropout_rate, training=training)

            ## Blocks
            for i in range(self.hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE):
                    # self-attention
                    enc = multihead_attention(queries=enc,
                                              keys=enc,
                                              values=enc,
                                              key_masks=src_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=False)
                    # feed forward
                    enc = ff(enc, num_units=[self.hp.d_ff, self.hp.d_model])
        memory = enc
        return memory, sents1, src_masks

（1）超參數的含義：

超參數名稱	含義
N	batch_size
T1	句子長度
d_model	詞向量維度

（2）實現的功能

輸入詞向量+positional_encoding
encode中共有6個blocks進行連接，每個encode中有multihead attention和全連接層ff進行連接

（3）具體的分析

4.2.1 positional encoding

transformer模型中缺少一種解釋輸入序列中單詞順序的方法，它跟序列模型還不不一樣。爲了處理這個問題，transformer給encoder層和decoder層的輸入添加了一個額外的向量Positional Encoding，維度和embedding的維度一樣，這個向量採用了一種很獨特的方法來讓模型學習到這個值，這個向量能決定當前詞的位置，或者說在一個句子中不同的詞之間的距離。這個位置向量的具體計算方法有很多種，論文中的計算方法如下：

其中pos是指當前詞在句子中的位置，i是指向量中每個值的index，可以看出，在偶數位置，使用正弦編碼，在奇數位置，使用餘弦編碼。

def positional_encoding(inputs,
                        maxlen,
                        masking=True,
                        scope="positional_encoding"):
    '''Sinusoidal Positional_Encoding. See 3.5
    inputs: 3d tensor. (N, T, E)
    maxlen: scalar. Must be >= T
    masking: Boolean. If True, padding positions are set to zeros.
    scope: Optional scope for `variable_scope`.

    returns
    3d tensor that has the same shape as inputs.
    '''

    E = inputs.get_shape().as_list()[-1] # static
    N, T = tf.shape(inputs)[0], tf.shape(inputs)[1] # dynamic
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # position indices
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1]) # (N, T)

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, (i-i%2)/E) for i in range(E)]
            for pos in range(maxlen)])

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1
        position_enc = tf.convert_to_tensor(position_enc, tf.float32) # (maxlen, E)

        # lookup
        outputs = tf.nn.embedding_lookup(position_enc, position_ind)

        # masks
        if masking:
            outputs = tf.where(tf.equal(inputs, 0), inputs, outputs)

        return tf.to_float(outputs)

最後得到的positional encoding加在原始的初始化詞向量中。

4.2.2 multihead_attention

multihead attention的主要公式爲：

def scaled_dot_product_attention(Q, K, V, key_masks,
                                 causality=False, dropout_rate=0.,
                                 training=True,
                                 scope="scaled_dot_product_attention"):
    '''See 3.2.1.
    Q: Packed queries. 3d tensor. [N, T_q, d_k].
    K: Packed keys. 3d tensor. [N, T_k, d_k].
    V: Packed values. 3d tensor. [N, T_k, d_v].
    key_masks: A 2d tensor with shape of [N, key_seqlen]
    causality: If True, applies masking for future blinding
    dropout_rate: A floating point number of [0, 1].
    training: boolean for controlling droput
    scope: Optional scope for `variable_scope`.
    '''
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        d_k = Q.get_shape().as_list()[-1]

        # dot product
        outputs = tf.matmul(Q, tf.transpose(K, [0, 2, 1]))  # (N, T_q, T_k)

        # scale
        outputs /= d_k ** 0.5

        # key masking
        outputs = mask(outputs, key_masks=key_masks, type="key")

        # causality or future blinding masking
        if causality:
            outputs = mask(outputs, type="future")

        # softmax
        outputs = tf.nn.softmax(outputs)
        attention = tf.transpose(outputs, [0, 2, 1])
        tf.summary.image("attention", tf.expand_dims(attention[:1], -1))

        # # query masking
        # outputs = mask(outputs, Q, K, type="query")

        # dropout
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=training)

        # weighted sum (context vectors)
        outputs = tf.matmul(outputs, V)  # (N, T_q, d_v)

    return outputs

def multihead_attention(queries, keys, values, key_masks,
                        num_heads=8, 
                        dropout_rate=0,
                        training=True,
                        causality=False,
                        scope="multihead_attention"):
    '''Applies multihead attention. See 3.2.2
    queries: A 3d tensor with shape of [N, T_q, d_model].
    keys: A 3d tensor with shape of [N, T_k, d_model].
    values: A 3d tensor with shape of [N, T_k, d_model].
    key_masks: A 2d tensor with shape of [N, key_seqlen]
    num_heads: An int. Number of heads.
    dropout_rate: A floating point number.
    training: Boolean. Controller of mechanism for dropout.
    causality: Boolean. If true, units that reference the future are masked.
    scope: Optional scope for `variable_scope`.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    d_model = queries.get_shape().as_list()[-1]
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # Linear projections
        Q = tf.layers.dense(queries, d_model, use_bias=True) # (N, T_q, d_model)
        K = tf.layers.dense(keys, d_model, use_bias=True) # (N, T_k, d_model)
        V = tf.layers.dense(values, d_model, use_bias=True) # (N, T_k, d_model)
        
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h)
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)

        # Attention
        outputs = scaled_dot_product_attention(Q_, K_, V_, key_masks, causality, dropout_rate, training)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model)
              
        # Residual connection
        outputs += queries
              
        # Normalize
        outputs = ln(outputs)
 
    return outputs

其中d_k參數的值爲：512/8 = 64
用到了residual方法和layer normalization方法
可以看到，在encode模型中Q,K,V三個參數都是一致的。

4.3 decode模型

這部分代碼是實現decode模型的

    def decode(self, ys, memory, src_masks, training=True):
        '''
        memory: encoder outputs. (N, T1, d_model)
        src_masks: (N, T1)

        Returns
        logits: (N, T2, V). float32.
        y_hat: (N, T2). int32
        y: (N, T2). int32
        sents2: (N,). string.
        '''
        with tf.variable_scope("decoder", reuse=tf.AUTO_REUSE):
            decoder_inputs, y, seqlens, sents2 = ys

            # tgt_masks
            tgt_masks = tf.math.equal(decoder_inputs, 0)  # (N, T2)

            # embedding
            dec = tf.nn.embedding_lookup(self.embeddings, decoder_inputs)  # (N, T2, d_model)
            dec *= self.hp.d_model ** 0.5  # scale

            dec += positional_encoding(dec, self.hp.maxlen2)
            dec = tf.layers.dropout(dec, self.hp.dropout_rate, training=training)

            # Blocks
            for i in range(self.hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE):
                    # Masked self-attention (Note that causality is True at this time)
                    dec = multihead_attention(queries=dec,
                                              keys=dec,
                                              values=dec,
                                              key_masks=tgt_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=True,
                                              scope="self_attention")

                    # Vanilla attention
                    dec = multihead_attention(queries=dec,
                                              keys=memory,
                                              values=memory,
                                              key_masks=src_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=False,
                                              scope="vanilla_attention")
                    ### Feed Forward
                    dec = ff(dec, num_units=[self.hp.d_ff, self.hp.d_model])

        # Final linear projection (embedding weights are shared)
        weights = tf.transpose(self.embeddings) # (d_model, vocab_size)
        logits = tf.einsum('ntd,dk->ntk', dec, weights) # (N, T2, vocab_size)
        y_hat = tf.to_int32(tf.argmax(logits, axis=-1))

        return logits, y_hat, y, sents2

encode模型中與encode不同之處在於，實現了兩個multihead attention結構。

第一個multihead attention結構和encode模型中的一樣，都爲self-attention結構
第二個multihead attention結構，在Q,K,V的輸入就不同了，其輸入memory實際上是encode模型的輸入結果。
tf.einsum函數是實現矩陣相乘的方式，一般來說如果用mutual的話，不能進行三維矩陣和二維矩陣的相乘，而einsum則可以，通過設置參數，如“ntd,dk->ntk”可以得到矩陣維度是[n,t,k]。

4.4 train函數

用來訓練模型的函數
label_smoothing函數用來進行one hot函數的絲滑處理：

def label_smoothing(inputs, epsilon=0.1):
    '''Applies label smoothing. See 5.4 and https://arxiv.org/abs/1512.00567.
    inputs: 3d tensor. [N, T, V], where V is the number of vocabulary.
    epsilon: Smoothing rate.
    
    For example,
    
    ''
    import tensorflow as tf
    inputs = tf.convert_to_tensor([[[0, 0, 1], 
       [0, 1, 0],
       [1, 0, 0]],

      [[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0]]], tf.float32)
       
    outputs = label_smoothing(inputs)
    
    with tf.Session() as sess:
        print(sess.run([outputs]))
    
    >>
    [array([[[ 0.03333334,  0.03333334,  0.93333334],
        [ 0.03333334,  0.93333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334]],

       [[ 0.93333334,  0.03333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334],
        [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]   
    ''   
    '''
    V = inputs.get_shape().as_list()[-1] # number of channels
    return ((1-epsilon) * inputs) + (epsilon / V)

loss函數用到了交叉熵函數，但是在計算的時候去掉了padding的影響。

    y_ = label_smoothing(tf.one_hot(y, depth=self.hp.vocab_size))
    ce = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=y_)
    nonpadding = tf.to_float(tf.not_equal(y, self.token2idx["<pad>"]))  # 0: <pad>
    loss = tf.reduce_sum(ce * nonpadding) / (tf.reduce_sum(nonpadding) + 1e-7)

同時多學習率進行調整，用到了warmup操作，初始階段lr慢慢上升，迭代後期則慢慢下降

    global_step = tf.train.get_or_create_global_step()
    lr = noam_scheme(self.hp.lr, global_step, self.hp.warmup_steps)

4.5 eval函數

大致和train函數差不多
在decoder輸入的時候，沒有用到輸出數據集，而是用了xs的輸入數據集進行構造，這個decoder_input實際的大小爲[N,1]，其中N爲batch size，也即是僅僅輸入了<s>開頭標記。同時在推斷下一個詞語，一個一個詞語進行拼接

    decoder_inputs, y, y_seqlen, sents2 = ys

    decoder_inputs = tf.ones((tf.shape(xs[0])[0], 1), tf.int32) * self.token2idx["<s>"]
    ys = (decoder_inputs, y, y_seqlen, sents2)

    memory, sents1, src_masks = self.encode(xs, False)

    logging.info("Inference graph is being built. Please be patient.")
    for _ in tqdm(range(self.hp.barrages_maxlen2)):
        logits, y_hat, y, sents2 = self.decode(ys, memory, src_masks, False)
        if tf.reduce_sum(y_hat, 1) == self.token2idx["<pad>"]: break

        _decoder_inputs = tf.concat((decoder_inputs, y_hat), 1)
        ys = (_decoder_inputs, y, y_seqlen, sents2)

上述代碼最後會生成 $y_{hat}$ ，其矩陣大小爲 $[N,T2]$ ，其中 $N$ 爲 $batch\_size$ 和 $T2$ 爲句子長度。

5.train.py

用來訓練模型，生成的模型在log文件夾中
同時計算出BELU的分數值：利用了文件’multi-bleu.perl’

洛克-李

發佈了33 篇原創文章 · 獲贊 18 · 訪問量 5萬+

私信關注

簡單解析transformer代碼