Table of Contents

ps：本文略長，堅持看完，相信對深入理解bert模型的原理及實現會有較大幫助。如有錯誤，懇請批評指正。

模型簡介

2018年10月11號，Google團隊發表了BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding，引起了NLP領域的轟動，爲什麼如此之火？第一，效果好，bert在11項NLP任務中取得了state-of-the-art的結果，包括NER、問答等任務；第二，在nlp領域更加成熟的運用“預訓練+fine-tuning”方法（同圖像處理領域，更加成熟是指前面已經有了word2vec、glove、fasttext等前輩及GPT、ELMo等兄弟），算得上是nlp領域的里程碑。張俊林博士在博客中也說，“客觀的說，把Bert當做最近兩年NLP重大進展的集大成者更符合事實。”

關於模型，網上有很多介紹，這裏僅列出我認爲最核心的幾個點：

1. A High-Level Look

bert的名稱已經反映了模型本身，即：

1) Deep: base model: 12層，110M parameters，large model：24層，340M parameters;

2) Bidirectional: 雙向，可以看到上下文;

3) Transformer: 作者拋棄了rnn，lstm等經典rnn結構，而是採用transformer作爲基本單元，即採用self-attention，自己跟自己做attention，不僅可以獲取長距離依賴關係，還可以顯著提升並行計算能力(對比傳統attention)。

bert模型本身創新不是太大，核心的transformer引用自2017年12月6號發表的Attention Is All You Need(也是谷歌團隊)，但在預訓練中採用了一些tricks，這些tricks顯著提升了模型的表達能力。

1) 輸入表徵(Input Representation): bert不再僅僅使用WordPiece embeddings，還增加了segment embeddings和position embeddings，如下圖(https://arxiv.org/pdf/1810.04805.pdf)，

在代碼解析部分會對三個向量進行詳細解釋。

2) 隨機遮蔽(Masked LM): 對輸入token進行隨機遮蔽，然後對對應位置的輸出進行預測，具體會在代碼部分進行詳細解釋，作者說靈感來源於Cloze task，本質是CBOW(大家都這麼說)；

3) Next Sentence Prediction: 即將句子A，B拼在一起作爲輸入，在輸出時預測B是否是A的下一句。因此bert的預訓練是多任務訓練，不僅預測masked tokens，還需預測是否是下一句，這使得模型在一定程度上學習到了句子間的關係知識。

2. Encoder(http://jalammar.github.io/illustrated-transformer/)

每個Encoder包含兩層——self-attention、ffnn(feed forward neural network)，每層後面跟了residual connection及layer normalization。

3. Self-Attention(http://jalammar.github.io/illustrated-transformer/)

上圖描述了self-attention的具體過程，其中q(Query)、k(Key)、v(Value)見下文，Jay Alammar在The Illustrated Transformer一文中對transformer的encoder進行了詳細介紹。

4. Matrix Calculation of Self-Attention(http://jalammar.github.io/illustrated-transformer/)

上圖公式即transformer的self-attention的前向計算過程。

5. Multi-headed Self-Attention(http://jalammar.github.io/illustrated-transformer/)

上圖描述了Multi-headed Self-Attention的具體過程，每個head生成一個Zi，所有的Zi拼接，然後與矩陣 $W^{O}$ 相乘，得到Z。Multi-headed的主要作用是增加了representation subspaces(表徵子空間)，因爲每個head對應一套 $W^{Q}$ 、 $W^{K}$ 、 $W^{V}$ ，因此上圖中的Zi對應了不同的表徵子空間，從而增加了模型的表徵能力。

至此，模型簡介完畢，下面會通過代碼解析對其實現細節進行全面剖析。

一、模塊一之生成預訓練數據

先來個感觀認識:

INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] this was nearly opposite . [SEP] at last dunes reached the quay at [MASK] opposite end of [MASK] street [MASK] and there burst on [MASK] ##am ##mon [MASK] s [MASK] eyes a vast semi [MASK] ##rcle of blue sea , ring ##ed with palaces and towers . [MASK] stopped in ##vo ##lun ##tar [MASK] ; and his little guide [MASK] also , and looked ask ##ance at the young monk , [MASK] watch the effect which that [MASK] panorama should produce on him . [SEP]
INFO:tensorflow:input_ids: 101 2023 2001 3053 4500 1012 102 2012 2197 17746 2584 1996 21048 2012 103 4500 2203 1997 103 2395 103 1998 2045 6532 2006 103 3286 8202 103 1055 103 2159 1037 6565 4100 103 21769 1997 2630 2712 1010 3614 2098 2007 22763 1998 7626 1012 103 3030 1999 6767 26896 7559 103 1025 1998 2010 2210 5009 103 2036 1010 1998 2246 3198 6651 2012 1996 2402 8284 1010 103 3422 1996 3466 2029 2008 103 23652 2323 3965 2006 2032 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 9 14 18 20 25 28 30 35 48 54 60 72 78 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 2027 1996 1996 1025 6316 1005 22741 6895 2002 6588 3030 2000 2882 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1

bert模型的預訓練數據生成主要在create_pretraining_data.py及tokenization.py中。

先看下官方給出的生成預訓練數據命令行：

python create_pretraining_data.py \
         --input_file=./sample_text.txt \     # 輸入文件，以逗號分隔的文件名稱，示例見下圖
         --output_file=/tmp/tf_examples.tfrecord \ # 輸出文件，以逗號分隔的文件名稱
         --vocab_file=$BERT_BASE_DIR/vocab.txt \   # 詞典文件，Google提供
         --do_lower_case=True \                    # 是否忽略大小寫，爲True時，忽略
         --max_seq_length=128 \ # 每條訓練數據的最大長度，過長的會截取，不夠的會進行padding
         --max_predictions_per_seq=20 \            # 每條樣本遮蔽token的最大數量 
         --masked_lm_prob=0.15 \  # 每條樣本以15%的概率遮蔽token(不夠準確, 詳見代碼)
         --random_seed=12345 \                     # 隨機數
         --dupe_factor=5    # 命令行中的dupe_factor, 最外層循環, 直觀理解是同一句話生成了      
                              dupe_factor條樣本(不過每條樣本的next sentence和masks不同)

將從create_pretraining_data.py的main開始講起。。。

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)
  # 1.設置tokenizer爲FullTokenizer，負責對文本進行預處理。詳見下文(1. tokenizer)
  tokenizer = tokenization.FullTokenizer(
      vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
  # 將以空格分隔的文件名轉爲文件列表
  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern))

  tf.logging.info("*** Reading from input files ***")
  for input_file in input_files:
    tf.logging.info("  %s", input_file)

  rng = random.Random(FLAGS.random_seed)
  # 2. 構造數據集，詳見下文(2. create_training_instances)
  instances = create_training_instances(
      input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
      FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
      rng)

  output_files = FLAGS.output_file.split(",")
  tf.logging.info("*** Writing to output files ***")
  for output_file in output_files:
    tf.logging.info("  %s", output_file)

  write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
                                  FLAGS.max_predictions_per_seq, output_files)

1. tokenizer

tokenizer定義在tokenization.py模塊中，該模塊主要定義了三種tokenizer: BasicTokenizer, WordpieceTokenizer, FullTokenizer。每個tokenizer中都對應一個tokenize函數，用以對文本進行預處理。其中FullTokenizer綜合了BasicTokenizer和WordpieceTokenizer，所以叫Full...

1）BasicTokenizer

功能：對原始文本進行預處理，包括刪除無效字符、轉換空白字符爲空格、將中文及部分韓文日文字符前後加空格、去除accent字符等，最後按空格分隔，返回tokens列表。

僅有一個入參do_lower_case，是否忽略大小寫

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = self._clean_text(text)    # 1.1 刪除無效字符，將所有空白字符轉爲空格

    # This was added on November 1st, 2018 for the multilingual and Chinese
    # models. This is also applied to the English models now, but it doesn't
    # matter since the English models were not trained on any Chinese data
    # and generally don't have any Chinese data in them (there are Chinese
    # characters in the vocabulary because Wikipedia does have some Chinese
    # words in the English Wikipedia.).
    text = self._tokenize_chinese_chars(text)    # 1.2 將中文及部分韓文日文字符前後加空格，部分以空格分隔的日文韓文不作處理

    orig_tokens = whitespace_tokenize(text)    # 2 將經過上述處理後的文本按空格切分，生成列表
    split_tokens = []
    # 3 針對每個token，根據unicodedata去除accent字符，對於含標點的token進行進一步分詞(包含標點)
    for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    # 返回basic預處理後的詞/字列表
    return output_tokens

2) WordpieceTokenizer

功能：對經過BasicTokenizer處理後的tokens列表進行“分詞”（根據Google提供的bert模型中的詞典），返回合法tokens列表。

有三個參數: a) vocab: 詞典(dict);

b) unk_token(str, default '[UNK]'): unknown string替換爲[UNK]，單個詞長超過max_input_chars_per_word的token也替換爲[UNK];

c) max_input_chars_per_word(default 200): 每個詞(token)的最大長度。

  def tokenize(self, text):
    """Tokenizes a piece of text into its word pieces.
    運用最長匹配優先的貪婪算法將一段文本解析成詞列表
    This uses a greedy longest-match-first algorithm to perform tokenization
    using the given vocabulary.

    For example:
      input = "unaffable"
      output = ["un", "##aff", "##able"]

    Args:
      text: A single token or whitespace separated tokens. This should have
        already been passed through `BasicTokenizer.
      輸入text：單個詞或以空格分隔的詞字符串

    Returns:
      A list of wordpiece tokens.
    """

    text = convert_to_unicode(text)

    output_tokens = []    # 輸出tokens列表
    for token in whitespace_tokenize(text):
      chars = list(token)
      # 對於詞長大於max_input_chars_per_word的替換爲[UNK]
      if len(chars) > self.max_input_chars_per_word:
        output_tokens.append(self.unk_token)
        continue

      is_bad = False
      start = 0
      sub_tokens = []    # 子詞列表
      while start < len(chars):
        end = len(chars)
        cur_substr = None    # 當前子詞
        while start < end:   # start從前向後(以子詞爲單位)，end從後向前(以一個字符爲單位)
          substr = "".join(chars[start:end])
          if start > 0:
            substr = "##" + substr    # 待加入子詞
          if substr in self.vocab:    # 滿足條件：待加入子詞應在詞典中
            cur_substr = substr
            break
          end -= 1    # end從後向前(以一個字符爲單位)
        if cur_substr is None:
          is_bad = True
          break
        sub_tokens.append(cur_substr)
        start = end    # start從前向後(以子詞爲單位)

      if is_bad:
        output_tokens.append(self.unk_token)
      else:
        output_tokens.extend(sub_tokens)
    return output_tokens

3) FullTokenizer

功能：主要是封裝了BasicTokenizer和WordpieceTokenizer，通過tokenize函數，返回經過預處理後的tokens列表。

該類較簡單，此處不作介紹。

預處理說完了，值得我們借鑑的是它整個的預處理流程，並沒有正則去除ip、電話號碼、網址等複雜的規則操作，只是簡單粗暴的去除無效字符、空白字符、accent，然後運用貪婪算法依據vocab進行分詞，效果卻不錯，值得思考借鑑！

2. create_training_instances

功能：從輸入文檔的raw text構造data set

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng):
  """Create `TrainingInstance`s from raw text.
  輸入: 
    input_files: 輸入文檔列表
    tokenizer: 上文提到的文檔預處理類(FullTokenizer)
    max_seq_length: 命令行中的參數, 單個輸入樣本的序列最大長度
    dupe_factor: 命令行中的參數, 每篇文章重複生成訓練集的次數(即每句話重複                
        dupe_factor次, 因爲mask是隨機產生的)
    short_seq_prob: 命令行中參數, 默認0.1, 以10%的概率生成短訓練樣本, 以增加魯棒性
    masked_lm_prob: 命令行參數, 默認0.15, 以15%的概率遮蔽token(不夠準確, 詳見代碼)
    max_predictions_per_seq: 命令行參數, 默認20, 與masked_lm_prob共同決定遮蔽token數量, 
        詳見代碼
    rng: random實例
  輸出:
    instances: TrainingInstance列表, 每個TrainingInstance對應一條訓練樣本, 包括
               mask好的tokens, segmented_ids(句子id), is_random_next, 
               masked_lm_positions(mask的index), masked_lm_labels(對應原始文本)
  """
  # all_documents是個三維列表, 第一維度代表一篇文章,
  # 每篇文章是一個二維列表, 其中每個元素代表one line
  all_documents = [[]]

  # 輸入文件內容格式(樣例見Google提供的sample_text.txt文本):
  # (1) 每行一句
  # (2) 不同文章以空行間隔(方便構造"next sentence prediction"任務的訓練樣本)
  # 讀取輸入文件列表文本內容, 保存在all_documents中
  for input_file in input_files:
    with tf.gfile.GFile(input_file, "r") as reader:
      while True:
        line = tokenization.convert_to_unicode(reader.readline())
        if not line:
          break
        line = line.strip()

        # Empty lines are used as document delimiters
        if not line:
          all_documents.append([])
        # 調用FullTokenizer類的tokenize接口(內部分別調用了BasicTokenizer、    
        # WordPieceTokenizer的tokenize接口)
        tokens = tokenizer.tokenize(line)
        if tokens:
          all_documents[-1].append(tokens)

  # Remove empty documents  去除空文章
  all_documents = [x for x in all_documents if x]
  # 對第一維度(文章)進行隨機排列
  rng.shuffle(all_documents)

  vocab_words = list(tokenizer.vocab.keys())    # 字典中的所有詞
  instances = []
  # 命令行中的dupe_factor, 最外層循環, 直觀理解是同一句話生成了dupe_factor條樣本
  # (不過每條樣本的next sentence和masks不同)
  for _ in range(dupe_factor):    
    for document_index in range(len(all_documents)):
      # 因爲有"next sentence prediction"任務, 所以實際是通過 
      # create_instances_from_document生成樣本(以all_documents第一維度---文章爲單位)
      instances.extend(
          create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

  rng.shuffle(instances)
  return instances

3. create_instances_from_document

def create_instances_from_document(
    all_documents, document_index, max_seq_length, short_seq_prob,
    masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
  """以單篇文章爲單位生成樣本數據.
  輸入: 
    all_documents: 預處理後的文檔數據列表
    document_index: 當前文檔在all_documents的索引  
  輸出:
    instances: TrainingInstance列表, 每個TrainingInstance對應一條訓練樣本, 包括
               mask好的tokens, segmented_ids(句子id), is_random_next, 
               masked_lm_positions(mask的index), masked_lm_labels(對應原始文本)  
  """
  document = all_documents[document_index]    # 當前生成樣本所用文章

  # Account for [CLS], [SEP], [SEP]
  # 每條樣本由兩句話A,B組成, 樣本數據以'[CLS]'開頭，每句話以'[SEP]'結尾, 
  # 即[CLS]A[SEP]B[SEP], 所以實際最大的tokens數量需減去3
  max_num_tokens = max_seq_length - 3

  # We *usually* want to fill up the entire sequence since we are padding
  # to `max_seq_length` anyways, so short sequences are generally wasted
  # computation. However, we *sometimes*
  # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
  # sequences to minimize the mismatch between pre-training and fine-tuning.
  # The `target_seq_length` is just a rough target however, whereas
  # `max_seq_length` is a hard limit.
  # 爲增加模型魯棒性, 使其泛化到其他nlp任務, 會以10%概率生成短訓練數據, 
  # 最終實際結果40%左右(sample_text.txt作爲輸入), 原因是
  # 1) 構造sentence A 時到達文章末尾; 
  # 2) 構造sentence B 時, 候選sentences過少等
  target_seq_length = max_num_tokens
  if rng.random() < short_seq_prob:
    target_seq_length = rng.randint(2, max_num_tokens)

  instances = []
  current_chunk = []    # 生成樣本的候選sentences
  current_length = 0    # 候選sentences總tokens長度
  i = 0
  while i < len(document):    
    segment = document[i]
    current_chunk.append(segment)    # 生成樣本的候選sentences
    current_length += len(segment)
    # 只有當文章最後一句或候選sentences總tokens長度大於待生成樣本長度時才進入代碼, 
    # 否則一直append sentence
    if i == len(document) - 1 or current_length >= target_seq_length:
      if current_chunk:
        # `a_end` is how many segments from `current_chunk` go into the `A`
        # (first) sentence.
        a_end = 1    # 代表有幾句sentences作爲A
        if len(current_chunk) >= 2:
          a_end = rng.randint(1, len(current_chunk) - 1)

        tokens_a = []
        # 構造sentence A
        for j in range(a_end):
          tokens_a.extend(current_chunk[j])

        tokens_b = []
        # Random next  
        is_random_next = False
        # sentence B 是隨機的
        # 條件1是到達當前文章末尾了, 那隻能隨機了; 條件2是50%概率
        if len(current_chunk) == 1 or rng.random() < 0.5:
          is_random_next = True
          target_b_length = target_seq_length - len(tokens_a)

          # 10次隨機生成不同的文章索引, 如果10次還生成不了, 那還不趕緊買彩票去？！
          for _ in range(10):
            random_document_index = rng.randint(0, len(all_documents) - 1)
            if random_document_index != document_index:
              break
          # 構造sentence B, 此時樣本A+B的長度>=target_seq_length, 後面會修剪
          random_document = all_documents[random_document_index]
          random_start = rng.randint(0, len(random_document) - 1)
          for j in range(random_start, len(random_document)):
            tokens_b.extend(random_document[j])
            if len(tokens_b) >= target_b_length:
              break
          # 爲避免浪費raw text, 需要將倒退到a_end
          num_unused_segments = len(current_chunk) - a_end
          i -= num_unused_segments
        # sentence B非隨機
        else:
          is_random_next = False
          for j in range(a_end, len(current_chunk)):
            tokens_b.extend(current_chunk[j])
        # 對樣本數據進行修剪至max_num_tokens, A, B誰長修剪誰, 
        # 並以等概率從A/B開頭或結尾開始修剪
        truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

        assert len(tokens_a) >= 1
        assert len(tokens_b) >= 1

        tokens = []       # 用以轉爲上文'模型簡介---輸入表徵'中的tokens embeddings
        segment_ids = []  # 用以轉爲上文'模型簡介---輸入表徵'中的segment_ids embeddings
        tokens.append("[CLS]")
        segment_ids.append(0)    # sentence A 對應 0
        for token in tokens_a:
          tokens.append(token)
          segment_ids.append(0)

        tokens.append("[SEP]")
        segment_ids.append(0)

        for token in tokens_b:
          tokens.append(token)
          segment_ids.append(1)    # sentence B 對應 1
        tokens.append("[SEP]")
        segment_ids.append(1)
        # 至此, 原始訓練樣本構造完畢, 接下來隨機mask, 得到最終的訓練樣本. 
        # 詳見create_masked_lm_predictions
        (tokens, masked_lm_positions,
         masked_lm_labels) = create_masked_lm_predictions(
             tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
        instance = TrainingInstance(
            tokens=tokens,
            segment_ids=segment_ids,
            is_random_next=is_random_next,
            masked_lm_positions=masked_lm_positions,
            masked_lm_labels=masked_lm_labels)
        instances.append(instance)
      current_chunk = []
      current_length = 0
    i += 1

  return instances

4. create_masked_lm_predictions

def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
  """Creates the predictions for the masked LM objective.
  輸入:
    tokens: 一條訓練樣本
    masked_lm_prob: 隨機遮蔽概率
    max_predictions_per_seq: 單條樣本的最大遮蔽數
    vocab_words: 詞典
    rng: random實例
  輸出:
    output_tokens: mask後的訓練樣本
    masked_lm_positions: 替換的index
    masked_lm_labels: 替換的原始文本
  """

  cand_indexes = []    # 待mask的候選indexs
  for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
      continue
    cand_indexes.append(i)

  rng.shuffle(cand_indexes)    # 隨機打亂

  output_tokens = list(tokens)
  # 待mask或待預測的數量(由max_predictions_per_seq和masked_lm_prob同時控制)
  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []
  covered_indexes = set()
  for index in cand_indexes:    
    if len(masked_lms) >= num_to_predict:    # 取前num_to_predict個進行替換
      break
    if index in covered_indexes:
      continue
    covered_indexes.add(index)

    masked_token = None
    # 80%替換爲'[MASK]'
    if rng.random() < 0.8:    
      masked_token = "[MASK]"
    else:
      # 10%保持不變
      if rng.random() < 0.5:
        masked_token = tokens[index]
      # 10%隨機替換爲詞典中其他詞
      else:
        masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

    output_tokens[index] = masked_token
    # MaskedLmInstance爲namedtuple類型('index', 'label')
    masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

  masked_lms = sorted(masked_lms, key=lambda x: x.index)

  masked_lm_positions = []
  masked_lm_labels = []
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)

  return (output_tokens, masked_lm_positions, masked_lm_labels)

5. write_instance_to_example_files

def write_instance_to_example_files(instances, tokenizer, max_seq_length,
                                    max_predictions_per_seq, output_files):
  """Create TF example files from `TrainingInstance`s.
  將instances寫入output_files
  """
  writers = []
  # 每個output_file對應一個TFRecordWriter
  for output_file in output_files:
    writers.append(tf.python_io.TFRecordWriter(output_file))

  writer_index = 0

  total_written = 0
  for (inst_index, instance) in enumerate(instances):
    # 將樣本轉換爲id列表
    input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
    input_mask = [1] * len(input_ids)
    segment_ids = list(instance.segment_ids)
    assert len(input_ids) <= max_seq_length
    # 對不足max_seq_length的樣本進行pad(0對應詞典中的'[PAD]')
    while len(input_ids) < max_seq_length:
      input_ids.append(0)
      input_mask.append(0)
      segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    masked_lm_positions = list(instance.masked_lm_positions)    # mask位置
    # mask對應的原始token, 轉ids列表
    masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
    # mask權重, 方便後續loss計算, 詳見下文
    masked_lm_weights = [1.0] * len(masked_lm_ids)    
    # 不足 max_predictions_per_seq 的補 0/0.0
    while len(masked_lm_positions) < max_predictions_per_seq:
      masked_lm_positions.append(0)
      masked_lm_ids.append(0)
      masked_lm_weights.append(0.0)

    next_sentence_label = 1 if instance.is_random_next else 0
    # 創建tf.train.Example實例, 並寫入output_files
    features = collections.OrderedDict()
    features["input_ids"] = create_int_feature(input_ids)
    features["input_mask"] = create_int_feature(input_mask)
    features["segment_ids"] = create_int_feature(segment_ids)
    features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
    features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
    features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
    features["next_sentence_labels"] = create_int_feature([next_sentence_label])

    tf_example = tf.train.Example(features=tf.train.Features(feature=features))

    writers[writer_index].write(tf_example.SerializeToString())
    writer_index = (writer_index + 1) % len(writers)

    total_written += 1
    # 日誌信息(打印前20條)
    if inst_index < 20:
      tf.logging.info("*** Example ***")
      tf.logging.info("tokens: %s" % " ".join(
          [tokenization.printable_text(x) for x in instance.tokens]))

      for feature_name in features.keys():
        feature = features[feature_name]
        values = []
        if feature.int64_list.value:
          values = feature.int64_list.value
        elif feature.float_list.value:
          values = feature.float_list.value
        tf.logging.info(
            "%s: %s" % (feature_name, " ".join([str(x) for x in values])))

  for writer in writers:
    writer.close()

  tf.logging.info("Wrote %d total instances", total_written)

再來看下生成樣本：

INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] this was nearly opposite . [SEP] at last dunes reached the quay at [MASK] opposite end of [MASK] street [MASK] and there burst on [MASK] ##am ##mon [MASK] s [MASK] eyes a vast semi [MASK] ##rcle of blue sea , ring ##ed with palaces and towers . [MASK] stopped in ##vo ##lun ##tar [MASK] ; and his little guide [MASK] also , and looked ask ##ance at the young monk , [MASK] watch the effect which that [MASK] panorama should produce on him . [SEP]
INFO:tensorflow:input_ids: 101 2023 2001 3053 4500 1012 102 2012 2197 17746 2584 1996 21048 2012 103 4500 2203 1997 103 2395 103 1998 2045 6532 2006 103 3286 8202 103 1055 103 2159 1037 6565 4100 103 21769 1997 2630 2712 1010 3614 2098 2007 22763 1998 7626 1012 103 3030 1999 6767 26896 7559 103 1025 1998 2010 2210 5009 103 2036 1010 1998 2246 3198 6651 2012 1996 2402 8284 1010 103 3422 1996 3466 2029 2008 103 23652 2323 3965 2006 2032 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 9 14 18 20 25 28 30 35 48 54 60 72 78 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 2027 1996 1996 1025 6316 1005 22741 6895 2002 6588 3030 2000 2882 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1

樣本生成完了, 模型馬上就來。。。

二、模塊二之構建模型

bert主模型在modeling.py模塊中。配置信息存儲在BertConfig實例中，模型在BertModel中構建。

模型配置信息(BertConfig)：

    vocab_size,                        # 詞典大小
    hidden_size=768,                   # 隱藏層大小（向量維數）
    num_hidden_layers=12,              # 隱藏層數（transformer層數）
    num_attention_heads=12,            # transformer層的header數
    intermediate_size=3072,            # 中間層維度（先升維後降維，後面介紹）
    hidden_act="gelu",                 # 激活函數
    hidden_dropout_prob=0.1,           # embedding層、hidden層dropout的概率值
    attention_probs_dropout_prob=0.1,  # attention層dropout的概率值
    max_position_embeddings=512,       # position embeddings shape: [max_position_embeddings, hidden_size]
                                       # 即輸入最長支持512
    type_vocab_size=16,                # segment embedding shape: [type_vacab_size, hidden_size]
                                       # 即輸入最長支持16句話
    initializer_range=0.02             # 參數初始化時標準差stddev

BertModel::__init__

  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=True,
               scope=None):
    """Constructor for BertModel.

    Args:
      config: `BertConfig` instance.
      is_training: bool. true for training model, false for eval model. Controls
        whether dropout will be applied.
      input_ids: int32 Tensor of shape [batch_size, seq_length].
      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
        embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
        it is much faster if this is True, on the CPU or GPU, it is faster if
        this is False.
      scope: (optional) variable scope. Defaults to "bert".

    Raises:
      ValueError: The config is invalid or one of the input tensor shapes
        is invalid.
    """
    config = copy.deepcopy(config)
    # 如果是預測，將drop_out去掉
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0
    
    # shape: [batch_size, seq_length]
    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      # embedding層
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        # 返回embedding_output([batch_size, seq_length, embedding_size]), 
        # embedding_table([vacab_size, embedding_size]) --- (word vector)
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        # 1. embedding_postprocessor， 增加位置編碼及句子編碼，詳見下文
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # 將input_mask由[batch_size, seq_length]轉換成[batch_size, seq_length, seq_length]
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # 2. 12/24層的transformer在transformer_model中創建, 詳見下文.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)
      # transformer的最後輸出
      self.sequence_output = self.all_encoder_layers[-1]
      # 後面的‘pooler’可根據實際業務需求進行調整
      # 此處'pooler'是將第一個token作爲input，增加一個dense layer
      # 論文中說：The first token of every sequence is always the special 
      # classification embedding ([CLS]). The final hidden state (i.e., output of 
      # Transformer) corresponding to this token is used as the aggregate sequence         
      # representation for classification tasks. For non- classification tasks, 
      # this vector is ignored.
      # 即 輸入的第一個token爲[CLS],輸出對應的第一個token(論文認爲該token可以代表整個輸入的
      # 集合)可用以對輸入進行分類
      # input shape:[batch_size, hidden_size], out_shape:[batch_size, hidden_size]
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        # 第一個token shape: [batch_size, hidden_size]
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

1. embedding_postprocessor

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor. 
     此處主要爲詞向量增加位置信息(position embeddings)和句子信息(segment embeddings)。

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    # Segment Embedding（類似於詞向量表的初始化）
    # shape: token_type_vocab_size * hidden_size (16 * 768), 默認最長支持16句話
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      # Position Embedding（類似於詞向量表的初始化）
      # shape: max_position_embeddings * hidden_size (512 * 768), 最長支持句長512
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      # 同一個batch中的position embeddings相同，所以直接在batch_size維進行broadcast
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings
  # 最後對輸出進行layer normalization、dropout
  output = layer_norm_and_dropout(output, dropout_prob)
  return output

2. transformer_model

transformer_model包含num_hidden_layers層，每層包含multi-headed self-attention(residual)，FFNN(residual)

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".
     構建transformer
  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.

  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.

  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  # 將input_tensor由3D reshape成2D(優化在tpu上的計算)
  prev_output = reshape_to_matrix(input_tensor)
  # 逐層構建transformer，每層包含"attention"、"intermediate"、"output"三個scope
  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output
      # 2.1 構造multi-headed self-attention，詳見attention_layer
      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        # attention後加dense layer，將attention output映射到hidden_size維
        # (可將該dense layer視作attention的一部分)
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      # FFNN過程，對attention_output先升維後降維（激活函數僅在該層使用）
      # The activation is only applied to the "intermediate" hidden layer.
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      # 降維到hidden_size
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output    # 單層transformer的輸出
        all_layer_outputs.append(layer_output)    # 所有transformer層的輸出，方便取出

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

2.1 attention_layer

self_attention機制簡介：self_attention字面意思是自己跟自己做attention（乘法注意力），使每個詞都能“看到”整句話，即使得每個詞都具有全局語義信息；然後運用multi-headed（將hidden_size平均分成多個heads），使每個head學習到不同的子空間語義，然後將multi-heads的輸出進行拼接，得到當前attention_layer的最終輸出。

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.

  # from_tensor ---> Q  to_tensor ---> K, V
  # (乘法注意力)
  # output_multi_headed_attention = softmax((Q * K.T) / sqrt(num_heads)) * V
  # shape: [batch_size, seq_len, size_per_head]
  # then concatenate all heads and return [batch_size, seq_len, hidden_size]

  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities. (對注意力矩陣進行dropout)
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
      3D: [batch_size, from_seq_length, num_attention_heads * size_per_head] 或
      2D: [batch_size * from_seq_length, num_attention_heads * size_per_head])

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))
  
  # 將key、value reshape成[batch_size, num_per_head, seq_len, size_per_head]
  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)
  
  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  # softmax((Q * K.T) / sqrt(num_heads)) * V （乘法注意力）
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask`: [B, F, T] ---> [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # mask中爲1的位置置爲0， 爲0的地方置爲-10000，然後直接加到attention_scores中，進行
    # softmax，提升計算效率
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
  
  # 對V進行reshape、transpose，然後attention_probs * V ===> [B, N, F, H]
  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

三、模塊三之預訓練過程

bert模型的預訓練過程包含兩個任務: 1) 下一句預測(Next Sentence Prediction); 2) 遮蔽詞預測。主程序在run_pretraining.py中。由於訓練過程使用高級接口(TPU)Estimator，因此整個預訓練過程的邏輯基本按照Estimator的參數要求進行。

主程序：

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)    # 配置log日誌打印級別爲INFO

  if not FLAGS.do_train and not FLAGS.do_eval:
    raise ValueError("At least one of `do_train` or `do_eval` must be True.")
  # bert網絡模型結構參數定義在json文件中
  bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

  tf.gfile.MakeDirs(FLAGS.output_dir)
  # 輸入文件列表
  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern))

  tf.logging.info("*** Input Files ***")
  for input_file in input_files:
    tf.logging.info("  %s" % input_file)

  tpu_cluster_resolver = None
  if FLAGS.use_tpu and FLAGS.tpu_name:
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
        FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
  # tpu訓練配置信息（沒用過tpu，所以對參數沒有深入研究。。。不過不影響對整個訓練過程的理解）
  is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  run_config = tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      master=FLAGS.master,
      model_dir=FLAGS.output_dir,
      save_checkpoints_steps=FLAGS.save_checkpoints_steps,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=FLAGS.iterations_per_loop,
          num_shards=FLAGS.num_tpu_cores,
          per_host_input_for_training=is_per_host))
  
  # 1. 模型構建器，Estimator的入參之一。詳見下文
  model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      num_train_steps=FLAGS.num_train_steps,
      num_warmup_steps=FLAGS.num_warmup_steps,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_tpu)

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  # 構造Estimator
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=FLAGS.train_batch_size,
      eval_batch_size=FLAGS.eval_batch_size)

  if FLAGS.do_train:
    tf.logging.info("***** Running training *****")
    tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    # 訓練數據集構建器。
    train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=True)
    # 開始訓練
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)

  if FLAGS.do_eval:
    tf.logging.info("***** Running evaluation *****")
    tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
    # 驗證數據集構建器
    eval_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=False)
    # 開始驗證
    result = estimator.evaluate(
        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)

    output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
    with tf.gfile.GFile(output_eval_file, "w") as writer:
      tf.logging.info("***** Eval results *****")
      for key in sorted(result.keys()):
        tf.logging.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))

1. model_fn_builder

def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
  """Returns `model_fn` closure for TPUEstimator.
  根據Estimator的輸入需要, 構造model_fn: 
                input:'features', 'labels', 'mode'(optional), 
                      'params'(optional), 'config'(optional)
                output: tf.estimator.EstimatorSpec)
  """

  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    tf.logging.info("*** Features ***")
    for name in sorted(features.keys()):
      tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    masked_lm_positions = features["masked_lm_positions"]
    masked_lm_ids = features["masked_lm_ids"]
    masked_lm_weights = features["masked_lm_weights"]
    next_sentence_labels = features["next_sentence_labels"]

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)
    # 構建神經網絡
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)
    # 1. 構建遮蔽詞預測的損失函數。詳見下文
    (masked_lm_loss,
     masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
         bert_config, model.get_sequence_output(), model.get_embedding_table(),
         masked_lm_positions, masked_lm_ids, masked_lm_weights)
    # 2. 構建Next Sentence Prediction的損失函數。 詳見下文
    (next_sentence_loss, next_sentence_example_loss,
     next_sentence_log_probs) = get_next_sentence_output(
         bert_config, model.get_pooled_output(), next_sentence_labels)
    # 兩項訓練任務的total loss
    total_loss = masked_lm_loss + next_sentence_loss
    # tvars中對應所有可訓練的變量
    tvars = tf.trainable_variables()

    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      # 根據tvars從init_checkpoint中初始化變量:assignment_map({scope/var: scope/var})
      (assignment_map, initialized_variable_names
      ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
      if use_tpu:

        def tpu_scaffold():
          tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
          return tf.train.Scaffold()

        scaffold_fn = tpu_scaffold
      else:
        tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    tf.logging.info("**** Trainable Variables ****")
    for var in tvars:
      init_string = ""
      if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
      tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                      init_string)

    output_spec = None
    if mode == tf.estimator.ModeKeys.TRAIN:    # 訓練
      # 創建optimizer training op
      train_op = optimization.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
      # 返回TPUEstimatorSpec實例
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          train_op=train_op,
          scaffold_fn=scaffold_fn)
    elif mode == tf.estimator.ModeKeys.EVAL:    # 驗證

      def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                    masked_lm_weights, next_sentence_example_loss,
                    next_sentence_log_probs, next_sentence_labels):
        """Computes the loss and accuracy of the model.
        input:
          masked_lm_example_loss: (batch_size * max_predictions_per_seq, )
          masked_lm_log_probs: (batch_size * max_predictions_per_seq, vocab_size)
          masked_lm_ids: (batch_size, max_predictions_per_seq)
          masked_lm_weights: (batch_size, max_predictions_per_seq)
          next_sentence_example_loss: (batch_size, )
          next_sentence_log_probs: (batch_size, 2)
          next_sentence_labels: (batch_size, )
        """
        masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
                                         [-1, masked_lm_log_probs.shape[-1]])
        # shape: (batch_size * max_predictions_per_seq, )
        masked_lm_predictions = tf.argmax(
            masked_lm_log_probs, axis=-1, output_type=tf.int32)
        masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
        masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
        masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
        masked_lm_accuracy = tf.metrics.accuracy(
            labels=masked_lm_ids,
            predictions=masked_lm_predictions,
            weights=masked_lm_weights)
        masked_lm_mean_loss = tf.metrics.mean(
            values=masked_lm_example_loss, weights=masked_lm_weights)

        next_sentence_log_probs = tf.reshape(
            next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
        next_sentence_predictions = tf.argmax(
            next_sentence_log_probs, axis=-1, output_type=tf.int32)
        next_sentence_labels = tf.reshape(next_sentence_labels, [-1])
        next_sentence_accuracy = tf.metrics.accuracy(
            labels=next_sentence_labels, predictions=next_sentence_predictions)
        next_sentence_mean_loss = tf.metrics.mean(
            values=next_sentence_example_loss)

        return {
            "masked_lm_accuracy": masked_lm_accuracy,
            "masked_lm_loss": masked_lm_mean_loss,
            "next_sentence_accuracy": next_sentence_accuracy,
            "next_sentence_loss": next_sentence_mean_loss,
        }

      eval_metrics = (metric_fn, [
          masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
          masked_lm_weights, next_sentence_example_loss,
          next_sentence_log_probs, next_sentence_labels
      ])
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          eval_metrics=eval_metrics,
          scaffold_fn=scaffold_fn)
    else:
      raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))

    return output_spec

  return model_fn

1.1 構建遮蔽詞預測的損失函數

我認爲可以通過對比word2vec理解masked_lm，類似於word2vec中運用skip gram訓練詞向量，即先將bert模型的輸入encode爲該函數的輸入input_tensor，再通過參數output_weights解碼（針對masked輸出，連接一個全連接層，將hidden_size映射到vocab_size維，並視作是vocab_size維的多分類問題，且此處的參數是bert模型輸入的embedding table（真實詞向量），類似skip gram的decoder層），不過損失函數區別於word2vec的負採樣及層序softmax。

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
  """Get loss and log probs for the masked LM.
  input:
      input_tensor: (batch_size, seq_length, hidden_size)
      output_weights: (vocab_size, hidden_size)
      positions: (batch_size, max_predictions_per_seq)
      label_ids: (batch_size, max_seq_length)
      label_weights: (batch_size, max_predictions_per_seq)
  return:
      loss: scalar
      per_example_loss: shape (batch_size * max_predictions_per_seq, )
      log_probs: shape (batch_size * max_predictions_per_seq, vocab_size)
  """
  # shape: (batch_size * max_predictions_per_seq, hidden_size)
  # 獲取被masked位置的輸出向量（每個batch有batch_size * max_predictions_per_seq個位置）
  input_tensor = gather_indexes(input_tensor, positions)

  with tf.variable_scope("cls/predictions"):
    # We apply one more non-linear transformation before the output layer.
    # This matrix is not used after pre-training.
    # 在預訓練時會針對masked positions的向量加一層非線性映射，非預訓練時不需要該層
    # 輸出：(batch_size * max_predictions_per_seq, hidden_size)
    with tf.variable_scope("transform"):
      input_tensor = tf.layers.dense(
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      input_tensor = modeling.layer_norm(input_tensor)

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
    # 此處的權重即bert模型的embedding table
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    # 計算ln(softmax)  shape: (batch_size*max_predictions_per_seq, vocab_size)
    log_probs = tf.nn.log_softmax(logits, axis=-1)   
 
    # shape: batch_size * max_predictions_per_seq
    label_ids = tf.reshape(label_ids, [-1])
    # shape: batch_size * max_predictions_per_seq
    label_weights = tf.reshape(label_weights, [-1])

    # masked positions的真實labels
    # shape: max_predictions_per_seq * vocab_size
    one_hot_labels = tf.one_hot(
        label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    # log_probs * one_hot_labels shape: (batch_size * max_predictions_per_seq, 
    # vocab_size) ---> 每一行僅一個維度非0
    # shape: (batch_size * max_predictions_per_seq, )
    # target function：最大化labels處的概率，取負號改爲最小化（區別於word2vec）
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    # 去掉補0項（padding）的loss
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5    # 代表該批次需要預測的個數
    loss = numerator / denominator    # 平均loss

  return (loss, per_example_loss, log_probs)

1.2 構建Next Sentence Prediction的損失函數

def get_next_sentence_output(bert_config, input_tensor, labels):
  """Get loss and log probs for the next sentence prediction.
  input:
      input_tensor: (batch_size, hidden_size)
  return:
      loss: scalar
      per_example_loss: (batch_size, )
      log_probs: (batch_size, 2)
  """
  # 二分類問題：對input_tensor連接兩個神經元，參數爲output_weights, output+bias
  # Simple binary classification. Note that 0 is "next sentence" and 1 is
  # "random sentence". This weight matrix is not used after pre-training.
  with tf.variable_scope("cls/seq_relationship"):
    output_weights = tf.get_variable(
        "output_weights",
        shape=[2, bert_config.hidden_size],
        initializer=modeling.create_initializer(bert_config.initializer_range))
    output_bias = tf.get_variable(
        "output_bias", shape=[2], initializer=tf.zeros_initializer())
    # shape: (batch_size, 2)
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    # log probabilities shape: (batch_size, 2)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    labels = tf.reshape(labels, [-1])
    # shape: (batch_size, 2)
    one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
    # 僅取label對應的loss
    # shape: (batch_size, )
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, per_example_loss, log_probs)

Reference

1. BERT model

2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3. Attention Is All You Need

4. The Illustrated Transformer

5. 從Word Embedding到Bert模型—自然語言處理中的預訓練技術發展史

注：本文可以任意轉載，轉載時請標明作者和出處。

google bert模型詳解源碼解析

模型簡介

1. A High-Level Look

2. Encoder(http://jalammar.github.io/illustrated-transformer/)

3. Self-Attention(http://jalammar.github.io/illustrated-transformer/)

4. Matrix Calculation of Self-Attention(http://jalammar.github.io/illustrated-transformer/)

5. Multi-headed Self-Attention(http://jalammar.github.io/illustrated-transformer/)

一、模塊一之生成預訓練數據

1. tokenizer

2. create_training_instances

3. create_instances_from_document

4. create_masked_lm_predictions

5. write_instance_to_example_files

二、模塊二之構建模型

1. embedding_postprocessor

2. transformer_model

2.1 attention_layer

三、模塊三之預訓練過程

1. model_fn_builder

1.1 構建遮蔽詞預測的損失函數

1.2 構建Next Sentence Prediction的損失函數

Reference

google bert模型詳解源碼解析

google bert 源碼解析

消息反射

WM_NOTIFY消息

詳解Attention機制及Tensorflow之attention_wrapper

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

google bert模型詳解 源碼解析

模型簡介

1. A High-Level Look

2. Encoder(http://jalammar.github.io/illustrated-transformer/)

3. Self-Attention(http://jalammar.github.io/illustrated-transformer/)

4. Matrix Calculation of Self-Attention(http://jalammar.github.io/illustrated-transformer/)

5. Multi-headed Self-Attention(http://jalammar.github.io/illustrated-transformer/)

一、模塊一之生成預訓練數據

1. tokenizer

2. create_training_instances

3. create_instances_from_document

4. create_masked_lm_predictions

5. write_instance_to_example_files

二、模塊二之構建模型

1. embedding_postprocessor

2. transformer_model

2.1 attention_layer

三、模塊三之預訓練過程

1. model_fn_builder

1.1 構建遮蔽詞預測的損失函數

1.2 構建Next Sentence Prediction的損失函數

Reference

google bert模型詳解源碼解析