文章目錄

轉載來源：https://zhuanlan.zhihu.com/p/139898040
作者：feelingzhou，騰訊 WXG 應用研究員。

1、概述

在 2019 年 11 月份，NLP 大神 Manning 聯合谷歌做的 ELECTRA 一經發布，迅速火爆整個 NLP 圈，其中 ELECTRA-small 模型參數量僅爲 BERT-base 模型的 1/10，性能卻依然能與 BERT、RoBERTa 等模型相媲美，得益於 ELECTRA 模型的巧妙構思 LOSS, 在 2020 年 3 月份 Google 對代碼做了開源，下面針對 Google 放出的 ELECTRA 做代碼做解讀，希望通過此文章大家能在自己文本數據、行爲序列數據訓練一個較好的預訓練模型，在業務上提升價值。

2、ELECTRA 模型

2.1 總體框架

ELECTRA 模型 (BASE 版本) 本質是換一種方法來訓練 BERT 模型的參數；BERT 模型主要是利用 MLM 的思想來訓練參數，直接把需要預測的詞給挖掉了，挖了 15% 的比例。由於每次訓練是一段話中 15% 的 token，導致模型收斂更新較慢，需要的語料也比較龐大。同時爲了兼顧處理閱讀理解這樣的任務，模型加入了 NSP，是個二分類任務，判斷上下兩句是不是互爲上下句；而 ELECTRA 模型主要藉助於圖像領域 gan 的思想，利用生成器和判別器思想，如下圖所；ELECTRA 的預訓練可以分爲兩部分，生成器部分仍然是 MLM，結構與 BERT 類似，利用這個模型對挖掉的 15% 的詞進行預測，並將其進行替換，若替換的詞不是原詞，則打上被替換的標籤，語句的其他詞則打上沒有替換的標籤，判別器部分是訓練一個判別模型對所有位置的詞進行替換識別，此時預測模型轉換成了一個二分類模型。這個轉換可以帶來效率的提升，對所有位置的詞進行預測，收斂速度會快的多，損失函數是利用生成器部分的損失和判別器的損失函數以一個比例數 (官方代碼是 50) 相加。

2.2 代碼框架

2020 年 3 月份 Google 開源了 ELECTRA 模型代碼，見代碼連接，其主要代碼框架如下：

下面對該代碼框架做一一說明：

finetune： 該文件夾下面的代碼主要是對已經訓練好的 ELECTRA 模型做微調的代碼例子，如文本分類、NER 識別、閱讀理解等任務，這個和 BERT 的任務一致，在這裏不做過多累贅。

model/modeling.py： 該文件主要是 bert 模型的實現邏輯以及 bert 模型的配置讀取代碼，在 ELECTRA 模型的預訓練階段生成和判別階段做調用，另外在做各種 finetuneing 任務會調用。

model/optimization.py： 該文件主要是對優化器的實現，主要是對 AdamWeightDecay 的實現，可以自己加 lamb 等優化方法的實現。

model/tokenization.py： 該文件主要是 WordPiece 分詞器的實現，可以對英文、中文分詞，在將文本轉化爲 tfrecord 的時候會用的到。

pretrain/pretrain_data.py: 該文件是主要作用是對 ELECTRA 模型在 pretraining 對 tfrecords 文件讀取、collections.namedtuple 更新上的一些邏輯實現。

pretrain_helpers.py： 該文件是 pretraining 階段核心功能實現，主要實現邏輯對序列做動態隨機 mask 以及對已經 mask 的序列做 unmask。

util/training_utils.py： 該文件主要在訓練階段實現了一個 Hook，在訓練階段爲了打印更多日誌信息。

util/utils.py： 該文件主要是一些基礎方法，如增刪文件夾、序列化反序列化、讀取配置文件等。

build_openwebtext_pretraining_dataset.py、build_pretraining_dataset.py： 這兩個文件功能類似，但是用到的數據源不一樣，主要是把文本文件轉化爲 tfrecord 文件，tfrecord 文件的 key 包括 input_ids、input_mask、segment_ids，生成的 tfrecord 文件不像 bert 預訓練需要的文件那樣，不需要再生成 masked_lm_positions,masked_lm_ids,masked_lm_weights ，這幾個 key 會在模型 pretraining 階段自動生成，與此同時 mask 也是隨機動態的，類似於 RoBerta，不像 BERT 那樣固定。裏面寫了個多線程加速；對於大型的文件，還是使用 spark 將文本轉化較適宜。

configure_finetuning.py： finetuneing 階段的一些超參數配置，google 這次放出的代碼參數並沒有使用 tf.flags。

configure_pretraining.py： pretraining 階段的一些超參數配置，google 這次放出的代碼參數並沒有使用 tf.flags。

run_finetuning.py： 模型做 finetuning 邏輯，加載已經訓練好的 ELECTRA 模型做微調。

run_pretraining.py： 模型 pretraining 邏輯，也是 ELECTRA 模型最核心的邏輯，下面會加以詳細說明。

2.3 pretraining 階段

ELECTRA 模型 pretraining 階段是最核心的邏輯，代碼是在 run_pretraining.py 裏面，下面會加以詳細說明，整體階段理解繪製了一張圖，邏輯見如下圖：

2.3.1 主方法入口

在 run_pretraining.py 文件中主方法有三個必須的參數見：

def main():
  parser = argparse.ArgumentParser(description=__doc__)
  parser.add_argument("--data-dir", required=True,
                      help="Location of data files (model weights, etc).")
  parser.add_argument("--model-name", required=True,
                      help="The name of the model being fine-tuned.")
  parser.add_argument("--hparams", default="{}",
                      help="JSON dict of model hyperparameters.")
  args = parser.parse_args()
  if args.hparams.endswith(".json"):
    hparams = utils.load_json(args.hparams)
  else:
    hparams = json.loads(args.hparams)
  tf.logging.set_verbosity(tf.logging.ERROR)
  train_or_eval(configure_pretraining.PretrainingConfig(
      args.model_name, args.data_dir, **hparams))

三個必須參數見如下： –data-dir： 表示 tfrecord 文件的地址，一般是以 pretrain_data.tfrecord-0-of * 這種格式，調用 build_pretraining_dataset.py 文件生成，默認生成 1000 個 tfrecord 文件，數目可以自己改，此外需要注意的是切詞需要制定 vocab.txt，訓練中文的模型詞典指定 BERT 模型那個 vocab.txt 詞典即可，同理用於英文的模型訓練。

–model-name： 表示預訓練模型的名字一般是 electar，可以自己設定。

–hparams：，一般是個 json 文件，可以傳遞自己的參數進去，比如你要訓練的模型是 small、base、big 等模型，還有 vocab_size，一般中文是 21128，英文是 30522。還有模型訓練是否是測試狀態等參數，一般我訓練中文模型 hparams 參數是的 config.json 是:

{
  "model_size": "base",
  "vocab_size": 21128
}

詳細的參數可以去看 configure_pretraining.py，一般你傳進去的參數進去會更新裏面的超參數。

程序入口訓練模型：

train_or_eval(configure_pretraining.PretrainingConfig(
      args.model_name, args.data_dir, **hparams))

還有一個入口，只 see.run() 一次，用於測試，見如下：

train_one_step(configure_pretraining.PretrainingConfig(
       args.model_name, args.data_dir, **hparams))

2.3.2 數據 mask

訓練模型主要是代碼是 PretrainingModel 類的定義，在 PretrainingModel 裏面程序首先對輸入的 tfrecord 文件做隨機 mask，

# Mask the input
    masked_inputs = pretrain_helpers.mask(
        config, pretrain_data.features_to_inputs(features), config.mask_prob)

用於生成含有 masked_lm_positions,masked_lm_ids,masked_lm_weights 等 key 的 tfrecord 文件，隨機 MASK 實現的主要邏輯是調用 pretrain_helpers.mask() 來實現，其中用到了隨機生成多項分佈的函數 tf.random.categorical，這個函數目的是隨機獲取 masked_lm_positions、masked_lm_weights，再根據 masked_lm_positions 調用 tf.gather_nd 做索引截取來獲取 masked_lm_ids。

2.3.3 Generator BERT

數據獲取之後往下一步走就是生成 Generator BERT 階段的模型，調用方法見如下：

generator = self._build_transformer(
          masked_inputs, is_training,
         bert_config=get_generator_config(config, self._bert_config),
          embedding_size=(None if config.untied_generator_embeddings
                          else embedding_size),
          untied_embeddings=config.untied_generator_embeddings,
          )

這裏主要用於 Generator 階段 BERT 模型生成，同時生成 MLM loss 和 Fake data，其中 Fake data 非常核心。 MLM loss 生成見代碼，和 BERT 的邏輯幾乎一樣 ：

def _get_masked_lm_output(self, inputs: pretrain_data.Inputs, model):
    """Masked language modeling softmax layer."""
    masked_lm_weights = inputs.masked_lm_weights
    with tf.variable_scope("generator_predictions"):
      if self._config.uniform_generator:
        logits = tf.zeros(self._bert_config.vocab_size)
        logits_tiled = tf.zeros(
            modeling.get_shape_list(inputs.masked_lm_ids) +
            [self._bert_config.vocab_size])
        logits_tiled += tf.reshape(logits, [1, 1, self._bert_config.vocab_size])
        logits = logits_tiled
      else:
        relevant_hidden = pretrain_helpers.gather_positions(
            model.get_sequence_output(), inputs.masked_lm_positions)
        hidden = tf.layers.dense(
            relevant_hidden,
            units=modeling.get_shape_list(model.get_embedding_table())[-1],
            activation=modeling.get_activation(self._bert_config.hidden_act),
            kernel_initializer=modeling.create_initializer(
                self._bert_config.initializer_range))
        hidden = modeling.layer_norm(hidden)
        output_bias = tf.get_variable(
            "output_bias",
            shape=[self._bert_config.vocab_size],
            initializer=tf.zeros_initializer())
        logits = tf.matmul(hidden, model.get_embedding_table(),
                           transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)

      oh_labels = tf.one_hot(
          inputs.masked_lm_ids, depth=self._bert_config.vocab_size,
          dtype=tf.float32)

      probs = tf.nn.softmax(logits)
      log_probs = tf.nn.log_softmax(logits)
      label_log_probs = -tf.reduce_sum(log_probs * oh_labels, axis=-1)

      numerator = tf.reduce_sum(inputs.masked_lm_weights * label_log_probs)
      denominator = tf.reduce_sum(masked_lm_weights) + 1e-6
      loss = numerator / denominator
      preds = tf.argmax(log_probs, axis=-1, output_type=tf.int32)

      MLMOutput = collections.namedtuple(
          "MLMOutput", ["logits", "probs", "loss", "per_example_loss", "preds"])
      return MLMOutput(
          logits=logits, probs=probs, per_example_loss=label_log_probs,
          loss=loss, preds=preds)

Fake data 數據生成邏輯見下面代碼，這裏調用了 unmask 函數和上面提到的 mask 函數作用相反，把原來 input_ids 隨機 mask 的函數還原回去生成一個 input_ids_new，再利用谷生成模型生成的 logit 取最大索引去還原原來被 mask 調的 input_ids, 生成一個 updated_input_ids，判斷 input_ids_new 和 updated_input_ids 是否相等，生成 true label

def _get_fake_data(self, inputs, mlm_logits):
    """Sample from the generator to create corrupted input."""
    inputs = pretrain_helpers.unmask(inputs)
    disallow = tf.one_hot(
        inputs.masked_lm_ids, depth=self._bert_config.vocab_size,
        dtype=tf.float32) if self._config.disallow_correct else None
    sampled_tokens = tf.stop_gradient(pretrain_helpers.sample_from_softmax(
        mlm_logits / self._config.temperature, disallow=disallow))
    sampled_tokids = tf.argmax(sampled_tokens, -1, output_type=tf.int32)
    updated_input_ids, masked = pretrain_helpers.scatter_update(
        inputs.input_ids, sampled_tokids, inputs.masked_lm_positions)
    labels = masked * (1 - tf.cast(
        tf.equal(updated_input_ids, inputs.input_ids), tf.int32))
    updated_inputs = pretrain_data.get_updated_inputs(
        inputs, input_ids=updated_input_ids)
    FakedData = collections.namedtuple("FakedData", [
        "inputs", "is_fake_tokens", "sampled_tokens"])
    return FakedData(inputs=updated_inputs, is_fake_tokens=labels,
                     sampled_tokens=sampled_tokens)

2.3.4 Discrimina BERT

利用上一步生成的 Fake data，作爲 Discrimina BERT 的輸入，見代碼：

if config.electra_objective:
      discriminator = self._build_transformer(
          fake_data.inputs, is_training, reuse=not config.untied_generator,
          embedding_size=embedding_size)
      disc_output = self._get_discriminator_output(
          fake_data.inputs, discriminator, fake_data.is_fake_tokens)

獲取二分類的損失函數，代碼見：

def _get_discriminator_output(self, inputs, discriminator, labels):
    """Discriminator binary classifier."""
    with tf.variable_scope("discriminator_predictions"):
      hidden = tf.layers.dense(
          discriminator.get_sequence_output(),
          units=self._bert_config.hidden_size,
          activation=modeling.get_activation(self._bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              self._bert_config.initializer_range))
      logits = tf.squeeze(tf.layers.dense(hidden, units=1), -1)
      weights = tf.cast(inputs.input_mask, tf.float32)
      labelsf = tf.cast(labels, tf.float32)
      losses = tf.nn.sigmoid_cross_entropy_with_logits(
          logits=logits, labels=labelsf) * weights
      per_example_loss = (tf.reduce_sum(losses, axis=-1) /
                          (1e-6 + tf.reduce_sum(weights, axis=-1)))
      loss = tf.reduce_sum(losses) / (1e-6 + tf.reduce_sum(weights))
      probs = tf.nn.sigmoid(logits)
      preds = tf.cast(tf.round((tf.sign(logits) + 1) / 2), tf.int32)
      DiscOutput = collections.namedtuple(
          "DiscOutput", ["loss", "per_example_loss", "probs", "preds",
                         "labels"])

2.3.5 總的損失函數

上面一步驟求出了 disc_output.loss 也就是 sigmod 的 loss，代碼見：

self.total_loss = config.gen_weight * mlm_output.loss
 self.total_loss += config.disc_weight * disc_output.loss

這裏 config.gen_weight=1 以及 config.disc_weight=50，這裏 sigmod 的損失函數設置爲 50，作者也沒給明確的答覆。

2.3.6 模型優化以及 checkpoint

上面一步已經求出了總的損失函數，下一步則是做模型優化訓練以及做 checkpoint，程序入口在 train_or_eval() 裏面，代碼見:

is_per_host = tf.estimator.tpu.InputPipelineConfig.PER_HOST_V2
  tpu_cluster_resolver = None
  if config.use_tpu and config.tpu_name:
    tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
        config.tpu_name, zone=config.tpu_zone, project=config.gcp_project)
  tpu_config = tf.estimator.tpu.TPUConfig(
      iterations_per_loop=config.iterations_per_loop,
      num_shards=(config.num_tpu_cores if config.do_train else
                  config.num_tpu_cores),
      tpu_job_name=config.tpu_job_name,
      per_host_input_for_training=is_per_host)
  run_config = tf.estimator.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      model_dir=config.model_dir,
      save_checkpoints_steps=config.save_checkpoints_steps,
      tpu_config=tpu_config)
  model_fn = model_fn_builder(config=config)
  estimator = tf.estimator.tpu.TPUEstimator(
      use_tpu=config.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=config.train_batch_size,
      eval_batch_size=config.eval_batch_size)

可以看到 google 開源的代碼主要是 TPU 的一些鉤子，改成 GPU 也比較簡單，在 BERT 裏面就有 GPU 相關的鉤子，下面就會講到。

2.4 finetuning 階段

ELECTRA finetuning 階段給出了不少例子，也比較簡單，在 finetuneing 文件下面，這裏不做過多的說明，和 bert 類似，唯一要改的就是把 TPU 相關的設置改爲 GPU、CPU 即可。

2.5 序列訓練改進

上面代碼主要存在兩個問題，第一個是 TPU 設置的問題，並不是人人都是土豪，還是要適配 GPU 的訓練，第二個就是假如我想訓練一個 vocab 比較大的序列模型，上面模型是訓練不動的，loss 方面改爲負採樣的形式。

2.5.1 TPU 改 GPU 訓練

global_step = tf.train.get_or_create_global_step()
            optimizer = optimization.AdamWeightDecayOptimizer(learning_rate=learning_rate)
            train_op = optimizer.apply_gradients(zip(grads, tvars), global_step)

            update_global_step = tf.assign(global_step, global_step + 1, name='update_global_step')
            output_spec=tf.estimator.EstimatorSpec(
              mode=mode,
              predictions=probabilities,
              loss=total_loss,
              train_op=tf.group(train_op, update_global_step))

 run_config = tf.estimator.RunConfig(model_dir=FLAGS.modelpath,save_checkpoints_steps=500)

    bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

    model_fn = model_fn_builder(
      bert_config=bert_config,
      num_labels=FLAGS.n_class,
      is_training=True,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      use_one_hot_embeddings=False)

    estimator = tf.estimator.Estimator(model_fn=model_fn,                                    config=run_config)
    total_files=glob.glob("/tf*")
    random.shuffle(total_files)
    eval_files=total_files.pop()
    input_fn_train=lambda:input_fn(total_files,FLAGS.batch_size,num_epochs=N)
    input_fn_eval=lambda:input_fn(eval_files,FLAGS.batch_size,is_training=False)


    train_spec = tf.estimator.TrainSpec(input_fn=input_fn_train, max_steps=10000)
    eval_spec = tf.estimator.EvalSpec(input_fn=input_fn_eval,steps=None, start_delay_secs=30, throttle_secs=30)

    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

2.5.2 負採樣改造

主要是對 mlm loss 做改造：

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):

  with tf.variable_scope("cls/predictions"):
    with tf.variable_scope("transform"):
      input_tensor = tf.layers.dense(
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      input_tensor = modeling.layer_norm(input_tensor) #batch*10*embeding_size

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
    label_ids = tf.reshape(label_ids, [-1,1])
    label_weights = tf.reshape(label_weights, [-1])
    per_example_loss = tf.nn.sampled_softmax_loss(
        weights=output_weights,
        biases=output_bias,
        labels=label_ids,
        inputs=input_tensor,
        num_sampled=N,
        num_classes=bert_config.vocab_size)

    numerator = tf.reduce_sum(label_weights*per_example_loss )
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator
  return (loss, per_example_loss)

3、總結

前段時間挺忙，有比較多的新 idea 出來沒來的及看，上週末花了一天時間看了下 electra 源碼，並做記錄，也看到不少團隊做了一些中文 electra 預訓練模型，雖然 electra 沒有達到 state of the art，和 roberta 差距可以忽視，但是這種訓練方式這是一個很棒的 idea，其收斂速度是其他以 bert 爲基礎爲改造的模型不能比的，在序列建模就有非常重大的研究意義，歡迎一起交流。

4、參考文獻

[1]https://openreview.net/forum?id=r1xMH1BtvB
[2]https://zhuanlan.zhihu.com/p/104956125

超越 BERT 模型的 ELECTRA 代碼解讀