


使用 PyTorch 的可以結合使用 Apex ,加速訓練和減小顯存的佔用

PyTorch必備神器 | 唯快不破:基於Apex的混合精度加速



這裏使用 BertForSequenceClassification 進行情感分類任務,還是用 蘇劍林 整理的情感二分類數據集

可以結合之前發的那篇一起看,由於官方的 examples 還沒細看,這裏的例子使用 PyTorch 比較傳統的方式進行訓練



個人網盤:鏈接: https://pan.baidu.com/s/1OAhNbRYpU1HW25_vChdRng 提取碼: uxax 



# 超參數
hidden_dropout_prob = 0.3
num_labels = 2
learning_rate = 1e-5
weight_decay = 1e-2
epochs = 2
batch_size = 16

繼承 PyTorch 的 Dataset ,編寫一個類表示數據集,這裏我們用字典返回一個樣本和它的標籤

from torch.utils.data import Dataset
import pandas as pd

class SentimentDataset(Dataset):
    def __init__(self, path_to_file):
        self.dataset = pd.read_csv(path_to_file, sep="\t", names=["text", "label"])
    def __len__(self):
        return len(self.dataset)
    def __getitem__(self, idx):
        text = self.dataset.loc[idx, "text"]
        label = self.dataset.loc[idx, "label"]
        sample = {"text": text, "label": label}
        return sample


Transformers 已經實現好了用來分類的模型,我們這裏就不自己編寫了,直接使用 BertForSequenceClassification 調用預訓練模型

一些自定義的配置可以通過 BertConfig 傳遞給 BertForSequenceClassification

from transformers import BertConfig, BertForSequenceClassification
# 使用GPU
# 通過model.to(device)的方式使用
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = BertConfig.from_pretrained("bert-base-uncased", num_labels=num_labels, hidden_dropout_prob=hidden_dropout_prob)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", config=config)


config = BertConfig.from_pretrained("bert-base-uncased", author="DogeCheng")


用 DataLoader 得到一個迭代器,每次得到一個 batch_size 的數據

from torch.utils.data import DataLoader
data_path = "/data/sentiment/"
# 加載數據集
sentiment_train_set = SentimentDataset(data_path + "sentiment.train.data")
sentiment_train_loader = DataLoader(sentiment_train_set, batch_size=batch_size, shuffle=True, num_workers=2)

sentiment_valid_set = SentimentDataset(data_path + "sentiment.train.data")
sentiment_valid_loader = DataLoader(sentiment_valid_set, batch_size=batch_size, shuffle=False, num_workers=2)


主要實現對文本進行 tokenization 和 padding 的函數

vocab_file = "PyTorch_Pretrained_Model/chinese_wwm_pytorch/vocab.txt"
tokenizer = BertTokenizer(vocab_file)

def convert_text_to_ids(tokenizer, text, max_len=100):
    if isinstance(text, str):
        tokenized_text = tokenizer.encode_plus(text, max_length=max_len, add_special_tokens=True)
        input_ids = tokenized_text["input_ids"]
        token_type_ids = tokenized_text["token_type_ids"]
    elif isinstance(text, list):
        input_ids = []
        token_type_ids = []
        for t in text:
            tokenized_text = tokenizer.encode_plus(t, max_length=max_len, add_special_tokens=True)
        print("Unexpected input")
    return input_ids, token_type_ids

def seq_padding(tokenizer, X):
    pad_id = tokenizer.convert_tokens_to_ids("[PAD]")
    if len(X) <= 1:
        return torch.tensor(X)
    L = [len(x) for x in X]
    ML = max(L)
    X = torch.Tensor([x + [pad_id] * (ML - len(x)) if len(x) < ML else x for x in X])
    return X


其實從源碼看,我們知道 BertForSequenceClassification 已經有了損失函數,可以不用實現,這裏展示一個更通用的例子,自己實現損失函數

import torch
import torch.nn as nn
from transformers import AdamW
# 定義優化器和損失函數
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
#optimizer = AdamW(model.parameters(), lr=learning_rate)
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
criterion = nn.CrossEntropyLoss()

從上面我們可以看到 bias 和 LayerNorm.weight 都沒用權重衰減,可以參考下面的博文,主要是由於 bias 的更新跟權重衰減無關

權重衰減(weight decay)與學習率衰減(learning rate decay):



PyTorch 不像 Keras 那樣調用 fit 就可以了,大多都需要自己實現,爲了複用性,這裏用函數實現了簡單的訓練和測試函數

因爲 BertForSequenceClassification 裏面已經有了一個 CrossEntropyLoss() ,實際可以不用我們剛剛的實例化的損失函數見 train() 函數 中的註釋

函數返回2個內容:一個 epoch 內的損失和準確率,如果要計算其他評估指標需自己實現(或通過 sklearn.metrics 幫助)

def train(model, iterator, optimizer, criterion, device):
    epoch_loss = 0
    epoch_acc = 0
    for i, batch in enumerate(iterator):
        label = batch["label"]
        text = batch["text"]
        input_ids, token_type_ids = convert_text_to_ids(tokenizer, text)
        input_ids = seq_padding(tokenizer, input_ids)
        token_type_ids = seq_padding(tokenizer, token_type_ids)
        # 標籤形狀爲 (batch_size, 1) 
        label = label.unsqueeze(1)
        # 需要 LongTensor
        input_ids, token_type_ids, label = input_ids.long(), token_type_ids.long(), label.long()
        # 梯度清零
        # 遷移到GPU
        input_ids, token_type_ids, label = input_ids.to(device), token_type_ids.to(device), label.to(device)
        output = model(input_ids=input_ids, token_type_ids=token_type_ids, labels=label)
        y_pred_prob = output[1]
        y_pred_label = y_pred_prob.argmax(dim=1)
        # 計算loss
        # 這個 loss 和 output[0] 是一樣的
        loss = criterion(y_pred_prob.view(-1, 2), label.view(-1))
        #loss = output[0]
        # 計算acc
        acc = ((y_pred_label == label.view(-1)).sum()).item()
        # 反向傳播
        # epoch 中的 loss 和 acc 累加
        epoch_loss += loss.item()
        epoch_acc += acc
        if i % 200 == 0:
            print("current loss:", epoch_loss / (i+1), "\t", "current acc:", epoch_acc / ((i+1)*len(label)))
    # return epoch_loss / len(iterator), epoch_acc / (len(iterator) * iterator.batch_size)
    # 經評論區提醒修改
    return epoch_loss / len(iterator), epoch_acc / len(iterator.dataset.dataset)

def evaluate(model, iterator, criterion, device):
    epoch_loss = 0
    epoch_acc = 0
    with torch.no_grad():
        for _, batch in enumerate(iterator):
            label = batch["label"]
            text = batch["text"]
            input_ids, token_type_ids = convert_text_to_ids(tokenizer, text)
            input_ids = seq_padding(tokenizer, input_ids)
            token_type_ids = seq_padding(tokenizer, token_type_ids)
            label = label.unsqueeze(1)
            input_ids, token_type_ids, label = input_ids.long(), token_type_ids.long(), label.long()
            input_ids, token_type_ids, label = input_ids.to(device), token_type_ids.to(device), label.to(device)
            output = model(input_ids=input_ids, token_type_ids=token_type_ids, labels=label)
            y_pred_label = output[1].argmax(dim=1)
            loss = output[0]
            acc = ((y_pred_label == label.view(-1)).sum()).item()
            epoch_loss += loss.item()
            epoch_acc += acc
    # return epoch_loss / len(iterator), epoch_acc / (len(iterator) * iterator.batch_size)
    # 經評論區提醒修改
    return epoch_loss / len(iterator), epoch_acc / len(iterator.dataset.dataset)


這裏只跑了 2 個 epoch,在驗證集上的效果達到了 92 的準確率

# 再測試
for i in range(epochs):
    train_loss, train_acc = train(model, sentiment_train_loader, optimizer, criterion, device)
    print("train loss: ", train_loss, "\t", "train acc:", train_acc)
    valid_loss, valid_acc = evaluate(model, sentiment_valid_loader, criterion, device)
    print("valid loss: ", valid_loss, "\t", "valid acc:", valid_acc)

第一個 epoch

第二個 epoch



這裏直接使用 Transformers 提供 run_squad.py 進行說明

這裏使用的是 SQuAD v1.1 數據集,下載地址爲

下載好數據後,放入 $SQUAD_DIR 中,輸入以下命令運行

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/




我們主要使用到的是 paragraphs 字段的內容,包括context 和 qas

  • qas 包括 answers,question 以及 id
  • 而 answers 包括 answer_start 和 text


read_squad_examples() 就是負責把原來的JSON數據抽取出來,讓數據變得更清晰。每條數據變成一個 SquadExample 對象,存放下面的內容

  • qas_id:id
  • question_text:question
  • doc_tokens:經過 tokenization 後的 text
  • orig_answer_text:text
  • start_position:開始位置
  • end_position:結束位置
  • is_impossible:樣本是否沒有答案

其中 is_impossible 是SQuAD v2纔有的字段,read_squad_examples() 會判斷數據集的版本,我們在運行 run_squad.py時,可以通過 --version_2_with_negative 參數設置

if version_2_with_negative:
    is_impossible = qa["is_impossible"]

數據集只提供了 answer_start,所以 end_position 需要自己計算,即 start_position 加上答案的長度-1

if not is_impossible:
    answer = qa["answers"][0]
    orig_answer_text = answer["text"]
    answer_offset = answer["answer_start"]
    answer_length = len(orig_answer_text)
    start_position = char_to_word_offset[answer_offset]
    end_position = char_to_word_offset[answer_offset + answer_length - 1]


read_squad_examples() 負責從 JSON 中讀取數據,並進行一些處理,但是這樣不能輸入 Bert 模型中

所以還需要使用 convert_examples_to_features() 函數處理成能夠輸入到 Bert 中的格式,主要是截斷、padding 和 token轉換爲id


if len(query_tokens) > max_query_length:
    query_tokens = query_tokens[0:max_query_length]

token 轉換成 id 和 mask 等操作

input_ids = tokenizer.convert_tokens_to_ids(tokens)

# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

padding 操作

# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
    input_mask.append(0 if mask_padding_with_zero else 1)

最後每個樣本用 InputFeatures 對象存放這些處理後的數據


run_squad.py 用一個字典存放不同模型用到的Config、Model 和 Tokenizer,通過 --model_type 參數使用

    'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
    'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
    'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
    'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer),
    'albert': (AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)

我們通過 --model_type 參數使用需要的模型,並在 main() 函數中實例化 config、model 和 tokenizer 對象

config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]


使用 read_squad_examples 處理再用 convert_examples_to_features 將每一個樣本用 InputFeatures 對象表示

再通過 load_and_cache_examples() 將數據全部變成張量,大致代碼如下

examples = read_squad_examples(input_file=input_file, is_training=not evaluate, version_2_with_negative=args.version_2_with_negative) features = convert_examples_to_features(examples=examples, tokenizer=tokenizer, max_seq_length=args.max_seq_length, doc_stride=args.doc_stride, max_query_length=args.max_query_length, is_training=not evaluate, cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, pad_token_segment_id=3 if args.model_type in ['xlnet'] else 0, cls_token_at_end=True if args.model_type in ['xlnet'] else False, sequence_a_is_doc=True if args.model_type in ['xlnet'] else False) ... # Convert to Tensors and build dataset all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long) all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float) if evaluate: all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long) dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_example_index, all_cls_index, all_p_mask) else: all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long) all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long) dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_start_positions, all_end_positions, all_cls_index, all_p_mask) if output_examples: return dataset, examples, features return dataset

在 main() 函數中將這些數據變成張量後,再送到 train() 函數中

train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer)


這裏也使用 DataLoader 將數據集變成生成器,用來取出一個 batch 的數據訓練

train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)


no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)


scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)

如果機器上有 apex ,可以通過 --fp16 參數設置,進行加速訓練

if args.fp16:
        from apex import amp
    except ImportError:
        raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
    model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

訓練過程設置了 checkpoint,默認每 50 個 step 保存一次模型,可以通過 --save_steps 修改

if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
    # Save model checkpoint
    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
    if not os.path.exists(output_dir):
    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
    logger.info("Saving model checkpoint to %s", output_dir)

在 BertForQuestionAnswering 中,它會把 sequence_output 送到一個全連接層中,轉換 hidden_size 維度爲2,然後split輸出,得到 start_logits 和 end_logits

下面是 BertForQuestionAnswering 中的代碼片段,描述了上面說的過程

logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)

start_logits 和 end_logits 就是每個 Bert 預測答案開始和結束位置的置信度,這個過程大致如下圖所示

BertForQuestionAnswering 計算損失的方式就是 start_logits 和 end_logits 分別計算 loss 再相加取平均

loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2


如果設置了 --do_eval 參數,模型訓練完後會進行測試

預測完成後生成的文件有 predictions_.json 和 nbest_predictions_.json 等

其中 predictions_.json 記錄的是每個問題 i對應的最佳答案,如下圖所示

nbest_predictions_.json 記錄的是每個問題 id 對應的 n 個最佳答案


