怎麼裁剪LLM(大語言模型)的vocab(詞表)?
Part1前言
對於一些多語言的大語言模型而言,它的詞表往往很大。在下游使用這些模型的時候,可能我們不需要其它的一些語言,例如只需要中文和英文,此時,我們可以對其vocab進行裁剪,既可以大大減少參數量,也能夠保留模型的性能,接下來以Bloom模型爲例看看是怎麼進行操作的。
代碼來源於這:https://github.com/yangjianxin1/LLMPruner
Part2裁剪Bloom的vocab
我們簡單使用Bloom進行一個文本生成的小例子:
from transformers import BloomTokenizerFast, BloomForCausalLM
model_name = "bigscience/bloom-560m"
tokenizer = BloomTokenizerFast.from_pretrained(model_name)
model = BloomForCausalLM.from_pretrained(model_name)
print(tokenizer.batch_decode(model.generate(tokenizer.encode('長風破浪會有時', return_tensors='pt'), max_length=64)))
# ['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']
裁剪之前我們去hugging face上bloom-560m找到tokenizer.json文件,發現裏面的tokens都是一些亂碼,但是明明可以生成中文,這是怎麼回事。其實,tokenizer對這些token進行了進一步的編碼,具體怎麼做感興趣的可以下去了解一下。我們不妨用個例子來看看:
print(tokenizer("長風破浪會有時"))
# {'input_ids': [2523, 6295, 8238, 19490, 954, 39509], 'attention_mask': [1, 1, 1, 1, 1, 1]}
for i in [2523, 6295, 8238, 19490, 954, 39509]:
print(tokenizer.decode([i]), tokenizer.convert_ids_to_tokens(i))
"""
長 éķ¿
風 é£İ
破 çł´
浪 浪
會 ä¼ļ
有時 æľīæŶ
"""
接下來我們按照tokenizer.json格式準備好自己裁剪之後的tokenizer.json,可以去這裏找到:bloom-396m-zh 。接下來就是轉換的代碼了:
import os.path
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
class VocabularyPruner(object):
def check(self, old_model_name_or_path, new_model_name_or_path, text):
# 檢查模型裁剪後,生成結果是否一致
max_length = 20
# 使用老模型對文本編碼
old_model = AutoModelForCausalLM.from_pretrained(old_model_name_or_path)
old_tokenizer = AutoTokenizer.from_pretrained(old_model_name_or_path)
old_input_ids = old_tokenizer(text, return_tensors='pt').input_ids
old_output = old_model.generate(old_input_ids, max_length=max_length)
old_output_text = old_tokenizer.batch_decode(old_output)
print('old_output:{}'.format(old_output_text))
# 使用新模型對文本編碼
new_model = AutoModelForCausalLM.from_pretrained(new_model_name_or_path)
new_tokenizer = AutoTokenizer.from_pretrained(new_model_name_or_path)
new_input_ids = new_tokenizer(text, return_tensors='pt').input_ids
new_output = new_model.generate(new_input_ids, max_length=max_length)
new_output_text = new_tokenizer.batch_decode(new_output)
print('new_output:{}'.format(new_output_text))
if old_output_text == new_output_text:
print('output is same, succeed to prune.')
else:
print('output is not same, fail to prune.')
def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
raise NotImplemented
def prune(self, model_name_or_path, new_tokenizer_name_or_path, save_path, new_name_or_path=None):
# 創建輸出目錄
if not os.path.exists(save_path):
os.makedirs(save_path)
# 加載新詞表。如果是中文,就是中文的詞表
new_tokenizer = AutoTokenizer.from_pretrained(new_tokenizer_name_or_path)
# 加載原詞表。一般爲多語言模型的詞表
old_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# 檢查新詞表是否爲原詞表的子集
old_vocab = old_tokenizer.vocab
new_vocab = new_tokenizer.vocab
for token in tqdm(new_vocab.keys()):
if token not in old_vocab:
raise Exception('{} not exist'.format(token))
print('new_tokenizer is subset of old_tokenizer')
# 獲得新詞表中每個token_id到原詞表的token_id的映射
new2old_token_id = {}
for token, token_id in tqdm(new_vocab.items()):
old_token_id = old_vocab[token]
new2old_token_id[token_id] = old_token_id
# 加載多語言模型
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype='auto')
# 計算原模型的參數量
old_params = sum(p.numel() for p in model.parameters())
print("Total params of original model: %.2fM" % (old_params / 1e6))
# 對於新詞表中的每個token,取出其對應的權重,複製到新模型中
vocab_size = len(new_tokenizer)
hidden_size = model.config.hidden_size
new_embeds = torch.nn.Embedding(vocab_size, hidden_size, dtype=model.dtype)
new_lm_head = torch.nn.Linear(in_features=hidden_size, out_features=vocab_size, bias=False, dtype=model.dtype)
# 更新詞表權重
self.update_ebeddings(model, new2old_token_id, new_embeds, new_lm_head)
model.config.__dict__['vocab_size'] = vocab_size
if new_name_or_path is not None:
model.config.__dict__['_name_or_path'] = new_name_or_path
# 計算新模型的參數量
new_params = sum(p.numel() for p in model.parameters())
print("Total params of new model : %.2fM" % (new_params / 1e6))
print('詞表縮小爲原來的:{}%'.format(round(len(new_tokenizer) / len(old_tokenizer), 4)*100))
print('模型參數量縮小爲原來的:{}%'.format(round(new_params / old_params, 4)*100))
model.save_pretrained(save_path)
new_tokenizer.save_pretrained(save_path)
class BloomVocabularyPruner(VocabularyPruner):
def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
for token_id, old_token_id in tqdm(new2old_token_id.items()):
new_embeds.weight.data[token_id] = model.transformer.word_embeddings.weight.data[old_token_id]
new_lm_head.weight.data[token_id] = model.lm_head.weight.data[old_token_id]
model.transformer.word_embeddings.weight = new_embeds.weight
model.lm_head.weight = new_lm_head.weight
概括的講就是提取輸出new_token對應的原來模型裏面的token對應的參數,然後再重新更新(開始的嵌入層以及最後的lm_head層))這些參數。開始轉換:
# 需要進行裁剪的模型路徑
model_name_or_path = 'bigscience/bloom-560m'
# 自己製作的詞表的路
new_tokenizer_name_or_path = 'YeungNLP/bloom-396m-zh'
save_path = 'path-to-save'
pruner = BloomVocabularyPruner()
# 裁剪
pruner.prune(model_name_or_path, new_tokenizer_name_or_path, save_path)
# 檢查裁剪的模型與原模型是否一致
pruner.check(model_name_or_path, save_path, text='長風破浪會有時')
結果:
100%|██████████| 46145/46145 [00:00<00:00, 1309531.65it/s]
new_tokenizer is subset of old_tokenizer
100%|██████████| 46145/46145 [00:00<00:00, 1120687.88it/s]
Total params of original model: 559.21M
100%|██████████| 46145/46145 [00:01<00:00, 41641.55it/s]
Total params of new model : 396.82M
詞表縮小爲原來的:18.41%
模型參數量縮小爲原來的:70.96000000000001%
old_output:['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']
new_output:['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']
output is same, succeed to prune.
Part3補充
可以按照這種方式對不同的多語言模型進行裁剪,可能需要注意的地方:
一些特殊符號的索引儘可能和原模型保持一致。