怎麼裁剪LLM（大語言模型）的vocab（詞表）？

Part1前言

對於一些多語言的大語言模型而言，它的詞表往往很大。在下游使用這些模型的時候，可能我們不需要其它的一些語言，例如只需要中文和英文，此時，我們可以對其vocab進行裁剪，既可以大大減少參數量，也能夠保留模型的性能，接下來以Bloom模型爲例看看是怎麼進行操作的。

代碼來源於這：https://github.com/yangjianxin1/LLMPruner

Part2裁剪Bloom的vocab

我們簡單使用Bloom進行一個文本生成的小例子：

from transformers import BloomTokenizerFast, BloomForCausalLM
model_name = "bigscience/bloom-560m"
tokenizer = BloomTokenizerFast.from_pretrained(model_name)
model = BloomForCausalLM.from_pretrained(model_name)
print(tokenizer.batch_decode(model.generate(tokenizer.encode('長風破浪會有時', return_tensors='pt'), max_length=64)))

# ['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']

裁剪之前我們去hugging face上bloom-560m找到tokenizer.json文件，發現裏面的tokens都是一些亂碼，但是明明可以生成中文，這是怎麼回事。其實，tokenizer對這些token進行了進一步的編碼，具體怎麼做感興趣的可以下去了解一下。我們不妨用個例子來看看：

print(tokenizer("長風破浪會有時"))
# {'input_ids': [2523, 6295, 8238, 19490, 954, 39509], 'attention_mask': [1, 1, 1, 1, 1, 1]}
for i in [2523, 6295, 8238, 19490, 954, 39509]:
  print(tokenizer.decode([i]), tokenizer.convert_ids_to_tokens(i))
"""
長 éķ¿
風 é£İ
破 çł´
浪 æµª
會 ä¼ļ
有時 æľīæĹ¶
"""

接下來我們按照tokenizer.json格式準備好自己裁剪之後的tokenizer.json，可以去這裏找到：bloom-396m-zh 。接下來就是轉換的代碼了：

import os.path
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm


class VocabularyPruner(object):

    def check(self, old_model_name_or_path, new_model_name_or_path, text):
        # 檢查模型裁剪後，生成結果是否一致
        max_length = 20

        # 使用老模型對文本編碼
        old_model = AutoModelForCausalLM.from_pretrained(old_model_name_or_path)
        old_tokenizer = AutoTokenizer.from_pretrained(old_model_name_or_path)
        old_input_ids = old_tokenizer(text, return_tensors='pt').input_ids
        old_output = old_model.generate(old_input_ids, max_length=max_length)
        old_output_text = old_tokenizer.batch_decode(old_output)
        print('old_output:{}'.format(old_output_text))

        # 使用新模型對文本編碼
        new_model = AutoModelForCausalLM.from_pretrained(new_model_name_or_path)
        new_tokenizer = AutoTokenizer.from_pretrained(new_model_name_or_path)
        new_input_ids = new_tokenizer(text, return_tensors='pt').input_ids
        new_output = new_model.generate(new_input_ids, max_length=max_length)
        new_output_text = new_tokenizer.batch_decode(new_output)
        print('new_output:{}'.format(new_output_text))

        if old_output_text == new_output_text:
            print('output is same, succeed to prune.')
        else:
            print('output is not same, fail to prune.')

    def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
        raise NotImplemented

    def prune(self, model_name_or_path, new_tokenizer_name_or_path, save_path, new_name_or_path=None):
        # 創建輸出目錄
        if not os.path.exists(save_path):
            os.makedirs(save_path)
        # 加載新詞表。如果是中文，就是中文的詞表
        new_tokenizer = AutoTokenizer.from_pretrained(new_tokenizer_name_or_path)
        # 加載原詞表。一般爲多語言模型的詞表
        old_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

        # 檢查新詞表是否爲原詞表的子集
        old_vocab = old_tokenizer.vocab
        new_vocab = new_tokenizer.vocab
        for token in tqdm(new_vocab.keys()):
            if token not in old_vocab:
                raise Exception('{} not exist'.format(token))
        print('new_tokenizer is subset of old_tokenizer')

        # 獲得新詞表中每個token_id到原詞表的token_id的映射
        new2old_token_id = {}
        for token, token_id in tqdm(new_vocab.items()):
            old_token_id = old_vocab[token]
            new2old_token_id[token_id] = old_token_id

        # 加載多語言模型
        model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype='auto')
        # 計算原模型的參數量
        old_params = sum(p.numel() for p in model.parameters())
        print("Total params of original model: %.2fM" % (old_params / 1e6))

        # 對於新詞表中的每個token，取出其對應的權重，複製到新模型中
        vocab_size = len(new_tokenizer)
        hidden_size = model.config.hidden_size

        new_embeds = torch.nn.Embedding(vocab_size, hidden_size, dtype=model.dtype)
        new_lm_head = torch.nn.Linear(in_features=hidden_size, out_features=vocab_size, bias=False, dtype=model.dtype)
        # 更新詞表權重
        self.update_ebeddings(model, new2old_token_id, new_embeds, new_lm_head)

        model.config.__dict__['vocab_size'] = vocab_size
        if new_name_or_path is not None:
            model.config.__dict__['_name_or_path'] = new_name_or_path

        # 計算新模型的參數量
        new_params = sum(p.numel() for p in model.parameters())
        print("Total params of new model : %.2fM" % (new_params / 1e6))

        print('詞表縮小爲原來的:{}%'.format(round(len(new_tokenizer) / len(old_tokenizer), 4)*100))
        print('模型參數量縮小爲原來的:{}%'.format(round(new_params / old_params, 4)*100))
        model.save_pretrained(save_path)
        new_tokenizer.save_pretrained(save_path)


class BloomVocabularyPruner(VocabularyPruner):

    def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
        for token_id, old_token_id in tqdm(new2old_token_id.items()):
            new_embeds.weight.data[token_id] = model.transformer.word_embeddings.weight.data[old_token_id]
            new_lm_head.weight.data[token_id] = model.lm_head.weight.data[old_token_id]
        model.transformer.word_embeddings.weight = new_embeds.weight
        model.lm_head.weight = new_lm_head.weight

概括的講就是提取輸出new_token對應的原來模型裏面的token對應的參數，然後再重新更新（開始的嵌入層以及最後的lm_head層））這些參數。開始轉換：

# 需要進行裁剪的模型路徑
model_name_or_path = 'bigscience/bloom-560m'
# 自己製作的詞表的路
new_tokenizer_name_or_path = 'YeungNLP/bloom-396m-zh'
save_path = 'path-to-save'
pruner = BloomVocabularyPruner()
# 裁剪
pruner.prune(model_name_or_path, new_tokenizer_name_or_path, save_path)
# 檢查裁剪的模型與原模型是否一致
pruner.check(model_name_or_path, save_path, text='長風破浪會有時')

結果：

100%|██████████| 46145/46145 [00:00<00:00, 1309531.65it/s]
new_tokenizer is subset of old_tokenizer
100%|██████████| 46145/46145 [00:00<00:00, 1120687.88it/s]
Total params of original model: 559.21M
100%|██████████| 46145/46145 [00:01<00:00, 41641.55it/s]
Total params of new model : 396.82M
詞表縮小爲原來的:18.41%
模型參數量縮小爲原來的:70.96000000000001%
old_output:['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']
new_output:['長風破浪會有時,直掛雲帆濟滄海。 願你,在人生的旅途中,能遇見最美的風景,遇見最美的自己。</s>']
output is same, succeed to prune.

Part3補充

可以按照這種方式對不同的多語言模型進行裁剪，可能需要注意的地方：

一些特殊符號的索引儘可能和原模型保持一致。

怎麼裁剪LLM（大語言模型）的vocab（詞表）？

Part1前言

Part2裁剪Bloom的vocab

Part3補充

Python多線程編程深度探索：從入門到實戰

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

怎麼讓英文大語言模型支持中文？（三）進行指令微調

怎麼讓英文大預言模型支持中文？（一）繼續預訓練

怎麼讓英文大預言模型支持中文？（一）構建自己的tokenization

anaconda和python之間的對應關係

【python】linux下安裝python的一般方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結