官方環境要求(推理、微調):
本次部署使用單卡A100-40G顯卡。
部署
虛擬環境創建:
conda create -n test python=3.10.9
conda activate test #啓動虛擬環境
拉取 Llama2-Chinese
git clone https://github.com/FlagAlpha/Llama2-Chinese.git
cd Llama2-Chinese
安裝依賴庫:
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple #使用清華源安裝
拉取 Llama2-Chinese-13b-Chat 模型權重及代碼
git lfs install
git clone git clone https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat
執行git lfs install時如果報錯
Git LFS is a command line extension and specification for managing large files with Git. The client is written in Go, with pre-compiled binaries available for Mac, Windows, Linux, and FreeBSD. Check out the website for an overview of features.
先安裝
apt-get install git-lfs
查看文件詳情 ls -l
-rw-r--r-- 1 root root 683 Aug 7 17:02 config.json
-rw-r--r-- 1 root root 175 Aug 7 17:02 generation_config.json
-rw-r--r-- 1 root root 9948728430 Aug 7 17:34 pytorch_model-00001-of-00003.bin
-rw-r--r-- 1 root root 9904165024 Aug 7 17:34 pytorch_model-00002-of-00003.bin
-rw-r--r-- 1 root root 6178983625 Aug 7 17:28 pytorch_model-00003-of-00003.bin
-rw-r--r-- 1 root root 33444 Aug 7 17:02 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 1514 Aug 7 17:02 README.md
-rw-r--r-- 1 root root 414 Aug 7 17:02 special_tokens_map.json
-rw-r--r-- 1 root root 749 Aug 7 17:02 tokenizer_config.json
-rw-r--r-- 1 root root 499723 Aug 7 17:02 tokenizer.model
如果文件大小和數量不正確,說明權重文件下載失敗,執行 rm -rf Llama2-Chinese-13b-Chat,再重新拉取(需要多試幾次)。或者可以單獨下載模型:
wget https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat/resolve/main/pytorch_model-00003-of-00003.bin
注:由於權重文件較大,且是hf是境外網站,經常會出現下載失敗或者網速比較慢的情況,甚至會導致下載的模型文件缺失、運行時出現錯誤。正確的模型文件已上傳至oss,可通過oss直接下載,速度快。
在Llama2-Chinese目錄下創建一個python文件 generate.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('Llama2-Chinese-13b-Chat',device_map='auto',torch_dtype=torch.float16,load_in_8bit=True)
model =model.eval()
tokenizer = AutoTokenizer.from_pretrained('Llama2-Chinese-13b-Chat',use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer(['<s>Human: 介紹一下華南理工大學\n</s><s>Assistant: '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda')
generate_input = {
"input_ids":input_ids,
"max_new_tokens":512,
"do_sample":True,
"top_k":50,
"top_p":0.95,
"temperature":0.3,
"repetition_penalty":1.3,
"eos_token_id":tokenizer.eos_token_id,
"bos_token_id":tokenizer.bos_token_id,
"pad_token_id":tokenizer.pad_token_id
}
generate_ids = model.generate(**generate_input)
text = tokenizer.decode(generate_ids[0])
print(text)
運行測試
執行 python generate.py
,看到輸出結果:
可見該模型存在許多事實性錯誤,中文組織能力弱。
微調
數據準備
首先準備自己的訓練數據和驗證數據(csv格式):
每個csv文件中包含一列“text”,每一行爲一個訓練樣例,每個訓練樣例按照以下格式將問題和答案組織爲模型輸入
按照以下格式自定義訓練和驗證數據集:
"<s>Human: "+問題+"\n</s><s>Assistant: "+答案
例如,
<s>Human: 你是誰?</s><s>Assistant: 我是Llama-Chinese-13B-chat,一個由小明在2023年開發的人工智能助手。</s>
將數據上傳至特定的文件夾內。
然後修改微調的腳本 train/sft/finetune.sh
參數,主要修改模型輸出路徑、原始模型路徑、訓練數據與驗證數據路徑:
執行
bash finetune.sh
開始微調訓練,得到微調後的模型:
合併模型
得到生成模型後,新建一個python文件:merge.py
,用於融合原模型和新生成的模型文件(參考代碼附於文末)
import argparse
import json
import os
import gc
import torch
import sys
sys.path.append("./")
import peft
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer
from huggingface_hub import hf_hub_download
parser = argparse.ArgumentParser()
parser.add_argument('--base_model', default=None, required=True,
type=str, help="Please specify a base_model")
parser.add_argument('--lora_model', default=None, required=True,
type=str, help="Please specify LoRA models to be merged (ordered); use commas to separate multiple LoRA models.")
parser.add_argument('--offload_dir', default=None, type=str,
help="(Optional) Please specify a temp folder for offloading (useful for low-RAM machines). Default None (disable offload).")
parser.add_argument('--output_type', default='pth',choices=['pth','huggingface'], type=str,
help="save the merged model in pth or huggingface format.")
parser.add_argument('--output_dir', default='./', type=str)
emb_to_model_size = {
4096 : '7B',
5120 : '13B',
6656 : '30B',
8192 : '65B',
}
num_shards_of_models = {'7B': 1, '13B': 2}
params_of_models = {
'7B':
{
"dim": 4096,
"multiple_of": 256,
"n_heads": 32,
"n_layers": 32,
"norm_eps": 1e-06,
"vocab_size": -1,
},
'13B':
{
"dim": 5120,
"multiple_of": 256,
"n_heads": 40,
"n_layers": 40,
"norm_eps": 1e-06,
"vocab_size": -1,
},
}
def transpose(weight, fan_in_fan_out):
return weight.T if fan_in_fan_out else weight
# Borrowed and modified from https://github.com/tloen/alpaca-lora
def translate_state_dict_key(k):
k = k.replace("base_model.model.", "")
if k == "model.embed_tokens.weight":
return "tok_embeddings.weight"
elif k == "model.norm.weight":
return "norm.weight"
elif k == "lm_head.weight":
return "output.weight"
elif k.startswith("model.layers."):
layer = k.split(".")[2]
if k.endswith(".self_attn.q_proj.weight"):
return f"layers.{layer}.attention.wq.weight"
elif k.endswith(".self_attn.k_proj.weight"):
return f"layers.{layer}.attention.wk.weight"
elif k.endswith(".self_attn.v_proj.weight"):
return f"layers.{layer}.attention.wv.weight"
elif k.endswith(".self_attn.o_proj.weight"):
return f"layers.{layer}.attention.wo.weight"
elif k.endswith(".mlp.gate_proj.weight"):
return f"layers.{layer}.feed_forward.w1.weight"
elif k.endswith(".mlp.down_proj.weight"):
return f"layers.{layer}.feed_forward.w2.weight"
elif k.endswith(".mlp.up_proj.weight"):
return f"layers.{layer}.feed_forward.w3.weight"
elif k.endswith(".input_layernorm.weight"):
return f"layers.{layer}.attention_norm.weight"
elif k.endswith(".post_attention_layernorm.weight"):
return f"layers.{layer}.ffn_norm.weight"
elif k.endswith("rotary_emb.inv_freq") or "lora" in k:
return None
else:
print(layer, k)
raise NotImplementedError
else:
print(k)
raise NotImplementedError
def unpermute(w):
return (
w.view(n_heads, 2, dim // n_heads // 2, dim).transpose(1, 2).reshape(dim, dim)
)
def save_shards(model_sd, num_shards: int):
# Add the no_grad context manager
with torch.no_grad():
if num_shards == 1:
new_state_dict = {}
for k, v in model_sd.items():
new_k = translate_state_dict_key(k)
if new_k is not None:
if "wq" in new_k or "wk" in new_k:
new_state_dict[new_k] = unpermute(v)
else:
new_state_dict[new_k] = v
os.makedirs(output_dir, exist_ok=True)
print(f"Saving shard 1 of {num_shards} into {output_dir}/consolidated.00.pth")
torch.save(new_state_dict, output_dir + "/consolidated.00.pth")
with open(output_dir + "/params.json", "w") as f:
json.dump(params, f)
else:
new_state_dicts = [dict() for _ in range(num_shards)]
for k in list(model_sd.keys()):
v = model_sd[k]
new_k = translate_state_dict_key(k)
if new_k is not None:
if new_k=='tok_embeddings.weight':
print(f"Processing {new_k}")
assert v.size(1)%num_shards==0
splits = v.split(v.size(1)//num_shards,dim=1)
elif new_k=='output.weight':
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif new_k=='norm.weight':
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'ffn_norm.weight' in new_k:
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'attention_norm.weight' in new_k:
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'w1.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif 'w2.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(1)//num_shards,dim=1)
elif 'w3.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif 'wo.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(1)//num_shards,dim=1)
elif 'wv.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif "wq.weight" in new_k or "wk.weight" in new_k:
print(f"Processing {new_k}")
v = unpermute(v)
splits = v.split(v.size(0)//num_shards,dim=0)
else:
print(f"Unexpected key {new_k}")
raise ValueError
for sd,split in zip(new_state_dicts,splits):
sd[new_k] = split.clone()
del split
del splits
del model_sd[k],v
gc.collect() # Effectively enforce garbage collection
os.makedirs(output_dir, exist_ok=True)
for i,new_state_dict in enumerate(new_state_dicts):
print(f"Saving shard {i+1} of {num_shards} into {output_dir}/consolidated.0{i}.pth")
torch.save(new_state_dict, output_dir + f"/consolidated.0{i}.pth")
with open(output_dir + "/params.json", "w") as f:
print(f"Saving params.json into {output_dir}/params.json")
json.dump(params, f)
if __name__=='__main__':
args = parser.parse_args()
base_model_path = args.base_model
lora_model_paths = [s.strip() for s in args.lora_model.split(',') if len(s.strip())!=0]
output_dir = args.output_dir
output_type = args.output_type
offload_dir = args.offload_dir
print(f"Base model: {base_model_path}")
print(f"LoRA model(s) {lora_model_paths}:")
if offload_dir is not None:
# Load with offloading, which is useful for low-RAM machines.
# Note that if you have enough RAM, please use original method instead, as it is faster.
base_model = LlamaForCausalLM.from_pretrained(
base_model_path,
load_in_8bit=False,
torch_dtype=torch.float16,
offload_folder=offload_dir,
offload_state_dict=True,
low_cpu_mem_usage=True,
device_map={"": "cpu"},
)
else:
# Original method without offloading
base_model = LlamaForCausalLM.from_pretrained(
base_model_path,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map={"": "cpu"},
)
print(base_model)
## infer the model size from the checkpoint
embedding_size = base_model.get_input_embeddings().weight.size(1)
model_size = emb_to_model_size[embedding_size]
print(f"Peft version: {peft.__version__}")
print(f"Loading LoRA for {model_size} model")
lora_model = None
lora_model_sd = None
for lora_index, lora_model_path in enumerate(lora_model_paths):
print(f"Loading LoRA {lora_model_path}")
tokenizer = LlamaTokenizer.from_pretrained(lora_model_path)
assert base_model.get_input_embeddings().weight.size(0) == len(tokenizer)
# if base_model.get_input_embeddings().weight.size(0) != len(tokenizer):
# base_model.resize_token_embeddings(len(tokenizer))
# print(f"Extended vocabulary size to {len(tokenizer)}")
first_weight = base_model.model.layers[0].self_attn.q_proj.weight
first_weight_old = first_weight.clone()
if hasattr(peft.LoraModel, 'merge_and_unload'):
lora_model = PeftModel.from_pretrained(
base_model,
lora_model_path,
device_map={"": "cpu"},
torch_dtype=torch.float16,
)
assert torch.allclose(first_weight_old, first_weight)
print(f"Merging with merge_and_unload...")
base_model = lora_model.merge_and_unload()
else:
base_model_sd = base_model.state_dict()
try:
lora_model_sd = torch.load(os.path.join(lora_model_path,'adapter_model.bin'),map_location='cpu')
except FileNotFoundError:
print("Cannot find lora model on the disk. Downloading lora model from hub...")
filename = hf_hub_download(repo_id=lora_model_path,filename='adapter_model.bin')
lora_model_sd = torch.load(filename,map_location='cpu')
lora_config = peft.LoraConfig.from_pretrained(lora_model_path)
lora_scaling = lora_config.lora_alpha / lora_config.r
fan_in_fan_out = lora_config.fan_in_fan_out
lora_keys = [k for k in lora_model_sd if 'lora_A' in k]
non_lora_keys = [k for k in lora_model_sd if not 'lora_' in k]
for k in non_lora_keys:
print(f"merging {k}")
original_k = k.replace('base_model.model.','')
base_model_sd[original_k].copy_(lora_model_sd[k])
for k in lora_keys:
print(f"merging {k}")
original_key = k.replace('.lora_A','').replace('base_model.model.','')
assert original_key in base_model_sd
lora_a_key = k
lora_b_key = k.replace('lora_A','lora_B')
base_model_sd[original_key] += (
transpose(lora_model_sd[lora_b_key].float() @ lora_model_sd[lora_a_key].float(),fan_in_fan_out) * lora_scaling
)
assert base_model_sd[original_key].dtype == torch.float16
# did we do anything?
assert not torch.allclose(first_weight_old, first_weight)
tokenizer.save_pretrained(output_dir)
if output_type=='huggingface':
print("Saving to Hugging Face format...")
LlamaForCausalLM.save_pretrained(
base_model, output_dir,
max_shard_size="2GB"
) #, state_dict=deloreanized_sd)
else: # output_type=='pth
print("Saving to pth format...")
base_model_sd = base_model.state_dict()
del lora_model, base_model, lora_model_sd
params = params_of_models[model_size]
num_shards = num_shards_of_models[model_size]
n_layers = params["n_layers"]
n_heads = params["n_heads"]
dim = params["dim"]
dims_per_head = dim // n_heads
base = 10000.0
inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head))
save_shards(model_sd=base_model_sd, num_shards=num_shards)
調用方法:
CUDA_VISIBLE_DEVICES="3" python merge.py \
--base_model /hy-tmp/Llama2-Chinese/Llama2-Chinese-13b-Chat \
--lora_model /hy-tmp/Llama2-Chinese/train/sft/output \
--output_type huggingface \
--output_dir ./output_merge
注 : base_model 爲原模型路徑,lora_model 爲訓練後生成模型了路徑 , output_dir 爲合併後模型生成路徑
執行代碼合併成功後,可以看見在輸出目錄出現新的模型:

測試
獲得新模型後,修改 generate.py
中的模型路徑爲 output_dir
開始運行測試:python generate.py
對比原輸出,可以看出微調效果明顯:
Llama2基座模型增量預訓練
前言
AI模型的訓練訓練過程分爲如下三個階段
第一個階段叫做無監督學習(PreTraining),就是輸入大量的文本語料讓GPT自己尋找語言的規律, 這樣一個巨大的詞向量空間就形成了,但是話說的漂亮並不一定正確。
第二個階段叫做監督學習(Supervised Fine-Tuning,也叫微調),就是人工標註一些語料,教會GPT什 麼該說,什麼不該說。(訓練數據集)
第三個階段叫做強化學習(RM,也叫獎勵模型訓練),就是給GPT的回答進行打分,告訴他在他 的一衆回答中,哪些回答更好。(驗證數據集)
第一個階段(無監督學習)分爲了底座模型預訓練,及增量預訓練,它們都屬於無監督學習,接下來基於Llama2底座模型繼續使用大量文本進行增量預訓練。
經測試,訓練的硬件需求至少需要6張40G*A100。
1.環境準備
在提供的預訓練文件夾中,下載原始模型,放入 llama_script/training/pretrained_model
目錄下
拉取transformers項目並安裝依賴
git clone https://github.com/huggingface/transformers.git
pip install -e .
在後續的訓練中,需要使用到HuggingFace格式的基座模型。如果下載了官網的PyTorch版本,需重新下載,也可使用transformers中的腳本轉換爲HuggingFace格式。
cd /transformers
python src/transformers/models/llama/convert_llama_weights_to_hf.py
--input_dir ./llama
--model_size 7B
--output_dir ./output
2.數據準備
將數據按照段落劃分爲多個txt文件,每個文件中只包含一個段落,且段落被包含在 <s> </s>
內:
將上述的文本數據放入文件夾下的 dataset_dir
目錄下。
3.開始訓練
進入項目的 train
目錄下,修改 pretrain腳本參數,主要修改 model_name_or_path(基座模型路徑)
,tokenizer_name_or_path(分詞器路徑)
,dataset_dir(數據集路徑)
參數(若已經按照前面的步驟放在指定的文件夾下,則不需要修改。)
開始訓練,默認使用單卡訓練,如需使用多卡,修改腳本中 nproc_per_node
參數
cd training
bash pretrain.sh
4.合併文件
訓練後的LoRA權重和配置存放於output/pt_lora_model中,運行文件夾中的合併代碼:
python merge_llama.py --base_model /hy-tmp/Llama-2-7b-hf --lora_model /hy-tmp/output_dir/pt_lora_model --output_dir /hy-tmp/output_merge --output_type huggingface
根據實際情況修改參數的路徑以及輸出類型,即可開始合併。
5.推理測試
調用生成代碼進行測試,結果如下:
原始模型:
訓練後:
可見經過增量預訓練,模型成功學習了新知識。