寫在前面

很多問題尚未弄清，還在進一步調整

目前已知

我用8卡的3090

採用deepspeed ZeRO3進行運行，下面是deepspeed3的配置

 1 {
 2     "fp16": {
 3         "enabled": "auto",
 4         "loss_scale": 0,
 5         "loss_scale_window": 1000,
 6         "initial_scale_power": 16,
 7         "hysteresis": 2,
 8         "min_loss_scale": 1
 9     },
10 
11     "optimizer": {
12         "type": "AdamW",
13         "params": {
14             "lr": "auto",
15             "betas": "auto",
16             "eps": "auto",
17             "weight_decay": "auto"
18         }
19     },
20 
21     "scheduler": {
22         "type": "WarmupLR",
23         "params": {
24             "warmup_min_lr": "auto",
25             "warmup_max_lr": "auto",
26             "warmup_num_steps": "auto"
27         }
28     },
29 
30     "zero_optimization": {
31         "stage": 3,
32         "overlap_comm": true,
33         "contiguous_gradients": true,
34         "sub_group_size": 1e9,
35         "reduce_bucket_size": "auto",
36         "stage3_prefetch_bucket_size": "auto",
37         "stage3_param_persistence_threshold": "auto",
38         "stage3_max_live_parameters": 1e9,
39         "stage3_max_reuse_distance": 1e9,
40         "stage3_gather_16bit_weights_on_model_save": true
41     },
42 
43     "gradient_accumulation_steps": "auto",
44     "gradient_clipping": "auto",
45     "steps_per_print": 2000,
46     "train_batch_size": "auto",
47     "train_micro_batch_size_per_gpu": "auto",
48     "wall_clock_breakdown": false
49 }

這是運行命令代碼

已知per_device_batch_size必須調大

deepspeed --num_gpus=8 src/train_bash.py \
    --stage sft \
    --model_name_or_path /hy-tmp/tigerbot-70b-chat-v4-4k \
    --do_train True \
    --finetuning_type lora \
    --template tigerbot \
    --dataset_dir data \
    --dataset self_cognition_golden \
    --cutoff_len 1024 \
    --learning_rate 0.01 \
    --num_train_epochs 1.0 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --save_steps 100 \
    --lora_rank 256 \
    --lora_dropout 0.1 \
    --lora_target q_proj,v_proj \
    --output_dir saves \
    --fp16 True \
    --plot_loss True \
    --deepspeed deepspeed.json

這是運行代碼

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用8卡3090微調llama2-70B模型

寫在前面

基於Deepspeed實現LLaMA-13B或70B模型的微調

基於vllm 0.3.0部署 llama2-70B模型

基於TigerBot-13b訓練其函數調用能力

使用8卡3090微調llama2-70B模型

8卡3090GPU雲服務器上採用VLLM部署中文llama2-70b模型及OpenAI格式接口

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結