使用docker搭建deepspeed多機多卡分佈式微調大模型環境

前置環境：兩臺可以互通的centos服務器（服務器1、服務器2），docker，NVIDIA驅動

一、docker創建overlay共享網絡

1）選用服務器1作爲manage節點進行初始化，執行docker swarm init

Swarm initialized: current node (ly4dipghh8734nys3t7t68sfx) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1zag1cf7n1o21cu88dyanbhlrbhrjicztf9qlyryt08w56k3ba-9pu6zhn2klfv2qm0bnaysm4lo 192.168.6.98:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

2）將服務器2加入集羣

進入服務器2執行上面的join命令：

docker swarm join --token SWMTKN-1-1zag1cf7n1o21cu88dyanbhlrbhrjicztf9qlyryt08w56k3ba-9pu6zhn2klfv2qm0bnaysm4lo 192.168.6.98:2377

3）創建deepspeed環境需要的網絡

docker network create --driver=overlay --attachable sharednet

4) 查看當前網絡狀態

在manage節點執行docker network ls

在work節點執行docker network ls

會發現沒有這條網絡。

這時候我們只需要開啓一個容器，強制指定網絡爲sharednet，docker就會自動同步對應網絡了

docker run -dit --name alpine2 --network sharednet alpine

再次查看work節點的網絡：

二、容器環境配置

以下步驟均需要在manage、work節點執行

創建一個 workspace 文件夾，內部的文件列表如下：

├── workspace/

├── code/

├── docker-compose.yml

├── Dockerfile

Dockerfile文件如下：

FROM nvidia/cuda:11.7.1-devel-ubuntu22.04

# 更新系統包
RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh

# 安裝Python
WORKDIR /home/user

RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
  tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
  ./configure --enable-optimizations && make -j 4 && make install

docker-compose.yml文件如下：

version: "3"
services:
  llm:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: llm
    tty: true
    restart: always
    ulimits:
      memlock: -1
      stack: 67108864
    shm_size: 40G
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    volumes:
      - ./code:/home/user/code:cached
    networks:
      - sharednet

networks:
  sharednet:
    external: true

使用LLaMA-Factory項目中的chatglm2-6b進行微調：

# 進入工作目錄
$ cd workspace

# 克隆項目到code文件夾
$ git clone https://github.com/hiyouga/LLaMA-Factory code

使用 docker compose up -d --build 命令啓動容器

分別進入容器內部docker exec -it 容器ID /bin/bash

1）驗證網絡是否暢通

分別進入manage節點與work節點查看ip

執行ifconfig

Manager:

Worker:

2）兩個環境互ping

root@453417f7b6d7:/home/user# ping 10.0.1.2

PING 10.0.1.2 (10.0.1.2) 56(84) bytes of data.

64 bytes from 10.0.1.2: icmp_seq=1 ttl=64 time=0.580 ms

64 bytes from 10.0.1.2: icmp_seq=2 ttl=64 time=0.293 ms

64 bytes from 10.0.1.2: icmp_seq=3 ttl=64 time=0.321 ms

64 bytes from 10.0.1.2: icmp_seq=4 ttl=64 time=0.318 ms

root@082431c6895e:/home/user# ping 10.0.1.4

PING 10.0.1.4 (10.0.1.4) 56(84) bytes of data.

64 bytes from 10.0.1.4: icmp_seq=1 ttl=64 time=0.429 ms

64 bytes from 10.0.1.4: icmp_seq=2 ttl=64 time=0.347 ms

64 bytes from 10.0.1.4: icmp_seq=3 ttl=64 time=0.394 ms

64 bytes from 10.0.1.4: icmp_seq=4 ttl=64 time=0.364 ms

可以看到兩服務器網絡互通

三、開啓免密訪問

分別去manager,worker節點的容器中安裝openssh-server服務並啓動

# 安裝ssh服務
apt-get install openssh-server -y
# 啓動ssh服務
/etc/init.d/ssh start

配置免密登錄

1）在manage、worker節點的容器中執行ssh-keygen -t rsa，全部回車

2）將manager節點中的~/.ssh/id_rsa.pub的內容複製寫入到manager節點和worker節點中的~/.ssh/authorized_keys文件中。

將worker節點中的~/.ssh/id_rsa.pub的內容複製寫入到manager節點和worker節點中的~/.ssh/authorized_keys文件中。

3）在manage、worker節點的/etc/hosts添加：

10.0.1.2 node1
10.0.1.4 node2

4）測試兩服務器之間是否可以ssh登錄

manager節點：ssh root@10.0.1.4

work節點：ssh root@10.0.1.2

配置NCCL網絡:

查看容器使用的網卡：ifconfig

這裏我使用的網卡是eth0

配置命令：export NCCL_SOCKET_IFNAME=eth0

四、分佈式訓練

1）準備chatglm2-6b模型，放到/home/user/code

2）安裝依賴

pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3）hostfile文件配置

/home/user/code/目錄下新建 hostfile 文件，內容如下：

node1 slots=2
node2 slots=2

slots代表使用2個GPU

4） ds_config.json配置

/home/user/code/目錄下新建 ds_config.json 文件，寫入：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": false,
    "contiguous_gradients": true
  }
}

五、訓練

在manager節點執行命令

deepspeed --hostfile hostfile src/train_bash.py \
    --deepspeed ds_config.json \
    --stage sft \
    --model_name_or_path /home/user/code/chatglm2-6b \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --template chatglm2 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir chatglm2_sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 10000 \
    --learning_rate 5e-5 \
    --num_train_epochs 0.25 \
    --plot_loss \
    --fp16

訓練日誌：

參考：

https://juejin.cn/post/7304182487683923994

使用docker搭建deepspeed多機多卡分佈式微調大模型環境

Yuan2.0代碼主要結構概覽及三種並行方式實現

docker創建容器問題-docker run

使用docker搭建deepspeed多機多卡分佈式微調大模型環境

16GB顯卡推理80G大模型

基於SentencePiece擴充LLaMa中文詞表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結