TigerBot-70b-4k-v4 推理部署

模型本地部署（基於HuggingFace)

根據實際測試，加載模型需要約129G顯存，最低需要6張3090顯卡（流水線並行）

如果使用vllm進行加速推理（張量並行），考慮8張3090顯卡或者4張A100-40G（模型分割要求）

模型下載

截至目前，模型數據僅在huggingface上保存，在恆源雲上的下載方式如下：

開啓恆源雲代理

export https_proxy=http://turbo.gpushare.com:30000 http_proxy=http://turbo.gpushare.com:30000

訪問模型下載地址

在這裏建議使用wget下載模型文件，優點是能夠斷點續傳，下方是wget示例

wget https://huggingface.co/TigerResearch/tigerbot-70b-chat-v4-4k/resolve/main/pytorch_model-00001-of-00015.bin

關閉恆源雲代理

unset http_proxy && unset https_proxy

依賴安裝

克隆官方github倉庫

git clone https://github.com/TigerResearch/TigerBot.git && cd Tigerbot

安裝依賴庫

pip install -r requirements.txt

模型推理

對於普通的多卡推理，示例推理代碼如下

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py --model_path /path/to/your/model --max_input_length 1024 --max_generate_length 1024 --streaming True

相關參數說明

--model_path: 模型路徑
--model_type=chat: base/chat
--max_input_length=1024: 最大輸入長度
--max_generate_length=1024: 最大輸出長度
--rope_scaling=None: 長度外推方法(dynamic/yarn supported now)
--rope_factor=8.0: 外推參數

vllm 加速推理

安裝vllm

pip install vllm

創建新的推理.py文件

import torch
from vllm import LLM, SamplingParams
 
# Set the number of GPUs you want to use
num_gpus = 8  # Change this to the number of GPUs you have
 
# Define your prompts and sampling parameters
prompts = """
### Instruction:
第一次指令

### Instruction:
第二次指令

### Response:
"""
sampling_params = SamplingParams(temperature=1, top_p=0.9, top_k=50, max_tokens=512, stop="</s>")
 
# Initialize the VLLM model
llm = LLM(model="/hy-tmp/tigerbot-70b-chat-v4-4k", tensor_parallel_size=8, trust_remote_code=True)
 
# Move the model to GPUs
llm = torch.nn.DataParallel(llm, device_ids=list(range(num_gpus)))
 
# Generate outputs
outputs = llm.module.generate(prompts, sampling_params)
 
# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

需要注意的是這裏的提示詞格式與llama2不同，tigerbot的提示詞遵循以下格式（注意最上面的兩個空換行）



### Instruction:
第一次指令

### Response:

報錯修復指引

安裝過程中的報錯大多是由於依賴庫的版本問題，調整後可以解決。

flash-attn庫安裝報錯

/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymIntltEl

修復方法：重新構建 flash-attn庫

pip uninstall flash-attn
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

OpenAI格式API部署

部署命令

還是在一臺8卡的3090上，我們可以通過一行命令，部署TigerBot模型：

python -m vllm.entrypoints.openai.api_server \
    --model="/hy-tmp/tigerbot-70b-chat-v4-4k" \
    --tensor-parallel-size 8 \
    --served-model-name "tigerbot" \
    --chat-template tiger_template.jinja \
    --host 0.0.0.0 \
    --port 8080

這裏面的參數意思如下:

--model 模型參數的地址，可以是本地的也可以是雲端的，本處爲本地加載這個模型
tensor-parallel-size 張量並行的個數，本地有8卡，所以設置8 （注意這個數字必須能夠整除head的個數）
served-model-name 這裏是修改提供服務的模型的名稱，默認情況下你的模型名字和model一樣，你可以用這個進行修改（否則是一個很不美觀的路徑名，搞不好還要被攻擊）
host port API暴露的本地IP和接口
--chat-template 這是爲了將OpenAI的API中多輪對話的頭，與TigerBot的多輪對話格式進行適配而使用的腳本，這裏要用\(jinja\)腳本，我撰寫的jinja腳本如下:

{{ "" }}
{% for message in messages %}
{% if message['role'] == 'user' %}
{{ "\n### Instruction:" }}
{% else %}
{{ "\n### Response:" }}
{% endif %}
{{ message['content'] }}
{% endfor %}
{{ "\n### Response:\n" }}

這裏的chat_template其實就是huggingface中的chat_template格式。

注意，這個東西比較新，vllm 0.2.3開始才支持，如果你發現你報了下面這個錯，請你馬上升級。

api_server.py: error: unrecognized arguments: --chat-templat

上面的jinja腳本，第一行也要保留（製造多一個\n），不要有縮進（有縮進會有額外的空格混進去）

啓動成功測試

如果你看到下面的信息出來了，那麼就代表你啓動成功了

INFO:     Started server process [49087]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

你可以用curl命令連接系統，看下有什麼模型可用

curl http://localhost:8080/v1/models

如果成功，你會看到下面這樣的信息:

{"object":"list","data":[{"id":"tigerbot","object":"model","created":1701951473,"owned_by":"vllm","root":"tigerbot","parent":null,"permission":[{"id":"modelperm-e084351f42514fd88aee16661312eaea","object":"model_permission","created":1701951473,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

API交互

我們可以curl，發送一些信息讓模型處理

下面這個是參照OpenAI的completion撰寫的，但是我套上了TigerBot的多輪對話

補全

curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "prompt": "\n\n### Instruction:\n你是誰？\n\n### Response:\n",
        "max_tokens": 1024,
        "temperature": 1
    }'

一個標準的單輪對話

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "messages": [
            {"role": "user", "content": "3+5=?"}
        ]
    }'

返回的信息:

{
  "id": "cmpl-002b8cd331814cb6b8dde2d70340a024",
  "object": "chat.completion",
  "created": 10628423,
  "model": "tigerbot",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " 3+5=8"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 16,
    "completion_tokens": 7
  }
}

下面這個是多輪對話的測試

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "messages": [
            {"role": "user", "content": "3+5=?"},
            {"role": "assistant", "content": "3+5=8"},
            {"role": "user", "content": "再加上4"}
        ]
    }'

外網鏈接

我在恆源雲上進行的測試部署

只要把端口部署在8080，然後開啓恆源雲的API自定義服務，就會給你一個鏈接，替換上去就可以了

我當時測試的時候是http://i-1.gpushare.com:30028/v1/chat/completions這個連接。

理論上，你還能用各種frp轉發來實現

OpenAI的Python代碼實現

和正常的代碼一樣，但需要修改API_base

注意api_key，默認是EMPTY

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"

# 這裏寫內網IP和外網IP取決於你的連接環境
openai_api_base = "http://i-1.gpushare.com:30028/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

completion = client.chat.completions.create(
    model="tigerbot",
    messages=[
        {"role": "user", "content": "你是誰"},
    ]
)
print("Chat response:", completion.choices[0].message.content)

VLLM壓測

單線程情況下的輸出速度在23token每秒

多線程可以達到320token每秒

8卡3090GPU雲服務器上採用VLLM部署中文llama2-70b模型及OpenAI格式接口