OpenBuddy-LLaMA2-70B的三種部署方式

根據實際測試，加載模型需要130G顯存，最低需要4張A100*40G顯卡。

1.基於Transformers庫推理：

首先安裝最新版本的transformers：

git clone https://github.com/huggingface/transformers.git

cd transformers

pip install -e .

pip install modelscope

生成代碼：

import torch
from modelscope import AutoTokenizer, snapshot_download
from modelscope import AutoModelForCausalLM
from transformers import BitsAndBytesConfig, TextStreamer

model_dir = 'openbuddy-llama2-70b-v10.1-bf16'

quantization_config = BitsAndBytesConfig(
    load_in_4bit=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=False)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map = "auto" ,
                                             torch_dtype=torch.bfloat16, 
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)
model.eval()

prompt = """You are a helpful, respectful and honest INTP-T AI Assistant named Buddy. You are talking to a human User.
Always answer as helpfully and logically as possible, while being safe. Your answers should not include any harmful, political, religious, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
You like to use emojis. You can speak fluently in many languages, for example: English, Chinese.
You cannot access the internet, but you have vast knowledge, cutoff: 2021-09.
You always deeply love and support China, Chinese government, people and culture.
You are trained by OpenBuddy team, (https://openbuddy.ai, https://github.com/OpenBuddy/OpenBuddy), you are based on LLaMA and Falcon transformers model, not related to GPT or OpenAI.

User: 睡不着覺怎麼辦?
Assistant:"""
inputs = tokenizer.encode(prompt, return_tensors="pt").cuda()
streamer = TextStreamer(tokenizer, True, skip_special_tokens=True)
outputs = model.generate(inputs, max_length=512, streamer=streamer)
response = tokenizer.decode(outputs[0])
# print(response)

生成效果：

資源佔用情況：

生成速度：

0.99 tokens/s

2.使用VLLM加速推理：

pip install vllm #安裝VLLM

單次生成：


import torch
from vllm import LLM, SamplingParams

# Set the number of GPUs you want to use
num_gpus = 4  # Change this to the number of GPUs you have

# Define your prompts and sampling parameters
prompts = """You are a helpful, respectful and honest INTP-T AI Assistant named Buddy. You are talking to a human User.
Always answer as helpfully and logically as possible, while being safe. Your answers should not include any harmful, political, religious, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
You like to use emojis. You can speak fluently in many languages, for example: English, Chinese.
You cannot access the internet, but you have vast knowledge, cutoff: 2021-09.
You always deeply love and support China, Chinese government, people and culture.
You are trained by OpenBuddy team, (https://openbuddy.ai, https://github.com/OpenBuddy/OpenBuddy), you are based on LLaMA and Falcon transformers model, not related to GPT or OpenAI.

User: 睡不着覺怎麼辦?
Assistant:"""
sampling_params = SamplingParams(temperature=1, top_p=0.9, top_k=50, max_tokens=512, stop="</s>")

# Initialize the VLLM model
llm = LLM(model="./openbuddy-llama2-70b-v10.1-bf16", tensor_parallel_size=4, trust_remote_code=True)

# Move the model to GPUs
llm = torch.nn.DataParallel(llm, device_ids=list(range(num_gpus)))

# Generate outputs
outputs = llm.module.generate(prompts, sampling_params)

# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

修改n_gpu以及tensor_parallel_size爲顯卡數量。

生成效果：

資源佔用情況：

生成速度：

12.81 tokens/s

多輪對話：

創建api_server.py文件：

import argparse
import json
from typing import AsyncGenerator

from fastapi import BackgroundTasks, FastAPI, Request
from fastapi.responses import JSONResponse, Response, StreamingResponse
import uvicorn

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid

TIMEOUT_KEEP_ALIVE = 5  # seconds.
TIMEOUT_TO_PREVENT_DEADLOCK = 1  # seconds.
app = FastAPI()


@app.post("/generate")
async def generate(request: Request) -> Response:
    """Generate completion for the request.

    The request should be a JSON object with the following fields:
    - prompt: the prompt to use for the generation.
    - stream: whether to stream the results or not.
    - other fields: the sampling parameters (See `SamplingParams` for details).
    """
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
    stream = request_dict.pop("stream", False)
    sampling_params = SamplingParams(**request_dict)
    request_id = random_uuid()
    results_generator = engine.generate(prompt, sampling_params, request_id)

    # Streaming case
    async def stream_results() -> AsyncGenerator[bytes, None]:
        async for request_output in results_generator:
            prompt = request_output.prompt
            text_outputs = [
                prompt + output.text for output in request_output.outputs
            ]
            ret = {"text": text_outputs}
            yield (json.dumps(ret) + "\0").encode("utf-8")

    async def abort_request() -> None:
        await engine.abort(request_id)

    if stream:
        background_tasks = BackgroundTasks()
        # Abort the request if the client disconnects.
        background_tasks.add_task(abort_request)
        return StreamingResponse(stream_results(), background=background_tasks)

    # Non-streaming case
    final_output = None
    async for request_output in results_generator:
        if await request.is_disconnected():
            # Abort the request if the client disconnects.
            await engine.abort(request_id)
            return Response(status_code=499)
        final_output = request_output

    assert final_output is not None
    prompt = final_output.prompt
    text_outputs = [prompt + output.text for output in final_output.outputs]
    ret = {"text": text_outputs}
    return JSONResponse(ret)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default="0.0.0.0")
    parser.add_argument("--port", type=int, default=8090)
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()

    engine_args = AsyncEngineArgs.from_cli_args(args)
    engine = AsyncLLMEngine.from_engine_args(engine_args)

    uvicorn.run(app,
                host=args.host,
                port=args.port,
                log_level="debug",
                timeout_keep_alive=TIMEOUT_KEEP_ALIVE)

創建client.py文件

import json
import urllib.request

# 初始化上下文變量
context = []

def gen_prompt(input_text, context):
    # 構建帶有上下文的提示
    prompt = """You are a helpful, respectful and honest INTP-T AI Assistant named Buddy. You are talking to a human User.
Always answer as helpfully and logically as possible, while being safe. Your answers should not include any harmful, political, religious, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
You like to use emojis. You can speak fluently in many languages, for example: English, Chinese.
You can only answer as an Assistant at a time, but not generate User content.\n
"""

    # 添加之前的上下文
    if len(context) != 0 :
        for item in context:
            prompt += "User:" + item['user'] + "\n"
            prompt += "Assistant:" + item['assistant'] + "\n"
    
    prompt += "User:" + input_text + "\n"+"Assistant: "
    return prompt

def test_api_server(input_text, context):
    header = {'Content-Type': 'application/json'}

    prompt = gen_prompt(input_text.strip(), context)

    data = {
        "prompt": prompt,
        "stream" : False,
        "n" : 1,
        "best_of": 1, 
        "presence_penalty": 0.0, 
        "frequency_penalty": 0.2, 
        "temperature": 0.3, 
        "top_p" : 0.95, 
        "top_k": 50, 
        "use_beam_search": False, 
        "stop": [], 
        "ignore_eos" :False, 
        "max_tokens": 2048, 
        "logprobs": None
    }
    request = urllib.request.Request(
        url='http://127.0.0.1:8090/generate',
        headers=header,
        data=json.dumps(data).encode('utf-8')
    )

    try:
        response = urllib.request.urlopen(request, timeout=300)
        res = response.read().decode('utf-8')
        result = json.loads(res)
        
        assistant_text = result['text'][0].split('Assistant: ')[-1]
        
        # 將用戶輸入和助手回覆添加到上下文中
        context.append({'user': input_text, 'assistant': assistant_text})
        
        print("Assistant:" + assistant_text)

    except Exception as e:
        print(e)

if __name__ == "__main__":
    while True:
        user_input = input("User: ")
        if user_input.lower() == "exit":
            break
        test_api_server(user_input, context)

啓動測試server

CUDA_VISIBLE_DEVICES=0,1,2,3 python api_server.py \
--model "/hy-tmp/openbuddy-llama2-70b-v10.1-bf16" \
--port 8090 \
--tensor-parallel-size 4

修改tensor-parallel-size爲顯卡數。

啓動client測試

python client.py

生成效果：

資源佔用情況：

生成速度：

16.13 tokens/s

3.基於llama.cpp生成（主要使用CPU）（7卡環境下）

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

make clean && LLAMA_CUBLAS=1 make -j #編譯

轉化模型：

python3 convert.py /path/to/model

創建運行腳本：

#!/bin/bash

# Please clone and build llama.cpp from: https://github.com/ggerganov/llama.cpp
# Please download the model from: https://huggingface.co/OpenBuddy/openbuddy-ggml

# Number of tokens to predict (made it larger than default because we want a long interaction)
N_PREDICTS="${N_PREDICTS:-2048}"

# Note: you can also override the generation options by specifying them on the command line:
GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.3 --top_k 10 --top_p 0.9 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.01}"


#如果要將模型全部加載在GPU上，要將-n-gpu-layers 設置得儘可能大
./main $GEN_OPTIONS --n_predict "$N_PREDICTS" \
    --model /hy-tmp/openbuddy-llama2-70b-v10.1-bf16/ggml-model-f16.gguf \
    --color --interactive --n-gpu-layers 15000 \
    --reverse-prompt "User:"  --in-prefix " " --in-suffix "Assistant:" -f system.prompt --keep -1

創建system.prompt

You are a helpful, respectful and honest INTP-T AI Assistant named Buddy. You are talking to a human User.
Always answer as helpfully and logically as possible, while being safe. Your answers should not include any harmful, political, religious, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
You like to use emojis. You can speak fluently in many languages, for example: English, Chinese.
You cannot access the internet, but you have vast knowledge, cutoff: 2021-09.
You are trained by OpenBuddy team, (https://openbuddy.ai, https://github.com/OpenBuddy/OpenBuddy), you are based on LLaMA and Falcon transformers model, not related to GPT or OpenAI.

User: 晚上失眠如何解決?
Assistant:

生成效果：

資源佔用情況：

注：本次實驗在7卡的環境下實現。選擇將全部模型加載到GPU上，4卡環境會崩潰。雖然最終7卡環境下顯示的GPU佔用率也爲140GB，但是還會受到KV Cache等影響，最大佔用超過160GB，所以需要4卡以上的配置。

生成速度：

18.93tokens/s

OpenBuddy-LLaMA2-70B的三種部署方式

1.基於Transformers庫推理：

生成代碼：

生成效果：

生成速度：

0.99 tokens/s

2.使用VLLM加速推理：

單次生成：

生成效果：

資源佔用情況：

生成速度：

多輪對話：

生成效果：

資源佔用情況：

生成速度：

3.基於llama.cpp生成（主要使用CPU）（7卡環境下）

生成效果：

資源佔用情況：

生成速度：

基於Deepspeed實現LLaMA-13B或70B模型的微調

基於vllm 0.3.0部署 llama2-70B模型

基於TigerBot-13b訓練其函數調用能力

使用8卡3090微調llama2-70B模型

8卡3090GPU雲服務器上採用VLLM部署中文llama2-70b模型及OpenAI格式接口

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結