利用text-generation-webui快速搭建chatGLM2/LLAMA2大模型運行環境

text-generation-webui 是一個基於Gradio的LLM Web UI開源項目，可以利用其快速搭建各種文本生成的大模型環境。

一、安裝

text-generation-webui的readme其實已寫得相當詳細了，這裏就不再重複，只說1個可能存在的坑：

安裝 peft 安裝卡住

requirements.txt 中有一些依賴項，需要訪問github網站，國內經常會打不開，看看這個文件的內容：

aiofiles==23.1.0
fastapi==0.95.2
gradio_client==0.2.5
gradio==3.33.1

accelerate==0.21.0
colorama
datasets
einops
markdown
numpy
pandas
Pillow>=9.5.0
pyyaml
requests
safetensors==0.3.1
scipy
sentencepiece
tensorboard
tqdm
wandb

git+https://github.com/huggingface/peft@4b371b489b9850fd583f204cdf8b5471e098d4e4
git+https://github.com/huggingface/transformers@baf1daa58eb2960248fd9f7c3af0ed245b8ce4af

bitsandbytes==0.41.1; platform_system != "Windows"
https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.1/auto_gptq-0.4.1+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.1/auto_gptq-0.4.1+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
https://github.com/jllllll/exllama/releases/download/0.0.10/exllama-0.0.10+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
https://github.com/jllllll/exllama/releases/download/0.0.10/exllama-0.0.10+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

# llama-cpp-python without GPU support
llama-cpp-python==0.1.78; platform_system != "Windows"
https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.78/llama_cpp_python-0.1.78-cp310-cp310-win_amd64.whl; platform_system == "Windows"
# llama-cpp-python with CUDA support
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.78+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.78+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

# GPTQ-for-LLaMa
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-win_amd64.whl; platform_system == "Windows"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

# ctransformers
https://github.com/jllllll/ctransformers-cuBLAS-wheels/releases/download/AVX2/ctransformers-0.2.22+cu117-py3-none-any.whl

如果安裝peft時，項目無法下載，可以把這行註釋掉。然後再開1個終端，直接用

pip install peft -i  https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers -i  https://pypi.tuna.tsinghua.edu.cn/simple
pip install accelerate -i  https://pypi.tuna.tsinghua.edu.cn/simple

走國內鏡像安裝，然後再回到原來的終端 pip install -r requirements.txt, 其它依賴項，也是這個思路。

二、啓動及model下載

python server.py 即可啓動，剛啓動時沒有任何模型，需要手動下載。

huggingface.co的模型，只要允許下載的，text-generation-webui都能下載，不過這個速度比較慢，而且容易中斷，我個人更建議手動下載（比如 git LFS工具，或者國內有一些鏡像或網盤上，有熱心網友上傳了副本），如果是手動下載，只要把模型放到 text-generation-webui/models 目錄下即可。

三、加載模型

3.1 Llam2 模型加載

這裏選擇 TheBloke/Llama-2-7B-Chat-GGML · Hugging Face 這個Llama2模型測試，選擇後發現自動默認用了llama.cpp 這個c++版本的loader（注：c++版本的推理速度更快)