C#使用詞嵌入向量與向量數據庫爲大語言模型(LLM)賦能長期記憶實現私域問答機器人落地之openai接口平替

------------恢復內容開始------------

在上一篇文章中我們大致講述了一下如何通過詞嵌入向量的方式爲大語言模型增加長期記憶，用於落地在私域場景的問題。其中涉及到使用openai的接口進行詞嵌入向量的生成以及chat模型的調用

由於衆所周知的原因，國內調用openai接口並不友好，所以今天介紹兩款開源平替實現分別替代詞嵌入向量和文本生成。

照例還是簡單繪製一下拓撲圖：

從拓撲上來看還是比較簡單的，一個後端服務用於業務處理，兩個AI模型服務用於詞嵌入向量和文本生成以及一個向量數據庫(這裏依然採用es，下同),接着我們來看看流程圖：

從流程圖上來講，我們依然需要有兩個階段的準備，在一階段，我們需要構建私域回答的文本，這些文本往往以字符串的形式被輸入到嵌入接口，然後獲取到嵌入接口的嵌入向量。再以es索引的方式被寫入到向量庫。而在第二階段，也就是對外提供服務的階段，我們會將用戶的問題調用嵌入接口生成它的詞嵌入向量，然後通過向量數據庫的文本相似度匹配獲取到近似的回答，比如提問“青椒炒肉時我的鹽應該放多少”。向量庫相似的文本里如果包含了和該烹飪有關的文本會返回1到多條回答。接着我們在後端構建一個prompt，和之前的文章類似。最後調用我們的文本生成模型進行問題的回答。整個流程結束。

接下來我們看看如何使用和部署這些模型以及c#相關代碼的編寫

重要：在開始之前，請確保你的部署環境安裝了16G顯存的Nvidia顯卡或者48G以上的內存。前者用於基於顯卡做模型推理，效果比較好，速度生成合理。後者基於CPU推理，速度較慢，僅可用於部署測試。如果基於顯卡部署，需要單獨安裝CUDA11.8同時需要安裝nvidia-docker2套件用於docker上的gpu支持，這裏不再贅述安裝過程

首先我們需要下載詞嵌入模型，這裏推薦使用text2vec-large-chinese這個模型，該模型針對中文文本進行過微調。效果較好。

下載地址如下：https://huggingface.co/GanymedeNil/text2vec-large-chinese/tree/main

我們需要下載它的pytorch_model.bin、config.json、vocab.txt這三個文件用於構建我們的詞嵌入服務

接着我們在下載好的文件夾裏，新建一個web.py。輸入以下內容：

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
from transformers import AutoTokenizer, AutoModel
import torch

app = FastAPI()

# Load the model and tokenizer
model = AutoModel.from_pretrained("/app").half().cuda()
tokenizer = AutoTokenizer.from_pretrained("/app")


# Request body
class Sentence(BaseModel):
    sentence: str


@app.post("/embed")
async def embed(sentence: Sentence):
    # Tokenize the sentence and get the input tensors
    inputs = tokenizer(sentence.sentence, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # Move inputs to GPU
    for key in inputs.keys():
        inputs[key] = inputs[key].to('cuda')

    # Run the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the embeddings
    embeddings = outputs.last_hidden_state[0].cpu().numpy()

    # Return the embeddings as a JSON response
    return embeddings.tolist()

以上是基於gpu版本的api。如果你沒有gpu支持，那麼可以使用以下代碼：

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
from transformers import AutoTokenizer, AutoModel
import torch

app = FastAPI()

# Load the model and tokenizer
model = AutoModel.from_pretrained("/app").half()
tokenizer = AutoTokenizer.from_pretrained("/app")

# Request body
class Sentence(BaseModel):
    sentence: str

@app.post("/embed")
async def embed(sentence: Sentence):
    # Tokenize the sentence and get the input tensors
    inputs = tokenizer(sentence.sentence, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # No need to move inputs to GPU as we are using CPU

    # Run the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the embeddings
    embeddings = outputs.last_hidden_state[0].cpu().numpy()

    # Return the embeddings as a JSON response
    return embeddings.tolist()

這裏我們使用一個簡單的pyhont web框架fastapi對外提供服務。接着我們將之前下載的模型和py代碼放在一起，並且創建一個 requirements.txt用於構建鏡像時下載依賴， requirements.txt包含

torch
transformers
fastapi
uvicorn

其中前兩個是模型需要使用的庫/框架，後兩個是web服務需要的庫框架，接着我們在編寫一個Dockerfile用於構建鏡像：

FROM python:3.8-slim-buster

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Run app.py when the container launches
ENV MODULE_NAME=web 
ENV VARIABLE_NAME=app 
ENV HOST=0.0.0.0 
ENV PORT=80

# Run the application: 
CMD uvicorn ${MODULE_NAME}:${VARIABLE_NAME} --host ${HOST} --port ${PORT}

接着我們就可以基於以上內容構建鏡像了。直接執行docker build . -t myembed:latest等待編譯即可

鏡像編譯完畢後，我們可以在本機運行它：docker run -dit --gpus all -p 8080:80 myembed:latest。注意如果你是cpu環境則不需要添加“--gpus all”。接着我們可以通過postman模擬訪問接口，看是否可以生成向量，如果一切順利，它將生成一個嵌套的多維數組，如下所示：

接着我們需要同樣的辦法去炮製語言大模型的接口，這裏我們採用國內相對成熟的開源大語言模型Chat-glm-6b。首先我們新建一個文件夾，然後用git拉取它的web服務相關的代碼:

git clone https://github.com/THUDM/ChatGLM-6B.git

接着我們需要下載它的模型權重文件,地址：https://huggingface.co/THUDM/chatglm-6b/tree/main。下載從pytorch_model-00001-of-00008.bin到pytorch_model-00008-of-00008.bin的8個權重文件放在git根目錄

接着我們修改api.py的代碼：

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModel
import uvicorn, json, datetime
import torch
import asyncio

DEVICE = "cuda"
DEVICE_ID = "0"
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE


def torch_gc():
    if torch.cuda.is_available():
        with torch.cuda.device(CUDA_DEVICE):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()

app = FastAPI()

@app.post("/chat", response_class=StreamingResponse)
async def create_item(request: Request):
    global model, tokenizer
    json_post_raw = await request.json()
    json_post = json.dumps(json_post_raw)
    json_post_list = json.loads(json_post)
    prompt = json_post_list.get('prompt')
    history = json_post_list.get('history')
    max_length = json_post_list.get('max_length')
    top_p = json_post_list.get('top_p')
    temperature = json_post_list.get('temperature')
    
    last_response = ''
    async def stream_chat():
        nonlocal last_response,history
        for response, history in model.stream_chat(tokenizer,
                                                prompt,
                                                history=history,
                                                max_length=max_length if max_length else 2048,
                                                top_p=top_p if top_p else 0.7,
                                                temperature=temperature if temperature else 0.95):
            new_part = response[len(last_response):]
            last_response = response
            yield json.dumps(new_part,ensure_ascii=False)
            
    return StreamingResponse(stream_chat(), media_type="text/plain")


if __name__ == '__main__':
    tokenizer = AutoTokenizer.from_pretrained("/app", trust_remote_code=True)
    model = AutoModel.from_pretrained("/app", trust_remote_code=True).half().cuda()
    model.eval()
    uvicorn.run(app, host='0.0.0.0', port=80, workers=1)

同樣的如果你是cpu版本的環境，你需要將（這裏注意，如果你有顯卡，但是顯存並不足16G。那麼可以考慮8bit或者4bit量化，具體參閱https://github.com/THUDM/ChatGLM-6B的readme.md）

model = AutoModel.from_pretrained("/app", trust_remote_code=True).half().cuda()

修改爲

model = AutoModel.from_pretrained("/app", trust_remote_code=True)

剩餘的流程和之前部署向量模型類似，由於項目中已經包含了，創建對應的 requirements.txt，我們只需要創建類似詞嵌入向量的Dockerfile即可編譯。

FROM python:3.8-slim-buster
WORKDIR /app
ADD . /app
RUN pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
CMD ["python", "api.py"]

完成後可以使用docker run -dit --gpus all -p 8081:80 myllm:latest啓動測試,同樣的使用postman模擬訪問接口,順利的話我們應該能夠看到如下內容不要在意亂碼的部分那是emoji沒有正確解析的問題：

接下來我們需要構建c#後端代碼，將這些基礎服務連接起來，這裏我使用一個本地靜態字典來模擬詞嵌入向量的存儲和餘弦相似度查詢相似文本，就不再贅述使用es做向量庫，兩者的效果基本一致的。感興趣的同學去搜索NEST庫和es基於餘弦相似度搜索相關的內容即可

核心代碼如下，這裏我提供兩個接口，第一個接口用於獲取前端輸入的文本做詞嵌入並進行存儲，第二個接口用於回答問題。

///用於模擬向量庫    
private Dictionary<string, List<double>> MemoryList = new Dictionary<string, List<double>>();
///用於計算相似度
double Compute(List<double> vector1, List<double> vector2) => vector1.Zip(vector2, (a, b) => a * b).Sum() / (Math.Sqrt(vector1.Sum(a => a * a)) * Math.Sqrt(vector2.Sum(b => b * b)));
...
    [HttpPost("/api/save")]
    public async Task<int> SaveMemory(string str)
    {
        if (!string.IsNullOrEmpty(str))
        {
            foreach (var x in memory.Split("\n").ToList())
            {
                if (!MemoryList.ContainsKey(x))
                {
                    MemoryList.Add(x, await GetEmbeding(x));
                    StateHasChanged();
                }
            }
        }
        return MemoryList.Count; 
    }
...
    [HttpPost("/api/chat")]
    public async IAsyncEnumerable<string> SendData(string content)
    {
        if (!string.IsNullOrEmpty(content))
        {
            var userquestionEmbeding = await GetEmbeding(content);
            var prompt = "";
            if (MemoryList.Any())
            {  //這裏從向量庫中獲取到第一條，你可以根據實際情況設置比如相似度閾值或者返回多條等等
                prompt = MemoryList.OrderByDescending(x => Compute(userquestionEmbeding, x.Value)).FirstOrDefault().Key;
                prompt = $"你是一個問答小助手，你需要基於以下事實依據回答問題，事實依據如下：{prompt}。用戶的問題如下：{Content}。不要編造事實依據，請回答：";
            }
            else
                prompt = Content;
            await foreach (var item in ChatStream(prompt))
            {
                yield return item;
            }
        }
    }

同時我們需要提供兩個函數用於使用httpclient訪問AI模型的api：

async IAsyncEnumerable<string> ChatStream(string x)
    {
        HttpClient hc = new HttpClient();
        var reqcontent = new StringContent(System.Text.Json.JsonSerializer.Serialize(new { prompt = x }));
        reqcontent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/json");
        var response = await hc.PostAsync("http://192.168.1.100:8081/chat", reqcontent);
        if (response.IsSuccessStatusCode)
        {
            var responseStream = await response.Content.ReadAsStreamAsync();
            using (var reader = new StreamReader(responseStream, Encoding.UTF8))
            {
                string line;
                while ((line = await reader.ReadLineAsync()) != null)
                {
                    yield return line;
                }
            }
        }
    }
    async Task<List<double>> GetEmbeding(string x)
    {
        HttpClient hc = new HttpClient();
        var reqcontent = new StringContent(System.Text.Json.JsonSerializer.Serialize(new { sentence = x }));
        reqcontent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/json");
        var result = await hc.PostAsync("http://192.168.1.100:8080/embed", reqcontent);
        var content = await result.Content.ReadAsStringAsync();
        var embed = System.Text.Json.JsonSerializer.Deserialize<List<List<double>>>(content);
        var embedresult = new List<double>();
        for (var i = 0; i < 1024; i++)
        {
            double sum = 0;
            foreach (List<double> sublist in embed)
            {
                sum += (sublist[i]);
            }
            embedresult.Add(sum / 1024);
        }
        return embedresult;
    }

接下來我們可以測試一下效果，當模型沒有引入記憶的情況下，詢問一個問題，它會自己編造回答：

接着我們在向量庫中添加多條記憶後再進行問詢，模型即可基本正確的對內容進行回答。

以上就是本次博客的全部內容，相比上一個章節我們使用基於openai的接口來講基於本地部署應該更符合大多數人的情況，以上

C#使用詞嵌入向量與向量數據庫爲大語言模型(LLM)賦能長期記憶實現私域問答機器人落地之openai接口平替

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

使用純c#在本地部署多模態模型，讓本地模型也可以理解圖像

純c#運行開源本地大模型Mixtral-8x7B

基於ChatGPT函數調用來實現C#本地函數邏輯鏈式調用助力大模型落地

C#使用詞嵌入向量與向量數據庫爲大語言模型(LLM)賦能長期記憶實現私域問答機器人落地之openai接口平替

C#使用詞嵌入向量與向量數據庫爲大語言模型(LLM)賦能長期記憶實現私域問答機器人落地

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結