體驗Semantic Kernel圖片內容識別

前言

前幾日在瀏覽devblogs.microsoft.com的時候，看到了一篇名爲Image to Text with Semantic Kernel and HuggingFace的文章。這篇文章大致的內容講的是，使用Semantic Kernel結合HuggingFace來實現圖片內容識別。注意，這裏說的是圖片內容識別，並非是OCR，而是它可以大致的描述圖片裏的主要內容。我個人對這些還是有點興趣的，於是就嘗試了一下，本文就是我體驗過程的記錄。

示例

話不多說，直接展示代碼。按照文檔上說的，使用HuggingFace ImageToText構建自己的應用程序時，需要使用以下的包

Microsoft.SemanticKernel
Microsoft.SemanticKernel.Connectors.HuggingFace

第一個包是SemanticKernel包，提供構建AI應用的基礎能力。第二個包是HuggingFace包，提供HuggingFace的API，方便我們調用HuggingFace的模型。需要注意的是這個包是預發行版，所以在用VS添加的時候需要在VS勾選包括預發行版。使用起來也非常簡單，代碼如下所示

var kernel = Kernel.CreateBuilder().AddHuggingFaceImageToText("Salesforce/blip-image-captioning-base").Build();
IImageToTextService service = kernel.GetRequiredService<IImageToTextService>();
var imageBinary = File.ReadAllBytes(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "demo.jpg"));
var imageContent = new ImageContent(imageBinary) { MimeType = "image/jpeg" };
var textContent = await service.GetTextContentAsync(imageContent);
Console.WriteLine($"已識別圖片中描述的內容: {textContent.Text}");

代碼很簡單，運行起來試試效果，發現是直接報錯了，報錯信息如下：

Microsoft.SemanticKernel.HttpOperationException:“由於連接方在一段時間後沒有正確答覆或連接的主機沒有反應，連接嘗試失敗。 (api-inference.huggingface.co:443)”

原因也很簡單，我本地連接不了huggingface，這個需要換種上網方式才能解決。看來默認是請求的api-inference.huggingface.co:443這個地址，在源碼中求證了一下HuggingFaceClient.cs#L41，發現確實是這樣

internal sealed class HuggingFaceClient
{
    private readonly IStreamJsonParser _streamJsonParser;
    private readonly string _modelId;
    private readonly string? _apiKey;
    private readonly Uri? _endpoint;
    private readonly string _separator;
    private readonly HttpClient _httpClient;
    private readonly ILogger _logger;

    internal HuggingFaceClient(
        string modelId,
        HttpClient httpClient,
        Uri? endpoint = null,
        string? apiKey = null,
        IStreamJsonParser? streamJsonParser = null,
        ILogger? logger = null)
    {
        Verify.NotNullOrWhiteSpace(modelId);
        Verify.NotNull(httpClient);
        //默認請求地址
        endpoint ??= new Uri("https://api-inference.huggingface.co");
        this._separator = endpoint.AbsolutePath.EndsWith("/", StringComparison.InvariantCulture) ? string.Empty : "/";
        this._endpoint = endpoint;
        this._modelId = modelId;
        this._apiKey = apiKey;
        this._httpClient = httpClient;
        this._logger = logger ?? NullLogger.Instance;
        this._streamJsonParser = streamJsonParser ?? new TextGenerationStreamJsonParser();
    }
}

它只是默認情況下請求的api-inference.huggingface.co這個地址，如果想要請求其他地址的話，需要自己實現一個api，然後通過SemanticKernel調用。

曲線實現

上面提到了既然是huggingface的api我們訪問不到，而且我不是很喜歡這種在線方式，太依賴三方接口的穩定性了，我更喜歡本地可以部署的，這樣的話就不用考慮網絡和穩定性問題了。於是想到了一個曲線的方式，那是不是可以自己實現一個api，然後通過SemanticKernel調用呢？答案是肯定的。

blip-image-captioning-base模型

通過上面的示例我們可以看到它使用ImageToText圖片識別模型使用的是Salesforce/blip-image-captioning-base這個模型，我們可以自行下載這個模型到本地。上面說了huggingface需要換種上網方式，不過沒關係這個國內是有鏡像網站的https://hf-mirror.com/，找到模型地址Salesforce/blip-image-captioning-base選擇Files and versions標籤把裏面的所有文件下載到本地文件夾即可，大概是1.84 G左右。比如我是放到我的D:\Users\User\blip-image-captioning-base文件夾內，目錄結構如下所示

這個模型沒有特殊要求，我的電腦是16G內存和i5處理器都可以運行起來。接下來用調用這個模型試一試，該模型是適配了transformers框架，所以調用起來比較加單，代碼如下所示

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("D:\\Users\\User\\blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("D:\\Users\\User\\blip-image-captioning-base")

img_url = '01f8115545963d0000019ae943aaad.jpg@1280w_1l_2o_100sh.jpg'
raw_image = Image.open(img_url).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
en_text = processor.decode(out[0], skip_special_tokens=True)
print(f'已識別圖片中描述的內容：{en_text}')

然後我使用了我本地的一張圖片

運行這段代碼之後輸出信息如下所示

已識別圖片中描述的內容：a kitten is standing on a tree stump

識別的結果描述的和圖片內容大致來說是一致的，看來簡單的圖片效果還是不錯的。不過美中不足的是，它說的是英文，給中國人看說英文這明顯不符合設定。所以還是得想辦法把英文翻譯成中文。

opus-mt-en-zh模型

上面我們看到了blip-image-captioning-base模型效果確實還可以，只是它返回的是英文內容，這個對於英文不足六級的人來說讀起來確實不方便。得想辦法解決把英文翻譯成中文的問題。因爲不想調用翻譯接口，所以這裏我還是想使用模型的方式來解決這個問題。使用Bing搜索了一番，發現推薦的opus-mt-en-zh模型效果不錯，於是打算試一試。還是在hf-mirror.com上下載模型到本地文件夾內，方式方法如上面的blip-image-captioning-base模型一致。它的大小大概在1.41 GB左右，也是CPU可運行的，比如我的是下載到本地D:\Users\User\opus-mt-en-zh路徑下，內容如下所示

接下來還是老規矩，調用一下這個模型看看效果，不過在huggingface對應的倉庫裏並沒有給出如何使用模型的示例，於是去stackoverflow上找到兩個類似的內容參考了一下

通過上面的連接可以看到，非常好的地方就是，這個模型也是兼容transformers框架的，所以調用起來非常簡單，把上面的英文內容拿過來試一試, 代碼如下所示

from transformers import AutoTokenizer, AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("D:\\Users\\User\\opus-mt-en-zh")
tokenizer = AutoTokenizer.from_pretrained("D:\\Users\\User\\opus-mt-en-zh")
# 英文文本
en_text='a kitten is standing on a tree stump'

encoded = tokenizer([en_text], return_tensors="pt")
translation = model.generate(**encoded)
# 翻譯後的中文內容
zh_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(f'已識別圖片中描述的內容：\r\n英文：{en_text}\r\n中文：{zh_text}')

運行這段代碼之後輸出信息如下所示

已識別圖片中描述的內容：
英文：a kitten is standing on a tree stump
中文：一隻小貓站在樹樁上

這下看着舒服了，至少不用藉助翻譯工具了。模型的部分到此就差不多了，接下來看如何整合一下模型的問題。

結合Microsoft.SemanticKernel.Connectors.HuggingFace

上面我們調研了圖片內容識別的模型和英文翻譯的模型，接下來我們看一下如何使用Microsoft.SemanticKernel.Connectors.HuggingFace去整合我們本地的模型。我們通過上面瞭解到了他說基於http的方式去調用了，這就很明確了。只需要知道調用的路徑、請求參數、返回參數就可以自己寫接口來模擬了。這個就需要去看一下SemanticKernel裏面涉及的代碼了。核心類就是HuggingFaceClient類，我們來看下它的GenerateTextAsync方法的代碼

public async Task<IReadOnlyList<TextContent>> GenerateTextAsync(
        string prompt,
        PromptExecutionSettings? executionSettings,
        CancellationToken cancellationToken)
{
	string modelId = executionSettings?.ModelId ?? this._modelId;
	var endpoint = this.GetTextGenerationEndpoint(modelId);
	var request = this.CreateTextRequest(prompt, executionSettings);
	using var httpRequestMessage = this.CreatePost(request, endpoint, this._apiKey);

	string body = await this.SendRequestAndGetStringBodyAsync(httpRequestMessage, cancellationToken)
		.ConfigureAwait(false);

	var response = DeserializeResponse<TextGenerationResponse>(body);
	var textContents = GetTextContentFromResponse(response, modelId);

	return textContents;
}

//組裝請求路徑方法
private Uri GetTextGenerationEndpoint(string modelId)
	=> new($"{this._endpoint}{this._separator}models/{modelId}");

private HttpRequestMessage CreateImageToTextRequest(ImageContent content, PromptExecutionSettings? executionSettings)
{
	var endpoint = this.GetImageToTextGenerationEndpoint(executionSettings?.ModelId ?? this._modelId);

	var imageContent = new ByteArrayContent(content.Data?.ToArray());
	imageContent.Headers.ContentType = new(content.MimeType);

	var request = new HttpRequestMessage(HttpMethod.Post, endpoint)
	{
		Content = imageContent
	};

	this.SetRequestHeaders(request);

}

private Uri GetImageToTextGenerationEndpoint(string modelId)
	=> new($"{this._endpoint}{this._separator}models/{modelId}");

通過上面的GenerateTextAsync方法代碼我們可以得到我們自定義接口時所需要的全部信息

首先是請求路徑問題，我們通過GetTextGenerationEndpoint和GetImageToTextGenerationEndpoint方法可以看到，拼接的路徑地址服務地址/models/模型id，比如我們上面調用的是Salesforce/blip-image-captioning-base模型，拼接的路徑就是models/Salesforce/blip-image-captioning-base。
其次通過CreateImageToTextRequest方法我們可以得知，請求參數的類型是ByteArrayContent，請求參數的ContentType是image/jpeg。也就是把我們的圖片內容轉換成字節數組放到請求body請求體裏即可，然後POST到具體的服務裏即可。
通過TextGenerationResponse返回類型我們可以知道這個承載的是返回參數的類型裏。

我們來看下TextGenerationResponse類的定義

internal sealed class TextGenerationResponse : List<GeneratedTextItem>
{
    internal sealed class GeneratedTextItem
    {
        [JsonPropertyName("generated_text")]
        public string? GeneratedText { get; set; }
    }
}

這個參數比較簡單，就是返回一個包含generated_text字段的數組即可對應成json格式的話就是[{"generated_text":"識別結果"}]。接下來我們需要做的是把模型整合換成http接口，這樣的話Microsoft.SemanticKernel.Connectors.HuggingFace就可以調用這個接口了。這裏我選擇使用的是python的fastapiweb框架去整合成webapi服務，其他框架也可以，只要入參返回的結果把握住就可以，整合後效果如下所示

import io
import uvicorn
from fastapi import FastAPI, Request
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelWithLMHead

app = FastAPI()

# 圖片內容識別模型
processor = BlipProcessor.from_pretrained("D:\\Users\\User\\blip-image-captioning-base")
blipModel = BlipForConditionalGeneration.from_pretrained("D:\\Users\\User\\blip-image-captioning-base")

# 英文翻譯模型
tokenizer = AutoTokenizer.from_pretrained("D:\\Users\\User\\opus-mt-en-zh")
opusModel = AutoModelWithLMHead.from_pretrained("D:\\Users\\User\\opus-mt-en-zh")

# 定義接口函數
@app.post("/models/Salesforce/blip-image-captioning-base", summary="圖片內容識別")
async def blip_image_captioning_base(request: Request):
    # 獲取請求參數
    request_object_content: bytes = await request.body()
    # 轉換圖片內容
    raw_image = Image.open(io.BytesIO(request_object_content)).convert('RGB')

    # 識別圖片內容
    inputs = processor(raw_image, return_tensors="pt")
    out = blipModel.generate(**inputs)
    en_text = processor.decode(out[0], skip_special_tokens=True)
    
    # 英譯漢
    encoded = tokenizer([en_text], return_tensors="pt")
    translation = opusModel.generate(**encoded)
    zh_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
    return [{"generated_text": zh_text}]


if __name__ == '__main__':
    # 運行fastapi程序
    uvicorn.run(app="snownlpdemo:app", host="0.0.0.0", port=8000, reload=True)

這裏我們把服務暴露到8000端口上去，等待服務啓動成功即可，然後我們去改造Microsoft.SemanticKernel.Connectors.HuggingFace的代碼如下所示

//這裏我們傳遞剛纔自行構建的fastapi服務地址
var kernel = Kernel.CreateBuilder().AddHuggingFaceImageToText("Salesforce/blip-image-captioning-base", new Uri("http://127.0.0.1:8000")).Build();
IImageToTextService service = kernel.GetRequiredService<IImageToTextService>();
var imageBinary = File.ReadAllBytes(Path.Combine(Directory.GetCurrentDirectory(), "01f8115545963d0000019ae943aaad.jpg@1280w_1l_2o_100sh.jpg"));
var imageContent = new ImageContent(imageBinary) { MimeType = "image/jpeg" };
var textContent = await service.GetTextContentAsync(imageContent);
Console.WriteLine($"已識別圖片中描述的內容: {textContent.Text}");

這樣的話代碼改造完成，需要注意的是得先運行fastapi服務等待服務啓動成功之後，再去然後運行dotnet項目，運行起來效果如下所示

已識別圖片中描述的內容: 一隻小貓站在樹樁上

改造成插件

我們使用上面的方式是比較生硬古板的，熟悉SemanticKernel的同學都清楚它是支持自定插件的，這樣的話它可以根據我們的提示詞來分析調用具體的插件，從而實現調用我們自定義的接口。這是一個非常實用的功能，讓SemanticKernel的調用更加靈活，是對AIGC能力的擴展，可以讓他調用我們想調用的接口或者服務等等。話不多說，我們定義一個插件讓它承載我們識別圖片的內容，這樣的話就可以通過SemanticKernel的調用方式去調用這個插件了。定義插件的代碼如下所示

public class ImageToTextPlugin
{
    private IImageToTextService _service;
    public ImageToTextPlugin(IImageToTextService service)
    {
        _service = service;
    }

    [KernelFunction]
    [Description("根據圖片路徑分析圖片內容")]
    public async Task<string> GetImageContent([Description("圖片路徑")] string imagePath)
    {
        var imageBinary = File.ReadAllBytes(imagePath);
        var imageContent = new ImageContent(imageBinary) { MimeType = "image/jpeg" };
        var textContent = await _service.GetTextContentAsync(imageContent);
        return $"圖片[{imagePath}]分析內容爲:{textContent.Text!}";
    }
}

這裏需要注意的是我們定義的方法的Description和參數的Description，其中GetImageContent方法的Description是SemanticKernel的提示詞，這樣在調用的時候就可以通過提示詞來調用這個方法了。參數imagePath的Description這樣OpenAI就知道如何在提示詞裏提取出來對應的參數信息了。好了接下來我們看下如何使用這個插件

using HttpClient httpClient = new HttpClient(new RedirectingHandler());
var executionSettings = new OpenAIPromptExecutionSettings()
{
    ToolCallBehavior = ToolCallBehavior.EnableKernelFunctions,
    Temperature = 1.0
};
var builder = Kernel.CreateBuilder().AddHuggingFaceImageToText("Salesforce/blip-image-captioning-base", new Uri("http://127.0.0.1:8000"));
var kernel = builder.Build();
ImageToTextPlugin imageToTextPlugin = new ImageToTextPlugin(kernel.GetRequiredService<IImageToTextService>());
kernel.Plugins.AddFromObject(imageToTextPlugin);

var chatCompletionService = new OpenAIChatCompletionService("gpt-3.5-turbo-0125", "你的apiKey", httpClient: httpClient);

Console.WriteLine("現在你可以開始和我聊天了，輸入quit退出。等待你的問題：");
do
{
    var prompt = Console.ReadLine();
    if (!string.IsNullOrWhiteSpace(prompt))
    {
        if (prompt.ToLowerInvariant() == "quit")
        {
            Console.WriteLine("非常感謝！下次見。");
            break;
        }
        else
        {
            var history = new ChatHistory();
            history.AddUserMessage(prompt);
            //調用gpt的chat接口
            var result = await chatCompletionService.GetChatMessageContentAsync(history,
                        executionSettings: executionSettings,
                        kernel: kernel);
            //判斷gpt返回的結果是否是調用插件
            var functionCall = ((OpenAIChatMessageContent)result).GetOpenAIFunctionToolCalls().FirstOrDefault();
            if (functionCall != null)
            {
                kernel.Plugins.TryGetFunctionAndArguments(functionCall, out KernelFunction? pluginFunction, out KernelArguments? arguments);
                var content = await kernel.InvokeAsync(pluginFunction!, arguments);
                Console.WriteLine(content);
            }
            else
            {
                //不是調用插件這直接輸出返回結果
                Console.WriteLine(result.Content);
            }
        }
    }
} while (true);

這裏需要注意自定義的RedirectingHandler，如果你不是使用OpenAI的接口而是自己對接或者代理的OpenAI的接口，就需要自行定義HttpClientHandler來修改請求的GPT的服務地址。

public class RedirectingHandler : HttpClientHandler
{
    protected override Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request, CancellationToken cancellationToken)
    {
        request.RequestUri = new UriBuilder(request.RequestUri!) { Scheme = "http", Host = "你的服務地址", Path= "/v1/chat/completions" }.Uri;
        return base.SendAsync(request, cancellationToken);
    }
}

這樣的話我們就可以在於GPT的交互中調用我們自定義的插件了，當我們輸入相關的提示詞OpenAI的接口就可以根據提示詞和插件信息返回調用哪個插件。使用了幾張我本地的圖片試了一下效果還是不錯的，能分析出大致的圖片內容，如下所示

這樣使用起來就比較靈活了，在對話的過程中就可以使用本地的功能，不得不說有了插件化的能力SemanticKernel的功能就更加豐富了。關於插件化的實現原理也是比較簡單，這是利用OpenAI對話接口的能力，我們只需要定義好插件和相關的提示詞就可以，比如我們上面示例，使用Fiddler或Charles攔截一下發出的請求即可，它是發起的HTTP請求，請求格式如下

{
    "messages": [
        {
            "content": "Assistant is a large language model.",
            "role": "system"
        },
        {
            "content": "請幫我分析這張圖片的內容D:\\Software\\AI.Lossless.Zoomer-2.1.0-x64\\Release\\output\\20200519160906.png",
            "role": "user"
        }
    ],
    "temperature": 1,
    "top_p": 1,
    "n": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "model": "gpt-3.5-turbo-0125",
    "tools": [
        {
            "function": {
                "name": "ImageToTextPlugin-GetImageContent",
                "description": "根據圖片路徑分析圖片內容",
                "parameters": {
                    "type": "object",
                    "required": [
                        "imagePath"
                    ],
                    "properties": {
                        "imagePath": {
                            "type": "string",
                            "description": "圖片路徑"
                        }
                    }
                }
            },
            "type": "function"
        }
    ],
    "tool_choice": "auto"
}

通過請求OpenAI的/v1/chat/completions接口的請求參數我們可以大致瞭解它的工作原理，SemanticKernel通過掃描我們定義的插件的元數據比如類_方法、方法的描述、參數的描述來放入請求的JSON數據裏，我們定義的Description裏的描述作爲提示詞拆分來具體匹配插件的依據。接下來我們再來看一下這個接口的返回參數的內容

{
    "id": "chatcmpl-996IuJbsTrXHcHAM3dqtguwNi9M3Z",
    "object": "chat.completion",
    "created": 1711956212,
    "model": "gpt-35-turbo",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": null,
                "tool_calls": [
                    {
                        "id": "call_4aN9xUhly2cEbNmzRcIh1it0",
                        "type": "function",
                        "function": {
                            "name": "ImageToTextPlugin-GetImageContent",
                            "arguments": "{\"imagePath\":\"D:\\\\Software\\\\AI.Lossless.Zoomer-2.1.0-x64\\\\Release\\\\output\\\\20200519160906.png\"}"
                        }
                    }
                ]
            },
            "finish_reason": "tool_calls"
        }
    ],
    "usage": {
        "prompt_tokens": 884,
        "completion_tokens": 49,
        "total_tokens": 933
    },
    "system_fingerprint": "fp_2f57f81c11"
}

OpenAI接口給我們返回了它選擇的插件信息，告訴我們可以調用ImageToTextPlugin-GetImageContent這個方法，傳遞的參數則是{\"imagePath\":\"D:\\\\Software\\\\AI.Lossless.Zoomer-2.1.0-x64\\\\Release\\\\output\\\\20200519160906.png\"}，這是GPT幫我們分析的結果，SemanticKernel根據這個信息來調用我們本地的插件，執行具體操作。這裏GPT的起到的作用就是，我們請求的時候提交插件的元數據，GPT根據提示詞和插件的元數據幫我分析我們可以調用哪個插件，並且把插件參數幫我們分析出來，這樣我們就可以根據返回的插件元數據來調用我們本地的插件了。

需要注意的，目前我嘗試的是隻有OpenAI或AzureOpenAI提供的對話接口支持插件的能力，國內的模型我試了一下比如文心一言、訊飛星火、通義千問、百川都不支持，至少通過OneApi對接過來的不支持，不知道是不是我姿勢不對。

參考連接

以下是學習研究過程中參考的一些連接，在這裏展示出來供大家參考。涉及到學習參考、解決問題、查找資源相關。畢竟人生地不熟的，需要找到方向

總結

本文緣起來於在devblogs上看到的一篇文章，感覺比較有趣，便動手實踐一下。其中遇到了問題，便部署本地模型來實現，最終實現了Microsoft.SemanticKernel.Connectors.HuggingFace調用本地模型實現圖片內容識別。最終把它定義成一個插件，這樣在SemanticKernel中就可以通過調用插件的方式來調用本地模型，實現圖片內容識別。這些可以在本地運行的實現特定功能的模型還是比較有意思的，模型本身不大，本地可運行，適合初學者或者有興趣的人使用。

我始終倡導大家積極接觸和學習新技術。這並不意味着我們必須深入鑽研，畢竟人的精力有限，無法將所有精力都投入到這些領域。但至少，我們應該保持好奇心，對這些新技術有所瞭解，理解其基本原理。這樣，當有一天我們需要應用這些技術時，就能更加得心應手。即使我們不能成爲某個領域的專家，但對這些技術的瞭解也會成爲我們思考的一部分，讓我們在解決問題時擁有更多的選擇和思路。因此，不要害怕嘗試新事物，保持好奇心和學習態度，這將是我們不斷進步的關鍵。

👇歡迎掃碼關注我的公衆號👇

體驗Semantic Kernel圖片內容識別

前言

示例

曲線實現

blip-image-captioning-base模型

opus-mt-en-zh模型

結合Microsoft.SemanticKernel.Connectors.HuggingFace

改造成插件

參考連接

總結

體驗Semantic Kernel圖片內容識別

細聊ASP.NET Core WebAPI格式化程序

基於C# Socket實現的簡單的Redis客戶端

細聊C# AsyncLocal如何在異步間進行數據流轉

由C# yield return引發的思考

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結