使用EmguCV集成的Tesseract-OCR進行光學字符識別

原創

2020-06-15 05:33

開源代碼：https://github.com/tesseract-ocr/tesseract

簡述：之前是惠普開發的，並在2005年開源出來，2006年，谷歌接手維護。在光學字符識別上算是一個不錯的算法，而且還是開源的。開源項目中有詳細的使用說明書供大家參考，可以仔細查閱，說不定有意外的的收穫呢。另外，EmguCV集成了該算法，這對使用C#語言的人來說是一種福音，資源文件需要去官網下載，下載速度挺慢的，有需要可在下方鏈接進行下載：
https://download.csdn.net/download/IT_BOY__/12009964

調用的核心代碼：

using Emgu.CV;
using Emgu.CV.OCR;
using Emgu.CV.Structure;
using Emgu.CV.CvEnum;
using System.Xml;

/// <summary>
/// 返回識別結果的置信度
/// </summary>
/// <param name="fileImagePath">圖片路徑</param>
public static void GetConfidence(string fileImagePath)
{
    XmlDocument doc = new XmlDocument();//新建對象
    //Tessdata 爲資源路徑
    _ocr = new Tesseract(@"Tessdata", "eng", OcrEngineMode.TesseractOnly);
    _ocr.SetVariable("tessedit_char_whitelist", "qwertyuioplkjhgfdaazxcvb0123456789");
    DirectoryInfo TheFolder = new DirectoryInfo(fileImagePath);
    foreach (FileInfo NextFile in TheFolder.GetFiles())
    {
        Image<Gray, Byte> image = new Image<Gray, byte>(NextFile.FullName);
        _ocr.SetImage(image);
        _ocr.Recognize();
        string s1 = _ocr.GetHOCRText();  //獲得置信度相關的xml字符串，emgucv-4.1版本有該方法
        // 從文件載入
        //doc.Load(path);
        // 從字符串載入
        doc.LoadXml(s1);
        try
        {
            var elem = doc.FirstChild.FirstChild.FirstChild.FirstChild.FirstChild;
            // 解析x_wconf
            string x_wconf = elem.Attributes["title"].Value.Split(';')[1].Trim().Split()[1];
            // 解析值
            string value = elem.InnerText;
            Log.log("file path:", NextFile.FullName, "   x_wconf", x_wconf, "  value", value);
        }
        catch (Exception ex)
        {
        }
    }
}

XML樣例如下：

<div class='ocr_page' id='page_1' title='image ""; bbox 0 0 26 39; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 2 6 26 34">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 2 6 26 34">
     <span class='ocr_line' id='line_1_1' title="bbox 2 6 26 34; baseline 0 -7; x_size 41; x_descenders 8; x_ascenders 11">
      <span class='ocrx_word' id='word_1_1' title='bbox 2 6 26 34; x_wconf 69'><strong>4</strong></span>
     </span>
    </p>
   </div>
  </div>

總結：本文探索的是英文字符識別，識別效果很不錯，畢竟是別人老外的東西。其實Tesseract-OCR也可以識別中文，識別效果有待驗證。在計算機視覺領域，需要不斷地學習和積累優秀的經典算法，持續學習，多閱讀英文資料，這樣才能不斷增強自己的競爭力。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用EmguCV集成的Tesseract-OCR進行光學字符識別

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

leetcode 127. Word Ladder

Pytorch官方微調Mask-RCNN遇到的坑

最長公共子串和最長公共子序列（僅討論2個字符串）

python中opencv 與 PIL讀圖區別，以及與Numpy轉換

使用EmguCV集成的Tesseract-OCR進行光學字符識別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結