使用EmguCV集成的Tesseract-OCR进行光学字符识别

原創

2020-06-15 05:33

开源代码：https://github.com/tesseract-ocr/tesseract

简述：之前是惠普开发的，并在2005年开源出来，2006年，谷歌接手维护。在光学字符识别上算是一个不错的算法，而且还是开源的。开源项目中有详细的使用说明书供大家参考，可以仔细查阅，说不定有意外的的收获呢。另外，EmguCV集成了该算法，这对使用C#语言的人来说是一种福音，资源文件需要去官网下载，下载速度挺慢的，有需要可在下方链接进行下载：
https://download.csdn.net/download/IT_BOY__/12009964

调用的核心代码：

using Emgu.CV;
using Emgu.CV.OCR;
using Emgu.CV.Structure;
using Emgu.CV.CvEnum;
using System.Xml;

/// <summary>
/// 返回识别结果的置信度
/// </summary>
/// <param name="fileImagePath">图片路径</param>
public static void GetConfidence(string fileImagePath)
{
    XmlDocument doc = new XmlDocument();//新建对象
    //Tessdata 为资源路径
    _ocr = new Tesseract(@"Tessdata", "eng", OcrEngineMode.TesseractOnly);
    _ocr.SetVariable("tessedit_char_whitelist", "qwertyuioplkjhgfdaazxcvb0123456789");
    DirectoryInfo TheFolder = new DirectoryInfo(fileImagePath);
    foreach (FileInfo NextFile in TheFolder.GetFiles())
    {
        Image<Gray, Byte> image = new Image<Gray, byte>(NextFile.FullName);
        _ocr.SetImage(image);
        _ocr.Recognize();
        string s1 = _ocr.GetHOCRText();  //获得置信度相关的xml字符串，emgucv-4.1版本有该方法
        // 从文件载入
        //doc.Load(path);
        // 从字符串载入
        doc.LoadXml(s1);
        try
        {
            var elem = doc.FirstChild.FirstChild.FirstChild.FirstChild.FirstChild;
            // 解析x_wconf
            string x_wconf = elem.Attributes["title"].Value.Split(';')[1].Trim().Split()[1];
            // 解析值
            string value = elem.InnerText;
            Log.log("file path:", NextFile.FullName, "   x_wconf", x_wconf, "  value", value);
        }
        catch (Exception ex)
        {
        }
    }
}

XML样例如下：

<div class='ocr_page' id='page_1' title='image ""; bbox 0 0 26 39; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 2 6 26 34">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 2 6 26 34">
     <span class='ocr_line' id='line_1_1' title="bbox 2 6 26 34; baseline 0 -7; x_size 41; x_descenders 8; x_ascenders 11">
      <span class='ocrx_word' id='word_1_1' title='bbox 2 6 26 34; x_wconf 69'><strong>4</strong></span>
     </span>
    </p>
   </div>
  </div>

总结：本文探索的是英文字符识别，识别效果很不错，毕竟是别人老外的东西。其实Tesseract-OCR也可以识别中文，识别效果有待验证。在计算机视觉领域，需要不断地学习和积累优秀的经典算法，持续学习，多阅读英文资料，这样才能不断增强自己的竞争力。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用EmguCV集成的Tesseract-OCR进行光学字符识别

MySQL 核心模块揭秘 | 18 期 | 锁在内存里长什么样*

使用perf工具生成火焰图

大龄程序员思考

响应式界面控件DevExtreme * 更强的数据分析和可视化功能

HttpSecurity 是如何组装过滤器链的

数说海南——近6年海南各市县人口简单看

长序列中Transformers的高级注意力机制总结

WebStorm 创建 Vue 项目

leetcode 127. Word Ladder

Pytorch官方微調Mask-RCNN遇到的坑

最長公共子串和最長公共子序列（僅討論2個字符串）

python中opencv 與 PIL讀圖區別，以及與Numpy轉換

使用EmguCV集成的Tesseract-OCR進行光學字符識別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結