C++ 調用 Tesseract

Tesseract-ocr 是一個知名的開源的 OCR 。這裏簡單寫寫它的 C++ API 接口的使用方法。

本文主要參考了：

還有就是API 幫助文檔：https://ub-mannheim.github.io/tesseract/index.html

如何編譯 tesseract 這裏就不多說了。在 VC 下就是 vcpkg install tesseract 一條命令。

先看一個官方的例子：

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("phototest.png");
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    // Destroy used object and release memory
    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

api->Init(NULL, “eng”) 這句是加載 eng.traineddata ，NULL表示從默認的位置加載。當然也可以把eng.traineddata 的位置傳進來。

如果我們還想同時加載其他的語言的訓練數據可以這樣寫：api->Init(NULL, “eng+deu”)

這樣就同時加載了英文和德文數據。

api->Init(NULL, “xxx”) 函數在程序中可以多次調用。每次調用後 OCR 引擎就被重新初始化。

api->SetImage(image); 這就是加載圖像。之後我們還可以限制只對圖像的一部分區域進行 OCR。類似下面這條語句：

api->SetRectangle(left, top, width, height) ;

api->GetUTF8Text() 獲得 OCR 識別出的字符串。需要特別注意的是 GetUTF8Text() 返回的是 C 字符串，需要我們自己釋放這個字符串的內存空間：

delete [] outText;

從這裏也可以看出 Tesseract 比較原始，好歹應該返回個 std::string 啊。這樣很容易造成內存泄漏。

一般在 OCR 之後還會看看識別的 confidence value 。

api->MeanTextConf();

這個值介於 0 到100 之間，越大說明識別正確的概率越大。

完事之後可以調用 api->End(); 來釋放內存空間。

基本上這個例子就是一個最簡單的用法。上面例子中用到了一個圖片，我把圖片放這裏：

在我電腦上輸出的結果如下：

1284567890 4934567890

This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

可以看到有幾個數字識別錯了。如果用 SetRectangle 圈住那一串數字後再識別就全都可以識別正確。

下面再看一個高級些的例子：

  Pix *image = pixRead("phototest.png");
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng");
  api->SetImage(image);
  Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
  printf("Found %d textline image components.\n", boxes->n);
  for (int i = 0; i < boxes->n; i++) {
    BOX* box = boxaGetBox(boxes, i, L_CLONE);
    api->SetRectangle(box->x, box->y, box->w, box->h);
    char* ocrResult = api->GetUTF8Text();
    int conf = api->MeanTextConf();
    fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
                    i, box->x, box->y, box->w, box->h, conf, ocrResult);
    boxDestroy(&box);
  }

這個例子可以將圖片中的文字按行分割出來，利用的是下面這個函數：

api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);

RIL_TEXTLINE 表示按行分割，除此之外還可以按段落（RIL_PARA）、單詞（RIL_WORD）或者字符（RIL_WORD）分割。

上面的例子運行結果如下，可以看出識別率不高。

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 40, text: 123496 /890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 92, text: This Is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 89, text: ocr code and see If it works on ail types
Box[3]: x=36, y=160, w=187, h=24, confidence: 88, text: of tie format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 90, text: The quick brown dog Jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 75, text: lazy Tox. 1ne quick brown dog Jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 93, text: over the lazy fox. [he quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 89, text: jumped over the lazy Tox. [ne quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 82, text: brown dog Jumped over the lazy Tox.

之所以識別率不高，是因爲 api->SetRectangle(box->x, box->y, box->w, box->h); 這句有點問題。如果改成下面這樣：

api->SetRectangle(box->x, box->y-1, box->w, box->h+1);

識別率會提升很多。這時的結果如下：

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 91, text: 1234567890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 95, text: This is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 95, text: ocr code and see if it works on all types
Box[3]: x=36, y=160, w=187, h=24, confidence: 94, text: of file format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 95, text: The quick brown dog jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 93, text: lazy fox. The quick brown dog jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 95, text: over the lazy fox. The quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 95, text: jumped over the lazy fox. The quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 93, text: brown dog jumped over the lazy fox.

上面代碼另一個問題是分配的字符串沒有釋放空間。所以正確的代碼應該改成這樣：

    Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
    printf("Found %d textline image components.\n", boxes->n);
    for (int i = 0; i < boxes->n; i++) 
    {
        BOX* box = boxaGetBox(boxes, i, L_CLONE);
        api->SetRectangle(box->x, box->y-1, box->w, box->h+1);
        char* ocrResult = api->GetUTF8Text();
        int conf = api->MeanTextConf();
        fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
                    i, box->x, box->y, box->w, box->h, conf, ocrResult);
        delete [] ocrResult;      
        boxDestroy(&box);
    }

最後再看一個例子：

Pix *image = pixRead("phototest.png");
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng");
  api->SetImage(image);
  api->Recognize(0);
  tesseract::ResultIterator* ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
  if (ri != 0) {
    do {
      const char* word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",
               word, conf, x1, y1, x2, y2);
      delete[] word;
    } while (ri->Next(level));
  }

這個代碼和上面的代碼差不多，只不過用了 Iterator。這裏不多解釋了。

程序運行的結果如下：

word: '1284567890';     conf: 64.73; BoundingBox: 42,33,170,50;
word: '4934567890';     conf: 56.32; BoundingBox: 190,47,363,66;
word: 'This';   conf: 96.59; BoundingBox: 36,92,96,116;
word: 'is';     conf: 96.92; BoundingBox: 109,92,129,116;
word: 'a';      conf: 96.33; BoundingBox: 141,98,156,116;
word: 'lot';    conf: 96.33; BoundingBox: 169,92,201,116;
word: 'of';     conf: 96.45; BoundingBox: 212,92,240,116;
word: '12';     conf: 96.45; BoundingBox: 251,92,282,116;
word: 'point';          conf: 96.47; BoundingBox: 296,92,364,122;
word: 'text';   conf: 96.47; BoundingBox: 374,93,427,116;
word: 'to';     conf: 96.88; BoundingBox: 437,93,463,116;
word: 'test';   conf: 96.98; BoundingBox: 474,93,526,116;
word: 'the';    conf: 96.37; BoundingBox: 536,92,580,116;
word: 'ocr';    conf: 96.07; BoundingBox: 36,132,81,150;
word: 'code';   conf: 96.07; BoundingBox: 91,126,160,150;
word: 'and';    conf: 96.62; BoundingBox: 172,126,223,150;
word: 'see';    conf: 96.53; BoundingBox: 236,132,286,150;
word: 'if';     conf: 94.37; BoundingBox: 299,126,314,150;
word: 'it';     conf: 94.37; BoundingBox: 325,126,339,150;
word: 'works';          conf: 95.96; BoundingBox: 348,126,433,150;
word: 'on';     conf: 93.54; BoundingBox: 445,132,478,150;
word: 'all';    conf: 93.54; BoundingBox: 500,126,529,150;
word: 'types';          conf: 96.90; BoundingBox: 541,127,618,157;
word: 'of';     conf: 96.23; BoundingBox: 36,160,64,184;
word: 'file';   conf: 95.72; BoundingBox: 72,160,113,184;
word: 'format.';        conf: 95.68; BoundingBox: 123,160,223,184;
word: 'The';    conf: 96.51; BoundingBox: 36,194,91,218;
word: 'quick';          conf: 96.63; BoundingBox: 102,194,177,224;
word: 'brown';          conf: 96.82; BoundingBox: 189,194,274,218;
word: 'dog';    conf: 95.79; BoundingBox: 287,194,339,225;
word: 'jumped';         conf: 95.79; BoundingBox: 348,194,456,225;
word: 'over';   conf: 96.60; BoundingBox: 468,200,531,218;
word: 'the';    conf: 96.49; BoundingBox: 540,194,585,218;
word: 'lazy';   conf: 96.40; BoundingBox: 37,228,92,259;
word: 'fox.';   conf: 96.44; BoundingBox: 103,228,153,252;
word: 'The';    conf: 96.70; BoundingBox: 165,228,220,252;
word: 'quick';          conf: 96.63; BoundingBox: 232,228,307,258;
word: 'brown';          conf: 96.62; BoundingBox: 319,228,404,252;
word: 'dog';    conf: 95.80; BoundingBox: 417,228,468,259;
word: 'jumped';         conf: 95.80; BoundingBox: 478,228,585,259;
word: 'over';   conf: 96.29; BoundingBox: 36,268,99,286;
word: 'the';    conf: 96.28; BoundingBox: 109,262,153,286;
word: 'lazy';   conf: 96.51; BoundingBox: 165,262,221,293;
word: 'fox.';   conf: 96.30; BoundingBox: 231,262,281,286;
word: 'The';    conf: 96.65; BoundingBox: 294,262,349,286;
word: 'quick';          conf: 96.61; BoundingBox: 360,262,435,292;
word: 'brown';          conf: 96.12; BoundingBox: 447,262,532,286;
word: 'dog';    conf: 96.12; BoundingBox: 545,262,597,293;
word: 'jumped';         conf: 96.73; BoundingBox: 43,296,150,327;
word: 'over';   conf: 96.38; BoundingBox: 162,302,226,320;
word: 'the';    conf: 96.38; BoundingBox: 235,296,279,320;
word: 'lazy';   conf: 96.80; BoundingBox: 292,296,347,327;
word: 'fox.';   conf: 96.77; BoundingBox: 357,296,407,320;
word: 'The';    conf: 96.17; BoundingBox: 420,296,475,320;
word: 'quick';          conf: 96.95; BoundingBox: 486,296,561,326;
word: 'brown';          conf: 96.83; BoundingBox: 37,330,122,354;
word: 'dog';    conf: 96.32; BoundingBox: 135,330,187,361;
word: 'jumped';         conf: 96.80; BoundingBox: 196,330,304,361;
word: 'over';   conf: 96.95; BoundingBox: 316,336,379,354;
word: 'the';    conf: 96.56; BoundingBox: 388,330,433,354;
word: 'lazy';   conf: 95.99; BoundingBox: 445,330,500,361;
word: 'fox.';   conf: 96.61; BoundingBox: 511,330,561,354;

可以看出還是有識別錯誤的。對於這些識別錯誤的，可以記錄下位置，稍微擴大些範圍，利用 SetRectangle 重新識別。但是一定不要在 Iterator 迭代時做這個事情。因爲重新識別會破壞 Iterator 的狀態。

C++ 調用 Tesseract

# Python 日期時間與字符串的相互轉換

陶哲軒實分析 3.4 補充

介紹幾個可以用在 C/C++ 語言裏的畫函數圖像的庫

Leptonica PIX 與 Qt QImage 的相互轉換

Pro Git 學習筆記（Branching)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結