5 Tesseract-ocr 系列：使用 jTessBoxEditor，結合 tesseract-ocr-3.4 進行訓練、識別

原創

2018-09-04 06:32

首先，本地環境： Ubuntu 16.40 + Tesseract-ocr + jTessBoxEditor 1.7.3
（具體的環境搭建在之前都有介紹，此處不再贅述。）

這裏詳細介紹使用圖片文件作爲輸入的訓練方法。比如我的訓練素材是十張帶字體格式的數字圖片，最後我經過訓練得到的語言庫爲： num1.tessdata

跳轉到 jTessBoxEditor 的安裝目錄，打開 jTessBoxEditor 客戶端：

cd  /your/path/jTessBoxEditor-1.7.3
java -Xms128m -Xmx1024m -jar jTessBoxEditor.jar

由 .box 文件 -> .tr

tesseract num1.invoicenum.exp0.tif num1.invoicenum.exp0 box.train.stderr

由 .box 文件 -> unicharset

unicharset_extractor num1.invoicenum.exp0.box

生成 font_properties // 本質是 ~.txt 文件，但是不帶後綴
```
echo invoicenum 0 1 1 0 1 >font_properties
```

由 font_properties, unicharset, .tr -> shapetable

shapeclustering -F font_properties -U unicharset num1.invoicenum.exp0.tr

由 font_properties, unicharset, .tr ->lang.unicharset, inttemp, pffmtable

mftraining -F font_properties -U unicharset -O num1.unicharset num1.invoicenum.exp0.tr

重命名以 lang. 開頭重命名 inttemp, normproto, pffmtable, shapetable

mv inttemp num1.inttemp
mv normproto num1.normproto 
mv pffmtable num1.pffmtable 
mv shapetable  num1.shapetable

// 重命名後有

合併生成 lang.traineddata文件，在這裏爲 num1.traineddata
```
combine_tessdata num1.
```
將生成的lang.traineddata 文件拷貝到系統中 ~/tessdata/目錄下.
如我的tessdata目錄爲： /usr/local/share/tessdata/
```
sudo cp /your/path/num1.traineddata /usr/local/share/tessdata/
```
測試
要識別的圖片：

a) 使用自帶的語言庫 eng.traineddata 識別結果：

b) 使用剛纔訓練得到的庫 num1.tessdata 識別結果：