tesseract update traineddata的方法

tesseract update traineddata的方法

tesseract有時會更新它的訓練數據,通常是發佈一個增量更新,如目前4.0版的訓練數據就是增量更新。將增量更新與之前的訓練數據組合起來可以用combine_tessdata命令,步驟如下:

環境準備

  1. 下載traineddata
    前往:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
    下載Data Files for Version 4.00 (November 29, 2016)以及
    Updated Data Files for Version 4.00 (September 15, 2017)

  2. 創建一個目錄用來放置解壓的traineddata文件
    PS E:\tesseract4> mkdir chi_sim

  3. 目錄結構示例

PS E:\tesseract4> tree.com /F
文件夾 PATH 列表
卷序列號爲 76B2-83BC
E:.
│  chi_sim.traineddata
│  eng.traineddata
│  equ.traineddata
│  tesseract-ocr-w32-setup-v4.0.0-beta.4.20180912.exe
│
├─20170915-Updated Data Files for Version 4.00
│      chi_sim.traineddata
│      chi_sim_vert.traineddata
│
└─chi_sim

解壓和重新打包traineddata

  • 解壓原始的traineddata到某目錄中
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u .\chi_sim.traineddata .\chi_sim\chi_sim.
Extracting tessdata components from .\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.unicharset
Wrote .\chi_sim\chi_sim.unicharambigs
Wrote .\chi_sim\chi_sim.inttemp
Wrote .\chi_sim\chi_sim.pffmtable
Wrote .\chi_sim\chi_sim.normproto
Wrote .\chi_sim\chi_sim.punc-dawg
Wrote .\chi_sim\chi_sim.word-dawg
Wrote .\chi_sim\chi_sim.number-dawg
Wrote .\chi_sim\chi_sim.freq-dawg
Wrote .\chi_sim\chi_sim.shapetable
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.version
Version string:Pre-4.0.0
0:config:size=1930, offset=192
1:unicharset:size=382937, offset=2122
2:unicharambigs:size=1, offset=385059
3:inttemp:size=39926030, offset=385060
4:pffmtable:size=50194, offset=40311090
5:normproto:size=618655, offset=40361284
6:punc-dawg:size=290, offset=40979939
7:word-dawg:size=652386, offset=40980229
8:number-dawg:size=74, offset=41632615
9:freq-dawg:size=1042, offset=41632689
13:shapetable:size=455944, offset=41633731
17:lstm:size=9924750, offset=42089675
18:lstm-punc-dawg:size=18, offset=52014425
19:lstm-word-dawg:size=648082, offset=52014443
20:lstm-number-dawg:size=74, offset=52662525
23:version:size=9, offset=52662599
  • 解壓增量更新的traineddata
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u '.\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata' .\chi_sim\chi_sim.
Extracting tessdata components from .\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.lstm-unicharset
Wrote .\chi_sim\chi_sim.lstm-recoder
Wrote .\chi_sim\chi_sim.version
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335
  • 將目錄下文件打包成完整的traineddata文件
    這個操作會在相應的目錄下生成一個完整的traineddata文件
PS E:\tesseract4> cd .\chi_sim\
PS E:\tesseract4\chi_sim> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' chi_sim
Combining tessdata files
Output chi_sim.traineddata created successfully.
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
1:unicharset:size=382937, offset=2158
2:unicharambigs:size=1, offset=385095
3:inttemp:size=39926030, offset=385096
4:pffmtable:size=50194, offset=40311126
5:normproto:size=618655, offset=40361320
6:punc-dawg:size=290, offset=40979975
7:word-dawg:size=652386, offset=40980265
8:number-dawg:size=74, offset=41632651
9:freq-dawg:size=1042, offset=41632725
13:shapetable:size=455944, offset=41633767
17:lstm:size=12152851, offset=42089711
18:lstm-punc-dawg:size=282, offset=54242562
19:lstm-word-dawg:size=590634, offset=54242844
20:lstm-number-dawg:size=82, offset=54833478
21:lstm-unicharset:size=258834, offset=54833560
22:lstm-recoder:size=72494, offset=55092394
23:version:size=84, offset=55164888
  • 將新的traineddata文件拷貝到tesseract安裝路徑的tessdata目錄下
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章