tesseract4.0訓練 腳本(一)

lstmeval

NAME
       lstmeval - Evaluation program for LSTM-based networks.
       基於LSTM網絡的評估程序

SYNOPSIS
       lstmeval --model lang.lstm|langtrain_checkpoint|pluscharsN.NNN_NN.checkpoint [--traineddata
       lang/lang.traineddata] --eval_listfile lang.eval_files.txt [--verbosity N] [--max_image_MB NNNN]

DESCRIPTION
       lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as
       input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, --traineddata
       should also be specified.
       lstmeval 評估基於LSTM的神經網絡。識別的模型或者訓練的檢查點都可以用來作爲進行lstmf文件識別的輸入項。
       如果評估一個訓練的檢查點,那麼生成檢查點時用到的--traineddata也需要提起(作爲輸入項)

OPTIONS
       --model FILE
           Name of model file (training or recognition) (type:string default:)

       --traineddata FILE
           If model is a training checkpoint, then traineddata must be the traineddata file that was given to the
           trainer (type:string default:)
           當之前的那個model選項是checkpoint時,traineddata 需要時之前訓練checkout時用到的traineddata

       --eval_listfile FILE
           File listing sample files in lstmf training format. (type:string default:)
           含有lstmf格式文件的列表txt

       --max_image_MB INT
           Max memory to use for images. (type:int default:2000)
           最大內存佔用

       --verbosity INT
           Amount of diagnosting information to output (0-2). (type:int default:1)
           診斷信息級別

HISTORY
       lstmeval(1) was first made available for tesseract4.00.00alpha.

RESOURCES
       Main web site: https://github.com/tesseract-ocr Information on training tesseract LSTM:
       https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

SEE ALSO
       tesseract(1)

COPYING
       Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0

AUTHOR
       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and
       Google (2006-present).

tesstrain.sh

# This script provides an easy way to execute various phases of training
# Tesseract.  For a detailed description of the phases, see
# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
# # 這是tesseract訓練過程中幾個不同階段所需要的腳本方法。
# USAGE:
#
# tesstrain.sh
#    --fontlist FONTS           # A list of fontnames to train on. # 所需要訓練的字體類型
#    --fonts_dir FONTS_PATH     # Path to font files. # 字體文件所在文件夾
#    --lang LANG_CODE           # ISO 639 code. # 遵循iso639規範的三字母代碼規範
#    --langdata_dir DATADIR     # Path to tesseract/training/langdata directory. # 
#    --output_dir OUTPUTDIR     # Location of output traineddata file. #
#    --overwrite                # Safe to overwrite files in output_dir. # 覆蓋的選項,沒有使用過
#    --linedata_only            # Only generate training data for lstmtraining. # lstm訓練的選項
#    --run_shape_clustering     # Run shape clustering (use for Indic langs). # 沒用過
#    --exposures EXPOSURES      # A list of exposure levels to use (e.g. "-1 0 1"). # 還不太清楚
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory. # 可選項,沒有提起的話 會在langdata文件夾中查找
#    --training_text TEXTFILE   # Text to render and use for training. # 訓練文本文件
#    --wordlist WORDFILE        # Word list for the language ordered by 
#                               # decreasing frequency. # 所訓練語言的word列表,以使用頻率的降序排列
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in
# the current environment. # 當需要特徵提起時,用來指定tessdata文件夾的可選項,如果沒有提到的話,會選用當前環境變量裏面的TESSDATA_PREFIX對應的地址
#    --tessdata_dir TESSDATADIR     # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango using
# fontconfig. An easy way to list the canonical names of all fonts available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章