《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》論文翻譯

 本文是Tesseract在多語種方面OCR改進措施,剛好最近也在做相關工作,就順便翻譯了下。總的來說,看完之後對自己的相關工作也有一定的啓發,感覺還不錯,就在這分享一下。目前只翻譯了下ocr的版面分析和字符預處理以及分類器的構造方面的內容,後續部分因爲目前還未用得到,所以後處理部分還未翻譯。以後用到的話在添加相關內容吧。有需要的同學可以自己下載並閱讀,廢話不多說,放鏈接。

論文的下載鏈接:https://storage.googleapis.com/pub-tools-public-publication-data/pdf/35248.pdf


 

Abstract

摘要

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

我們將介紹瞭如何適應多個腳本和語言的Tesseract開源OCR工具。 我們工作的重點的重點是使其能夠通用的多語言而可以忽略不同語言語言必須提供一個文本語料庫之外,還需要進行分類。 雖然需要改變各種模塊,包括版面分析和語言後處理,除了這幾個限制改變之外,沒有改變字符分類器。 該分類器易於適應簡體中文,英語測試結果,歐洲語言的混合,而對於俄文,從隨機的書籍樣本中,顯示出單詞錯誤率在3.72%至5.78%之間,簡體中文的字符錯誤率僅爲3.77%。

Keywords

關鍵字

Tesseract, Multi-Lingual OCR.

Tesseract,多語言OCR

1. Introduction

1. 介紹

Research interest in Latin-based OCR faded away more than adecade ago, in favor of Chinese, Japanese, and Korean (CJK)[1,2], followed more recently by Arabic [3,4], and then Hindi[5,6]. These languages provide greater challenges specifically to classifiers, and also to the other components of OCR systems.Chinese and Japanese share the Han script, which contains thousands of different character shapes. Korean uses the Hangul script, which has several thousand more of its own, as well as using Han characters. The number of characters is one or two orders of magnitude greater than Latin. Arabic is mostly written with connected characters, and its characters change shape according to the position in a word. Hindi combines a small number of alphabetic letters into thousands of shapes that represent syllables. As the letters combine, they form ligatures whose shape only vaguely resembles the original letters. Hindi then combines the problems of CJK and Arabic, by joining all the symbols in a word with a line called the shiro-reka.

Research approaches have used

language-specific work-arounds to avoid the problems in some way, since that is simpler than trying to find a solution that works for all languages. For instance, the large character sets of Han, Hangul, and Hindi are mostly made up of a much smaller number of components, known as radicals in Han, Jamo in Hangul, and letters in Hindi. Since it is much easier to develop a classifier for a small number of classes, one approach has been to recognize the radicals [1, 2, 5] and infer the actual characters from the combination of radicals. This approach is easier for Hangul than for Han or Hindi, since theradicals don't change shape much in Hangul characters, where as in Han, the radicals often are squashed to fit in the character and mostly touch other radicals. Hindi takes this a step further by changing the shape of the consonants when they form a conjunct consonant ligature. Another example of a more language-specific work-around for Arabic, where it is difficult to determine the

character boundaries to segment connected components into characters. A commonly used method is to classify individual vertical pixel strips, each of which is a partial character, and

combine the classifications with a Hidden Markov Model that models the character boundaries [3].

  Google is committed to making its services available in as many languages as possible [7], so we are also interested in adapting the Tesseract Open Source OCR Engine [8, 9] to many languages. This paper discusses our efforts so far in fully internationalizing Tesseract, and the surprising ease with which some of it has been possible. Our approach is use language generic methods, to minimize the manual effort to cover many languages.

十多年前,對基於拉丁語的OCR的研究興趣逐漸消退,取而代之的是漢語、日語和韓語(CJK)[1,2],其次是阿拉伯語[3,4],然後是印地語[5,6]。 這些語言對分類器,以及OCR系統的其他組件,提出了更大的挑戰。 中文和日文有相似的漢字結構,這些漢字有成千上萬種不同的特徵形狀。韓語使用Hangul結構,這些結構也有幾千多個和漢字結構一樣的特徵。 字符數比拉丁文大一到兩個數量級。阿拉伯主要是用連接在一起的字符編寫的,其字符根據單詞中的位置而改變形狀。 印地語將少量字母組合成數千種形狀代表音節。 當字母組合在一起時,它們形成了語言,其形狀只模糊地類似於原始字母。 印地語結合了CJK和阿拉伯語的問題,加入了所有的符號,用一句話來說叫做“希羅-雷卡”。

 本文研究解決特定語言的方案,使用某種方式避免問題,因爲這比試圖找到一種適用於所有語言的解決方案要簡單。例如,大量的漢字、韓語和印地語的字符集少量的組件組成。 因爲對於少數類來說它更容易產生分類器,一種方法是識別自由基[1,2,5],並從自由基的組合中推斷出實際的字符。 這種方法對韓語來說比漢文或印地文更容易識別,因爲在韓語文字中,激進分子的形狀變化不大,而在漢文中,激進分子往往被擠壓以適應字符和大多接觸其他自由基。 印地語通過改變輔音的形狀,在它們形成一個連音輔音連接時,採取了進一步的步驟。另一個具體語言工作的例子,對於阿拉伯語,很難確定字符邊界,將連接的組件分割成字符。 一種常用的方法是對每個垂直像素條進行分類其中包含部分字符,並將隱馬爾可夫模型分類與建模字符邊界的相結合[3]。

 谷歌致力於以儘可能多的語言提供其服務[7],因此我們興趣是將Tesseract開源OCR引擎[8,9]應用於多種語言。這篇文章討論了我們到目前爲止在Tesseract方面的努力,以及其中包含一些令人吃驚的結果。我們的研究使用語言通用方法,並儘可能涵蓋更多的語言。

2. Review Of Tesseract For Latin

2. 識別拉丁文的Tesseract的框架

 

Fig. 1 is a block diagram of the basic components of Tesseract. The new page layout analysis for Tesseract [10] was designed from the beginning to be language-independent, but the rest of the engine was developed for English, without a great deal of thought as to how it might work for other languages. After noting that the commercial engines at the time were strictly for black-on-white text, one of the original design goals of Tesseract was that it should recognize white-on-black (inverse video) text as easily as black-on-white. This led the design (fortuitously as it turned out) in the direction of connected component (CC) analysis and operating on outlines of the components. The first step after CC analysis is to find the blobs in a text region. A blob is a putative classifiable unit, which may be one or more horizontally overlapping CCs, and their inner nested outlines or holes.A problem is detecting inverse text inside a box vs. the holes inside a character. For English, there are very few characters (maybe © and ®) that have more than 2 levels of outline, and it is very rare to have more than 2 holes, so any blob that breaks these rules is "clearly" a box containing inverse characters, or even the inside or outside of a frame around black-on-white characters.

 

Fig.2 is a block diagram of the word recognizer. In most cases, a blob corresponds to a character, so the word recognizer first classifies each blob, and presents the results to a dictionary search to find a word in the combinations of classifier choices for each blob in the word. If the word result is not good enough, the next

step is to chop poorly recognized characters, where this improves the classifier confidence. After the chopping possibilities are exhausted, a best-first search of the resulting segmentation graph puts back together chopped character fragments, or parts of characters that were broken into multiple CCs in the original

image. At each step in the best-first search, any new blob combinations are classified, and the classifier results are given to the dictionary again. The output for a word is the character string that had the best overall distance-based rating, after weighting according to whether the word was in a dictionary and/or had a

sensible arrangement of punctuation around it. For the English version, most of these punctuation rules were hard-coded.

 

The words in an image are processed twice. On the first pass,successful words, being those that are in a dictionary and are notdangerously ambiguous, are passed to an adaptive classifier fortraining. As soon as the adaptive classifier has sufficient samples,it can provide classification results, even on the first pass. On thesecond pass, words that were not good enough on pass 1 are processed for a second time, in case the adaptive classifier has gained more information since the first pass over the word.

From the foregoing description, there are clearly problems with this design for non-Latin languages, and some of the more complex issues will be dealt with in sections 3, 4 and 5, but some of the problems were simply complex engineering. For instance,the one byte code for the character class was inadequate, but should it be replaced by a UTF-8 string, or by a wider integer code? At first we adapted Tesseract for the Latin languages, and changed the character code to a UTF-8 string, as that was the most flexible, but that turned out to yield problems with the dictionary representation (see section 5), so we ended up using an index into a table of UTF-8 strings as the internal class code.

圖1是Tesseract基本組件的框圖。 Tesseract[10]新版面分析從一開始就被設計成與語言無關,但引擎的其餘部分是專門爲英語而發展的,沒有大量的考慮如何適用於其他語言。在注意到當時的商業化OCR工具都是是嚴格的黑白文本,Tesseract其中一個最初的設計目標是,它應該識別黑白(反視頻)文本,就像黑白一樣容易。 這就導致了設計(偶然的結果)的連接組件(CC)分析和操作的輪廓組件。在CC分析後的第一步是搜索文本塊區域。文本塊是一個假定的可分類單元,它可能是一個或多個水平重疊的CC,以及它們的內部嵌套輪廓或孔。一個問題是檢測盒子裏的反文本和包含字符文本的孔洞。 對於英語來說,很少有字符有超過2個層次的輪廓,而且很少有超過2個洞,所以任何文本塊打破這些規則是一個“明顯”的盒子,包含相反的字符,甚至框架的內部或外部圍繞黑白字符。

 

 圖2是單詞識別器的框圖。在大多數情況下,blob對應於一個字符,因此單詞識別器首先對每個blob進行分類,並將結果提供給字典搜索在單詞,這個單詞是在分類器選擇組合中找到的。如果單詞結果不夠好,下一步就是去除識別不好的字符,這樣可以改善分類器置信度。在去除所有可能的識別之後,首先搜索得到的分割圖可以將切碎的字符片段或字符的部分合併在一起。在最佳優先搜索的每一步,對任何新的BLOB組合進行分類,並將分類器結果再次給出字典。 一個單詞的輸出是一個字符串,並且它具有最好的基於距離的評分之後根據單詞是否在字典中和/或它周圍有一個合理的標點排列。 對於英文版,這些標點規則大多是固定的。

 圖像中的單詞被處理兩次。首先,正確的單詞,是那些在字典中並且不是模棱兩可的詞,這個詞被傳遞給一個自適應分類器來訓練。自適應分類器一旦有足夠的樣本,就可以提供正確的分類結果。第二次傳遞時,在第一次處理上不夠好的單詞被第二次處理,自適應分類器以獲得了更多的信息。

 

 從前面的描述來看,這種拉丁語的設計顯然有問題,一些更復雜的問題將在第3、4和5節中處理,但有些是專業的。瑕疵只是複雜的工程。例如,字符類的一個字節代碼是不夠的,但是應該用UT F-8字符串或更寬的整數代碼來替換它嗎? 一開始我們 Tesseract用於拉丁語言識別,並將字符代碼更改爲UT F-8字符串,因爲這是最靈活的,但這導致了字典表示的問題因此,我們最終使用一個索引到UT F-8字符串的表中作爲內部類代碼。

3. Layout Preprocessing

3. 版面分析

Several aspects of the “textord” (text-ordering) module of Tesseract required changes to make it more languageindependent. This section discusses these changes.

Tesseract的“Textord”(文本排序)模塊的幾個方面需要進行更改,以使其更加獨立於語言。 本節討論這些變化。

3.1 Vertical Text Layout

3.1 垂直文本版面

Chinese, Japanese, and Korean, to a varying degree, all read text lines either horizontally or vertically, and often mix directions on a single page. This problem is not unique to CJK, as English language magazine pages often use vertical text at the side of a photograph or article to credit the photographer or author. Vertical text is detected by the page layout analysis. If a majority of the CCs on a tab-stop have both their left side on a left tab and their right side on a right tab, then everything between the tabstops could be a line of vertical text. To prevent false-positives in tables, a further restriction requires vertical text to have a median vertical gap between CCs to be less than the mean width of the CCs. If the majority of CCs on a page are vertically aligned, the page is rotated by 90 degrees and page layout analysis is run again to reduce the chance of finding false columns in the vertical text.The minority originally horizontal text will then become vertical text in the rotated page, and the body of the text will be horizontal.

As originally designed, Tesseract had no capability to handle vertical text,and there are a lot of places in the code where some assumption is made over characters being arranged on a horizontal text line. Fortunately, Tesseract operates on outlines of CCs in a signed integer coordinate space,which makes rotations by multiples of 90 degrees trivial, and it doesn't care whether the coordinates are positive or negative. The solution is therefore simply to differentially rotate the vertical and horizontal text blocks on a page, and rotate the characters as needed for classification. Fig. 3 shows an example of this for English text.The page in Fig. 3(a) contains vertical text at the lower-right, which is detected in Fig. 3(b), along with the rest of the text. In Fig. 4, the vertical text region is rotated 90 degrees clockwise, (centered at the bottom-left of the image), so it appears well below the original image, but in horizontal orientation.

Fig. 5 shows an example for Chinese text. The mainly-vertical body text is rotated out of the image, to make it horizontal, and the header, which was originally horizontal, stays where it started. The vertical and horizontal text blocks are separated in coordinate space, but all Tesseract cares about is that the text lines are

horizontal. The data structure for a text block records the rotations that have been performed on a block, so that the inverse rotation can be applied to the characters as they are passed to the classifier, to make them upright. Automatic orientation detection [12] can be used to ensure that the text is upright when passed to the classifier, as vertical text could have characters that are in at least 3 different orientations relative to the reading direction. After Tesseract processes the rotated text blocks, the coordinate space is re-rotated back to the original image orientation so that reported character bounding boxes are still accurate.

中文、日語和韓語在不同程度上都是水平或垂直閱讀文本行,並且通常在一個頁面上混合方向。 這個問題不是CJK所獨有的,就像英語雜誌頁面一樣通常在照片或文章的側面使用垂直文本來讚揚攝影師或作者。垂直文本通過頁面佈局分析檢測。如果CC的多數在頁面上,他們的左邊在左邊的標籤頁上,他們的右邊在右邊的標籤上,那麼標籤之間的都可以是一行垂直文本。爲防止表格中的誤報,進一步的限制要求垂直文本之間的中位垂直間隙小於CC的平均寬度。如果頁面上的大多數CC是垂直對齊的,則將頁面旋轉90度,再次執行頁面佈局分析,這樣可以減少在垂直文本中找到錯誤列的機會。少數原本水平文本將成爲垂直文本,在旋轉的頁面中,主體文本將是水平的。

 

 正如最初設計的那樣,Tesseract沒有處理垂直文本的能力,代碼中有很多地方對在水平文本上排列的字符做了一些假設。幸運的是,Tesseract在一個有符號整數座標空間中對CC的輪廓進行操作,這使其按照90度的倍數的旋轉,而不在乎座標是否是一個正或負。因此,解決方案只是差異地旋轉頁面上的垂直和水平文本塊,並根據分類需要旋轉字符。圖3是英語課文的一個例子。 圖中的頁面包含右下角的垂直文本,以及案文的其餘部分。 在圖4中,垂直文本順時針旋轉90度(以圖像的左下角爲中心),因此它看起來遠低於原始圖像,但在水平方向。

 

圖5 給出了中文文本的一個例子。由垂直爲主體的文本被旋轉成水平得到的圖像,而標題最初是水平的,則停留在它的原始爲止。垂直文本塊和水平文本塊在座標空間中是分開的,Tesseract關心的是文本行是水平的。文本塊的數據結構記錄塊上執行的旋轉,當它們被傳遞給分類器時,可就使用以逆旋轉,使它們變成豎直。自動方向檢測[12]可以用於確保文本在傳遞給分類器時是直立的,因爲垂直文本可能具有相對於讀取方向至少有3個不同方向的字符。之後鏡像處理旋轉的文本塊,座標空間被重新旋轉回原始圖像方向,以便字符的包圍框仍然是準確的。

3.2 Text-line and Word Finding

3.2 文本行和單詞查找

The original Tesseract text-line finder [11] assumed that CCs that make up characters mostly vertically overlap the bulk of the text line. The one real exception is i dots. For general languages this is not true, since many languages have diacritics that sit well above and/or below the bulk of the text-line. For Thai for example, the distance from the body of the text line to the diacritics can be quite extreme. The page layout analysis for Tesseract is designed to simplify text-line finding by sub-dividing text regions into blocks of uniform text size and line spacing. This makes it possible to force-fit a line-spacing model, so the text-line finding has been modified to take advantage of this. The page layout analysis also estimates the residual skew of the text regions, which means the text-line finder no longer has to be insensitive to skew.

 

The modified text-line finding algorithm works independently for each text region from layout analysis, and begins by searching the neighborhood of small CCs (relative to the estimated text size) to find the nearest body-text-sized CC. If there is no nearby bodytext-sized CC, then a small CC is regarded as likely noise, and discarded. (An exception has to be made for dotted/dashed leaders, as typically found in a table of contents.) Otherwise, a bounding box that contains both the small CC and its larger neighbor is constructed and used in place of the bounding box of the small CC in the following projection.

A "horizontal" projection profile is constructed, parallel to the estimated skewed horizontal, from the bounding boxes of the CCs using the modified boxes for small CCs. A dynamic programming algorithm then chooses the best set of segmentation points in the

projection profile. The cost function is the sum of profile entries at the cut points plus a measure of the variance of the spacing between them. For most text, the sum of profile entries is zero,and the variance helps to choose the most regular line-spacing.For more complex situations, the variance and the modified bounding boxes for small CCs combine to help direct the line cuts to maximize the number of diacriticals that stay with their

appropriate body characters.

 Once the cut lines have been determined, whole connected components are placed in the text-line that they vertically overlap the most, (still using the modified boxes) except where a component strongly overlaps multiple lines. Such CCs are presumed to be either characters from multiple lines that touch,and so need cutting at the cut line, or drop-caps, in which case they are placed in the top overlapped line. This algorithm works well, even for Arabic.

After text lines are extracted, the blobs on a line are organized into recognition units. For Latin languages, the logical recognition units correspond to space-delimited words, which is naturally suited for a dictionary-based language model. For languages that are not space-delimited, such as Chinese, it is less clear what the corresponding recognition unit should be. One possibility is to treat each Chinese symbol as a recognition unit. However, given that Chinese symbols are composed of multiple glyphs (radicals), it would be difficult to get the correct character segmentation without the help of recognition. Considering the limited amount of information that is available at this early stage of processing, the solution is to break up the blob sequence at punctuations,which can be detected quite reliably based on their size and spacing to the next blob. Although this does not completely resolve the issue of a very long blob sequence, which is a crucial factor in determining the efficiency and quality when searching the segmentation graph, this would at least reduce the lengths of recognition units into more manageable sizes.

As described in Section 2, detection of white-on-black text is based on the nesting complexity of outlines. This same process also rejects non-text, including halftone noise, black regions on the side, or large container boxes as in sidebar or reversed-video region. Part of the filtering is based on a measure of the topological complexity of the blobs, estimated based on the number of interior components, layers of nested holes, perimeter to area ratio, and so on. However, the complexity of Traditional Chinese characters, by any measure, often exceeds that of an English word enclosed in a box. The solution is to apply a different complexity threshold for different languages, and rely on subsequent analysis to recover any incorrectly rejected blobs. resolve the issue of a very long blob sequence, which is a crucial factor in determining the efficiency and quality when searching the segmentation graph, this would at least reduce the lengths of recognition units into more manageable sizes. As described in Section 2, detection of white-on-black text is based on the nesting complexity of outlines. This same process also rejects non-text, including halftone noise, black regions on

the side, or large container boxes as in sidebar or reversed-video region. Part of the filtering is based on a measure of the topological complexity of the blobs, estimated based on the number of interior components, layers of nested holes, perimeter to area ratio, and so on. However, the complexity of Traditional Chinese characters, by any measure, often exceeds that of an

English word enclosed in a box. The solution is to apply a

different complexity threshold for different languages, and rely on subsequent analysis to recover any incorrectly rejected blobs.

原始的Tesseract文本行查找器[11]假設組成字符的CC大部分是垂直重疊文本行。一個真正的例外是I點。對於一般語言卻不是這樣的,因爲許多語言都有遠高於和/或低於大部分文本行。以泰語爲例,從文本線的主體到解剖的距離可以非常極端。Tesseract進行版面分析主要是通過將文本區域細分爲統一文本大小和行距的塊來簡化文本行查找。這就有可能爲了強制擬合行間距模型,對文本行查找進行了修改以利用這一點。頁面佈局分析還估計了文本區域的剩餘傾斜,這意味着文本行查找器不再需要對傾斜不敏感。

修改後的文本線查找算法從佈局分析開始對每個文本區域獨立工作,首先搜索小CC的鄰域(相對於估計的文本大小)到找到最近的正文大小的CC。 如果附近沒有正文大小的CC,那麼一個小的CC被認爲是可能的噪音,並被丟棄。(如在目錄中找到的虛線/虛線來說是一個例外)。 否則, 在下面的投影中,構造並使用包含小CC及其較大鄰居的包圍框來代替小CC的包圍框。

利用對小型CC的修改框,從CC的包圍框中構造一個平行於估計傾斜水平的“水平”投影輪廓。動態規劃算法選擇投影輪廓中最佳分割點集。成本函數是切點的輪廓條目之和,加上它們之間間距的方差的度量。對於大多數文本,輪廓條目之和爲零,並且方差有助於選擇最規則的行間距。 對於更復雜的情況,將方差和修改後的小CC包圍框結合起來,以幫助指導線切割,以最大限度地增加與其適當的主體特徵保持一致的差異

 一旦確定了切割線,所有連接的組件都被放置在文本行中,它們垂直重疊最多,(仍然使用修改後的框),除非組件很強重疊多條線。這種CC被認爲是來自多個接觸線的字符,因此需要在切割線上切割,或者在這種情況下,它們被放置在頂部中 重疊的線。即使是阿拉伯語,這個算法很好地工作。

提取文本行後,將行上的塊被組織成識別單元。對於拉丁語言,邏輯識別單元與空間分隔的單詞相對應,這自然是適合的 用於基於字典的語言模型。對於沒有空間分隔的語言,如漢語,不太清楚相應的識別單元應該是什麼。一種可能性是將每個中文符號作爲一個識別單元。然而,由於漢字符號是由多個字形(自由基)組成的,如果沒有這些符號,就很難得到正確的字符分割 幫助識別。考慮到在處理的早期階段可獲得的信息量有限,解決方案是在標點時分解BLOB序列,可以根據它們的大小和間距來非常可靠地檢測到下一個BLOB。 這可以根據它們的大小和間距來非常可靠地檢測到下一個BLOB。 儘管這並不完全解決一個很長的BLOB序列的問題, 這是在搜索分割圖時確定效率和質量的關鍵因素,這至少會將識別單元的長度減少到更易於管理的大小。

 如第2節所述,白對黑文本的檢測是基於輪廓的嵌套複雜性。同樣的過程也拒絕非文本,包括半色調噪聲,側面的黑色區域,或大容器盒,如側欄或反向視頻區域。 部分濾波是基於BLOB拓撲複雜性的度量,根據內部組件的數量、嵌套孔層、周長與面積比等來估計單倍體後代(代號)。然而,無論如何,繁體字的複雜程度,往往超過一個英語單詞的盒子。 解決方案是對不同的語言應用不同的複雜性閾值,並依賴於後續的分析來恢復任何不正確拒絕的BLOB。

 

3.3 Estimating x-height in Cyrillic Text

3.3 評估文本的高度

After completing the text line finding step and organizing blocks of blobs into rows, Tesseract estimates x-height for each text line.The x-height estimation algorithm first determines the bounds on the maximum and minimum acceptable x-height based on the initial line size computed for the block. Then, for each line separately, the heights of the bounding boxes of the blobs occurring on the line are quantized and aggregated into a histogram. From this histogram the x-height finding algorithm looks for the two most commonly occurring height modes that are far enough apart to be the potential x-height and ascender height.In order to achieve robustness against the presence of some noise,the algorithm ensures that the height modes picked to be the xheight and ascender height have sufficient number or occurrences relative to the total number of blobs on the line.

 

  This algorithm works quite well for most Latin fonts. However, when applied as-is to Cyrillic, Tesseract fails to find the correct xheight for most of the lines. As a result, on a data set of Russian books the word error-rate of Tesseract turns out to be 97%. The reason for such high error rate is two-fold. First of all the ascender statistics in Cyrillic fonts differ significantly from Latin ones. Simply lowering the threshold for the expected number of ascenders per line is not an effective solution, since it is not infrequent that a line of text would contain one or no ascender letters. The second reason for such poor performance is a high degree of case ambiguity in Cyrillic fonts. For example, out of 33 upper-case modern Russian letters only 6 have a lower-case shape that is significantly different from the upper-case in most fonts. Thus, when working with Cyrillic, Tesseract can be easily misled by the incorrect x-height information and would readily recognize lower-case letters as upper-case.

 

Our approach to fixing the x-height problem for Cyrillic was to adjust the minimum expected number of ascenders on the line, take into account the descender statistics and use x-height information from the neighboring lines in the same block of text more effectively (a block is a text region identified by the page

layout analysis that has a consistent size of text blobs and linespacing, and therefore is likely to contain letters of the same or similar font sizes).

For a given block of text, the improved x-height finding algorithm first tries to find the x-height of each line individually. Based on the result of this computation each line falls into one of the following four categories: (1) the lines where the x-height and ascender modes were found, (2) where descenders were found, (3) where a common blob height that could be used as an estimate of either cap-height or x-height was found, (4) the lines where none of the above were identified (i.e. most likely lines containing noise with blobs that are too small, too large or just inconsistent in size). If any lines from the first category with reliable x-height and ascender height estimates were found in the block, their height estimates are used for the lines in the second category (lines with descenders present) that have a similar x-height estimate. The same x-height estimate is utilized for those lines in the third category (no ascenders or descenders found), whose most common height is within a small margin of the x-height estimate. If the line-by-line approach does not result in finding any reliable x-height and ascender height modes, the statistics for all the blobs in the text block are aggregated and the same search for x-height and ascender height modes is repeated using this cumulative information.

As the result of the improvements described above the word error rate on a test set of Russian books was reduced to 6%. After the improvements the test set still contained some errors due to the failure to estimate the correct x-height of the text line. However, in many of such cases even a human reader would have to use the information from the neighboring blocks of text or knowledge about the common organization of the books to determine whether the given line is upper- or lower-case.

完成文本行查找步驟並將塊塊組織成行後,Tesseract估計每個文本行的x-高度。 x-高度估計算法首先根據初始線大小決定最大和最小可接受高度並以此來計算塊的高度。然後,對於每一行,發生在該行上的的包圍盒的高度分別爲被量化並聚合成直方圖。 從這個直方圖中,x高度查找算法尋找兩種最常見的高度模式,它們之間的距離足夠遠,足以成爲潛在的x高度和上升高度。爲了實現該算法對某些噪聲的存在具有魯棒性,保證了選擇爲xheight和ascender高度的高度模式相對總有足夠的數量或出現。

 

 這種算法對大多數拉丁字體非常有效。 然而,當將應用於西里爾時,Tesseract未能爲大多數行找到正確的xheight。 因此,在一組俄文數據上書籍的單詞錯誤率結果是97%。出錯率如此之高的原因是雙重的。首先,西里爾字體的ascender統計數據與拉丁字體的差異很大 簡單地降低每行預期提升次數的閾值並不是一個有效的解決辦法,因爲一行文本包含一個或不包含提升字母並非罕見。  表現不佳的第二個原因是西里爾字體的大小寫模糊程度很高例如,在33個大寫的現代俄語字母中,只有6個字母的小寫形狀是有意義的,這與大多數字體的大寫截然不同。因此,當與西裏爾字母合作時,Tesseract很容易被不正確的x-高度信息誤導,並且很容易小寫字母識別爲大寫。

 

 我們解決西裏爾字母x高度問題的方法是調整上的最小期望上升次數,考慮到下降統計數據,並使用在同一文本塊中x高度信相鄰的行更有效(塊是由頁面佈局分析識別的文本區域,具有一致的文本塊大小和行行間距,並且在那裏。因此可能包含相同或相似字體大小的字母)。 對於給定的文本塊,改進的x高查找算法首先嚐試單獨查找每行的x高。基於計算的結果每一行有以下四種情況: (1)發現x高和ascender模式的線,(2)發現descender的線,(3)發現了一個常見的BLOB高度,可以作爲帽高度或x高度的估計,(4)沒有發現上述任何一條的線條(即。 最有可能的線條包含噪音與斑點太小,太大或只是不一致的大小)。如果在區塊中發現了來自第一類的任何具有可靠x高度和ascender高度估計的線,則它們的高度估計用於第二類的線(d線) 有類似x-高度估計。 類似的x高度估計。同樣的x-高度估計也被用於第三類(沒有發現提升或下降)的那些線,其最常見的高度在x-高度估計的一小部分之內。 如果逐行方法不能找到任何可靠的x高度和上升高度模式,則對文本塊中所有塊的統計數據進行聚合,並使用此累積信息重複對x-高度和提升高度模式的相同搜索。

 

 由於上述改進,一套俄羅斯書籍測試的單詞錯誤率降低到6%。 改進後,測試集仍然存在一些錯誤,因爲我們錯誤的估計文本行的正確x高度。然而,在許多這樣的情況下,即使是人類讀者也必須使用來自相鄰文本塊的信息或瞭解書籍的共同組織,以確定給定是大寫還是小寫

4. Character / Word Recognition

4.  字符/單詞識別

One of the main challenges to overcome in adapting Tesseract for multilingual OCR is extending what is primarily designed for alphabetical languages to handle ideographical languages like Chinese and Japanese. These languages are characterized by having a large set of symbols and lacking clear word boundaries, which pose serious tests for a search strategy and classification

engine designed for well delimited words from small alphabets. We will discuss classification of large set of ideographs in the next section, and describe the modifications required to address the search issue first.

在使Tesseract適應多語種OCR方面要克服的主要挑戰之一是擴展爲字母語言設計的語言,以處理像漢語和日語這樣的語言。這些語言的特點是有大量的符號集,且缺乏清晰的單詞邊界,這對爲設計的搜索策略和分類引擎帶來嚴重的考驗。我們將在下一節中討論大量表意文字的分類,並描述解決搜索問題所需的修改。

4.1 Segmentation and Search

4.1 分割和搜索

As mentioned in section 3.2, for non-space delimited languages like Chinese, recognitin units that form the equivalence of words in western languages now correspond to punctuation delimited phrases. Two problems need to be considered to deal with these phrases: they involve deeper search than typical words in Latin and they do not correspond to entries in the dictionary. Tesseract uses a best-first-search strategy over the segmentation graph, which grows exponentially with the length of the blob sequence.While this approach worked on shorter Latin words with fewersegmentation points and a termination condition when the result is

found in the dictionary, it often exhausts available resources when classifying a Chinese phrase. To resolve this issue, we need to dramatically reduce the number of segmentation points evaluated in the permutation and devise a termination condition that is easier to meet.

In order to reduce the number of segmentation points, we incorporate the constraint of roughly constant character widths in a mono-spaced language like Chinese and Japanese. In these languages, characters mostly have similar aspect ratios, and are either full-pitch or half-pitch in their positioning. Although the normalized width distribution would vary across fonts, and the spacing would shift due to line justification and inclusion of digits

or Latin words, which is not uncommon, by and large these constraints provide a strong guideline for whether a particular segmentation point is compatible with another. Therefore, using the deviation from the segmentation model as a cost, we can eliminate a lot of implausible segmentation states and effectively reduce the search space. We also use this estimate to prune the search space based on the best partial solution, making it effectively a beam search. This also provides a termination condition when no further expansion is likely to produce a better solution.

Another powerful constraint is the consistency of character script within a phrase. As we include shape classes from multiple scripts, confusion errors between characters across different scripts become inevitable. Although we can establish the dominant script or language for the page, we must allow for Latin characters as well, since the occurrence of English words inside foreign language books is so common. Under the assumption that characters within a recognition unit would have the same script, we would promote a character interpretation if it improves the overall script consistency of the whole unit. However, blindly promoting script characters based on prior could actually hurt the performance if the word or phrase is truly mixed script. So we apply the constraint only if over half the characters in the top interpretation belong to the same script, and the adjustment is weighted against the shape recognition score, like any other permutation.

如第3.2節所述,對於漢語等非空間定界語言,構成西方語言單詞對等的識別單元對應於標點符號劃定短語的界限。處理這些短語需要考慮兩個問題:它們涉及比拉丁語中的典型詞更深的搜索以及與字典中的條目不對應。在分割圖像上,Tesseract使用了一種最優搜索策略,該策略隨着BLOB序列的長度呈指數增長。 當在字典中找到結果時,這種方法只能在較短的拉丁詞上工作且需要較少的詞分割點和終止條件,在對漢語短語進行分類時,往往會耗盡可用的資源。爲了解決這個問題,我們需要減少在排列中評估的分割點的數量,並設計一個更容易滿足的終止條件。

 爲了減少分割點的數量,我們將像中文和日文這樣的大致恆定的字符寬度間距的語言中加入了大致恆定字符寬度的約束。在這些語言中,字符大多具有相似的縱橫比,並且在它們的定位中要麼是全音節,要麼是半音節。雖然標準化的寬度分佈在不同的字體之間會有所不同,而且由於線條的合理性和數字或拉丁詞的包含,間距也會發生變化,這在很大程度上這些約束爲特定的分割點是否與另一個分割點兼容提供了強有力的指導方針。因此,利用與分割模型的偏差作爲代價,可以消除許多不可信的分割狀態,有效地減少搜索空間。 我們還使用這個估計來修剪基於最佳部分解的搜索空間,使其有效地成爲波束搜索。這也提供了一個終止條件,當沒有進一步的擴展可能產生更好的解決方案。

另一個強大的約束是短語中字符腳本的一致性。由於我們包括來自多個腳本的形狀類,不同腳本之間的字符之間的混淆錯誤變得不可避免。雖然我們可以爲頁面建立主導的腳本或語言,但我們也必須允許拉丁字符,因爲英語單詞在外語書籍中的出現是如此普遍。假設識別單元中的字符具有相同的腳本,如果它提高了整個單元的整體腳本一致性,我們將促進字符解釋。然而,如果單詞或短語是真正混合的腳本,盲目地推廣基於先驗的腳本字符實際上可能會損害性能。因此,只有當頂部解釋中超過一半的字符屬於同一個腳本時,我們才應用約束,並且調整與形狀識別分數加權,就像每個腳本中的任何其他字符一樣。

4.2 Shape Classification

4.2 形狀分類

Classifiers for large numbers of classes are still a research problem; even today, especially when they are required to operate at the speeds needed for OCR [13, 14]. The curse of dimensionality is largely to blame. The Tesseract shape classifier works surprisingly well on 5000 Chinese characters without requiring any major modifications, so it seems to be well suited to large class-size problems. This result deserves some explanation, so in this section we describe the Tesseract shape classifier.

The features are components of a polygonal approximation of the outline of a shape. In training, a 4-dimensional feature vector of (x, y-position, direction, length) is derived from each element of the polygonal approximation, and clustered to form prototypical feature vectors. (Hence the name: Tesseract.) In recognition, the elements of the polygon are broken into shorter pieces of equal length, so that the length dimension is eliminated from the feature vector. Multiple short features are matched against each prototypical feature from training, which makes the classification process more robust against broken characters.

Fig.7(a) shows an example prototype of the letter ‘h’ for the font Times Roman. The green line-segments represent cluster means of significant clusters that contain samples from almost every sample of ‘h’ in Times Roman. Blue segments are cluster means that were merged with another cluster to form a significant cluster. Magenta segments were not used, as they matched an existing significant cluster. Red segments did not contain enough samples to be significant, and could not be merged with any neighboring cluster to form a significant cluster.

Fig.7(b) shows how the shorter features of the unknown match against the prototype to achieve insensitivity to broken characters. The short, thick lines are the features of the unknown, being a broken ‘h’ and the longer lines are the prototype features. Colors represent match quality: black -> good, magenta -> reasonable,cyan -> poor, and yellow -> no match. The vertical prototypes are all well matched, despite the fact that the h is broken.

The shape classifier operates in two stages. The first stage, called the class pruner, reduces the character set to a short-list of 1-10 characters, using a method closely related to Locality Sensitive Hashing (LSH) [13]. The final stage computes the distance of the unknown from the prototypes of the characters in the short-list.

Originally designed as a simple and vital time-saving optimization, the class pruner partitions the high-dimensional feature space, by considering each 3-D feature individually. In place of the hash table of LSH, there is a simple look-up table,which returns a vector of integers in the range [0, 3], one for each class in the character set, with the value representing the approximate goodness of match of that feature to a prototype of the character class. The vector results are summed across all features of the unknown, and the classes that have a total score within a fraction of the highest are returned as the shortlist to be classified by the second stage. The class pruner is relatively fast, but its time scales linearly with the number of classes and also with the number of features.

大量類的分類器仍然是一個研究問題; 即使在今天,特別是當他們被要求以OCR所需的速度工作時[13,14]。維度高在很大程度上是限制因素。Tesseract形狀分類器在5000個漢字上的工作效果非常好,而不需要任何重大修改,因此它似乎非常適合於大類的分類問題。這一結果值得一些解釋,因此在本節中我們描述了Tesseract形狀分類器。

這些特徵是形狀輪廓的多邊形近似的組成部分。在訓練中,從多邊形近似的每個元素導出一個(x,y位置,方向,長度)的四維特徵向量,並聚類形成原型特徵向量。在識別中,將多邊形的元素分解成長度相等的較短的塊,從而從特徵向量中消除長度維數。多個短特徵與訓練中的每個原型特徵相匹配,這使得分類過程對破碎字符更加健壯。

圖7(a)顯示字體TimesRoman字母‘h’的示例原型。綠線片段表示重要簇的聚類手段,這些簇包含來自幾乎每個“h”樣本的TimesRoman樣本。藍色段是與另一個集羣合併形成一個重要集羣的集羣手段。未使用Magenta片段,因爲它們與現有的重要集羣相匹配。紅色段不包含足夠大的樣本,不能與任何相鄰的集羣合併,形成一個重要的集羣。

 圖7(b)說明未知的較短特徵如何與原型匹配,以實現對破碎字符的不敏感。短而粗的線條是未知的特徵,是一個破碎的“h”,而較長的線條是原型特徵。顏色代表匹配質量:黑色->好,洋紅->合理,青色->差,黃色->沒有匹配。垂直原型都很匹配,儘管h被打破了。

 形狀分類器分兩個階段工作。第一階段,稱爲類修剪,使用與局部敏感散列(LS H)密切相關的方法,將字符集減少到1-10個字符的入圍列表[13]。最後階段計算與入圍名單中字符原型的距離。

 最初設計爲一個簡單而重要的節省時間的優化,類修剪通過單獨考慮每個三維特徵來劃分高維特徵空間。代替LSH的哈希表,有一個簡單的查找表,它返回範圍[0,3]中的整數向量,字符集中每個類一個值表示將該特徵的匹配性與字符類的原型進行近似。 向量的結果是在未知的所有特徵之間求和的,總分數在最高分數的類被返回作爲被分類的結果。類修剪相對較快,但其時間與類數和特徵數成線性關係

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章