翻譯和筆記--FOTS: Fast Oriented Text Spotting with a Unified Network

筆記

FOTS是一個快速的端到端的集成檢測+識別的框架,和其他two-stage的方法相比,FOTS具有更快的速度。FOTS通過共享訓練特徵,互補監督,從而壓縮了特徵提取所佔用的時間。
在這裏插入圖片描述
上圖,藍色框爲FOTS,紅色框爲其他two-stage方法,可以看出FOTS消耗的時間是two-stage時間的一半。
FOTS的整體結構由四部分組成。分別是:

  1. 卷積共享特徵(shared convolutions)
    共享網絡的主幹是 ResNet-50。 受FPN的啓發,我們連接了低級特徵映射和高級語義特徵映射。共享卷積產生的特徵圖的分辨率是輸入圖像的1/4。
    FOTS的基礎網絡結構爲ResNet-50,共享卷積層採用了類似U-net的卷積的共享方法,將底層和高層的特徵進行了融合。這部分和EAST中的特徵共享方式一樣。最終輸出的特徵圖大小爲原圖的1/4。
    在這裏插入圖片描述

  2. 文本檢測分支(the text detection branch)
    採用完全卷積網絡作爲文本檢測器。 由於自然場景圖像中有許多小文本框,我們將共享卷積中原始輸入圖像的1/32到1/4大小的特徵映射放大。 在提取共享特徵之後,應用一個轉換來輸出密集的每像素的單詞預測。 第一個通道計算每個像素爲正樣本的概率。 與EAST類似,原始文本區域的縮小版本中的像素被認爲是正的。 對於每個正樣本,以下4個通道預測其到包含此像素的邊界框的頂部,底部,左側,右側的距離,最後一個通道預測相關邊界框的方向。 通過對這些正樣本應用閾值和NMS產生最終檢測結果。
    這一部分與EAST相同。損失函數包括了分類的loss(Cross Entropy Loss)和座標的迴歸的loss(IOU Loss)。實驗中的平衡因子,γreg=1\gamma_{reg}=1

  3. RoIRotate操作(RoIRotate operation)
    將有角度的文本塊,經過仿射變換,轉化爲正常的軸對齊的文本塊。
    在這項工作中,我們修正輸出高度並保持縱橫比不變以處理文本長度的變化。對比RRoI,其通過最大池化將旋轉的區域轉換爲固定大小的區域,而本文使用雙線性插值來計算輸出的值。避免了RoI與提取特徵不對準,使得輸出特徵的長度可變,更加適用於文本識別。這個過程分爲兩個步驟:
    ①通過文本提議的預測或ground truth座標計算仿射變換參數。
    ②將仿射變換分別應用於每個區域的共享特徵映射,並獲得文本區域的正常情況下水平的特徵映射。

  4. 文本識別分支(the text recognition branch)
    文本識別分支旨在使用由共享卷積提取並由RoIRotate轉換的區域特徵來預測文本標籤。 考慮到文本區域中標籤序列的長度,LSTM的輸入特徵沿着寬度軸通過原始圖像的共享卷積僅減少兩次。 否則,將消除緊湊文本區域中的可辨別特徵,尤其是窄形字符的特徵。 我們的文本識別分支包括類似VGG的順序卷積,僅沿高度軸減少的彙集,一個雙向LSTM,一個完全連接和最終的CTC解碼器。

1.FOTS: Fast Oriented Text Spotting with a Unified Network

2.Abstract

Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks.
偶然的場景文本檢測識別被認爲是文檔分析中最困難和最有價值的挑戰之一。大多數現有方法將文本檢測和識別視爲分開的任務。在這項工作中,我們提出了一個統一的端到端可訓練的快速定向文本檢測識別(FOTS)網絡,用於同時進行檢測和識別,在兩個互補任務之間共享計算和視覺信息。

Specially, RoIRotate is introduced to share convolutional features between detection and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared to baseline text detection network, and the joint training method learns more generic features to make our method perform better than these two-stage methods.
特別地,引入了RoIRotate以在檢測和識別之間共享卷積特徵。受益於卷積共享策略,與基線文本檢測網絡相比,我們的FOTS具有很少的計算開銷,並且聯合訓練方法學習了更多的通用特徵,以使我們的方法比這兩個階段的方法更好。

Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods significantly, which further allows us to develop the first real-time oriented text spotting system which surpasses all previous state-of-theart results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6 fps.
對ICDAR 2015,ICDAR 2017 MLT和ICDAR 2013數據集的實驗表明,所提出的方法比當前最先進的方法表現更好,這使我們能夠開發出第一個實時的面向文本的檢測識別系統,在ICDAR 2015文本發現任務上比所有以前的最新技術結果高出5%以上。

3.Introduction

Reading text in natural images has attracted increasing attention in the computer vision community [49, 43, 53, 44, 14, 15, 34], due to its numerous practical applications in document analysis, scene understanding, robot navigation, and image retrieval. Although previous works have made significant progress in both text detection and text recognition, it is still challenging due to the large variance of text patterns and highly complicated background.
在計算機視覺界[49、43、53、44、14、15、34]中閱讀自然圖像中的文本已引起越來越多的關注,這是由於它在文檔分析,場景理解,機器人導航和圖像檢索中有大量實際應用。 儘管先前的工作在文本檢測和文本識別方面都取得了重大進展,但由於文本模式的差異很大且背景非常複雜,因此仍然具有挑戰性。

The most common way in scene text reading is to divide it into text detection and text recognition, which are handled as two separate tasks [20, 34]. Deep learning based approaches become dominate in both parts. In text detection, usually a convolutional neural network is used to extract feature maps from a scene image, and then different decoders are used to decode the regions [49, 43, 53].
場景文本閱讀中最常見的方法是將其分爲文本檢測和文本識別,這被視爲兩個單獨的任務[20,34]。基於深度學習的方法在這兩個方面都占主導地位。 在文本檢測中,通常使用卷積神經網絡從場景圖像中提取特徵映射,然後使用不同的解碼器對區域進行解碼[49、43、53]。

While in text recognition, a network for sequential prediction is conducted on top of text regions, one by one [44, 14]. It leads to heavy time cost especially for images with a number of text regions. Another problem is that it ignores the correlation in visual cues shared in detection and recognition. A single detection network cannot be supervised by labels from text recognition, and vice versa.

在文本識別中,在文本區域的頂部逐個進行用於順序預測的網絡[44,14]。 特別是對於具有多個文本區域的圖像,這會導致沉重的時間成本。 另一個問題是, 它忽略了在檢測和識別中共享的視覺線索之間的相關性。單個檢測網絡無法通過文本識別中的標籤進行監督,反之亦然。

In this paper, we propose to simultaneously consider text detection and recognition. It leads to the fast oriented text spotting system (FOTS) which can be trained end-to-end. In contrast to previous two-stage text spotting, our method learns more generic features through convolutional neural network, which are shared between text detection and text recognition, and the supervision from the two tasks are complementary. Since feature extraction usually takes most of the time, it shrinks the computation to a single detection network, shown in Fig. 1. The key to connect detection and recognition is the ROIRotate, which gets proper features from feature maps according to the oriented detection bounding boxes.
在本文中,我們建議同時考慮文本檢測和識別。 它可以定向端到端訓練的快速定位文本發現系統(FOTS)。 與以前的兩階段文本發現相反,我們的方法通過卷積神經網絡學習更多的通用特徵,這些特徵在文本檢測和文本識別之間共享,並且來自兩個任務的監督是互補的。 由於特徵提取通常花費大部分時間,因此將計算範圍縮小到單個檢測網絡,如圖1所示。 連接檢測和識別的關鍵是ROIRotate,它根據定向檢測邊界框從特徵圖獲得適當的特徵映。

在這裏插入圖片描述

The architecture is presented in Fig. 2. Feature maps are firstly extracted with shared convolutions. The fully convolutional network based oriented text detection branch is built on top of the feature map to predict the detection bounding boxes. The RoIRotate operator extracts text proposal features corresponding to the detection results from the feature map. The text proposal features are then fed into Recurrent Neural Network (RNN) encoder and Connectionist Temporal Classification (CTC) decoder [9] for text recognition.
該體系結構如圖2所示。首先使用共享卷積提取特徵圖。 基於全卷積網絡的定向文本檢測分支建立在特徵圖的頂部,以預測檢測邊界框。 RoIRotate運算符從特徵圖中提取相對應的文本提議特徵和檢測結果。 然後將文本建議特徵輸入到遞歸神經網絡(RNN)編碼器和連接器時序分類(CTC)解碼器[9]中,以進行文本識別。

Since all the modules in the network are differentiable, the whole system can be trained end-to-end. To the best of our knoweldge, this is the first end-to-end trainable framework for oriented text detection and recognition. We find that the network can be easily trained without complicated post-processing and hyper-parameter tuning.

由於網絡中的所有模塊都是可區分的,因此可以對整個系統進行端到端的培訓。 據我們所知,這是第一個用於定向文本檢測和識別的端到端可訓練框架。 我們發現,無需複雜的後處理和超參數調整,即可輕鬆訓練網絡。
在這裏插入圖片描述

The contributions are summarized as follows.

  • We propose an end-to-end trainable framework for fast oriented text spotting. By sharing convolutional features, the network can detect and recognize text simultaneously with little computation overhead, which leads to real-time speed.我們提出了一種用於快速定位文本的端到端可訓練框架。 通過共享卷積特徵,網絡可以以很少的計算開銷同時檢測和識別文本,從而提高了實時速度。

  • We introduce the RoIRotate, a new differentiable operator to extract the oriented text regions from convolutional feature maps. This operation unifies text detection and recognition into an end-to-end pipeline.我們引入了RoIRotate,這是一種新的可微操作符,用於從卷積特徵圖中提取定向文本區域。 此操作將文本檢測和識別統一到端到端管道中。

Without bells and whistles, FOTS significantly surpasses state-of-the-art methods on a number of text detection and text spotting benchmarks, including ICDAR 2015 [26], ICDAR 2017 MLT [1] and ICDAR 2013 [27].FOTS沒有任何障礙的困難,在許多文本檢測和文本發現基準上,遠遠超過了最新技術,包括ICDAR 2015 [26],ICDAR 2017 MLT [1]和ICDAR 2013 [27]。

4.Related Work

Text spotting is an active topic in computer vision and document analysis. In this section, we present a brief introduction to related works including text detection, text recognition and text spotting methods that combine both.
文本檢測識別是計算機視覺和文檔分析中的一個活躍主題。 在本節中,我們簡要介紹相關工作,包括文本檢測,文本識別和結合了兩者的文本發現方法。

4.1.Text Detection

Most conventional methods of text detection consider text as a composition of characters. These character based methods first localize characters in an image and then group them into words or text lines. Sliding-window-based methods [22, 28, 3, 54] and connected-components based methods [18, 40, 2] are two representative categories in conventional methods.
大多數傳統的文本檢測方法都將文本視爲字符的組合。 這些基於字符的方法首先將字符定位在圖像中,然後將它們分組爲單詞或文本行。 基於滑動窗口的方法[22、28、3、54]和基於連接組件的方法[18、40、2]是常規方法中的兩個代表性類別。

Recently, many deep learning based methods are proposed to directly detect words in images. 最近,提出了許多基於深度學習的方法來直接檢測圖像中的單詞。

Tian et al. [49] employ a vertical anchor mechanism to predict the fixedwidth sequential proposals and then connect them.
田等[49]採用垂直錨機制預測固定寬度的順序順序候選位置,然後將它們連接起來。

Ma et al. [39] introduce a novel rotation-based framework for arbitrarily oriented text by proposing Rotation RPN and Rotation RoI pooling.
Ma等[39]通過提出旋轉RPN和旋轉RoI池,引入了一種新的基於旋轉的框架,用於檢測任意定向的文本。

Shi et al. [43] first predict text segments and then link them into complete instances using the linkage prediction.
Shi等[43]首先預測文本段,然後使用鏈接預測將它們鏈接到完整實例中。

With dense predictions and one step post processing, Zhou et al. [53] and He et al. [15] propose deep direct regression methods for multi-oriented scene text detection.
通過密集的預測和一步後處理,Zhou等[53]和He等[15]提出了用於多方向場景文本檢測的深度直接回歸方法。

4.2.Text Recognition

Generally, scene text recognition aims to decode a sequence of label from regularly cropped but variable-length text images. Most previous methods [8, 30] capture individual characters and refine misclassified characters later. Apart from character level approaches, recent text region recognition approaches can be classified into three categories: word classification based, sequence-to-label decode based and sequence-to-sequence model based methods.
通常,場景文本識別旨在從規則裁剪但長度可變的文本圖像中解碼標籤序列。 以前的大多數方法[8,30]捕獲單個字符並在以後細化錯誤分類的字符。 除字符級方法外,最近的文本區域識別方法可分爲三類:基於單詞分類,基於序列到標籤解碼和基於序列到序列模型的方法。

Jaderberg et al. [19] pose the word recognition problem as a conventional multi-class classification task with a large number of class labels (about 90K words). Jaderberg等[19]提出了將單詞識別問題作爲具有大量類別標籤(約90K個單詞)的常規多類別分類任務。

Su et al. [48] frame text recognition as a sequence labelling problem, where RNN is built upon HOG features and adopt CTC as decoder. Su等[48]幀文本識別作爲序列標記問題,其中RNN建立在HOG功能的基礎上,並採用CTC作爲解碼器。

Shi et al. [44] and He et al. [14] propose deep recurrent models to encode the max-out CNN features and adopt CTC to decode the encoded sequence. Shi等[44]和He等[14]提出了深度遞歸模型來編碼最大輸出CNN特徵並採用CTC來解碼編碼序列。

Fujii et al. [5] propose an encoder and summarizer network to produce input sequence for CTC. 藤井等[5]提出了一種編碼器和彙總器網絡,以按順序生成CTC。

Lee et al. [31] use an attention-based sequence-to-sequence structure to automatically focus on certain extracted CNN features and implicitly learn a character level language model embodied in RNN. Lee等[31]使用基於注意力的序列到序列結構自動關注某些提取的CNN特徵,並隱式學習RNN中體現的字符級語言模型。

To handle irregular input images, Shi et al. [45] and Liu et al. [37] introduce spatial attention mechanism to transform a distorted text region into a canonical pose suitable for recognition.爲了處理不規則的輸入圖像,Shi等人[45]和劉等[37]介紹了空間注意力機制,將扭曲的文本區域轉換爲適合識別的規範姿勢。

4.3.Text Spotting

Most previous text spotting methods first generate text proposals using a text detection model and then recognize them with a separate text recognition model.
以前的大多數文本發現方法首先使用文本檢測模型生成文本提議,然後使用單獨的文本識別模型對其進行識別。

Jaderberg et al. [20] first generate holistic text proposals with a high recall using an ensemble model, and then use a word classifier for word recognition.
Jaderberg等[20]首先使用整體模型生成具有高召回率的整體文本候選框,然後使用單詞分類器進行單詞識別。

Gupta et al. [10] train a Fully-Convolutional Regression Network for text detection and adopt the word classifier in [19] for text recognition. Gupta等[10]訓練了一個全卷積迴歸網絡進行文本檢測,並採用[19]中的單詞分類器進行文本識別。

Liao et al. [34] use an SSD [36] based method for text detection and CRNN [44] for text recognition. 廖等[34]使用基於SSD [36]的方法進行文本檢測,並使用CRNN [44]進行文本識別。

Recently Li et al. [33] propose an end-to-end text spotting method, which uses a text proposal network inspired by RPN [41] for text detection and LSTM with attention mechanism [38, 45, 3] for text recognition.
最近李等人[33]提出了一種端到端的文本定位方法,該方法使用受RPN [41]啓發的文本提議網絡進行文本檢測,並使用具有關注機制[38,45,3]的LSTM進行文本識別。

Our method has two mainly advantages compared to them: (1) We introduce RoIRotate and use totally different text detection algorithm to solve more complicated and difficult situations, while their method is only suitable for horizontal text. (2) Our method is much better than theirs in terms of speed and performance, and in particular, nearly cost-free text recognition step enables our text spotting system to run at realtime speed, while their method takes approximately 900ms to process an input image of 600×800 pixels.
與之相比,我們的方法有兩個主要優點:(1)介紹了RoIRotate,並使用完全不同的文本檢測算法來解決更復雜和更困難的情況,而它們的方法僅適用於水平文本。 (2)我們的方法在速度和性能方面都比他們的方法好得多,特別是幾乎免費的文本識別步驟使我們的文本點播系統能夠實時運行,而他們的方法大約需要900毫秒來處理 600×800像素的輸入圖像。

在這裏插入圖片描述

5.Methodology

FOTS is an end-to-end trainable framework that detects and recognizes all words in a natural scene image simultaneously. It consists of four parts: shared convolutions, the text detection branch, RoIRotate operation and the text recognition branch.
FOTS是一個端到端的可訓練框架,可同時檢測和識別自然場景圖像中的所有單詞。 它由四個部分組成:共享卷積,文本檢測分支,RoIRotate操作和文本識別分支。

5.1.Overall Architecture

An overview of our framework is illustrated in Fig. 2. The text detection branch and recognition branch share convolutional features, and the architecture of the shared network is shown in Fig. 3. The backbone of the shared network is ResNet-50 [12]. Inspired by FPN [35], we concatenate low-level feature maps and high-level semantic feature maps. The resolution of feature maps produced by shared convolutions is 1/4 of the input image. 圖2說明了我們的框架。文本檢測分支和識別分支共享卷積特徵,共享網絡的體系結構如圖3所示。共享網絡的骨幹是ResNet- 50 [12]。 受FPN [35]的啓發,我們將低級特徵圖和高級語義特徵圖連接起來。 共享卷積產生的特徵圖的分辨率爲輸入圖像的1/4。

The text detection branch outputs dense per-pixel prediction of text using features produced by shared convolutions. With oriented text region proposals produced by detection branch, the proposed RoIRotate converts corresponding shared features into fixed-height representations while keeping the original region aspect ratio.
文本檢測分支使用共享卷積產生的特徵來輸出密集的每像素文本預測。 利用檢測分支生成的定向文本區域建議,建議的RoIRotate將相應的共享特徵轉換爲固定高度的表示,同時保持原始區域的寬高比。

Finally, the text recognition branch recognizes words in region proposals. CNN and LSTM are adopted to encode text sequence information, followed by a CTC decoder. The structure of our text recognition branch is shown in Tab. 1.
最後,文本識別分支識別區域提議中的單詞。 CNN和LSTM用於編碼文本序列信息,然後是CTC解碼器。 Tab1中顯示了文本識別分支的結構。
在這裏插入圖片描述

5.2.Text Detection Branch

Inspired by [53, 15], we adopt a fully convolutional network as the text detector. As there are a lot of small text boxes in natural scene images, we upscale the feature maps from 1/32 to 1/4 size of the original input image in shared convolutions. After extracting shared features, one convolution is applied to output dense per-pixel predictions of words. The first channel computes the probability of each pixel being a positive sample. Similar to [53], pixels in shrunk version of the original text regions are considered positive. For each positive sample, the following 4 channels predict its distances to top, bottom, left, right sides of the bounding box that contains this pixel, and the last channel predicts the orientation of the related bounding box. Final detection results are produced by applying thresholding and NMS to these positive samples.
受[53,15]的啓發,我們採用完全卷積的網絡作爲文本檢測器。 由於自然場景圖像中有許多小文本框,因此我們在共享卷積中將特徵圖的大小從原始輸入圖像的1/32放大到1/4。 提取共享特徵後,將進行一次卷積以輸出密集的每個像素的單詞預測。 第一通道計算每個像素爲正樣本的概率。 類似於[53],原始文本區域的縮小版本中的像素被認爲是正的。 對於每個正樣本,以下4個通道預測其到包含此像素的邊界框的頂部,底部,左側,右側的距離,最後一個通道預測相關邊界框的方向。 最終檢測結果是通過對這些陽性樣品進行閾值分析和NMS產生的。

在這裏插入圖片描述

In our experiments, we observe that many patterns similar to text strokes are hard to classify, such as fences, lattices, etc. We adopt online hard example mining (OHEM) [46] to better distinguish these patterns, which also solves the class imbalance problem. This provides a F-measure improvement of about 2% on ICDAR 2015 dataset.
在我們的實驗中,我們觀察到許多類似於文本筆劃的模式很難分類,例如柵欄,格子等。我們採用在線硬示例挖掘(OHEM)[46]來更好地區分這些模式,這也解決了類不平衡問題 問題。 這在ICDAR 2015數據集上提供了約2%的F量度改進。

The detection branch loss function is composed of two sterms: text classification term and bounding box regression term. The text classification term can be seen as pixel-wise classification loss for a down-sampled score map. Only shrunk version of the original text region is considered as the positive area, while the area between the bounding box and the shrunk version is considered as “NOT CARE”, and does not contribute to the loss for the classification. Denote the set of selected positive elements by OHEM in the score map as , the loss function for classification can be formulated as:
檢測分支損失函數由兩個項組成:文本分類項和邊界框迴歸項。 文本分類術語可以看作是下采樣分數圖的逐像素分類損失。 僅將原始文本區域的縮小版本視爲正區域,而將邊界框和縮小版本之間的區域視爲“NOT CARE”,並且不會造成分類損失。將OHEM在得分圖中表示的一組選定的正元素表示爲,分類的損失函數可以表示爲:
在這裏插入圖片描述

where |\cdot| is the number of elements in a set, and H(px,px)H(p_x,p^*_x) represents the cross entropy loss between pxp_x , the prediction of the score map, and pxp^*_x, the binary label that indicates text or non-text. As for the regression loss, we adopt the IoU loss in [52] and the rotation angle loss in [53], since they are robust to variation in object shape, scale and orientation:
其中,|\cdot|是集合中元素的數量,H(px,px)H(p_x,p^*_x)表示pxp_x(分數圖的預測)和pxp^*_x(指示文本或非文本的二進制標籤)之間的交叉熵損失。 至於迴歸損失,我們在[52]中採用IoU損失,在[53]中採用旋轉角損失,因爲它們對於物體形狀,比例和方向的變化具有魯棒性:
在這裏插入圖片描述

Here, IoURx,RxR_x,R^*_x is the IoU loss between the predicted bounding boxRxR_x , and the ground truthRxR^*_x . The second term is rotation angle loss, where θxθ_xand θxθ^*_x represent predicted orientation and the ground truth orientation respectively. We set the hyper-parameter γθ\gamma_\theta to 10 in experiments. Therefore the full detection loss can be written as:
在這裏,IoURx,RxR_x,R^*_x是預測邊界框RxR_x和地面真實值RxR^*_x之間的IoU損耗。 第二項是旋轉角損失,其中θxθ_xθxθ^*_x分別代表預測方向和真實標籤的方向。 我們在實驗中將超參數γθ\gamma_\theta設置爲10。 因此,完整的檢測損失可以寫成:

在這裏插入圖片描述

where a hyper-parameter γreg\gamma_{reg} balances two losses, which is set to 1 in our experiments.
其中超參數γreg\gamma_{reg}平衡了兩個損耗,在我們的實驗中將其設置爲1。

5.3.RoIRotate

RoIRotate applies transformation on oriented feature regions to obtain axis-aligned feature maps, as shown in Fig. 4. In this work, we fix the output height and keep the aspect ratio unchanged to deal with the variation in text length. Compared to RoI pooling [6] and RoIAlign [11], RoIRotate provides a more general operation for extracting features for regions of interest. We also compare to RRoI pooling proposed in RRPN [39]. RRoI pooling transforms the rotated region to a fixed size region through max-pooling, while we use bilinear interpolation to compute the values of the output. This operation avoids misalignments between the RoI and the extracted features, and additionally it makes the lengths of the output features variable, which is more suitable for text recognition.
RoIRotate對定向的特徵區域進行變換以獲得軸對齊的特徵圖,如圖4所示。在這項工作中,我們固定輸出高度並保持寬高比不變,以處理文本長度的變化。 與RoI合併[6]和RoIAlign [11]相比,RoIRotate提供了更通用的操作來提取感興趣區域的特徵。 我們還與RRPN中提出的RRoI池進行了比較[39]。 RRoI池通過最大池化將旋轉區域轉換爲固定大小的區域,而我們使用雙線性插值來計算輸出的值。 此操作避免了RoI和提取的特徵之間的未對準,此外,它還使輸出特徵的長度可變,這更適合於文本識別。

This process can be divided into two steps. First, affine transformation parameters are computed via predicted or ground truth coordinates of text proposals. Then, affine transformations are applied to shared feature maps for each region respectively, and canonical horizontal feature maps of text regions are obtained. The first step can be formulated as:
此過程可以分爲兩個步驟。 首先,仿射變換參數是通過文本提議的預測座標或真實座標來計算的。 然後,將仿射變換分別應用於每個區域的共享特徵圖,並獲得文本區域的規範水平特徵圖。 第一步可以表述爲:
在這裏插入圖片描述

where M is the affine transformation matrix. hth_t,wtw_t represent height (equals 8 in our setting) and width of feature maps after affine transformation. (x, y) represents the coordinates of a point in shared feature maps and (t, b, l, r) stands for distance to top, bottom, left, right sides of the text proposal respectively, and θ for the orientation. (t, b, l, r) and θ can be given by ground truth or the detection branch. With the transformation parameters, it is easy to produce the final RoI feature using the affine transformation:
其中M是仿射變換矩陣。 hth_t,wtw_t 代表仿射變換後的高度(在我們的設置中爲8)和特徵圖的寬度。(x,y)表示共享特徵圖中點的座標,(t,b,l,r)分別表示距文本提議的頂部,底部,左側,右側的距離,而θ表示方向。(t,b,l,r)和θ可以由真實標籤或檢測分支給出。 使用轉換參數,可以使用仿射轉換輕鬆生成最終的RoI功能:
在這裏插入圖片描述

where VijcV^c_{ij} is the output value at location (i, j) in channel c and UijcU^c_{ij} is the input value at location (n, m) in channel c. hsh_s,wsw_s represent the height and width of the input, and ϕx\phi_x, ϕy\phi_y are the parameters of a generic sampling kernel k(), which defines the interpolation method, specifically bilinear interpolation in this work. As the width of text proposals may vary, in practice, we pad the feature maps to the longest width and ignore the padding parts in recognition loss function.
其中 VijcV^c_{ij}是通道c中位置(i,j)的輸出值,UijcU^c_{ij}是通道c中位置(n,m)的輸入值。 hsh_s,wsw_s 代表輸入的高度和寬度,而 ϕx\phi_x, ϕy\phi_y 是通用採樣內核k()的參數,該內核定義了插值方法,尤其是這項工作中的Biliear。 由於文本建議的寬度可能會有所不同,因此在實踐中,我們將要素映射填充到最長寬度,而忽略識別損失函數中的填充部分。

Spatial transformer network [21] uses affine transformation in a similar way, but gets transformation parameters via a different method and is mainly used in the image domain, i.e. transforming images themselves. RoIRotate takes feature maps produced by shared convolutions as input, and generates the feature maps of all text proposals, with fixed height and unchanged aspect ratio.
空間變換器網絡[21]以相似的方式使用仿射變換,但是通過不同的方法獲得變換參數,並且主要用於圖像領域,即變換圖像本身。 RoIRotate將共享卷積生成的特徵圖作爲輸入,並生成所有文本建議的特徵圖,高度固定且縱橫比不變。

Different from object classification, text recognition is very sensitive to detection noise. A small error in predicted text region could cut off several characters, which is harmful to network training, so we use ground truth text regions instead of predicted text regions during training. When testing, thresholding and NMS are applied to filter predicted text regions. After RoIRotate, transformed feature maps are fed to the text recognition branch.
與對象分類不同,文本識別對檢測噪聲非常敏感。 預測文本區域中的一個小錯誤可能會切斷多個字符,這對網絡訓練有害。因此,在訓練過程中,我們使用地面真實文本區域而不是預測文本區域。 測試時,將應用閾值和NMS來過濾預測的文本區域。 在RoIRotate之後,將轉換後的特徵圖饋送到文本識別分支。

5.4.Text Recognition Branch

The text recognition branch aims to predict text labels using the region features extracted by shared convolutions and transformed by RoIRotate. Considering the length of the label sequence in text regions, input features to LSTM are reduced only twice (to 1/4 as described in Sec. 3.2) along width axis through shared convolutions from the original image. Otherwise discriminable features in compact text regions, especially those of narrow shaped characters, will be eliminated. Our text recognition branch consists of VGGlike [47] sequential convolutions, poolings with reduction along height axis only, one bi-directional LSTM [42, 16], one fully-connection and the final CTC decoder [9].
文本識別分支旨在使用通過共享卷積提取並由RoIRotate轉換的區域特徵來預測文本標籤。 考慮到文本區域中標籤序列的長度,通過與原始圖像共享卷積,LSTM的輸入特徵沿寬度軸僅減少了兩次(如第3.2節中所述爲1/4)。 否則,將消除緊湊文本區域中的可辨別特徵,尤其是那些狹窄形狀的字符。 我們的文本識別分支包括VGGlike [47]順序卷積,僅沿高度軸縮減的池,一個雙向LSTM [42,16],一個完全連接和最終的CTC解碼器[9]。

First, spatial features are fed into several sequential convolutions and poolings along height axis with dimension reduction to extract higher-level features. For simplicity, all reported results here are based on VGG-like sequential layers as shown in Tab. 1.
首先,將空間特徵沿高度軸輸入到幾個順序的卷積和池中,並縮小尺寸以提取更高級別的特徵。 爲了簡單起見,此處所有報告的結果均基於Tab1所示的VGG類連續層。
在這裏插入圖片描述

Next, the extracted higher-level feature maps LRCHWL \in R^{C*H*W} are permuted to time major form as a sequencel1,...,lwRCHl_1,...,l_w \in R^{C*H} and fed into RNN for encoding. Here we use a bi-directional LSTM, with D = 256 output channels per direction, to capture range dependencies of the input sequential features. Then, hidden states h1,...,hwRDh_1,...,h_w \in R^D calculated at each time step in both directions are summed up and fed into a fully-connection, which gives each state its distribution xtRSx_t \in R^{|S|} over the character classes S. To avoid overfitting on small training datasets like ICDAR 2015, we add dropout before fully-connection. Finally, CTC is used to transform frame-wise classification scores to label sequence. Given probability distribution xtx_t over S of each hth_t, and ground truth label sequence y=y1,...,yT,T<=Wy^*={y_1,...,y_T},T<=W, the conditional probability of the label yy^* is the sum of probabilities of all paths π agreeing with [9]:
接下來,將提取的高級特徵圖LRCHWL \in R^{C*H*W}排列爲時間主形式,序列爲l1,...,lwRCHl_1,...,l_w \in R^{C*H},並饋入RNN進行編碼。 在這裏,我們使用雙向LSTM,每個方向D = 256個輸出通道,以捕獲輸入順序特徵的範圍依賴性。 然後,將在每個時間步長上在兩個方向上計算出的隱藏狀態 h1,...,hwRDh_1,...,h_w \in R^D 相加並饋入完全連接,從而給出每個狀態的分佈xtRSx_t \in R^{|S|}。 爲了避免過度適合於像ICDAR 2015這樣的小型訓練數據集,我們在完全連接之前添加了dropout。 最後,使用CTC將逐幀分類評分轉換爲標籤序列。 給定每個hth_t的S上的概率分佈xtx_t,並且真值標籤序列y=y1,...,yT,T<=Wy^*={y_1,...,y_T},T<=W,標籤yy^*條件概率是所有路徑π同意的概率之和 與[9]:
在這裏插入圖片描述

where B defines a many-to-one map from the set of possible labellings with blanks and repeated labels to yy_∗. The training process attempts to maximize the log likelihood of summation of Eq. (11) over the whole training set. Following [9], the recognition loss can be formulated as:
其中B定義了從一組可能的帶有空白和重複標籤的標yy_∗簽到的多對一映射。 訓練過程試圖在整個訓練集中,使等式(11)求和的對數似然性最大化。根據[9],識別損失可以表述爲:
在這裏插入圖片描述

where N is the number of text regions in an input image, and yny^*_n is the recognition label. Combined with detection loss LdetectL_{detect} in Eq. (3), the full multi-task loss function is:
其中N是輸入圖像中文本區域的數量,yny^*_n是識別標籤。 與等式(3)中的檢測損失LdetectL_{detect}相結合,完整的多任務丟失功能是:

where a hyper-parameter γrecog\gamma_{recog} controls the trade-off between two losses. γrecog\gamma_{recog} is set to 1 in our experiments.
其中超參數γrecog\gamma_{recog}控制兩個損耗之間的權衡。 在我們的實驗中,γrecog\gamma_{recog}設置爲1。

5.5.Implementation Details

We use model trained on ImageNet dataset [29] as our pre-trained model. The training process includes two steps: first we use Synth800k dataset [10] to train the network for 10 epochs, and then real data is adopted to fine-tune the model until convergence. Different training datasets are adopted for different tasks, which will be discussed in Sec. 4. Some blurred text regions in ICDAR 2015 and ICDAR 2017 MLT datasets are labeled as “DO NOT CARE”, and we ignore them in training.
我們使用在ImageNet數據集[29]上訓練的模型作爲我們的預訓練模型。 訓練過程包括兩個步驟:首先,我們使用Synth800k數據集[10]訓練網絡10個時間段,然後採用實際數據對模型進行微調,直到收斂爲止。 針對不同的任務採用不同的訓練數據集,這將在節4中討論。 ICDAR 2015和ICDAR 2017 MLT數據集中的一些模糊文本區域被標記爲“請勿關注”,我們在訓練中將其忽略。

Data augmentation is important for robustness of deep neural networks, especially when the number of real data is limited, as in our case. First, longer sides of images are resized from 640 pixels to 2560 pixels. Next, images are rotated in range [ 10◦, 10◦] randomly. Then, the heights of images are rescaled with ratio from 0.8 to 1.2 while their widths keep unchanged. Finally, 640×640 random samples are cropped from the transformed images.
數據擴充對於深度神經網絡的魯棒性很重要,尤其是在實際數據數量有限的情況下(如本例所示)。 首先,將圖像的較長邊從640像素調整爲2560像素。 接下來,將圖像隨機旋轉到[10°,10°]範圍內。 然後,圖像的高度以0.8到1.2的比例重新縮放,而寬度保持不變。 最後,從轉換後的圖像中裁剪出640×640個隨機樣本。

As described in Sec. 3.2, we adopt OHEM for better performance. For each image, 512 hard negative samples, 512 random negative samples and all positive samples are selected for classification. As a result, positive-to-negative ratio is increased from 1:60 to 1:3. And for bounding box regression, we select 128 hard positive samples and 128 random positive samples from each image for training.
如3.2節所述,我們採用OHEM以獲得更好的性能。 對於每個圖像,選擇512個硬陰性樣本,512個隨機陰性樣本和所有陽性樣本進行分類。 結果,正負比從1:60增加到1:3。 對於邊界框迴歸,我們從每個圖像中選擇128個硬陽性樣本和128個隨機陽性樣本進行訓練。

At test time, after getting predicted text regions from the text detection branch, the proposed RoIRotate applys thresholding and NMS to these text regions and feeds selected text features to the text recognition branch to get final recognition result. For multi-scale testing, results from all scales are combined and fed to NMS again to get the final results.
在測試時,從文本檢測分支獲取預測的文本區域後,擬議的RoIRotate將閾值和NMS應用於這些文本區域,並將選定的文本特徵饋送到文本識別分支以獲取最終的識別結果。 對於多尺度測試,將所有尺度的結果合併並再次饋入NMS,以獲得最終結果。
在這裏插入圖片描述

在這裏插入圖片描述

6.Experiments

We evaluate the proposed method on three recent challenging public benchmarks: ICDAR 2015 [26], ICDAR 2017 MLT [1] and ICDAR 2013 [27], and surpasses state-of-the-art methods in both text localization and text spotting tasks. All the training data we use is publicly available.
我們在三個最新的具有挑戰性的公開基準上評估了所提出的方法:ICDAR 2015 [26],ICDAR 2017 MLT [1]和ICDAR 2013 [27],並且在文本本地化和文本發現任務方面都超過了最先進的方法。 我們使用的所有培訓數據都是公開的。

6.1.Benchmark Datasets

  • ICDAR 2015

ICDAR 2015 is the Challenge 4 of ICDAR 2015 Robust Reading Competition, which is commonly used for oriented scene text detection and spotting. This dataset includes 1000 training images and 500 testing images. These images are captured by Google glasses without taking care of position, so text in the scene can be in arbitrary orientations. For text spotting task, it provides 3 specific lists of words as lexicons for reference in the test phase, named as “Strong”, “Weak” and “Generic”. “Strong” lexicon provides 100 words per-image including all words that appear in the image. “Weak” lexicon includes all words that appear in the entire test set. And “Generic” lexicon is a 90k word vocabulary. In training, we first train our model using 9000 images from ICDAR 2017 MLT training and validation datasets, then we use 1000 ICDAR 2015 training images and 229 ICDAR 2013 training images to fine-tune our model.
ICDAR 2015是ICDAR 2015魯棒性閱讀比賽的第4個挑戰,通常用於定向的場景文本檢測和發現。 該數據集包括1000個訓練圖像和500個測試圖像。 這些圖像是由Google眼鏡捕獲的,而無需考慮位置,因此場景中的文本可以是任意方向。 對於文本發現任務,它提供了3個特定的單詞列表作爲詞典,供測試階段參考,分別稱爲“強”,“弱”和“通用”。 “強”詞典每個圖像提供100個單詞,包括圖像中出現的所有單詞。 “弱”詞典包括出現在整個測試集中的所有單詞。 而“通用”詞典是一個90k單詞的詞彙表。 在訓練中,我們首先使用ICDAR 2017 MLT訓練和驗證數據集中的9000張圖像訓練模型,然後使用1000張ICDAR 2015訓練圖像和229張ICDAR 2013訓練圖像對模型進行微調。

  • ICDAR 2017 MLT

ICDAR 2017 MLT is a large scale multi-lingual text dataset, which includes 7200 training images, 1800 validation images and 9000 testing images. The dataset is composed of complete scene images which come from 9 languages, and text regions in this dataset can be in arbitrary orientations, so it is more diverse and challenging. This dataset does not have text spotting task so we only report our text detection result. We use both training set and validation set to train our model.
ICDAR 2017 MLT是一個大規模的多語言文本數據集,包括7200個訓練圖像,1800個驗證圖像和9000個測試圖像。 該數據集由來自9種語言的完整場景圖像組成,並且該數據集中的文本區域可以處於任意方向,因此它更具多樣性和挑戰性。 該數據集沒有文本識別任務,因此我們僅報告文本檢測結果。 我們同時使用訓練集和驗證集來訓練我們的模型。

  • ICDAR 2013

ICDAR 2013 consists of 229 training images and 233 testing images, and similar to ICDAR 2015, it also provides “Strong”, “Weak” and “Generic” lexicons for text spotting task. Different to above datasets, it contains only horizontal text. Though our method is designed for oriented text, results in this dataset indicate the proposed method is also suitable for horizontal text. Due to there are too few training images, we first use 9000 images from ICDAR 2017 MLT training and validation datasets to train a pre-trained model and then use 229 ICDAR 2013 training images to fine-tune.
ICDAR 2013由229張訓練圖像和233張測試圖像組成,與ICDAR 2015相似,它還提供“強”,“弱”和“通用”詞典來進行文本發現任務。 與上述數據集不同,它僅包含水平文本。 儘管我們的方法是針對定向文本而設計的,但該數據集中的結果表明,該方法也適用於水平文本。 由於訓練圖像太少,我們首先使用ICDAR 2017 MLT訓練和驗證數據集中的9000張圖像來訓練預訓練模型,然後使用229個ICDAR 2013訓練圖像進行微調。

6.2.Comparison with Two-Stage Method

Different from previous works which divide text detection and recognition into two unrelated tasks, our method train these two tasks jointly, and both text detection and recognition can benefit from each other. To verify this, we build a two-stage system, in which text detection and recognition models are trained separately. The detection network is built by removing recognition branch in our proposed net-work, and similarly, detection branch is removed from origin network to get the recognition network. For recognition network, text line regions cropped from source images are used as training data, similar to previous text recognition methods [44, 14, 37].
與以前的將文本檢測和識別分爲兩個不相關任務的工作不同,我們的方法聯合訓練了這兩個任務,並且文本檢測和識別都可以從中受益。 爲了驗證這一點,我們構建了一個兩階段系統,其中分別對文本檢測和識別模型進行了訓練。 通過在我們提出的網絡中刪除識別分支來構建檢測網絡,類似地,從原始網絡中刪除檢測分支以獲得識別網絡。 對於識別網絡,類似於以前的文本識別方法[44、14、37],將從源圖像裁剪的文本行區域用作訓練數據。

在這裏插入圖片描述

在這裏插入圖片描述

As shown in Tab. 2,3,4, our proposed FOTS significantly outperforms the two-stage method “Our Detection” in text localization task and “Our Two-Stage” in text spotting task. Results show that our joint training strategy pushes model parameters to a better converged state.
如標籤所示。 2,3,4,我們提出的FOTS在文本定位任務中明顯優於兩階段方法“我們的檢測”,在文本發現任務中優於“我們的兩階段”方法。 結果表明,我們的聯合訓練策略將模型參數推向更好的收斂狀態。

FOTS performs better in detection because text recognition supervision helps the network to learn detailed character level features. To analyze in detail, we summarize four common issues for text detection, Miss: missing some text regions, False: regarding some non-text regions as text regions wrongly, Split: wrongly spliting a whole text region to several individual parts, Merge: wrongly merging several independent text regions together. As shown in Fig. 5, FOTS greatly reduces all of these four types of errors compared to “Our Detection” method. Specifically, “Our Detection” method focuses on the whole text region feature rather than character level feature, so this method does not work well when there is a large variance inside a text region or a text region has similar patterns with its background, etc. As the text recognition supervision forces the model to consider fine details of characters, FOTS learns the semantic information among different characters in one word that have different patterns. It also enhances the difference among characters and background that have similar patterns.
FOTS在檢測方面表現更好,因爲文本識別監控可以幫助網絡學習詳細的字符級功能。爲了進行詳細分析,我們總結了四個常見的文本檢測問題:丟失:缺少一些文本區域,錯:將一些非文本區域錯誤地視爲文本區域,拆分:錯誤地將整個文本區域拆分爲幾個單獨的部分,合併:錯誤地將幾個獨立的文本區域合併在一起。如圖5所示,與“我們的檢測”方法相比,FOTS大大減少了這四種類型的錯誤。具體來說,“我們的檢測”方法側重於整個文本區域特徵,而不是字符級特徵,因此,當文本區域內部存在較大差異或文本區域的背景圖案相似時,此方法效果不佳。隨着文本識別監督迫使模型考慮字符的精細細節,FOTS學習了一個單詞中具有不同的模式。它還可以增強具有相似圖案的字符和背景之間的差異。

As shown in Fig. 5, for the Miss case, “Our Detection” method misses the text regions because their color is similar to their background. For the False case, “Our Detection” method wrongly recognizes a background region as text because it has “text-like” patterns (e.g., repetitive structured stripes with high contrast), while FOTS avoids this mistake after training with recognition loss which considers details of characters in the proposed region. For the Split case, “Our Detection” method splits a text region to two because the left and right sides of this text region have different colors, while FOTS predicts this region as a whole because patterns of characters in this text region are continuous and similar. For the Merge case, “Our Detection” method wrongly merges two neighboring text bounding boxes together because they are too close and have similar patterns, while FOTS utilizes the character level information given by text recognition and captures the space between two words.
如圖5所示,對於Miss情況,“ Our Detection”方法會錯過文本區域,因爲它們的顏色與背景相似。對於False情況,“我們的檢測”方法將背景區域錯誤地識別爲文本,因爲它具有“類似文本”的圖案(例如,具有高對比度的重複結構條紋),而FOTS在訓練後考慮了細節的識別損失避免了此錯誤建議區域中的字符數。對於分割情況,“我們的檢測”方法將文本區域分爲兩個,因爲該文本區域的左側和右側具有不同的顏色,而FOTS預測該區域爲一個整體,因爲該文本區域中的字符模式是連續且相似的。對於合併情況,“我們的檢測”方法將兩個相鄰的文本邊界框錯誤地合併在一起,因爲它們太近且具有相似的樣式,而FOTS利用了文本識別所提供的字符級信息並捕獲了兩個單詞之間的空格。

在這裏插入圖片描述
在這裏插入圖片描述

6.3.Comparisons with State-of-the-Art Results

In this section, we compare FOTS to state-of-the-art methods. As shown in Tab. 2, 3, 4, our method outperforms all others by a large margin in all datasets. Since ICDAR 2017 MLT does not have text spotting task, we only report our text detection result. All text regions in ICDAR 2013 are labeled by horizontal bounding box while many of them are slightly tilted. As our model is pre-trained using ICDAR 2017 MLT data, it also can predict orientations of text regions. Our final text spotting results keep predicted orientations for better performance, and due to the limitation of the evaluation protocol, our detection results are the minimum horizontal circumscribed rectangles of network predictions. It is worth mentioning that in ICDAR 2015 text spotting task, our method outperforms previous best method [43, 44] by more than 15% in terms of F-measure.
在本節中,我們將FOTS與最新方法進行比較。 如標籤2、3、4所示,我們的方法在所有數據集中的表現都比其他方法好很多。 由於IC DAR 2017 MLT沒有文本識別任務,因此我們僅報告文本檢測結果。 ICDAR 2013中的所有文本區域均由水平邊界框標記,其中許多區域略微傾斜。 由於我們的模型已使用ICDAR 2017 MLT數據進行了預訓練,因此也可以預測文本區域的方向。 我們最終的文本發現結果將保持預測的方向,以實現更好的性能,並且由於評估協議的限制,我們的檢測結果是網絡預測的最小水平外接矩形。 值得一提的是,在ICDAR 2015文本查找任務中,就F度量而言,我們的方法比以前的最佳方法[43,44]高出15%以上。

For single-scale testing, FOTS resizes longer side of input images to 2240, 1280, 920 respectively for ICDAR 2015, ICDAR 2017 MLT and ICDAR 2013 to achieve the best results, and we apply 3-5 scales for multi-scale testing.
對於單尺度測試,對於ICDAR 2015,ICDAR 2017 MLT和ICDAR 2013,FOTS會將較長圖像的長邊尺寸分別調整爲2240、1280和920,以獲得最佳結果,並且我們將3-5尺度用於多尺度測試。 。

6.4.Speed and Model Size

As shown in Tab. 5, benefiting from our convolution sharing strategy, FOTS can detect and recognize text jointly with little computation and storage increment compared to
a single text detection network (7.5 fps vs. 7.8 fps, 22.0 fps vs. 23.9 fps), and it is almost twice as fast as “Our Two-Stage” method (7.5 fps vs. 3.7 fps, 22.0 fps vs. 11.2 fps). As a consequence, our method achieves state-of-the-art performance while keeping real-time speed. All of these methods are tested on ICDAR 2015 and ICDAR 2013 test sets. These datasets have 68 text recognition labels, and we evaluate all test images and calculate the average speed. For ICDAR 2015, FOTS uses 2240×1260 size images as inputs, “Our Two-Stage” method uses 2240×1260 images for detection and 32 pixels height cropped text region patches for recognition. As for ICDAR 2013, we resize longer size of input images to 920 and also use 32 pixels height image patches for recognition. To achieve real-time speed, “FOTS RT” replaces ResNet-50 with ResNet-34 and uses 1280×720 images as inputs. All results in Tab. 5 are tested on a modified version Caffe [23] using a TITAN-Xp GPU.
如標籤所示。 5,受益於我們的卷積共享策略,FOTS可以與較少的計算和存儲增量相比,共同檢測和識別文本
單個文本檢測網絡(7.5 fps對7.8 fps,22.0 fps對23.9 fps),幾乎是“我們的兩階段”方法(7.5 fps對3.7 fps,22.0 fps對11.2 fps)的兩倍。 )。結果,我們的方法在保持實時速度的同時達到了最先進的性能。所有這些方法都在ICDAR 2015和ICDAR 2013測試儀上進行了測試。這些數據集具有68個文本識別標籤,我們評估所有測試圖像並計算平均速度。對於ICDAR 2015,FOTS使用2240×1260尺寸的圖像作爲輸入,“我們的兩階段”方法使用2240×1260的圖像進行檢測,並使用32像素高的裁剪文本區域補丁進行識別。對於ICDAR 2013,我們將輸入圖像的較長尺寸調整爲920,並使用32像素高的圖像塊進行識別。爲了實現實時速度,“ FOTS RT”將ResNet-50替換爲ResNet-34,並使用1280×720圖像作爲輸入。所有結果都在選項卡中。使用TITAN-Xp GPU在改進版Caffe [23]上測試了5個。

7.Conclusion

In this work, we presented FOTS, an end-to-end trainable framework for oriented scene text spotting. A novel RoIRotate operation is proposed to unify detection and recognition into an end-to-end pipeline. By sharing convolutional features, the text recognition step is nearly costfree, which enables our system to run at real-time speed. Experiments on standard benchmarks show that our method significantly outperforms previous methods in terms of efficiency and performance.
在這項工作中,我們介紹了FOTS,這是一種用於定向場景文本定位的端到端可訓練框架。 提出了一種新穎的RoIRotate操作,以將檢測和識別統一到端到端管道中。 通過共享卷積功能,文本識別步驟幾乎是免費的,這使我們的系統能夠實時運行。 在標準基準上進行的實驗表明,在效率和性能方面,我們的方法明顯優於以前的方法。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章