TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting閱讀筆記

文章被收錄於ICCV2019
[論文地址]:http://openaccess.thecvf.com/content_ICCV_2019/html/Feng_TextDragon_An_End-to-End_Framework_for_Arbitrary_Shaped_Text_Spotting_ICCV_2019_paper.html
[代碼地址]:暫未找到

摘要

本文提出一種用來製造文本檢測與識別關係的可微運算RoISlide，使模型成爲端到端模型。本文在兩個彎曲文本數據集CTW1500和Total-Text上的表現達到最佳，在常規文本數據集ICDAR2015上達到了具有競爭力的結果。

介紹

目前，文本檢測的現有方法大多數是通過兩部實現：文本檢測與文本識別。這樣的方式具有時間成本高和忽略了文本檢測與識別之間的聯繫這兩個缺點。

本文提出的TextDragon靈感來源於TextSnake[32]，TextSnake文本檢測的方式是使用一些列的局部單元，因此可以實現任意形狀的文本檢測。但是其在訓練過程中需要字符級別的標籤，一些數據集並沒有提供此類標籤，因此可能需要耗費大量人工成本。

本文爲了實現任意形狀文本的檢測，使用了一系列局部四邊形來定位複雜的文本。
如圖2所示，RoISlide連接了檢測與識別模塊，用於從特徵圖中提取特徵和糾正任意形狀文本區域，從而減少了字符大小與方向的變化。之後，經過校正的文本特徵輸入到CNN和Connectionist Temporal Classification(CTC)中來生成最終的結果。此外，TextDragon是第一個可訓練的端到端的實現任意形狀文本檢測的模型，且僅僅使用單詞級別或行級別的標籤就可以完成檢測任務。

三大貢獻：
(1) TextDragon端到端模型提出
(2) 可微的RoISlide將識別與檢測統一到一起
(3) 僅僅使用單詞/行級別標註完成訓練

方法

本文方法：通過主幹網絡從圖像中抽取特徵，然後使用文本檢測器來描述一系列基於中心線定位的四邊形文本。然後使用RoISlide從特徵圖中沿着中心線抽取特徵，其中的局部轉換網絡將每一個四邊形中的特徵轉化爲校正後的特徵。最後，使用CNN來對每一個四邊形的特徵進行分類，使用CTC解碼器解碼出最終的文本序列。

文本檢測

爲了解決不同尺度文字識別的問題，本文采用多層特徵圖融合，將融合特徵圖上採樣至原圖像的1/4大小。
輸出模塊包括：Centerline Segmentation和Local Box Regression。

Centerline Segmentation: 中心線分割的主要目的是，找到文本的中心線。主要方法是將文本的中心線附近的像素預測爲1，其餘像素預測爲0（也就是非文本區域）。爲了解決中心線區域像素與非文本像素個數不均衡的問題，本文參考了[40]，採用**online hard example mining(OHEM)**方法。

損失函數: $L_{s e g}=\frac{1}{|S|} \sum_{s \in S} L\left(p_{s}, p_{s}^{*}\right) =\frac{1}{|S|} \sum_{s \in S}\left(-p_{s}^{*} \log p_{s}-\left(1-p_{s}^{*}\right) \log \left(1-p_{s}\right)\right)$
其中 $|S|$ 表示由OHEM選中的元素的個數， $p_s$ 爲網絡對該點的二分類結果， $p_s^*$ 爲ground truth， $p_{s}^{*} \in\{0,1\}$ 。

Local Box Regression: 這一步操作主要是得到bounding box。每一個box由兩個參數表示，一個是高度，另一個是角度，如圖3所示。

損失函數:
$\left[\begin{array}{c}L_{B} \\ L_{\theta}\end{array}\right]=\frac{1}{|P|} \sum_{i \in P} \operatorname{Smooth}_{L_{1}}\left[\begin{array}{c}B_{i}-B_{i}^{*} \\ \theta_{i}-\theta_{i}^{*}\end{array}\right]\\ L_{r e g}=L_{B}+\lambda_{\theta} L_{\theta}$
其中， $P$ 爲正樣本區域（文本中心線區域）， $B_i$ 和 ${\theta}_i$ 代表所預測得到的box和角度， $B_i^*$ 和 $\theta_i^*$ 代表ground-truth， $\lambda_i$ 是超參（本文實驗取10）,本文選擇SmoothL1損失[36]是因爲它對對象形狀變化具有魯棒性。

RoISlide

本文提出的RoISlide是通過按順序變換每一個局部四邊形，從而將全部的文本特徵間接地變換爲軸對稱的特徵。主要分爲以下兩步：1.首先，我們排列沿文本中心線分佈的四邊形。 2.使用了Local Transform Network(LTN)，以滑動方式將從每個四邊形裁剪的特徵圖轉換爲已校正的特徵圖。經過這兩步，特徵圖變爲了有序方形特徵圖，如圖4。

文本識別

本文采用了一系列的卷積層來代替[45][46]的LSTM。具體操作見表1。

文字識別主要包含兩個操作：文字分類器和轉錄層。分類器用於將上一步輸入的方形特徵圖轉化爲文本的概率，轉錄層則將概率映射爲英文字符。

其中，在轉錄層中，本文使用了CTC解碼器[9]，CTC目的是將概率分佈轉化爲文本序列。

文字識別的損失函數爲： $L_{r e c}=-\frac{1}{M} \sum_{m=1}^{M} \log p(y | X)$
則整個端到端訓練的損失函數爲： $L=L_{s e g}+\lambda_{r e g} L_{r e g}+\lambda_{r e c} L_{r e c}$ ，其中 $\lambda_{rec}$ 和 $\lambda_{reg}$ 是超參。

推理

推理步驟如圖5所示：

分組：本文根據幾何關係進行分組。
排序：1.檢測同組中的box整體是水平的還是垂直的。
採樣：對於邊界生成，本文只需對有序框進行均勻採樣以形成多邊形的頂點。然後，通過順序連接頂點來生成文本邊界。
識別：執行RoISlide和CTC。

實驗

端到端 vs. 非端到端：圖6中可看出，端到端訓練可以提升非顯著文本的檢測率。
RoISlide vs. RoIRotate：表2和3和圖6(c,d)中可看出，RoIRotate[29]不適合彎曲文本檢測，RoISlide和RoIRotate對於常規文本有着相似的效果。
Spotting with vs. without LSTM：基於CNN的文本識別器比LSTM快4倍。

參考文獻

列出博文中引用原文的部分文獻

[32] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. Textsnake: A flexible rep- resentation for detecting text of arbitrary shapes. In Euro- pean Conference on Computer Vision (ECCV), pages 19–35. Springer, 2018.

[31] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. Detecting curve text in the wild: New dataset and new solution. In arXiv preprint arXiv:1712.02170, 2017.

[46] Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, and Sungjin Kim. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[39] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Robust scene text recognition with auto- matic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016.

[28] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. Star-net: A spatial attention residue network for scene text recognition. In BMVC, volume 2, page 7, 2016.

[5] ZhanzhanCheng,XuyangLiu,FanBai,YiNiu,ShiliangPu, and Shuigeng Zhou. Arbitrarily-oriented text recognition. In arXiv preprint arXiv:1711.04226, 2017.

[25] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to- end text spotting with convolutional recurrent neural net- works. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5238–5246, 2017.

[29] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5676– 5685, 2018.

[35] Yash Patel, Michal Busˇta, and Jiri Matas. E2e-mlt - an unconstrained end-to-end method for multi-language scene text. In arXiv preprint arXiv:1801.09919, 2018.

[40] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 761–769, 2016.

[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.

[45] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-end text recognition with convolutional neural net- works. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages 3304–3308. IEEE, 2012.

[9] Alex Graves, Santiago Ferna ́ndez, Faustino Gomez, and Ju ̈rgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 369–376. ACM, 2006.

[11] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.

[33] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In European Conference on Computer Vision (ECCV), September 2018.

[論文閱讀]TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting閱讀筆記

TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting閱讀筆記

摘要

介紹

相關工作

場景文本檢測

場景文本識別

Scene Text Spotting（場景文本檢測與識別，可理解爲End-to-End）

方法

文本檢測

RoISlide

文本識別

推理

實驗

參考文獻

win11關閉自動檢測病毒刪文件

千兆寬帶實際網速能到達多少？

Ubuntu安裝破解版MATLAB及問題解決

吳恩達機器學習第六週測驗及編程作業和選做題

貪心-埃及分數

吳恩達機器學習第三章測試及編程練習

吳恩達機器學習第二週測試及編程練習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結