過分割和beam search

轉載自：http://blog.csdn.net/peaceinmind/article/details/51347679

前面的章節已經介紹了提取文本行的方法。本文主要介紹傳統的依賴over segmentation過分割,beam search和字符分類器的識別方法。主要參考文獻[1]和opencv contribute中text module中的代碼[5]。一般情況下我們會通過二值化，投影、連通域分割，分類器判別這套程序來做文字識別，但是一方面二值化現在還沒有一統江湖的方法，另一方面就算某些情況下二值化做的很好，如果有些字連起來，或者像中文單詞中這種有偏旁部首的，分割也不是非常好解決。因此，研究人員就提出了過分割。

一過分割

圖片摘自參考文獻[3]

如上圖，核心思想就是過量分割，一個可能只有幾個字的圖片，我們可以割10刀，20刀，甚至100刀，當然要儘量把真正的分割點包含在其中，然後可以靠beam search來選擇最合適的分割組合（下面介紹）。過分割的想法很簡單，但是怎麼得到過分割點呢？不同的論文中可能採用不同的方法。

如文獻[3]用的是一個double edge相關的特徵來進行二值化，文獻[1]用的是滑窗加二分類器，opencv中用的是滑窗加多分類器，如果某個窗口是個字或者是某個字的概率很大，那麼就作爲一個潛在的分割點。接下來就是要想方法選擇一個合適的分割組合。

二分割評分算法

在進行beam search搜索最優分割組合之前呢，我們需要先知道怎麼定義和計算一個分割組合的評分。這裏我們挑選opencv中用的滑窗加多分類器的方法來講解。首先我們有一個英文單詞文本行，然後定義一個滑窗，滑窗有兩個重要的參數，窗口大小和滑動歩距，窗口的高度一般跟文本行高度一致（比如說是32*32），歩距比如說5個像素（文獻1採用的是文本行高度的十分之一）,那麼Rect（x:0,y:0,width:32,height:32）,Rect(5,0,32,32)，Rect(10,0,32,32)..就是一些待評估的窗口，每隔一個歩距5就是一個潛在的分割點。如下圖，圖示中窗口選的稍微有點大，但是意思是差不多的（博主比較懶，不想再畫了）。

滑動窗口示意圖[6]

接着我們要有已經訓練好的一個分類器，輸入是一個窗口的圖像，輸出的是每一個類別的概率，比如a的概率是多少，b的概率是多少，其他類別的概率是多少。文獻[1]用的是HOG+ANN，opencv[5]用的是單層CNN[7].

這樣呢一個直觀的感覺就是選取每一個窗口裏面最大的概率求平均，比如一個文本行做了兩次分割，那麼就有三個窗口，第一個窗口是最大概率的類別是b，概率是0.3，第二個窗口是1的0.2，第三個窗口是s的0.4，那麼這樣子這個分割的分數就是3個概率的平均值0.3.但是這樣有一個缺陷是沒有考慮上下文關係，比如前面的例子中，第一個如果是b,雖然第二個的1的概率最大，但是1和l有時候很像，現實中b和l一起出現的概率也比b和1的概率高，所以第二個窗口的類別更有可能是1.那怎麼處理這種情況呢？我們在分數里加入轉移概率，下圖中截取opencv中統計的62個類別（小寫字母+大寫字母+數字）轉移概率

當加入轉移概率後，分割的分數計算就會變得相對複雜，這就需要維特比算法[2]（Viterbi algorithm）,在本人轉載的HMM帖子中有些涉及，主要是動態規劃的想法，這裏不再'贅述'，附上opencv的代碼作爲參考

[cpp] view plain copy print?

double score_segmentation(vector<int> &segmentation, string& outstring )
{
// Score Heuristics:
// No need to use Viterbi to know agiven segmentation is bad
// e.g.: in some cases we discard asegmentation because it includes a very large character
// in other cases we do it because the overlapping between two chars is toolarge
// TODO Add more heuristics (e.g. penalize large inter-character variance)
Mat interdist((int)segmentation.size()-1, 1, CV_32F, 1);
for (size_t i=0;i<segmentation.size()-1; i++)
{
interdist.at<float>((int)i,0) =(float)oversegmentation[segmentation[(int)i+1]]*step_size
-(float)oversegmentation[segmentation[(int)i]]*step_size;
if((float)interdist.at<float>((int)i,0)/win_size > 2.25) // TODO explainhow did you set this thrs
{
return -DBL_MAX;
}
if((float)interdist.at<float>((int)i,0)/win_size < 0.15) // TODO explainhow did you set this thrs
{
return -DBL_MAX;
}
}
Scalar m, std;
meanStdDev(interdist, m, std);
//double interdist_std = std[0];
//TODO Extracting start probs fromlexicon (if we have it) may boost accuracy!
vector<double>start_p(vocabulary.size());
for (int i=0;i<(int)vocabulary.size(); i++)
start_p[i] =log(1.0/vocabulary.size());
Mat V =Mat::ones((int)segmentation.size(),(int)vocabulary.size(),CV_64FC1);
V = V * -DBL_MAX;
vector<string>path(vocabulary.size());
// Initialize base cases (t == 0)
for (int i=0;i<(int)vocabulary.size(); i++)
{
V.at<double>(0,i) =start_p[i] + recognition_probabilities[segmentation[0]][i];
path[i] = vocabulary.at(i);
}
// Run Viterbi for t > 0
for (int t=1;t<(int)segmentation.size(); t++)
{
vector<string>newpath(vocabulary.size());
for (int i=0;i<(int)vocabulary.size(); i++)
{
double max_prob = -DBL_MAX;
int best_idx = 0;
for (int j=0;j<(int)vocabulary.size(); j++)
{
double prob =V.at<double>(t-1,j) + transition_p.at<double>(j,i) +recognition_probabilities[segmentation[t]][i];
if ( prob > max_prob)
{
max_prob = prob;
best_idx = j;
}
}
V.at<double>(t,i) =max_prob;
newpath[i] = path[best_idx] +vocabulary.at(i);
}
// Don't need to remember the oldpaths
path.swap(newpath);
}
double max_prob = -DBL_MAX;
int best_idx = 0;
for (int i=0;i<(int)vocabulary.size(); i++)
{
double prob =V.at<double>((int)segmentation.size()-1,i);
if ( prob > max_prob)
{
max_prob = prob;
best_idx = i;
}
}
outstring = path[best_idx];
return (max_prob /(segmentation.size()-1));
}
}

    double score_segmentation(vector<int> &segmentation, string& outstring )
    {
 
        // Score Heuristics:
        // No need to use Viterbi to know agiven segmentation is bad
        // e.g.: in some cases we discard asegmentation because it includes a very large character
        //      in other cases we do it because the overlapping between two chars is toolarge
        // TODO Add more heuristics (e.g. penalize large inter-character variance)
 
        Mat interdist((int)segmentation.size()-1, 1, CV_32F, 1);
        for (size_t i=0;i<segmentation.size()-1; i++)
        {
          interdist.at<float>((int)i,0) =(float)oversegmentation[segmentation[(int)i+1]]*step_size
                                          -(float)oversegmentation[segmentation[(int)i]]*step_size;
          if((float)interdist.at<float>((int)i,0)/win_size > 2.25) // TODO explainhow did you set this thrs
          {
             return -DBL_MAX;
          }
          if((float)interdist.at<float>((int)i,0)/win_size < 0.15) // TODO explainhow did you set this thrs
          {
             return -DBL_MAX;
          }
        }
        Scalar m, std;
        meanStdDev(interdist, m, std);
        //double interdist_std = std[0];
 
        //TODO Extracting start probs fromlexicon (if we have it) may boost accuracy!
        vector<double>start_p(vocabulary.size());
        for (int i=0;i<(int)vocabulary.size(); i++)
            start_p[i] =log(1.0/vocabulary.size());
 
 
        Mat V =Mat::ones((int)segmentation.size(),(int)vocabulary.size(),CV_64FC1);
        V = V * -DBL_MAX;
        vector<string>path(vocabulary.size());
 
        // Initialize base cases (t == 0)
        for (int i=0;i<(int)vocabulary.size(); i++)
        {
            V.at<double>(0,i) =start_p[i] + recognition_probabilities[segmentation[0]][i];
            path[i] = vocabulary.at(i);
        }
 
 
        // Run Viterbi for t > 0
        for (int t=1;t<(int)segmentation.size(); t++)
        {
 
            vector<string>newpath(vocabulary.size());
 
            for (int i=0;i<(int)vocabulary.size(); i++)
            {
                double max_prob = -DBL_MAX;
                int best_idx = 0;
                for (int j=0;j<(int)vocabulary.size(); j++)
                {
                    double prob =V.at<double>(t-1,j) + transition_p.at<double>(j,i) +recognition_probabilities[segmentation[t]][i];
                    if ( prob > max_prob)
                    {
                        max_prob = prob;
                        best_idx = j;
                    }
                }
 
                V.at<double>(t,i) =max_prob;
                newpath[i] = path[best_idx] +vocabulary.at(i);
            }
 
            // Don't need to remember the oldpaths
            path.swap(newpath);
        }
 
        double max_prob = -DBL_MAX;
        int best_idx = 0;
        for (int i=0;i<(int)vocabulary.size(); i++)
        {
            double prob =V.at<double>((int)segmentation.size()-1,i);
            if ( prob > max_prob)
            {
                max_prob = prob;
                best_idx = i;
            }
        }
 
        outstring = path[best_idx];
        return (max_prob /(segmentation.size()-1));
    }
 
}

三 Beam search

opencv中第一步做的就是最大值抑制(NMS)，如果鄰近的框有重合，且判別的是同一個類別,那麼較小概率的那個被抑制，然後在從合適的潛在分割點中找到最優的分割組合。從上面的分析知道，如果潛在分割點有10個，那麼分割的組合大概有2^10= 1024種，那麼搜索的空間還是比較大的。Beam sarch就是在寬度搜索的基礎了做了一些剪枝。

比如我們設最大的beam爲10，

(1)那麼最開始的時候我們把所有的分割數是1的集合加入候選解中

{{分割點1}，{分割點2}，{分割點3},…,{分割點10}}

(2)候選解按分數從大到小排列，如果候選解超過beam的大小，就刪掉末尾的

(3)加入新的分割點形成候選解帶有2個分割點的解

{

{分割點1}，{分割點2}，{分割點3},…,{分割點10}，

{分割點1，分割點2}，{分割點1，分割點3}，…,{分割點1，分割點10}，

{分割點2，分割點3}，{分割點2，分割點4}，…,{分割點2，分割點10}，

…

{分割點9，分割點10}
}

(4)候選解按分數從大到小排列，如果候選解超過beam的大小，就刪掉末尾的

迭代，直到“遍歷”到候選解帶有10個分割點的出現，然後分數最大的就是我們想要的分割點。

本文就講到這，錯誤與疏漏還請批評和指正。

參考文獻

[1]Bissacco A, Cummins M,Netzer Y, et al. Photoocr: Reading text in uncontrolledconditions[C]//Proceedings of the IEEE International Conference on ComputerVision. 2013: 785-792.

[2]統計學習方法[M].清華大學出版社, 2012.

[3]Bai, Jinfeng, et al."Chinese image text recognition on grayscale pixels." Acoustics,Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on.IEEE, 2014.

[4]Xiangyun Ye,M. Cheriet, andC.Y. Suen, “Stroke-model-based character extraction from gray-level documentimages,” Image Processing, IEEE Transactions on, vol. 10, no. 8, pp. 1152 –

1161, aug 2001.

[5]Opencv text module: https://github.com/Itseez/opencv_contrib/tree/master/modules/text

[6]He, Pan, et al."Reading scene text in deep convolutional sequences." arXiv preprintarXiv:1506.04395 (2015).

[7]Coates, Adam, Andrew Y. Ng,and Honglak Lee. "An analysis of single-layer networks in unsupervisedfeature learning." International conference on artificial intelligence andstatistics. 2011.

過分割和beam search

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

LibSVM庫的使用說明

Tesseract-OCR學習系列（四）API

樸素貝葉斯算法的理解與實現

Character Filter

過分割和beam search

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結