验证码识别之连体字符切割

根据前面的几篇博客已经知道，如果验证码里的字符之间没有相连，我们使用任意一个机器学习的算法(KNN,SVM等)很容易就可以把他们切割标注识别出来，实际上很多网站的验证码都不可能那么简单，那么我们字符连接如何切割是一个难题。如果这个时候你去问一些人，你会发现答案大部分都是叫你使用CNN也就是卷积神经网络来识别，这样就可以避免切割字符。难道就不能使用机器学习的算法识别吗？

我们先看一个比较简单，但是无法使用投影法切割的验证码：

这个验证码X和N之间是连在一起的，无法简单的切割，而且字符都有一定程度的倾斜，向下投影的话，可能并没有明显的切割边界, 当然针对N和h这种情况可以使用其他方法切割，比如通过连通域来切割字符。

在解决这个问题之前，我们先思考另一个问题，为什么识别验证码要切割字符，一定要切割字符吗？当然不一定，实际上即使使用KNN和SVM等算法也可以不切割字符来达到识别的效果，但是如果把验证码当成一个整体的话，类别就不是单个字符了，而是多个字符组成的整体，那么你标注的任务量会非常巨大，从原来的26个字母+10个数字的类别数直接变成了从36个字符中选出4个字符的类别数，这可不是一点点变化，使用手工标注的话，估计你的孙子都叫你爷爷了。这还仅仅是4个字符的验证码。

如何切割连体字符呢？滴水算法。原理很简单，我们先指定一个水滴的位置，比如在X和N的上方某个像素点，然后让他按照某种规则下落，当水滴到达图片底部时，它走过的路径就是切割的边界(曲线切割)，为了更容易理解，我们看一张图：

水滴下落走的方向有5种，分别是左、右、左下、下、和右下，至于走哪个方向就看这5个位置的像素点是黑色还是白色(注意滴水算法只能用于二值化的图片)，传统的滴水算法有6个规则来指定水滴的走向(这里我用背景代替白色像素点，笔迹代替黑色像素点)：

全为背景或者全为笔迹 -> 水滴向下走
左下为背景，且其他点至少有一个为笔迹　-> 走左下
左下角为笔迹，正下方为背景色 -> 走下
左下角跟正下方为笔迹的颜色，右下方为背景色 -> 走右下
下方全为笔迹颜色，且右边为背景色 -> 走右
除了左边是背景色，其他均为笔迹颜色 -> 走左

我们并不需要去记住这些规则，写程序的时候才需要将逻辑分开。这六条规则总结起来很简单，哪里有路走哪里，如果有多条路则看路的优先级(下>左下>右下>右>左)，如果都没有路则直接把下踩出路继续走。

我们来用Python实现一下，代码如下：

def dropfall(img, start):
	'''
		水滴起始下路位置为(0, start)
	'''
    a = np.array(img)
    a = (a < 200) * 1
    height, _ = a.shape
    x, y = 0, start
    way = [] # 存储水滴走过的路径
    while x+1 < height:
        n1, _, n5 = a[x, y-1:y+2] # 左(n1)和右(n5)
        n2, n3, n4 = a[x+1, y-1:y+2]  # 左下(n2)、下(n3)、右下(n4)
        # if和elif的条件就是上面6条规则，顺序也是一样的
        if n1 == n2 == n3 == n4 == n5:
            x += 1
        elif n2 == 0 and any((n1, n3, n4, n5)):
            x += 1
            y -= 1
        elif n2 == 1 and n3 == 0:
            x += 1
        elif all((n2, n3)) and n4 == 0:
            x += 1
            y += 1
        elif all((n2, n3, n4)) and n5 == 0:
            y += 1
            # 避免这一步和下一步进入死循环
            if (x, y) in way:
                x += 1
        elif all((n2, n3, n4, n5)) and n1 == 0:
            y -= 1
        way.append((x, y))
    return way

既然算法已经有了，那让我们来切割验证码，为了让切割看起来更直观，我们使用matplotlib来显示验证码和切割路径，代码如下：

import numpy as np
import os
from PIL import Image
import matplotlib.pyplot as mp


def dropfall(img, start):
    a = np.array(img)
    a = (a < 200) * 1
    height, _ = a.shape
    x, y = 0, start
    way = []
    while x+1 < height:
        n1, _, n5 = a[x, y-1:y+2]
        n2, n3, n4 = a[x+1, y-1:y+2]
        if n1 == n2 == n3 == n4 == n5:
            x += 1
        elif n2 == 0 and any((n1, n3, n4, n5)):
            x += 1
            y -= 1
        elif n2 == 1 and n3 == 0:
            x += 1
        elif all((n2, n3)) and n4 == 0:
            x += 1
            y += 1
        elif all((n2, n3, n4)) and n5 == 0:
            y += 1
            if (x, y) in way:
                x += 1
        elif all((n2, n3, n4, n5)) and n1 == 0:
            y -= 1
        way.append((x, y))
    return way
        
os.chdir('G:\\knn\\')
img = Image.open('3.png').convert('L')
a = np.array(img)
a = (a > 200) * 255
width, height = a.shape

x = []
for i in range(width):
    for j in range(height):
        if a[i, j] == 0:
            x.append([i, j])
#print(x)
x = np.array(x)

mp.scatter(x[:,1], x[:, 0], s=10)
ax = mp.gca()                               
ax.xaxis.set_ticks_position('top') 
ax.invert_yaxis() 

way = dropfall(img, 54)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')

way = dropfall(img, 71)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')

way = dropfall(img, 89)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')
mp.show()

切割效果：

可以看出，切割效果并不是很理想，它将N这个字符的一部分分给了X，Y也被切掉了一部分。不过这并不是算法的问题，而是N这个字符左上角有一部分缺口，Y被切掉一部分是因为我们指定的切割起始点有问题。如果就按图上的切割，其实每个字符的特征还在，直接用于验证码识别的话，效果不会太差。

切割代码中的三个切割起始点都是我根据验证码给定的，那么如何让程序自动获取到切割边界，我们可以从上面的效果看到，切割起始点的好坏直接决定了切割字符的好坏，在传统滴水算法中是这样寻找切割起始点的：从左至右找到图片左侧为黑色像素、右侧有黑的像素的白色像素点。但这并不准确，对于X和Y两个字符来说，这样找到的边界在X和Y的中间，算法会直接把XY劈成两半。

其实分割字符我最开始想到的并不是滴水算法，而是聚类算法。不过聚类算法达到的效果很差，我们看一下例子：

from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
import numpy as np
import os
from PIL import Image
import matplotlib.pyplot as mp


os.chdir('G:\\knn\\')
img = Image.open('3.png').convert('L')
a = np.array(img)
a = (a > 200) * 255
width, height = a.shape

x = []
for i in range(width):
    for j in range(height):
        if a[i, j] == 0:
            x.append([i, j])
x = np.array(x)
model = KMeans(n_clusters=4)
# # model = AgglomerativeClustering(n_clusters=4)
model.fit(x)

mp.scatter(x[:,1], x[:, 0], c=model.labels_, s=10, cmap='brg')
ax = mp.gca()                               
ax.xaxis.set_ticks_position('top') 
ax.invert_yaxis() 

mp.show()

代码运行效果如下：

这效果差吗？不差，但这仅仅是在这张图片上。因为这张图片每个字符都保持了一定的距离，所以聚类算法能表现不错。我试了多个验证码其中只有少数才能达到如图一样的效果。另外，在所有聚类算法中，AgglomerativeClustering和KMeans表现的最好，而这两个算法在不同的验证码中又表现的不一样，有时这个好，有时另一个又很好，当然也有两个都表现很差的验证码。

那么我们如果使用聚类算法来找水滴算法的起始点，效果会怎么样呢？依旧不理想，但相对于直接聚类来说要好。我们看一下代码和效果图：

from sklearn.cluster import KMeans
import numpy as np
import os
from PIL import Image
import matplotlib.pyplot as mp


def dropfall(img, start):
    a = np.array(img)
    a = (a < 200) * 1
    height, _ = a.shape
    x, y = 0, start
    way = []
    while x+1 < height:
        n1, _, n5 = a[x, y-1:y+2]
        n2, n3, n4 = a[x+1, y-1:y+2]
        if n1 == n2 == n3 == n4 == n5:
            x += 1
        elif n2 == 0 and any((n1, n3, n4, n5)):
            x += 1
            y -= 1
        elif n2 == 1 and n3 == 0:
            x += 1
        elif all((n2, n3)) and n4 == 0:
            x += 1
            y += 1
        elif all((n2, n3, n4)) and n5 == 0:
            y += 1
        elif all((n2, n3, n4, n5)) and n1 == 0:
            y -= 1
            if (x, y) in way:
                x += 1
        way.append((x, y))
    return way
        
os.chdir('G:\\knn\\')
img = Image.open('3.png').convert('L')
a = np.array(img)
a = (a > 200) * 255
width, height = a.shape

x = []
for i in range(width):
    for j in range(height):
        if a[i, j] == 0:
            x.append([i, j])
x = np.array(x)
model = KMeans(n_clusters=4)
model.fit(x)
# 计算切割水滴起始点
x1 = x[:,1][model.labels_==0].min()
x2 = x[:,1][model.labels_==1].min()
x3 = x[:,1][model.labels_==2].min()
x4 = x[:,1][model.labels_==3].min()
x_min = sorted([x1, x2, x3, x4])[1:]
x1 = x[:,1][model.labels_==0].max()
x2 = x[:,1][model.labels_==1].max()
x3 = x[:,1][model.labels_==2].max()
x4 = x[:,1][model.labels_==3].max()
x_max = sorted([x1, x2, x3, x4])[:-1]
x1, x2, x3 = [(i+j)//2 for i, j in zip(x_min, x_max)]
# 画验证码
mp.scatter(x[:,1], x[:, 0], c=model.labels_, s=10, cmap='brg')
ax = mp.gca()                               
ax.xaxis.set_ticks_position('top') 
ax.invert_yaxis() 
# 画切割路径
way = dropfall(img, x1)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')

way = dropfall(img, x2)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')

way = dropfall(img, x3)
way_x = [i[0] for i in way]
way_y = [i[1] for i in way]
mp.scatter(way_y, way_x, marker='*')

mp.show()

在代码中，为了减少误差，起始边界我是计算字符的右边界和它临近字符的左边界的平均值。

即使这样，所达到的效果还是不理想。这是因为字符的中空，对于实体字符而言，水滴切割效果会比这个好，不过对于实体字符的话，用聚类找到的边界会相对较差。

目前我所达到的也就这个水平了，如果后续还有什么改进或者新思路的话，在分享吧。或者如果你有什么大胆的想法也可以说出来，说不定就能达到不错的效果呢。

验证码识别之连体字符切割

驗證碼預處理

圖片數據集持久化保存(序列化)

驗證碼識別之連體字符切割

Windows10安裝TensorFlow-gpu

selenium如何連接已經打開的瀏覽器

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結