爬蟲中最常見的反爬手段之一就是驗證碼,而我們平常所遇見的驗證碼最多的便是數英驗證碼,數英驗證碼可以有效地收集數據集並進行訓練,達到靠譜的識別率,而另一個很常見的就是計算型驗證碼了,如下圖示:
這個驗證碼,咋一看,感覺上是十分簡單的,因爲簡單的OCR識別就能很精準地識別它,但是我們該怎麼做呢,先看這個圖片是如何構成的。
可以清晰地看到,這個驗證碼實際上是由4張圖片所構成,即"9" “x” “1” "="這4張圖片連接到一塊形成的驗證碼,那麼我們在用OCR識別的時候,便可以直接拼接成一塊再進行識別,這樣可以節約計算時間
核心代碼如下
def join(png1, png2, png3, png4, count):
"""
:param png1: path
:param png2: path
:param flag: horizontal or vertical
:return:
"""
img1, img2, img3, img4 = Image.open(png1), Image.open(png2), Image.open(png3), Image.open(png4)
size1, size2, size3, size4 = img1.size, img2.size, img3.size, img4.size
joint = Image.new('RGB', (size1[0]+size2[0], size1[1]))
loc1, loc2 = (0, 0), (size1[0], 0)
joint.paste(img1, loc1)
joint.paste(img2, loc2)
joint.save('./image_{0}/5.png'.format(str(count)))
joint = Image.new('RGB', (size3[0]+size4[0], size3[1]))
loc3, loc4 = (0, 0), (size1[0], 0)
joint.paste(img3, loc3)
joint.paste(img4, loc4)
joint.save('./image_{0}/6.png'.format(str(count)))
img1, img2 = Image.open("./image_{0}/5.png".format(str(count))), Image.open("./image_{0}/6.png".format(str(count)))
size1, size2 = img1.size, img2.size
joint = Image.new('RGB', (size1[0]+size2[0], size1[1]))
loc1, loc2 = (0, 0), (size1[0], 0)
joint.paste(img1, loc1)
joint.paste(img2, loc2)
joint.save('./image_{0}/7.png'.format(str(count)))
部分核心代碼
# 二值化處理
def two_value(filename):
image = Image.open(filename)
# 灰度圖
lim = image.convert('L')
# 灰度閾值設爲165,低於這個值的點全部填白色
threshold = 165
table = []
for j in range(256):
if j < threshold:
table.append(0)
else:
table.append(1)
bim = lim.point(table, '1')
bim.save(filename)
def add(string):
s1 = ''
for i in string:
if i.isdigit():
s1 = s1 + i
else:
s1 = s1 + " "
lt = s1.split(" ")
m = 0
for a in lt:
if a.isdigit():
m = m + int(a)
return m
def sub(string):
s1 = ''
for i in string:
if i.isdigit():
s1 = s1 + i
else:
s1 = s1 + " "
lt = s1.split(" ")
m = 0
for a in lt:
if a.isdigit():
if not m:
m = int(a)
else:
m -= int(a)
return m
def mul(string):
s1 = ''
for i in string:
if i.isdigit():
s1 = s1 + i
else:
s1 = s1 + " "
lt = s1.split(" ")
m = 1
for a in lt:
if a.isdigit():
m *= int(a)
return m
def get_num(filename):
two_value(filename)
image = Image.open(filename)
vcode = pytesseract.image_to_string(image)
print(vcode)
if "+" in vcode:
return add(vcode)
elif "X" in vcode or "x" in vcode:
return mul(vcode)
elif "-" in vcode:
return sub(vcode)
由於這個網站的計算驗證碼只有加減法和乘法,就簡單地寫了三個方法來計算。
運行效果圖: