图像验证码识别（两种方式）

原創

2020-06-17 07:07

准备库：

PIL   pytesseract

PIL：用于处理验证码图片

pytesseract：用于识别图片文字

准备工具：

Tesseract Ocr

下载地址

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe

简单来说，三个步骤：

下载验证码图片。
加载（处理）验证码图片。
识别图片。

下载验证码图片

这里就拿知网的验证码吧。

如图：

http://my.cnki.net/elibregister/#

找到对应的验证码链接即可下载。

url = 'http://my.cnki.net/elibregister/CheckCode.aspx'    
response = requests.get(url)
with open('CheckCode.png','wb') as fw:
    fw.write(response.content)

加载（处理）验证码图片

使用PIL.Image模块进行图片处理。

from PIL import Image
    
img = Image.open('CheckCode.png')

img = img.convert('L')  # 进行灰度处理
                    
# 二值化处理
threshold = 128     # 二值化阈值
t_list = []
for i in range(256):
    if i < threshold:
        t_list.append(0)
    else:
        t_list.append(1)
img = img.point(t_list, '1')

识别图片

图片的识别，用到pytesseract模块，只需一行代码即可处理。

import pytesseract
result = pytesseract.image_to_string(img)

完整代码：

# -*- coding: utf-8 -*-

# @Time    : 2019/8/14 14:58
# @Author  : hccfm
# @File    : 图像验证码.py
# @Software: PyCharm

"""
字符图像验证码处理
：使用的tesseract 验证去识别验证码，当图像干扰比较大时，处理时出错大，可进行图像处理。
"""
import requests
from PIL import Image
import pytesseract

url = 'http://my.cnki.net/elibregister/CheckCode.aspx'
# url = 'http://www.yundama.com//index/captcha'

def main():
    response = requests.get(url)
    with open('CheckCode.png','wb') as fw:
        fw.write(response.content)

    img = Image.open('CheckCode.png')
    result = pytesseract.image_to_string(img)
    print("未进行处理，出错机率很大:",result)

    img = img.convert('L')  # 进行灰度处理

    threshold = 128     # 二值化阈值
    t_list = []
    for i in range(256):
        if i < threshold:
            t_list.append(0)
        else:
            t_list.append(1)
    img = img.point(t_list, '1')

    img.show()
    result = pytesseract.image_to_string(img)
    print("处理后：",result)

if __name__ == '__main__':
    main()

处理结果：