圖像驗證碼識別（兩種方式）

原創

2020-06-17 07:07

準備庫：

PIL   pytesseract

PIL：用於處理驗證碼圖片

pytesseract：用於識別圖片文字

準備工具：

Tesseract Ocr

下載地址

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe

簡單來說，三個步驟：

下載驗證碼圖片。
加載（處理）驗證碼圖片。
識別圖片。

下載驗證碼圖片

這裏就拿知網的驗證碼吧。

如圖：

http://my.cnki.net/elibregister/#

找到對應的驗證碼鏈接即可下載。

url = 'http://my.cnki.net/elibregister/CheckCode.aspx'    
response = requests.get(url)
with open('CheckCode.png','wb') as fw:
    fw.write(response.content)

加載（處理）驗證碼圖片

使用PIL.Image模塊進行圖片處理。

from PIL import Image
    
img = Image.open('CheckCode.png')

img = img.convert('L')  # 進行灰度處理
                    
# 二值化處理
threshold = 128     # 二值化閾值
t_list = []
for i in range(256):
    if i < threshold:
        t_list.append(0)
    else:
        t_list.append(1)
img = img.point(t_list, '1')

識別圖片

圖片的識別，用到pytesseract模塊，只需一行代碼即可處理。

import pytesseract
result = pytesseract.image_to_string(img)

完整代碼：

# -*- coding: utf-8 -*-

# @Time    : 2019/8/14 14:58
# @Author  : hccfm
# @File    : 圖像驗證碼.py
# @Software: PyCharm

"""
字符圖像驗證碼處理
：使用的tesseract 驗證去識別驗證碼，當圖像干擾比較大時，處理時出錯大，可進行圖像處理。
"""
import requests
from PIL import Image
import pytesseract

url = 'http://my.cnki.net/elibregister/CheckCode.aspx'
# url = 'http://www.yundama.com//index/captcha'

def main():
    response = requests.get(url)
    with open('CheckCode.png','wb') as fw:
        fw.write(response.content)

    img = Image.open('CheckCode.png')
    result = pytesseract.image_to_string(img)
    print("未進行處理，出錯機率很大:",result)

    img = img.convert('L')  # 進行灰度處理

    threshold = 128     # 二值化閾值
    t_list = []
    for i in range(256):
        if i < threshold:
            t_list.append(0)
        else:
            t_list.append(1)
    img = img.point(t_list, '1')

    img.show()
    result = pytesseract.image_to_string(img)
    print("處理後：",result)

if __name__ == '__main__':
    main()

處理結果：