python實現圖片，驗證碼識別

原創

jinhua_110

2020-05-22 03:28

python實現圖片，驗證碼識別

1. 圖片識別OCR技術和Tesseract-OCR工具
2. python調用OCR技術的第三方包
3. 實例操作與實現
4. 操作過程注意事項

功能實現思路：
概述：
首先明確的是，python實現圖片、驗證碼的OCR識別並不是python自身技術，而是通過python第三方包調用OCR工具實現。

使用的OCR工具：Tesseract-OCR
python第三方包：PIL，pytesseract

1 圖片識別OCR技術和Tesseract-OCR工具

1.1 OCR技術定義
OCR（Optical Character Recognition，光學字符識別）是指電子設備（例如掃描儀或數碼相機）檢查紙上打印的字符，通過檢測暗、亮的模式確定其形狀，然後用字符識別方法將形狀翻譯成計算機文字的過程。
即通過識別圖片的暗、亮的模式識別出圖片中的文字。
1.2 Tesseract-OCR工具
使用的OCR工具：Tesseract-OCR（詳細瞭解點擊這裏）

windows下載地址：https://digi.bib.uni-mannheim.de/tesseract/

安裝方法一律點擊下一步即可。

1.3 Tesseract-OCR其他系統安裝方式
以下是其他系統對Tesseract-OCR工具的安裝方式：
對於CentOS 8，以root用戶身份運行以下命令：

dnf config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_8/
rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
dnf install tesseract
dnf install tesseract-langpack-deu

對於CentOS 7，以root用戶身份運行以下命令：

yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract 
yum install tesseract-langpack-deu

對於Fedora 30，以root用戶身份運行以下命令：

dnf config-manager --add-repo https://download.opensuse.org/repositories/home:Alexander_Pozdnyakov/Fedora_30/home:Alexander_Pozdnyakov.repo
dnf install tesseract
dnf install tesseract-langpack-deu

對於openSUSE Leap 15.0，以root用戶身份運行以下命令：

zypper addrepo https://download.opensuse.org/repositories/home:Alexander_Pozdnyakov/openSUSE_Leap_15.0/home:Alexander_Pozdnyakov.repo
zypper refresh
zypper install tesseract-ocr
zypper install tesseract-ocr-traineddata-german

以上是對不同系統如何安裝tesseract-ocr 如果需要更多系統的安裝方式請點擊這裏 https://tesseract-ocr.github.io/tessdoc/Home.html

1.4 配置環境變量
windows環境工具安裝結束之後需要配置環境變量。
打開控制面板—>系統和安全----->系統-----高級系統設置

在彈出框中輸入安裝地址例如： C:\Program Files\Tesseract-OCR\； 包括最後的；

點擊【確定】

win10 環境變量配置請參考 https://www.cnblogs.com/sea-stream/p/10961580.html

點擊電腦開始打開終端輸入 cmd
在終端輸入 tesseract --list-langs

至此環境變量配置完成
1.5 配置語言數據包
首先下載語言數據包：
4.0.0 對應數據文件 https://tesseract-ocr.github.io/tessdoc/Data-Files.html#data-files-for-version-400-november-29-2016
然後將.traineddata文件複製到“ tessdata”目錄中。如：C:\Program Files\Tesseract-OCR\tessdata
然後在終端中輸入 tesseract --list-langs
根據輸出內容，確定是否添加成功。

2 python調用OCR技術的第三方包

2.1 概述
python第三方包：PIL，pytesseract
2.2 第三方包描述
pytesseract 全程：Python-tesseract，Python-tesseract是一個基於google’s Tesseract-OCR的獨立封裝包。Python-tesseract默認支持tiff、bmp格式圖片，只有在安裝PIL之後，才能支持jpeg、gif、png等其他圖片格式。
PIL （Python Imaging Library）是 Python 中最常用的圖像處理庫
Image 類是 PIL 庫中一個非常重要的類，通過這個類來創建實例可以有直接載入圖像文件，讀取處理過的圖像和通過抓取的方法得到的圖像這三種方法。
2.3 安裝第三方包
python 安裝 pytesseract和 PIL
打開終端輸入 pip install pytesseract

pip install pytesseract

pip install PIL

3 實例操作與實現

# -*- coding: utf-8 -*-
"""
Created on Sun May 17 23:49:31 2020
@author: Zhuzi
"""
from PIL import Image
import pytesseract 
import sys

class ImageToStr:
    
    def imageToStr(self,image_url,lang):  
        im=Image.open(image_url)                     #1.打開圖片
        im = im.convert('L')                        #2.將彩色圖像轉化爲灰度圖 
        im_str = pytesseract.image_to_string(im,lang=lang) 
        return im_str
    pass


its = ImageToStr() 

img_str = its.imageToStr('驗證碼.png','eng') #識別英文
print('驗證碼識別輸出：',img_str)

cn_img_str = its.imageToStr('202005186.jpg','chi_sim')#識別中文
print('文字識別輸出：',cn_img_str)

圖片樣例：
1 驗證碼：

2漢字圖片：

執行結果

大家要注意驗證碼的干擾線對識別的字母有干擾，所以圖片越清晰越好。

後續會講解使用tenserflow 機器學習對模糊圖片的識別。

4 操作過程注意事項

4.1 Tesseract-OCR工具的環境變量一定要配置成功，否則會出現異常
4.2 出現 NameError: name ‘exit’ is not defined 異常，在程序的開頭增加 import sys
4.3 Python tesseract is not installed or it’s not in your path 錯誤處理：
修改需要修改pytesseract.py文件
1 打開 pytesseract.py 文件
2 查找 tesseract_cmd = ‘tesseract’
3 發現上面有一行註釋 # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
**翻譯：**由於 TESSERACT 所在目錄不同，需要手動更改目錄
將此行修改爲
tesseract_cmd = r’D:\Program Files (x86)\Tesseract-OCR\tesseract.exe’
保存

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python實現圖片，驗證碼識別

1 圖片識別OCR技術和Tesseract-OCR工具

2 python調用OCR技術的第三方包

3 實例操作與實現

4 操作過程注意事項

[ 7天學習Python編程，第一天]-----1.3在Python中打印字符串【python舵手】

[ 7天學習Python編程，第一天]-----1.5 Python變量的操作聲明，連接，全局和局部【python舵手】

[ 7天學習Python編程，第一天]-----1.2創建您的第一個Python程序【python舵手】

[ 7天學習Python編程，第一天]-----1.4 Python main函數：瞭解main【python舵手】

[ 7天學習Python編程，第一天]-----1.1在Windows上安裝Python 以及開發工具 Pycharm IDE【python舵手】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python實現圖片，驗證碼 識別

1 圖片識別OCR技術和Tesseract-OCR工具

2 python調用OCR技術的第三方包

3 實例操作與實現

4 操作過程注意事項

python實現圖片，驗證碼識別