6、爬取某高校頁面的驗證碼，進行驗證碼識別

根據這篇博客實現：https://blog.csdn.net/qq_40962368/article/details/89331608

代碼地址：https://github.com/Chakid/ImageProcess

現在的網站爲了防止人們輕易的獲取登陸後的頁面信息，在登陸上設置了很多的障礙，驗證碼就是其中的一種，所謂道高一尺，魔高一丈，人們總能想出辦法來予以應對，但是，應對的成本可能在不斷加大，這在一定程度上提升了反反爬蟲的門檻。本文的目的在於驗證Tesseract對普通驗證碼圖片的識別準確率，以便爲後續的工作做準備。

思路：
（1）獲取批量驗證碼圖片（利用某高校登錄頁面的驗證碼圖片）；
（2）爲驗證碼圖片做信息標註（雖然很不想手動標記，但這是必須的，因爲我們要確保百分百正確）；
（3）利用Tesseract-OCR對驗證碼圖片進行識別並測試識別效果；
（4）後續工作思路，如何提高識別的精度。

爬取某高校頁面的驗證碼：

對驗證碼圖片手動信息標註，將圖片上的驗證信息放入圖片的名稱內，便於後續測試

運行識別驗證碼識別代碼：

爬取代碼：
使用代理不斷訪問該網址獲取驗證碼圖片，並保存爲png格式文件；

# coding:utf-8
#爬取驗證碼
from urllib import request
import time
import random

def get_and_save_verify(i):
    try:
        url = 'http://jwxt.qlu.edu.cn/verifycode.servlet'
        request.urlretrieve(url, 'img/hb/' + 'verify_' + str(i) + '.png')
        print('第' + str(i) + '張圖片下載成功')
    except Exception:
        print('第' + str(i) + '張圖片下載失敗')


def get_proxy():
    # 使用代理步驟
    # - 1、設置代理地址
    proxys = [{'http': '39.137.69.10:8080'},
              {'http': '111.206.6.101:80'},
              {'http': '120.210.219.101:8080'},
              {'http': '111.206.6.101:80'},
              {'https': '120.237.156.43:8088'}]
    # - 2、創建ProxyHandler
    proxy = random.choice(proxys)
    proxy_handler = request.ProxyHandler(proxy)
    # - 3、創建Opener
    opener = request.build_opener(proxy_handler)
    # - 4、導入Opener
    request.install_opener(opener)


if __name__ == '__main__':
    for i in range(1, 101):
        get_proxy()
        time.sleep(random.randint(1, 4))
        get_and_save_verify(i)

驗證碼識別代碼：
利用Tesseract-OCR進行圖像信息識別，並將圖像的識別結果與藏在圖片文件名中的標籤進行比對，測試識別的準確率。

降噪處理：
分別用高斯濾波、中值濾波和雙邊濾波對圖像進行降噪處理，並不斷調整參數，確定出對應方法的最優參數

# 對結果的處理  
st = re.sub(r'[^A-Za-z0-9]+', '', a)  
st = st.lower()  
if len(st) > 4:  
	    b = st[-4:]  
	else:  
	    b = st

整個識別代碼

# coding:utf-8
#識別驗證碼
import pytesseract
import cv2
import os
import numpy as np
import re


path = 'img/hb/'

file_name = []
for k in os.walk(path):
    file_name = k[-1]

print('識別值' + '-----' + '真實值')
num = 0
for i in file_name:
    img = cv2.imdecode(np.fromfile(path + i, dtype=np.uint8), 1)

    # 對數據的處理
    # blur = cv2.GaussianBlur(img, (3, 3), 0)  # 高斯濾波函數
    # blur = cv2.medianBlur(img, 3)  # 中值濾波函數
    blur = cv2.bilateralFilter(img, 3, 560, 560)  # 雙邊濾波函數 560：0.28

    a = pytesseract.image_to_string(blur)

    # 對結果的處理
    st = re.sub(r'[^A-Za-z0-9]+', '', a)
    st = st.lower()
    if len(st) > 4:
        b = st[-4:]
    else:
        b = st

    true_value = i[-8:-4]
    print(b + '-----' + true_value)
    if b == true_value:
        num += 1

print('識別的準確率爲：' + str(num / 20))

6、爬取某高校頁面的驗證碼，進行驗證碼識別

根據這篇博客實現：https://blog.csdn.net/qq_40962368/article/details/89331608

代碼地址：https://github.com/Chakid/ImageProcess

爬取某高校頁面的驗證碼：

VUE.js項目中控制檯報錯： Uncaught (in promise) NavigationDuplicated解決方法

10、SpringBoot中使用Redis

9、SpringBoot使用自帶cache的時候標註@Cacheable不起作用

7、SpringBoot頁面跳轉訪問css、js等靜態資源引用無效解決

記錄一個網站頁面抓取軟件

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結