筆者在本科階段想學卻一致沒有學的Python爬蟲,沒有想到研究僧階段剛進實驗室的第一週就被安排學習了。這周筆者主要學習的有:UA黑名單饒過、JS混淆和驗證碼認證。其中,驗證碼認證是花費時間最長的,問題及代碼如下:
一、輸入式驗證碼
用戶根據圖片輸入相應的數字和字母,這種驗證碼出現相對較早,也較爲普遍,對於Python爬蟲來說,也較爲簡單。
解決辦法式用Python的第三方庫Tesserocr-OCR,代碼如下:
from PIL import Image
import tesserocr
image = Image.open('./1.png')
result = tesserocr.image_to_text(image)
print(result)
雖然代碼簡單,但是準確率卻非常受限制,比如當圖片背景有很多線條的時候,識別準確率是比較低的。這個時候的解決辦法是對圖片轉灰度再進行二值化處理,以此提高識別率,代碼如下:
image = Image.open('./1.png')
image.show()
image = image.convert('L')
threshold = 127
table = []
for i in range(256):
if i < threshold:
table.append(0)
else:
table.append(1)
image = image.point(table,'1')
image.show()
result = tesserocr.image_to_text(image)
print(result)
但是這個辦法也有限制,當背景紋理和字符的RGB都大於127,或者都小於127時(就是亮度接近時),準確率會很低。所以筆者覺得剛好深度學習比較火,用深度學習訓練個模型,這樣的識別率就會高很多。
二、滑動式驗證碼
滑動式驗證碼最爲典型的是B站的登錄界面。
解決思路是存三張圖片,分別是完整的圖、有缺口的圖和缺口圖。首先識別缺口在圖中的位置,然後計算滑動的距離和軌跡。最後用selenium進行模擬操作。代碼如下:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import time
import random
from PIL import Image
web='http://literallycanvas.com/'
#初始化
def init():
#定義全局變量
global url, browser, username, password, wait
url = 'https://passport.bilibili.com/login'
browser = webdriver.Chrome()
username = '************'
password = '************'
wait = WebDriverWait(browser, 20)
#登錄
def login():
browser.get(url)
user = wait.until(EC.presence_of_element_located((By.ID, 'login-username')))
passwd = wait.until(EC.presence_of_element_located((By.ID, 'login-passwd')))
user.send_keys(username)
passwd.send_keys(password)
#通過輸入回車鍵模仿用戶登錄
#passwd.send_keys(Keys.ENTER)
login_btn=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'a.btn.btn-login')))
#隨機延時點擊
time.sleep(random.random()*3)
login_btn.click()
#設置元素的可見性用於截圖
def show_element(element):
browser.execute_script("arguments[0].style = arguments[1]", element, "display: block;")
def hide_element(element):
browser.execute_script("arguments[0].style = arguments[1]", element, "display: none;")
#截圖
def save_pic(obj, name):
try:
pic_url = browser.save_screenshot('.\\bilibili.png')
#開始獲取元素位置信息
left = obj.location['x']
top = obj.location['y']
right = left + obj.size['width']
bottom = top + obj.size['height']
im = Image.open('.\\bilibili.png')
im = im.crop((left, top, right, bottom))
file_name = 'bili' + name + '.png'
im.save(file_name)
except BaseException as msg:
print("截圖失敗:%s" % msg)
def cut():
c_background = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'canvas.geetest_canvas_bg.geetest_absolute')))
c_slice = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'canvas.geetest_canvas_slice.geetest_absolute')))
c_full_bg = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'canvas.geetest_canvas_fullbg.geetest_fade.geetest_absolute')))
hide_element(c_slice)
save_pic(c_background, 'back')
show_element(c_slice)
save_pic(c_slice, 'slice')
show_element(c_full_bg)
save_pic(c_full_bg, 'full')
#判斷元素是否相同
def is_pixel_equal(bg_image, fullbg_image, x, y):
#bg_image是缺口的圖片
#fullbg_image是完整圖片
bg_pixel = bg_image.load()[x, y]
fullbg_pixel = fullbg_image.load()[x, y]
threshold = 60
if (abs(bg_pixel[0] - fullbg_pixel[0] < threshold) and abs(bg_pixel[1] - fullbg_pixel[1] < threshold) and abs(bg_pixel[2] - fullbg_pixel[2] < threshold)):
return True
else:
return False
#計算滑塊移動的距離
def get_distance(bg_image, fullbg_image):
distance = 57
for i in range(distance, fullbg_image.size[0]):
for j in range(fullbg_image.size[1]):
if not is_pixel_equal(fullbg_image, bg_image, i, j):
return i
#構造滑動軌跡
def get_trace(distance):
#distance是缺口離滑塊的距離
trace = []
faster_distance = distance*(4/5)
start, v0, t = 0, 0, 0.2
while start < distance:
if start < faster_distance:
a = 1.5
else:
a = -3
move = v0 * t + 1 / 2 * a * t * t
v = v0 + a * t
v0 = v
start += move
trace.append(round(move))
return trace
#模擬拖動
def move_to_gap(trace):
slider=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'div.geetest_slider_button')))
# 使用click_and_hold()方法懸停在滑塊上,perform()方法用於執行
ActionChains(browser).click_and_hold(slider).perform()
for x in trace:
# 使用move_by_offset()方法拖動滑塊,perform()方法用於執行
ActionChains(browser).move_by_offset(xoffset=x, yoffset=0).perform()
time.sleep(0.5)
ActionChains(browser).release().perform()
def slide():
distance=get_distance(Image.open('.\\bili_back.png'),Image.open('.\\bili_full.png'))
trace = get_trace(distance-5)
move_to_gap(trace)
time.sleep(3)
init()
login()
cut()
slide()
三、點擊式驗證碼
最常見的點擊式驗證碼有12306、簡書等。此處筆者以簡書爲例。解決思路的爲:獲取點擊式圖片的信息——調用第三方識別庫——獲取第三方返回的座標——用selenium模擬用戶點擊。筆者用的第三方識別是超級鷹,這是一個付費的軟件,但是註冊後關注公衆號有免費的測試額度,足夠做測試使用了。代碼分爲兩個部分,一個是超級鷹API接口,另一個是上述一系列操作。代碼如下:
import requests
from hashlib import md5
class Chaojiying(object):
def __init__(self, username, password, soft_id):
self.username = username
self.password = md5(password.encode('utf-8')).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def post_pic(self, im, codetype):
"""
im: 圖片字節
codetype: 題目類型 參考 http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
# 驗證不通過,請求該函數 , 後臺 則對該次判斷不做扣分處理
def report_error(self, im_id):
"""
im_id:報錯題目的圖片ID
"""
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
import time
from PIL import Image
from selenium import webdriver
from selenium.webdriver import ActionChains
def crack():
# 保存網頁截圖
browser.save_screenshot('222.jpg')
# 獲取 驗證碼確定按鈕
button = browser.find_element_by_xpath(xpath='//div[@class="geetest_panel"]/a/div')
# 獲取 驗證碼圖片的 位置信息
img1 = browser.find_element_by_xpath(xpath='//div[@class="geetest_widget"]')
location = img1.location
size = img1.size
top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size[
'width']
print('圖片的寬:', img1.size['width'])
print(top, bottom, left, right)
# 根據獲取的驗證碼位置信息和網頁圖片 對驗證碼圖片進行裁剪 保存
img_1 = Image.open('222.jpg')
capcha1 = img_1.crop((left, top, right, bottom-54))
capcha1.save('tu1-1.png')
# 接入超級鷹 API 獲取圖片中的一些參數 (返回的是一個字典)
cjy = Chaojiying('*********', '************', '900751')
im = open('tu1-1.png', 'rb').read()
content = cjy.post_pic(im, 9004)
print(content)
# 將圖片中漢字的座標位置 提取出來
positions = content.get('pic_str').split('|')
locations = [[int(number)for number in group.split(",")] for group in positions]
print(positions)
print(locations)
# 根據獲取的座標信息 模仿鼠標點擊驗證碼圖片
for location1 in locations:
print(location1)
ActionChains(browser).move_to_element_with_offset(img1 , location1[0],location1[1]).click().perform()
time.sleep(1)
button.click()
time.sleep(1)
# 失敗後重試
lower = browser.find_element_by_xpath('//div[@class="geetest_table_box"]/div[2]').text
print('判斷', lower)
if lower != '驗證失敗 請按提示重新操作'and lower != None:
print('登錄成功')
time.sleep(3)
else:
time.sleep(3)
print('登錄失敗')
# 登錄失敗後 , 調用 該函數 , 後臺 則對該次判斷不做扣分處理
pic_id = content.get('pic_id')
print('圖片id爲:',pic_id)
cjy = Chaojiying('********', '**********', '900751')
cjy.report_error(pic_id)
crack()
if __name__ == '__main__':
patn = 'chromedriver.exe'
browser = webdriver.Chrome(patn)
browser.get('https://www.jianshu.com/sign_in')
browser.save_screenshot('lodin.png')
# 填寫from表單 點擊登陸 獲取驗證碼 的網頁截圖
login = browser.find_element_by_id('sign-in-form-submit-btn')
username = browser.find_element_by_id('session_email_or_mobile_number')
password = browser.find_element_by_id('session_password')
username.send_keys('***********')
password.send_keys('***********')
login.click()
time.sleep(5)
crack()
四、宮格驗證碼
宮格驗證碼主要是指微博曾經使用的四宮格驗證碼,但是現在該驗證碼已經取消了。筆者仍然瞭解了一些這種驗證碼的破解辦法——枚舉。因爲四宮格只用24種可能,先用代碼獲取所有的情況後,再手動輸入對應每張圖片的滑動順尋,最後再用selenium模擬。