selenium 速度慢的問題

原創

2022-11-22 14:09

# -*- coding: utf-8 -*-
'''
createTime : 2022-08-04 10:22
@software: : spiderSystem
author :
@File : spider_0_douyin_user_list.py
Copyright: shannanai

在使用selenium的過程中，速度比較忙，主要原因有
1. 加載圖片或者css文件等
2. driver.get(url) 是阻塞的，加載速度慢，修改加載策略可以優化速度
3. 一次只打開一個網頁

如何實現併發的數據下載呢？
1. 使用多進程的方式，比如進程開20個
2. 一個進程打開一個瀏覽器，然後一個瀏覽器同時打開50個url，同時的併發數量是20*50=1000個，考慮到加載不成功等因素，速度也是非常的快
3. 該方法主要是針對一些js破解難度較大的網站，實現數據快速下載

關於攔截網絡請求的問題
攔截網絡請求的目的是：
1. 避免一些不必要的請求，節約時間。比如圖片，css,js等
2. 攔截一些反爬蟲的請求，或者修改請求的數據或者返回的數據結果，通過干擾請求數據和返回的數據，避免被反爬蟲檢測出來。

1. host 設置域名攔截請求
windows系統 C:\Windows\System32\drivers\etc\hosts 文件設置域名攔截，把IP地址轉移到並的地方
2. fiddler 攔截請求或者修改數據
fiddler 使用 bpu 攔截請求

解決問題重要是思路，下面代碼僅僅是思路，沒有實際用處

'''

import random
import time

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from common.contest import *
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import win32api
import win32con

r2 = redis.Redis(host='127.0.0.1', port='6379', db=2)

):

options = Options()
# options.add_argument("headless") # 無頭模式
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# 禁用GPU加速
options.add_argument("disable-gpu")

# 解決加載速度慢的問題
options.page_load_strategy = 'none'

# 禁止圖片加載
# options.add_argument('blink-settings=imagesEnabled=false')

# 禁止圖片加載
# No_Image_loading = {"profile.managed_default_content_settings.images": 2}
# options.add_experimental_option("prefs", No_Image_loading)

# prefs = {
# 'profile.default_content_setting_values': {
# 'images': 2,
# # 'permissions.default.stylesheet': 2,
# # 'javascript': 2
# }
# }
# options.add_experimental_option('prefs', prefs)

# 據說是執行後不會出現抖音滑塊
# options.add_argument('--disable-blink-features=AutomationControlled')
drivers = webdriver.Chrome(options=options)
# 瀏覽器最大化
drivers.maximize_window()
# 驅動檢測
drivers.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => false
})
"""
})

# for循環每次打開10個網頁，根據自己的需求指定打開網頁個數。可以打開 50個或者100個
for page in range(20):
data_dict = eval(r2.lpop("douyin_user_data_queue"))
# 調用打開瀏覽器的方法
crawl_data(data_dict, drivers)
time_sleep(0.5)

time_sleep(60)

#檢測網頁，加載成功的網頁可以直接關閉，節省資源
for pages in range(4):

print("pages",pages)

all_handles = drivers.window_handles

print("all_handles",all_handles)

if len(all_handles) > 0:
print(len(all_handles))
open_url_list = []

for handle in all_handles[0:]: # 遍歷全部頁面句柄
drivers.switch_to.window(handle) # 切換到新頁面
html = drivers.page_source

if 'RENDER_DATA' in str(html):
print(666666)
# 存入redis中去
else:
print(555555)
try:
if 'dy-account-close' in html:
drivers.find_element_by_class_name("dy-account-close").click()
except Exception as e:
print(e)

print(22222222222)

# 驗證碼中間頁

if 'ECMy_Zdt' in html or 'Eie04v01' in html:
print("數據下載成功", html.count('ECMy_Zdt'))
drivers.close()
elif '驗證碼中間頁' in html:
print("遇到狗屎的驗證碼了，程序正在重新打開......")
current_url = drivers.current_url
print(current_url)
# 注意事項先打開網頁，然後再去關閉有驗證碼的頁面

control_string = "window.open('" + current_url + "')"
drivers.execute_script(control_string)
drivers.close()
print("999999999999999999")

time_sleep(30)

drivers.quit()
print("瀏覽器關閉，馬上會重新打開瀏覽器下載數據")

if __name__ == "__main__":

starttime1 = time.time()
result_list = [page for page in range(10000)]

download_choice = 2
if download_choice == 1:
pool = multiprocessing.Pool(processes=50)
results = []
for item in result_list[0:]:
results.append(pool.apply_async(spider, args=(item,)))
pool.close()
pool.join()
else:
for item in result_list[0:]:
print("正在下載的位置是:", result_list.index(item))
spider(item)
print("============",time.time()-starttime1)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

selenium 速度慢的問題

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

千兆寬帶實際網速能到達多少？

python 對文件夾重命名

3 Error: Cannot find module 'jsdom'

python 刪除pdf 圖片

docker nodejs jsdom 打鏡像

python 批量刪除 redis 大量數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結