是否爬蟲可以通過如下的方式檢測出來
爬蟲如何通過https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html的檢測
1. 爬蟲的代碼
chrome headless 配置、基本安裝和使用可以參考:
http://www.voidcn.com/article/p-hwlrznzi-bpz.html
https://blog.csdn.net/xc_zhou/article/details/80823855
爬蟲的代碼
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
chrome_options = Options()
# 在啓動Chromedriver之前,爲Chrome開啓實驗性功能參數excludeSwitches,它的值爲['enable-automation'],可應對WebDriver檢測
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
chrome_options.add_argument('--headless')
chrome_options.add_argument('--proxy-server=http://127.0.0.1:8080')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox') # 取消沙盒模式
chrome_options.add_argument('--disable-setuid-sandbox')
# chrome_options.add_argument('--single-process') # 單進程運行
# chrome_options.add_argument('--process-per-tab') # 每個標籤使用單獨進程
# chrome_options.add_argument('--process-per-site') # 每個站點使用單獨進程
# chrome_options.add_argument('--in-process-plugins') # 插件不啓用單獨進程
# chrome_options.add_argument('--disable-popup-blocking') # 禁用彈出攔截
chrome_options.add_argument('--disable-images') # 禁用圖像
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
chrome_options.add_argument('--incognito') # 啓動進入隱身模式
chrome_options.add_argument('--lang=zh-CN') # 設置語言爲簡體中文
chrome_options.add_argument(
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--disable-bundled-ppapi-flash')
chrome_options.add_argument('--mute-audio')
chrome_options.add_argument('lang=zh_CN.UTF-8')
# chrome_options.add_extension(r'C:\hdmbdioamgdkppmocchpkjhbpfmpjiei-3.0.1-Crx4Chrome.com.crx') 添加插件
# chrome_options.add_argument('--disable-extensions') 禁用插件
# chrome_options.add_argument('--disable-plugins')
DRIVER = webdriver.Chrome(executable_path="C:\chromedriver.exe",
chrome_options=chrome_options)
WebDriverWait(DRIVER, 1)
DRIVER.get("http://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html")
WebDriverWait(DRIVER, 2)
page_source = DRIVER.page_source
DRIVER.quit()
print(page_source)
2. mitmproxy 腳本代碼
import mitmproxy.http
class JsCheckPass:
def response(slef, flow: mitmproxy.http.HTTPFlow):
t = 'window.chrome = true;'
t0 = 'Object.defineProperties(navigator,{webdriver:{get:() => false}});'
t1 = 'window.navigator.chrome = {runtime: {},// etc.};'
t2 = '''
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5,6],
});
'''
if 'chrome-headless-test' in flow.request.url or 'um.js' in flow.request.url:
flow.response.text = t + t0 + t1 + t2 + flow.response.text
flow.response.text = flow.response.text.replace("permissionStatus.state === 'prompt'",
"permissionStatus.state === 'promptzzzzzzzzz'")
addons = [
JsCheckPass(),
]
3.運行
從cmd進入虛擬環境,然後運行mitmweb -s addons.py,讓代理啓動,再執行headless的時候配置好代理,就能修改請求和響應了,判斷是否是爬蟲一般都是從js判斷的,也就是請求發送完,服務器響應js文件發回來,最後js在瀏覽器裏面執行完就是最終的判斷結果,只要找到這個js文件,修改裏面的代碼,就能修改最終的頁面結果
參考教程:
https://www.cnblogs.com/yangjintao/p/10599868.html
https://blog.csdn.net/freeking101/article/details/83901842
https://blog.csdn.net/Chen_chong__/article/details/85526088
https://blog.wolfogre.com/posts/usage-of-mitmproxy/
https://www.jianshu.com/p/0eb46f21fee9
基本的使用方法看這些教程就夠了
還有一種僞裝方式,參考:
https://ask.csdn.net/questions/382674
https://blog.csdn.net/sinly100/article/details/79184559