是否爬虫可以通过如下的方式检测出来
爬虫如何通过https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html的检测
1. 爬虫的代码
chrome headless 配置、基本安装和使用可以参考:
http://www.voidcn.com/article/p-hwlrznzi-bpz.html
https://blog.csdn.net/xc_zhou/article/details/80823855
爬虫的代码
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
chrome_options = Options()
# 在启动Chromedriver之前,为Chrome开启实验性功能参数excludeSwitches,它的值为['enable-automation'],可应对WebDriver检测
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
chrome_options.add_argument('--headless')
chrome_options.add_argument('--proxy-server=http://127.0.0.1:8080')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox') # 取消沙盒模式
chrome_options.add_argument('--disable-setuid-sandbox')
# chrome_options.add_argument('--single-process') # 单进程运行
# chrome_options.add_argument('--process-per-tab') # 每个标签使用单独进程
# chrome_options.add_argument('--process-per-site') # 每个站点使用单独进程
# chrome_options.add_argument('--in-process-plugins') # 插件不启用单独进程
# chrome_options.add_argument('--disable-popup-blocking') # 禁用弹出拦截
chrome_options.add_argument('--disable-images') # 禁用图像
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
chrome_options.add_argument('--incognito') # 启动进入隐身模式
chrome_options.add_argument('--lang=zh-CN') # 设置语言为简体中文
chrome_options.add_argument(
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--disable-bundled-ppapi-flash')
chrome_options.add_argument('--mute-audio')
chrome_options.add_argument('lang=zh_CN.UTF-8')
# chrome_options.add_extension(r'C:\hdmbdioamgdkppmocchpkjhbpfmpjiei-3.0.1-Crx4Chrome.com.crx') 添加插件
# chrome_options.add_argument('--disable-extensions') 禁用插件
# chrome_options.add_argument('--disable-plugins')
DRIVER = webdriver.Chrome(executable_path="C:\chromedriver.exe",
chrome_options=chrome_options)
WebDriverWait(DRIVER, 1)
DRIVER.get("http://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html")
WebDriverWait(DRIVER, 2)
page_source = DRIVER.page_source
DRIVER.quit()
print(page_source)
2. mitmproxy 脚本代码
import mitmproxy.http
class JsCheckPass:
def response(slef, flow: mitmproxy.http.HTTPFlow):
t = 'window.chrome = true;'
t0 = 'Object.defineProperties(navigator,{webdriver:{get:() => false}});'
t1 = 'window.navigator.chrome = {runtime: {},// etc.};'
t2 = '''
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5,6],
});
'''
if 'chrome-headless-test' in flow.request.url or 'um.js' in flow.request.url:
flow.response.text = t + t0 + t1 + t2 + flow.response.text
flow.response.text = flow.response.text.replace("permissionStatus.state === 'prompt'",
"permissionStatus.state === 'promptzzzzzzzzz'")
addons = [
JsCheckPass(),
]
3.运行
从cmd进入虚拟环境,然后运行mitmweb -s addons.py,让代理启动,再执行headless的时候配置好代理,就能修改请求和响应了,判断是否是爬虫一般都是从js判断的,也就是请求发送完,服务器响应js文件发回来,最后js在浏览器里面执行完就是最终的判断结果,只要找到这个js文件,修改里面的代码,就能修改最终的页面结果
参考教程:
https://www.cnblogs.com/yangjintao/p/10599868.html
https://blog.csdn.net/freeking101/article/details/83901842
https://blog.csdn.net/Chen_chong__/article/details/85526088
https://blog.wolfogre.com/posts/usage-of-mitmproxy/
https://www.jianshu.com/p/0eb46f21fee9
基本的使用方法看这些教程就够了
还有一种伪装方式,参考:
https://ask.csdn.net/questions/382674
https://blog.csdn.net/sinly100/article/details/79184559