Nginx禁止某些User Agent抓取網站

原創

SitVen

2020-06-21 13:53

大家都知道網絡上的爬蟲非常多，爬蟲有利也有弊，爬蟲可以讓我們的網站容易被其他人搜到，比如百度蜘蛛(Baiduspider)

問題是有些爬蟲不遵循robots規則對服務器造成壓力，或者是惡意爬取網頁、採集數據。不好的爬蟲會耗費大量的服務器資源

影響正常的用戶使用。有些服務器是按流量計費，被爬蟲耗費很多流量要交這些額外產生的費用，比如：七牛

Nginx反爬蟲

在Nginx安裝目錄下的conf目錄創建個spider目錄，spider下新建個agent_deny.conf

cd /usr/local/nginx/conf         # 進入nginx配置文件目錄
mkdir spider                     # 新建spider目錄
cd spider                        # 進入spider目錄
vim agent_deny.conf              # vim創建並編輯agent_deny.conf文件

agent_deny.conf配置如下

# 禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
     return 403;
}

# 禁止指定UA及UA爲空的訪問
if ($http_user_agent ~* "FeedDemon|YisouSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
     return 403;
}

# 禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
}

注：這裏設置是排除指定的User-Agent頭和空User-Agaent頭

下面是網絡上常見的垃圾UA列表，僅供參考

FeedDemon                      # 內容採集
BOT/0.1 (BOT for JCE)          # sql注入
CrawlDaddy                     # sql注入
Java                           # 內容採集
Jullo                          # 內容採集
Feedly                         # 內容採集
UniversalFeedParser            # 內容採集
ApacheBench                    # cc攻擊器
Swiftbot                       # 無用爬蟲
YandexBot                      # 無用爬蟲
AhrefsBot                      # 無用爬蟲
YisouSpider                    # 無用爬蟲（已被UC神馬搜索收購，此蜘蛛可以放開！）
MJ12bot                        # 無用爬蟲
ZmEu phpmyadmin                # 漏洞掃描
WinHttp                        # 採集cc攻擊
EasouSpider                    # 無用爬蟲
HttpClient                     # tcp攻擊
Microsoft URL Control          # 掃描
YYSpider                       # 無用爬蟲
jaunty                         # wordpress爆破掃描器
oBot                           # 無用爬蟲
Python-urllib                  # 內容採集
Indy Library                   # 掃描
FlightDeckReports Bot          # 無用爬蟲
Linguee Bot                    # 無用爬蟲

然後，在網站相關配置中的 location / { 之後插入如下代碼：

include /etc/nginx/conf.d/spider/agent_deny.conf;

如sitven博客的配置：

location / {
        add_header Content-Security-Policy upgrade-insecure-requests;
        uwsgi_pass mysite_server;
        include /etc/nginx/uwsgi_params;
        # 新增如下一行
        include /etc/nginx/conf.d/spider/agent_deny.conf;
    }

保存後，reload配置平滑重啓nginx即可

/usr/local/nginx/sbin/nginx -s reload

Apache和PHP配置可以參考張戈的博文：服務器反爬蟲攻略

測試效果

使用curl -A 模擬抓取，比如：

模擬宜搜蜘蛛抓取：

curl -I -A 'YisouSpider' sitven.cn

模擬UA爲空的抓取：

curl -I -A '' sitven.cn

模擬百度蜘蛛的抓取：

curl -I -A 'Baiduspider' sitven.cn

三次抓取結果截圖如下：

可以看到，宜搜蜘蛛和UA爲空的返回是403禁止訪問標識，而百度蜘蛛則成功返回200，說明生效！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Nginx禁止某些User Agent抓取網站

Nginx反爬蟲

測試效果

Nginx禁止某些User Agent抓取網站

Jmeter之常用參數化方式

Jmeter運行環境搭建

Ubuntu通過Nginx反向代理Jenkins

Jmeter取樣器之HTTP請求

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結