網站日誌分析最完整實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析網站日誌可以幫助我們瞭解用戶地域信息,統計用戶行爲,發現網站缺陷。操作會面臨幾個問題 "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌分析工具splunk如何使用? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌格式信息不全,如何配置日誌打印出全面信息? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有大量爬蟲訪問,並非真實流量,如何去掉這些流量? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果網站用了CDN加速,反向代理,網關等,日誌打印的是這些設備的ip,那麼用戶的真實ip如何獲得呢? "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閱讀本文能幫您有效解決上述問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"splunk"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk安裝使用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌分析工具splunk是一款商業軟件,但有免費版,免費版每天處理日誌限制500M以內。對於簡單分析,500M也可以挖掘出很多信息了。本文使用免費版splunk分析Apache離線日誌。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先要到splunk官網註冊用戶,註冊用戶填寫的信息除郵箱外不需要真實,公司瞎填即可。註冊完成到下載頁面選擇Linux 64位版本, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/79/79b349b0ad2cbf6452b28edfec815698.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇命令行下載,會給出一個wget的指令, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ad/ad8f4c358e94973623cc1c7192bf18de.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"複製wgt指令,到Linux上執行,得到壓縮包。 (wget指令splunk-8.0.5-xxx的xxx每個人都不同) "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[root@localhost splunk]# wget -O splunk-8.0.5-xxx-Linux-x86_64.tgz 'https://www.splunk.com/bin/splunk/DownloadActivityServlet?architecture=x86_64&platform=linux&version=8.0.5&product=splunk&filename=splunk-8.0.5-xxx-Linux-x86_64.tgz&wget=true'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解壓壓縮包,到bin目錄下執行 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"./splunk start"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看見協議條款按q,是否同意協議位置輸入y "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"Do you agree with this license? [y/n]: y "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶名輸入 admin "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"密碼輸入 adminroot "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出現 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"The Splunk web interface is at http://192.168.56.106:8000 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表示啓動成功。相比於開源工具(ELK,graylog)確實省心太多了。確保Linux防火牆是關閉狀態,然後瀏覽器輸入前面8000地址打開登錄。首次會有引導教程指導如何上傳文件。日常如果想上傳文件,直接點擊左上角splunk->enterprise進入主界面,然後選擇添加數據, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/ea8e3bd36fd7194608105f2475c94a78.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有多種添加數據的方式,這裏選擇上載,就是日誌文件已經在我電腦裏了,像上傳附件一樣傳給splunk。過程全部默認,上載文件需要等一段時間。Apache日誌設置“來源類型”時選擇web裏的access_combined。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/88/88bbdc85d40ceb83b6df6ec3c469497c.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步,“檢查”,“提交”都是默認。顯示文件上載成功,點擊“開始搜索”按鈕,搜索剛纔上傳的日誌數據。 "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/72/723ed85eda8befeb39b9ec2b9d502288.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索框是放搜索限制條件的,右下方是原始日誌,左側是各個字段的名稱,這些字段是工具內置的,滿足格式的日誌會自動被解析出這些字段,比如每條日誌開頭都有個客戶端的ip,就是左側的clientip,鼠標點擊clientip會看見統計結果,默認是出現頻率最高的前10個ip。如果要統計更多,需要在搜索框用對應語法查詢。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/7342e99c6bfc5526026c4590316e0081.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk搜索語言介紹(SPL語法)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語法用於在搜索框中使用,達到限制範圍,統計所需要指標的目的。語法像“搜索引擎 + SQL + shell”的混合體。如果能熟練運用功能非常強大。 "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基本語法,"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"source=\"access2020-09-11.log\" host=\"basicserver\" sourcetype=\"access_combined\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"source表示數據來源文件名,host是splunk所在的主機名稱,sourcetype是導入時配置的。這些都可以變換,我想統計10號的數據,將access2020-09-10.log作爲source就達到了效果。如果想查看basicserver裏的所有數據不分日期,把source條件去掉,只保留host和sourcetype兩個條件。搜索框最少要有一個條件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問頻率最高的200個ip "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | top clientip limit=200"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端ip的地理信息 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行後左側下方“感興趣的字段”會比剛纔多出City Country Region字段,這些和客戶端ip的地理位置是對應的。"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問頻率最高的十個城市"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip | top City limit=10"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"地圖查看ip分佈 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip | geostats count"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cb/cb087fddfc21c9f9ba70a9a361ab3849.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有多少不同的ip訪問網站"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | stats dc(clientip)"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有日誌記錄按時間正序排列 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | sort _time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認按照倒序,最新的日誌排在最前面"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問次數最少的ip "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | rare clientip"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"被訪問最多的uri "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | top uri limit=20"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clientip不等於某兩個網段的記錄 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=basicserver clientip!=\"158.111.2.*\" clientip!=\"192.190.2.*\" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"補充一句,搜索框可以輸入很多條件,條件增多搜索框會隨着變大,不要擔心條件多裝不下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據可視化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索欄下方依次有 事件、模式、統計信息、可視化 選項,最後的可視化選項能生成圖表,最好是在搜索命令計算了某個統計指標,然後點擊可視化。如果沒計算指標直接點擊可視化,配置會比較繁瑣才能生成圖表。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設搜索欄統計某天訪問次數最高的20個clientip,命令爲"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"source=\"access2020-09-11.log\" | top clientip limit=20"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行完會在統計信息下方列出前20個ip,點擊可視化,選擇柱狀圖。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d65d86cc85ab6e84aa2cc58f087685da.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"柱狀圖出來後,點擊格式可以配置讓座標ip豎着顯示,看着更舒服。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f400b2d7cd176376a4b2e7688788697.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ip地址的地理信息數據庫如何更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"統計ip的地理位置依賴於地理信息庫,安裝時有個內置的庫,不是最新的。如果要更新到最新的需要到https://dev.maxmind.com/zh-hans/geoip/geoip2/geolite2/下載最新的GeoLite2-City.mmdb(要先註冊),把這個文件複製到splunk/share目錄下覆蓋原來的同名文件即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"刪除數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"刪除所有數據./splunk clean eventdata -f "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"刪除屬於索引indexname的數據 ./splunk clean eventdata -index indexname -f"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Apache日誌需要注意的"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" common和combined兩種格式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌格式有common和combined兩種格式,combined格式信息更全面,比common格式多了refer和useragent信息。下面是apache/conf下的httpd.conf文件裏對兩種格式的定義"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"LogFormat \"%h %l %u %t \\\"%r\\\" %>s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\" combined\nLogFormat \"%h %l %u %t \\\"%r\\\" %>s %b\" common"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果common日誌滿足不了分析要求,可以把格式改成common格式。方法是修改apache/conf下的httpd.conf文件,把裏面CustomLog末尾配置爲combined"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"CustomLog \"|/opt/apache/bin/rotatelogs /opt/apache/logs/access%Y-%m-%d.log 86400\" combined"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"無法直接看到用戶真實ip怎麼辦"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果網站前方有反向代理或者網關,或者網站做了CDN加速,那麼日誌的clientip是代理服務器、網關或者CDN加速服務器的ip,沒什麼分析價值。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/349f5c51e99fbf28fedf99749fcd1555.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要獲取用戶真實ip可以修改httpd.conf的LogFormat,加上%{X-FORWARDED-FOR}i (簡稱XFF),我直接將XFF加到了%h的後面,"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"LogFormat \"%h %{X-FORWARDED-FOR}i %l %u %t \\\"%r\\\" %>s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\" combined\nLogFormat \"%h %{X-FORWARDED-FOR}i %l %u %t \\\"%r\\\" %>s %b\" common"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設網站有CDN加速(其它情況同理分析),按上面格式,每條日誌首先打印的是CDN加速服務器ip,然後是XFF的ip(也就是用戶真實ip)。如果用戶沒有經過CDN直接訪問,那麼XFF就是一條橫線\"-\"。下圖就是用戶直連網站的情況,clientip就是用戶的真實ip。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/47/4794f7819458e1f280a06c7ada07487a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Apache動態載入配置文件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"修改完配置文件,不需要重啓Apache,到Apache/bin下執行./apachectl graceful可以動態載入配置文件,不停止服務,新的配置立刻生效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk如何解析XFF字段"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"splunk內置的access"},{"type":"text","marks":[{"type":"italic"}],"text":"combined和access"},{"type":"text","text":"common格式都無法解析XFF,如果要正確解析需要修改splunk/etc/system/default/transforms.conf "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新增[xff]段配置XFF的正則"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[xff]\nREGEX = \\d{1,3}(\\.\\d{1,3}){2,3}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"修改[access-extractions]段,在clientip後增加([[nspaces:xff]]\\s++)?,用來匹配XFF"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[access-extractions]\nREGEX = ^[[nspaces:clientip]]\\s++([[nspaces:xff]]\\s++)?[[nspaces:ident]]\\s++[[nspaces:user]]\\s++[[sbstring:req_time]]\\s++[[access-request]]\\s++[[nspaces:status]]\\s++[[nspaces:bytes]](?:\\s++\"(?[[bc_domain:referer_]]?+[^\"]*+)\"(?:\\s++[[qstring:useragent]](?:\\s++[[qstring:cookie]])?+)?+)?[[all:other]]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[xff]段的位置不重要,寫在哪裏都行。配置完成,重啓splunk,上傳帶有XFF的日誌,左側會看見“感興趣的字段”出現了xff "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/78/78f0bfa5cf487a2cf06ec28f7155c440.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"xff字段的分析統計和clientip完全一樣,只不過這是真實用戶的ip了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"如何對付爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過分析日誌,下列行爲可以判斷爲爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該ip訪問佔比特高"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"useragent明確說自己是哪家搜索引擎爬蟲"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問的uri明顯不需要那麼頻繁訪問"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非必要的凌晨訪問(不睡覺嗎?)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問頻率高(兩分鐘訪問上千個url) "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎的爬蟲訪問網站是爲了收錄網站數據。有一些惡意的爬蟲會做壞事,除了抓數據還嘗試登陸執行腳本等。爬蟲訪問的頻率都很高會給網站帶來負載,應該根據網站情況進行不同程度的限制。限制惡意爬蟲只能封對方ip。搜索引擎的爬蟲可以通過配置robots.txt文件,以及在該引擎的站長平臺配置或投訴來限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"robots.txt"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎抓取數據會先讀取網站根目錄下的robots.txt文件,文件根據robots協議書寫規則,文件的規則就是搜索引擎要遵守的規則。比如打開https://www.taobao.com/robots.txt可以看到淘寶的協議規定百度爬蟲任何數據都不可以爬。 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"User-agent: Baiduspider\nDisallow: /"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果要任何爬蟲都不能爬任何數據,就寫成 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"User-agent: *\nDisallow: /"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"複雜的規則如指定引擎可以爬取某些目錄,某些目錄不可以爬,robots協議也是支持的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"robots協議是“君子協定”,它並沒有通過技術手段限制爬蟲,要靠爬蟲的自覺遵守。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按我經驗,百度、谷歌、360、字節、都能遵守協議,搜狗很流氓,不遵守協議。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些請求的useragent寫的是Baiduspider,但可能是冒充百度爬蟲,useragent是可以自己設置的。要想判斷一個ip是否是搜索引擎的爬蟲可以使用,nslookup或者host命令。這兩個命令返回的域名信息可以看出來是否是爬蟲。 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# nslookup 49.7.21.76\nServer: 219.141.136.10\nAddress: 219.141.136.10#53\nNon-authoritative answer:\n76.21.7.49.in-addr.arpa name = sogouspider-49-7-21-76.crawl.sogou.com.\n\n# host 111.206.198.69\n69.198.206.111.in-addr.arpa domain name pointer baiduspider-111-206-198-69.crawl.baidu.com."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,各大搜索引擎的站長平臺會教如何判斷ip是否是自己的爬蟲,百度站長平臺就有“輕鬆兩步,教你快速識別百度蜘蛛”,介紹了百度蜘蛛useragent的格式和判斷方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"站長平臺"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎都有站長平臺,裏面有很多相關的教程幫助更好的使用搜索引擎。註冊站長平臺時要證明自己有網站的管理權限,驗證方法是可以將指定文件放置到網站根目錄。成爲站長後可以查詢自己網站的索引收錄情況,查詢搜索引擎給網站帶來的流量等指標。還可以投訴爬蟲抓取頻繁,設定抓取頻率。有些平臺公佈郵箱可以投訴。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"封IP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於惡意或者不遵守robots協議的爬蟲,只能封ip。網站源站用防火牆來封,CDN加速服務器也都提供了封ip功能。配置了CDN加速的網站一定要封xff的IP,因爲大部分clientip都是CDN加速服務器的地址,封了這些地址很多正常用戶就不能正常訪問了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌分析是從理性方面瞭解系統,分析結果可能會顛覆之前對系統的固有認知。對開發,運維,運營都能提供有價值的信息,建議大家有機會嘗試一下。如果不想封禁爬蟲ip,可以在搜索欄排除爬蟲ip的訪問記錄(xff!=\"爬蟲ip\"),這樣既能排除干擾,還能和爬蟲和平共處。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章