网站日志分析最完整实践

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析网站日志可以帮助我们了解用户地域信息,统计用户行为,发现网站缺陷。操作会面临几个问题 "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日志分析工具splunk如何使用? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日志格式信息不全,如何配置日志打印出全面信息? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有大量爬虫访问,并非真实流量,如何去掉这些流量? "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果网站用了CDN加速,反向代理,网关等,日志打印的是这些设备的ip,那么用户的真实ip如何获得呢? "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阅读本文能帮您有效解决上述问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"splunk"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk安装使用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日志分析工具splunk是一款商业软件,但有免费版,免费版每天处理日志限制500M以内。对于简单分析,500M也可以挖掘出很多信息了。本文使用免费版splunk分析Apache离线日志。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先要到splunk官网注册用户,注册用户填写的信息除邮箱外不需要真实,公司瞎填即可。注册完成到下载页面选择Linux 64位版本, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/79/79b349b0ad2cbf6452b28edfec815698.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"选择命令行下载,会给出一个wget的指令, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ad/ad8f4c358e94973623cc1c7192bf18de.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"复制wgt指令,到Linux上执行,得到压缩包。 (wget指令splunk-8.0.5-xxx的xxx每个人都不同) "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[root@localhost splunk]# wget -O splunk-8.0.5-xxx-Linux-x86_64.tgz 'https://www.splunk.com/bin/splunk/DownloadActivityServlet?architecture=x86_64&platform=linux&version=8.0.5&product=splunk&filename=splunk-8.0.5-xxx-Linux-x86_64.tgz&wget=true'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解压压缩包,到bin目录下执行 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"./splunk start"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看见协议条款按q,是否同意协议位置输入y "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"Do you agree with this license? [y/n]: y "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用户名输入 admin "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"密码输入 adminroot "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出现 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"The Splunk web interface is at http://192.168.56.106:8000 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表示启动成功。相比于开源工具(ELK,graylog)确实省心太多了。确保Linux防火墙是关闭状态,然后浏览器输入前面8000地址打开登录。首次会有引导教程指导如何上传文件。日常如果想上传文件,直接点击左上角splunk->enterprise进入主界面,然后选择添加数据, "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/ea8e3bd36fd7194608105f2475c94a78.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有多种添加数据的方式,这里选择上载,就是日志文件已经在我电脑里了,像上传附件一样传给splunk。过程全部默认,上载文件需要等一段时间。Apache日志设置“来源类型”时选择web里的access_combined。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/88/88bbdc85d40ceb83b6df6ec3c469497c.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步,“检查”,“提交”都是默认。显示文件上载成功,点击“开始搜索”按钮,搜索刚才上传的日志数据。 "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/72/723ed85eda8befeb39b9ec2b9d502288.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索框是放搜索限制条件的,右下方是原始日志,左侧是各个字段的名称,这些字段是工具内置的,满足格式的日志会自动被解析出这些字段,比如每条日志开头都有个客户端的ip,就是左侧的clientip,鼠标点击clientip会看见统计结果,默认是出现频率最高的前10个ip。如果要统计更多,需要在搜索框用对应语法查询。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/7342e99c6bfc5526026c4590316e0081.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk搜索语言介绍(SPL语法)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"语法用于在搜索框中使用,达到限制范围,统计所需要指标的目的。语法像“搜索引擎 + SQL + shell”的混合体。如果能熟练运用功能非常强大。 "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基本语法,"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"source=\"access2020-09-11.log\" host=\"basicserver\" sourcetype=\"access_combined\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"source表示数据来源文件名,host是splunk所在的主机名称,sourcetype是导入时配置的。这些都可以变换,我想统计10号的数据,将access2020-09-10.log作为source就达到了效果。如果想查看basicserver里的所有数据不分日期,把source条件去掉,只保留host和sourcetype两个条件。搜索框最少要有一个条件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"访问频率最高的200个ip "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | top clientip limit=200"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客户端ip的地理信息 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"执行后左侧下方“感兴趣的字段”会比刚才多出City Country Region字段,这些和客户端ip的地理位置是对应的。"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"访问频率最高的十个城市"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip | top City limit=10"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"地图查看ip分布 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | iplocation clientip | geostats count"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cb/cb087fddfc21c9f9ba70a9a361ab3849.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有多少不同的ip访问网站"}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | stats dc(clientip)"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有日志记录按时间正序排列 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | sort _time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认按照倒序,最新的日志排在最前面"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"访问次数最少的ip "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | rare clientip"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"被访问最多的uri "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=\"basicserver\" | top uri limit=20"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clientip不等于某两个网段的记录 "}]}]}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"host=basicserver clientip!=\"158.111.2.*\" clientip!=\"192.190.2.*\" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"补充一句,搜索框可以输入很多条件,条件增多搜索框会随着变大,不要担心条件多装不下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"数据可视化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索栏下方依次有 事件、模式、统计信息、可视化 选项,最后的可视化选项能生成图表,最好是在搜索命令计算了某个统计指标,然后点击可视化。如果没计算指标直接点击可视化,配置会比较繁琐才能生成图表。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设搜索栏统计某天访问次数最高的20个clientip,命令为"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"source=\"access2020-09-11.log\" | top clientip limit=20"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"执行完会在统计信息下方列出前20个ip,点击可视化,选择柱状图。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d65d86cc85ab6e84aa2cc58f087685da.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"柱状图出来后,点击格式可以配置让座标ip竖着显示,看着更舒服。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f400b2d7cd176376a4b2e7688788697.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ip地址的地理信息数据库如何更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"统计ip的地理位置依赖于地理信息库,安装时有个内置的库,不是最新的。如果要更新到最新的需要到https://dev.maxmind.com/zh-hans/geoip/geoip2/geolite2/下载最新的GeoLite2-City.mmdb(要先注册),把这个文件复制到splunk/share目录下覆盖原来的同名文件即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"删除数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"删除所有数据./splunk clean eventdata -f "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"删除属于索引indexname的数据 ./splunk clean eventdata -index indexname -f"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Apache日志需要注意的"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" common和combined两种格式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日志格式有common和combined两种格式,combined格式信息更全面,比common格式多了refer和useragent信息。下面是apache/conf下的httpd.conf文件里对两种格式的定义"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"LogFormat \"%h %l %u %t \\\"%r\\\" %>s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\" combined\nLogFormat \"%h %l %u %t \\\"%r\\\" %>s %b\" common"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果common日志满足不了分析要求,可以把格式改成common格式。方法是修改apache/conf下的httpd.conf文件,把里面CustomLog末尾配置为combined"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"CustomLog \"|/opt/apache/bin/rotatelogs /opt/apache/logs/access%Y-%m-%d.log 86400\" combined"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"无法直接看到用户真实ip怎么办"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果网站前方有反向代理或者网关,或者网站做了CDN加速,那么日志的clientip是代理服务器、网关或者CDN加速服务器的ip,没什么分析价值。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/349f5c51e99fbf28fedf99749fcd1555.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要获取用户真实ip可以修改httpd.conf的LogFormat,加上%{X-FORWARDED-FOR}i (简称XFF),我直接将XFF加到了%h的后面,"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"LogFormat \"%h %{X-FORWARDED-FOR}i %l %u %t \\\"%r\\\" %>s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\" combined\nLogFormat \"%h %{X-FORWARDED-FOR}i %l %u %t \\\"%r\\\" %>s %b\" common"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设网站有CDN加速(其它情况同理分析),按上面格式,每条日志首先打印的是CDN加速服务器ip,然后是XFF的ip(也就是用户真实ip)。如果用户没有经过CDN直接访问,那么XFF就是一条横线\"-\"。下图就是用户直连网站的情况,clientip就是用户的真实ip。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/47/4794f7819458e1f280a06c7ada07487a.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Apache动态载入配置文件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"修改完配置文件,不需要重启Apache,到Apache/bin下执行./apachectl graceful可以动态载入配置文件,不停止服务,新的配置立刻生效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"splunk如何解析XFF字段"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"splunk内置的access"},{"type":"text","marks":[{"type":"italic"}],"text":"combined和access"},{"type":"text","text":"common格式都无法解析XFF,如果要正确解析需要修改splunk/etc/system/default/transforms.conf "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新增[xff]段配置XFF的正则"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[xff]\nREGEX = \\d{1,3}(\\.\\d{1,3}){2,3}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"修改[access-extractions]段,在clientip后增加([[nspaces:xff]]\\s++)?,用来匹配XFF"}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"[access-extractions]\nREGEX = ^[[nspaces:clientip]]\\s++([[nspaces:xff]]\\s++)?[[nspaces:ident]]\\s++[[nspaces:user]]\\s++[[sbstring:req_time]]\\s++[[access-request]]\\s++[[nspaces:status]]\\s++[[nspaces:bytes]](?:\\s++\"(?[[bc_domain:referer_]]?+[^\"]*+)\"(?:\\s++[[qstring:useragent]](?:\\s++[[qstring:cookie]])?+)?+)?[[all:other]]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[xff]段的位置不重要,写在哪里都行。配置完成,重启splunk,上传带有XFF的日志,左侧会看见“感兴趣的字段”出现了xff "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/78/78f0bfa5cf487a2cf06ec28f7155c440.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"xff字段的分析统计和clientip完全一样,只不过这是真实用户的ip了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"如何对付爬虫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过分析日志,下列行为可以判断为爬虫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"该ip访问占比特高"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"useragent明确说自己是哪家搜索引擎爬虫"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"访问的uri明显不需要那么频繁访问"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非必要的凌晨访问(不睡觉吗?)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"访问频率高(两分钟访问上千个url) "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎的爬虫访问网站是为了收录网站数据。有一些恶意的爬虫会做坏事,除了抓数据还尝试登陆执行脚本等。爬虫访问的频率都很高会给网站带来负载,应该根据网站情况进行不同程度的限制。限制恶意爬虫只能封对方ip。搜索引擎的爬虫可以通过配置robots.txt文件,以及在该引擎的站长平台配置或投诉来限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"robots.txt"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎抓取数据会先读取网站根目录下的robots.txt文件,文件根据robots协议书写规则,文件的规则就是搜索引擎要遵守的规则。比如打开https://www.taobao.com/robots.txt可以看到淘宝的协议规定百度爬虫任何数据都不可以爬。 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"User-agent: Baiduspider\nDisallow: /"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果要任何爬虫都不能爬任何数据,就写成 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"User-agent: *\nDisallow: /"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"复杂的规则如指定引擎可以爬取某些目录,某些目录不可以爬,robots协议也是支持的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"robots协议是“君子协定”,它并没有通过技术手段限制爬虫,要靠爬虫的自觉遵守。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按我经验,百度、谷歌、360、字节、都能遵守协议,搜狗很流氓,不遵守协议。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些请求的useragent写的是Baiduspider,但可能是冒充百度爬虫,useragent是可以自己设置的。要想判断一个ip是否是搜索引擎的爬虫可以使用,nslookup或者host命令。这两个命令返回的域名信息可以看出来是否是爬虫。 "}]},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# nslookup 49.7.21.76\nServer: 219.141.136.10\nAddress: 219.141.136.10#53\nNon-authoritative answer:\n76.21.7.49.in-addr.arpa name = sogouspider-49-7-21-76.crawl.sogou.com.\n\n# host 111.206.198.69\n69.198.206.111.in-addr.arpa domain name pointer baiduspider-111-206-198-69.crawl.baidu.com."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,各大搜索引擎的站长平台会教如何判断ip是否是自己的爬虫,百度站长平台就有“轻松两步,教你快速识别百度蜘蛛”,介绍了百度蜘蛛useragent的格式和判断方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"站长平台"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索引擎都有站长平台,里面有很多相关的教程帮助更好的使用搜索引擎。注册站长平台时要证明自己有网站的管理权限,验证方法是可以将指定文件放置到网站根目录。成为站长后可以查询自己网站的索引收录情况,查询搜索引擎给网站带来的流量等指标。还可以投诉爬虫抓取频繁,设定抓取频率。有些平台公布邮箱可以投诉。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"封IP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于恶意或者不遵守robots协议的爬虫,只能封ip。网站源站用防火墙来封,CDN加速服务器也都提供了封ip功能。配置了CDN加速的网站一定要封xff的IP,因为大部分clientip都是CDN加速服务器的地址,封了这些地址很多正常用户就不能正常访问了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日志分析是从理性方面了解系统,分析结果可能会颠覆之前对系统的固有认知。对开发,运维,运营都能提供有价值的信息,建议大家有机会尝试一下。如果不想封禁爬虫ip,可以在搜索栏排除爬虫ip的访问记录(xff!=\"爬虫ip\"),这样既能排除干扰,还能和爬虫和平共处。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章