ncpa是nagios最近幾年推出的監控客戶端,已日趨完善,用於替代老舊的nrpe。
首先,nagios的優點在於
1、監控界的工業標準,專注報警近二十年(1999年誕生) 業界的話是這樣的,每種監控系統背後都有nagios的影子 2、優秀的設計永不過時,無數據庫設計 與zabbix的臃腫相比,nagios是遵循unix哲學的典範,做一件事並把它做好。 無數據庫設計,不讓數據庫拖後腿。 3、c語言編寫,超高性能 nagios4.0以前,採用了類似apache prefork模式,性能一度受到影響。在事件模型出現以前,它仍然是當時最好的方案。 nagios4.0之後,採用了類似nginx的事件模型,以極小的內存代價,取得性能上質的提升,10k+不成問題。 4、優秀的插件機制,非常靈活 nagios積累了十餘年的由社區貢獻的海量插件,自己編寫插件也十分容易。
ncpa比nrpe優秀的地方在於
1、支持被動監控,即ncpa主動向nagios上報(通過nrdp) 2、ncpa跟snmp類似,基本不需要配置,自帶基本監控項,比如cpu,內存,服務、進程等, 而nrpe需要在客戶端定義一堆check,然後還要在nagios服務端再定義一遍,非常繁瑣。 3、保留原有的nagios插件 4、通過簡單的腳本編程,在nagios服務端用nmap掃描ncpa客戶端,可以實現自動添加基本監控 5、環境依賴除了python2.7,對系統沒有任何侵入
本文描述基於nagios+ncpa的主動監控,替代nrpe。
環境
服務端:CentOS 7 + nagios 4 IP:192.168.1.200 客戶端:CentOS 7 + ncpa 2.0.6 IP:192.168.1.50
客戶端配置
1、安裝ncpa
rpm -ivh https://assets.nagios.com/downloads/ncpa/ncpa-2.0.6.el7.x86_64.rpm
2、啓動ncpa服務
/etc/init.d/ncpa_passive start /etc/init.d/ncpa_listener start chkconfig ncpa_listener on chkconfig ncpa_passive on
3、客戶端開啓防火牆端口5693
iptables -A INPUT -p tcp --dport 5693 -j ACCEPT
或
iptables -A INPUT -s 192.168.1.200 -p tcp --dport 5693 -j ACCEPT
服務端配置
安裝nagios(簡略版)
yum install epel-release -y yum install nagios httpd php php-pecl-zendopcache fping nmap -y systemctl enable httpd nagios systemctl start httpd nagios iptables -A INPUT -p tcp --dport 80 -j ACCEPT
mkdir -p /etc/nagios/bin mkdir -p /etc/nagios/hosts mkdir -p /etc/nagios/services mkdir -p /etc/nagios/template echo "cfg_dir=/etc/nagios/hosts" >> /etc/nagios/nagios.cfg echo "cfg_dir=/etc/nagios/services " >>/etc/nagios/nagios.cfg service nagios restart
一、主機自動發現
所謂自動發現,就是用掃描器掃描局域網,
1、如果IP已在監控之內,則略過;
2、如果是新IP,則按照固定的模板,創建配置文件,並通知管理員;
3、如果某個IP發現後又消失了,nagios會報警,通知管理員。
這樣就形成了一個局域網IP管理的閉環。
使用fping配置主機自動發現
創建主機模板文件/etc/nagios/template/host.cfg,內容如下:
define host { host_name HOST address HOST check_command check-host-alive max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 contacts nagiosadmin notification_interval 60 notification_period 24x7 notifications_enabled 1 } 創建腳本/etc/nagios/bin/find-hosts.sh,內容如下:
#!/usr/bin/env bash if [ ! -f /usr/sbin/fping ];then yum install fping -y fi network=$1 echo_usage() { echo -e "\e[1;31mUsage: $0 [network] \e[0m" echo -e "example: \e[1;32m $0 192.168.0.0/24 \e[0m" echo exit 3 } if [ x$network == "x" ];then echo_usage fi ######################################################## ######################################################## dir=/etc/nagios/hosts host_template=/etc/nagios/template/host.cfg result=$(mktemp -u /tmp/fping-XXXXXX) mkdir -p $dir fping -a -q -g $network > $result i=0 while read host;do if [ ! -f /etc/nagios/hosts/$host.cfg ];then echo new host found $host #mailx -s "new host found :$host" root@localhost sed "s/HOST/$host/g" $host_template > $dir/$host.cfg i=$(expr $i + 1) fi done < $result rm -rf $result if [ $i -eq 0 ];then echo no new host found exit 0 fi if (nagios -v /etc/nagios/nagios.cfg |grep -q "Things look okay");then echo "nagios configuration is OK" sleep 1 service nagios restart echo "nagios restart successfully" else echo "nagios restart failed.please check" exit 1 fi
通過定時任務運行這個腳本,即可自動添加主機監控,也可以修改腳本,讓每次發現新機器時發郵件通知管理員。
二、服務自動發現
使用nmap+check_ncpa實現服務自動發現
1、下載check_ncpa
wget https://assets.nagios.com/downloads/ncpa/check_ncpa.tar.gz tar zxvf check_ncpa.tar.gz cp check_ncpa.py /usr/lib64/nagios/plugins/ cp check_ncpa.py /usr/bin/
2、配置check_ncpa
創建文件/etc/nagios/conf.d/check_ncpa.cfg,內容如下:
# 'check_ncpa' command definition define command{ command_name check_ncpa command_line $USER1$/check_ncpa.py -H $HOSTADDRESS$ -P 5693 -t mytoken $ARG1$ }
3、測試check_ncpa.py
python check_ncpa.py -H 192.168.1.50 -p 5693 -t mytoken -l
4、創建服務發現模板
常規的監控項目無外乎兩類,一類是基本的CPU、swap、負載、磁盤等,另一種是服務,比如nginx
創建文件/etc/nagios/template/ncpa-service.cfg,內容如下:
define service { host_name HOST service_description SERVICE check_command check_ncpa!-M service/SERVICE max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
創建文件/etc/nagios/template/ncpa-basic.cfg,內容如下:
#監控uptime,防止機器重啓 define service { host_name HOST service_description system uptime check_command check_ncpa!-M system/uptime -w @60:120 -c @1:60 max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控CPU使用率 define service { host_name HOST service_description CPU Usage check_command check_ncpa!-M cpu/percent -w 50 -c 80 -q 'aggregate=avg' max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控swap define service { host_name HOST service_description swap Usage check_command check_ncpa!-M memory/swap -w 512 -c 1024 -u mb max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控進程總數 define service { host_name HOST service_description Process Count check_command check_ncpa!-M processes -w 500 -c 1000 max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控磁盤空間 define service { host_name HOST service_description Disk Usage check_command check_ncpa!-M 'plugins/check_disk' -a "-w 20 -c 10 --local" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控系統負載 define service { host_name HOST service_description Load average check_command check_ncpa!-M 'plugins/check_load' -a "-w 8,4,4 -c 12,8,8" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #監控殭屍進程 define service { host_name HOST service_description Load average check_command check_ncpa!-M 'plugins/check_procs' -a "-w 3 -c 5 -s Z" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
創建自動發現腳本/etc/nagios/bin/find-ncpa.sh,內容如下
#!/usr/bin/env bash if [ ! -f /usr/bin/nmap ];then yum install nmap -y fi network=$1 usage() { echo -e "\e[1;31mUsage: $0 [ip|ip-rang|network] \e[0m" echo -e "example1: \e[1;32m $0 192.168.0.100 \e[0m" echo -e "example2: \e[1;32m $0 192.168.1-200 \e[0m" echo -e "example3: \e[1;32m $0 192.168.2.0/24 \e[0m" echo exit 0 } if [ x$network == "x" ];then usage fi dir="/etc/nagios/services" ncpa_basic_template="/etc/nagios/template/ncpa-basic.cfg" ncpa_service_template="/etc/nagios/template/ncpa-service.cfg" nmap -sS -p 5693 --open $network |awk '/Nmap scan report for/{print $5}' > /tmp/ncpa_hosts.txt while read host;do if [ ! -f $dir/$host.cfg ];then touch $dir/$host.cfg sed "s/HOST/$host/g" $ncpa_basic_template >> $dir/$host.cfg /usr/local/bin/check_ncpa.py -H $host -t mytoken -M services -l |grep running |awk '/running/{print $1}' |tr -d \" |tr -d \: |egrep -v "@|systemd" > /tmp/$host.servicelist.txt while read service;do sed -e "s/HOST/$host/g" -e "s/SERVICE/$service/g" $ncpa_service_template >> $dir/$host.cfg done < /dev/shm/$host.servicelist.txt rm -rf /dev/shm/$host.servicelist.txt fi done < /tmp/ncpa_hosts.txt rm -rf /tmp/ncpa_hosts.txt if (nagios -v /etc/nagios/nagios.cfg |grep -q "Things look okay");then echo "nagios configuration is OK" sleep 1 service nagios restart echo "nagios restart successfully" else echo "nagios restart failed. please check" exit 1 fi
業務監控
自動發現在很大程度上可以減輕工作量,但具體的業務監控仍然需要手動添加。
比如監控nginx是否重啓過 (運行時長是否超過1800秒)
#監控進程運行時長 define service { host_name HOST service_description Load average check_command check_ncpa!-M plugins/check_procs -a "-a nginx -m ELAPSED -w @1800:3600 -c @1:1800" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
對於php-fpm這類動態進程模型,其特點是root身份啓動一個master進程,子進程屬主是普通用戶,且個數是動態的,故只需監控master進程運行時長即可,也可以照葫蘆劃瓢,
#監控php-fpm define service { host_name HOST service_description Load average check_command check_ncpa!-M plugins/check_procs -a "-u root -a php-fpm -m ELAPSED -w @1800:3600 -c @1:1800" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }