nagios 事件處理機制

接到zz的任務,實現自動化處理nagios某項報警

腦海裏有個印象,這個功能之前線下做過實驗

一、首先必須查看下nagios的官方文檔,確認可行,以下是筆者整理的一些自認爲有用的信息

1)
瞭解命令的定義方法
Writing Event Handler Commands
Event handler commands will likely be shell or perl scripts, but they can be any type of executable that can run from a
command prompt. At a minimum, the scripts should take the following macros as arguments:
For Services: $SERVICESTATE$, $SERVICESTATETYPE$, $SERVICEATTEMPT$
For Hosts: $HOSTSTATE$, $HOSTSTATETYPE$, $HOSTATTEMPT$
這段說的是,針對於主機處理需要的一些參數,跟針對於服務需要的一些參數,這方面配置是在objects/commands.cfg配置的
官方文檔記錄
define command{
command_name restart-httpd
command_line /usr/local/nagios/libexec/eventhandlers/restart-httpd $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}

2)
瞭解主機配置文件的方法(筆者線上host和service合併的一個文件,一般是分開的,主host跟service是分開的)
官方文檔記錄:
define service{
host_name somehost
service_description HTTP
max_check_attempts 4
event_handler restart-httpd
...
}


二、一些解釋:

1)
變量解釋:
$SERVICESTATE$:服務的當前狀態(OK、WARNING、UNKNOWN、CRITICAL)
$SERVICESTATETYPE$:服務器狀態類型,分爲兩種,軟狀態,硬狀態
$SERVICEATTEMPT$:軟狀態的嘗試check的次數
這些值是後面自動恢復腳本必須要處理的三個參數

2)
HARD:硬狀態
SOFT:軟狀態
nagios在檢測服務正常的過程中,如果第一次檢測失敗,狀態成SOFT 嘗試設置的最大次數後,狀態就改變成爲HARD

3)
事件處理的一些參數配置
event_handler_timeout=30  超時時間
enable_event_handlers=1   開機事件處理機制
event_handler


三、操作步驟:
1)
確認事件處理開關有沒有打開
enable_event_handlers=1
0:關閉
1:打開

2)
自恢復腳本製作
需要處理的參數太多,建議使用case,下面是官網的腳本例子,當然可以寫成其他的
#!/bin/sh
#
# Event handler script for restarting the web server on the local machine
#
# Note: This script will only restart the web server if the service is
# retried 3 times (in a "soft" state) or if the web service somehow
# manages to fall into a "hard" error state.
#
# What state is the HTTP service in?
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Aha! The HTTP service appears to have a problem - perhaps we should restart the server...
# Is this a "soft" or a "hard" state?
case "$2" in
# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart the web server on the first
# check, because it may just be a fluke!
case "$3" in
# Wait until the check has been tried 3 times before restarting the web server.
# If the check fails on the 4th time (after we restart the web server), the state
# type will turn to "hard" and contacts will be notified of the problem.
# Hopefully this will restart the web server successfully, so the 4th check will
# result in a "soft" recovery. If that happens no one gets notified because we
# fixed the problem!
3)
echo -n "Restarting HTTP service (3rd soft critical state)..."
# Call the init script to restart the HTTPD server
/etc/rc.d/init.d/httpd restart
;;
esac
;;
#

3)報警產生之後,如何當nagios服務器得到下面傳上來的報警呢,做出自維護呢(objects/commands.cfg)
在objects/commands.cfg該文件中表明當報警產生,如何去執行遠端的腳本,


4)
在主機服務的配置文件中,啓用時間處理機制
event_handler  shell_name


/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
檢測上面修改的文件,有沒有報錯

殺掉個服務,做下簡單的測試,ok

不辜負zz的信任,解決了

搞定收工

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章