Nagios遇到的一點問題--關於如何使用timeperiod

關於在使用Nagios過程中遇到的一點問題

——關於如何使用timeperiod

1. 前言、問題的產生

問題的產生是這樣的，在我的應用服務器（App server）上跑着一套業務系統，使用的是weblogic中間件。由於出於安全方面的考慮，這套業務需要在下班以後關閉，並且在第二天上班的時候開啓。其實這個實現也沒什麼複雜的，可以寫兩個shell腳本，一個用於啓動weblogic服務，另一個用於停止weblogic服務（也就是kill掉相應的java進程），然後把這兩個腳本加到crontab裏，安排好執行時間，讓它們定時執行就可以了。

這樣雖然定時停開服務是實現了，可這樣帶來了一個新的問題。具體的問題如下，我使用Nagios軟件來監控系統整個運行狀況，不僅包括主機的狀態，還包括數據庫、中間件的狀態等。Nagios軟件24小時不間斷的監控着整個系統的運行狀態（很盡職盡責），在下班後，weblogic服務已經被停掉，這屬於正常狀態，但Nagios依然去檢查weblogic的運行狀態，結果可想而知當然是不能獲得任何信息（critical狀態）。於是，Nagios將critical狀態報告給我（我配置的是email的通知方式）。我的郵箱裏收到了一堆垃圾郵件，沒有任何價值的信息。

那麼如何解決這種具有時間段要求的監控問題呢？仔細的Nagios的官方文檔，我們不難發現其中有一個定義叫timeperiod，這個屬性可以控制時間範圍。下面簡單的說明一下我的處理方法。

關於如何通過Nagios監控weblogic的方法，參見我的另一篇博文《通過Nagios監控Weblogic服務》，鏈接[url]http://skymax.blog.51cto.com/365901/101603[/url]。

2. 問題的解決方式

2.1. 配置信息

由於配置文件的較多，而且文件的內容過多，我在這裏僅列出與文件相關的一些配置。

· 服務監控配置

#the check_wls_server_adminserver on the remote host.

define service{

use generic-service

host_name HPUX_XX.XX.XX.XX

service_description Weblogic Server adminserver

check_command check_nrpe! check_wls_server_adminserver

}

· Generic-service定義

###############################################################################

# SERVICE TEMPLATES

###############################################################################

# Generic service definition template - This is NOT a real service, just a template!

define service{

name generic-service ; The 'name' of this service template

active_checks_enabled 1 ; Active service checks are enabled

passive_checks_enabled 1 ; Passive service checks are enabled/accepted

parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service 1 ; We should obsess over this service (if necessary)

check_freshness 0 ; Default is to NOT check service 'freshness'

notifications_enabled 1 ; Service notifications are enabled

event_handler_enabled 1 ; Service event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

failure_prediction_enabled 1 ; Failure prediction is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

is_volatile 0 ; The service is not volatile

check_period 24x7 ; The service can be checked at any time of the day

max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state

normal_check_interval 10 ; Check the service every 10 minutes under normal conditions

retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined

contact_groups admins ; Notifications get sent out to everyone in the 'admins' group

notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events

notification_interval 60 ; Re-notify about service problems every hour

notification_period 24x7 ; Notifications can be sent out at any time

}

· 24x7的定義

# This defines a timeperiod where all times are valid for checks,

# notifications, etc. The classic "24x7" support nightmare. :-)

define timeperiod{

timeperiod_name 24x7

alias 24 Hours A Day, 7 Days A Week

sunday 00:00-24:00

monday 00:00-24:00

tuesday 00:00-24:00

wednesday 00:00-24:00

thursday 00:00-24:00

friday 00:00-24:00

saturday 00:00-24:00

}

從以上配置不難看出，我所定義的監控服務所使用的模板是generic-service，而該模板中定義的check_period和notification_period使用的都是timeperiod 24x7。timeperiod 24x7中明確定義時間範圍是從週一到週日，每天24小時全天。

問題的癥結就在於此，timeperiod的定義。如果我們把監控服務的監控時間段（check_period）改爲我們所希望的工作時間（從早8到晚5，週一到週五），那麼問題就可以迎刃而解了。

2.2. 修改配置文件

· 定義一個新的timeperiod。

# Some P.R.C holidays

# 中國的一些法定節假日

define timeperiod{

name cn-holidays

timeperiod_name cn-holidays

alias CN Holidays

january 1 00:00-00:00 ; 1.1

may 1 00:00-00:00 ; 5.1

october 1 00:00-00:00 ; 10.1

}

# Work time

# Week monday to friday

# Time 8:00 to 17:00

# 工作時間週一至週五的早八點到晚五點

define timeperiod{

timeperiod_name cn_work_time_8x5

alias CN Work TIme 8x5

use cn-holidays ;使用cn-holidays模板

sunday 00:00-00:00

monday 08:00-17:00

tuesday 08:00-17:00

wednesday 08:00-17:00

thursday 08:00-17:00

friday 08:00-17:00

saturday 00:00-00:00

}

· 使用剛剛定義好的timeperiod創建一個新的服務監控模板。

# 8x5 service definition template - This is NOT a real service, just a template!

define service{

name generic-service-8x5 ; The name of this service template

use generic-service ; Inherit default values from the generic-service definition

check_period cn_work_time_8x5

notification_period cn_work_time_8x5

}

· 使用新定義的模板修改具體服務監控配置

#the check_wls_server_adminserver on the remote host.

define service{

use generic-service-8x5

host_name HPUX_XX.XX.XX.XX

service_description Weblogic Server adminserver

check_command check_nrpe!check_wls_server_adminserver

}

配置修改完了，下一步具體驗證一下。

· 首先驗證配置文件是否書寫正確。

bash-3.00$ ./nagios -v ../etc/nagios.cfg

Nagios 3.0.3

Last Modified: 06-25-2008

License: GPL

Reading configuration data...

Running pre-flight check on configuration data...

Checking services...

Checked 111 services.

Checking hosts...

Checked 7 hosts.

Checking host groups...

Checked 1 host groups.

Checking service groups...

Checked 1 service groups.

Checking contacts...

Checked 2 contacts.

Checking contact groups...

Checked 1 contact groups.

Checking service escalations...

Checked 0 service escalations.

Checking service dependencies...

Checked 0 service dependencies.

Checking host escalations...

Checked 0 host escalations.

Checking host dependencies...

Checked 0 host dependencies.

Checking commands...

Checked 25 commands.

Checking time periods...

Checked 7 time periods.

Checking for circular paths between hosts...

Checking for circular host and service dependencies...

Checking global event handlers...

Checking obsessive compulsive processor commands...

Checking misc settings...

Total Warnings: 0

Total Errors: 0

Things look okay - No serious problems were detected during the pre-flight check

好，配置沒有問題，下一步重啓Nagios服務。我的操作系統Solaris10，我將Nagios配置成了SMF管理的服務，重啓服務較方便。

bash-3.00# svcadm restart nagios

bash-3.00# svcs nagios

STATE STIME FMRI

online 9:48:11 svc:/site/nagios:default

2.3. 驗證

觀察一下具體的監控情況，主要是看一下是否在下班時間是否還是會發出報警。郵箱裏再也沒有收到那些無用的垃圾郵件了，問題得以解決。

3. 結語

以上是我在使用Nagios監控系統時遇到的一個具體問題，以及解決過程、方法。由於監控的環境複雜、多變，在使用Nagios的過程中會遇到各種特殊的問題、和特殊的需要。不過還好，Nagios的整體設計架構比較強大，大部分的問題都能得以解決。當然如果有時間還是仔細看看Nagios的官方文檔，會從中受益匪淺。

Nagios遇到的一點問題--關於如何使用timeperiod

python gdal 安裝使用（Windows， python 3.6.8）

通過Nagios監控Tomcat服務

一次服務器被***的經歷

修改Nagios的check_oracle腳本來監控Oracle的臨時表空間

在CentOS下安裝Oracle10g

Solaris10下Nagios安裝

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結