【分佈式centreon故障處理】centreon不出圖

今天可是弄了一天啊 尼瑪怎麼都不出圖啊 最後終於解決了
查了N多資料啊,一般情況下centreon裏畫圖是不需要你操心的,
但一旦出不了圖,可真不好弄,下面有個排查步驟,一般情況下應該能解決,關鍵我的架構比較特殊,才導致了該問題
【故障現象】
centreon里加好監控後,perfdata正常輸出,但是圖上面無任何數據
【故障原因】
由於我的架構問題,分佈式其中一個poller節點之前做central核心用後遷移了,但是該節點上的centstorage和centcore mysql都運行狀態;結果加的這幾臺問題機器正好是這個poller上
【解決方案】
停掉該poller節點上的centcore和centstorage mysql等服務,即分佈式中只能保留一臺機器啓動centcore和centstorage


【延伸】
上面的原因有些奇葩,正兒八經的解決辦法是:(成圖主要是centstorage)
重點查看 centstorage.log日誌查看是否有報錯
同時確認監控服務中perfdata是否正常產生
再有就是相關的centstorage配置
再就是到wiki或論壇上查找相應關鍵字了

Tips:
當監控服務器很多時,centstorage會有假死(?)情況,可以設置crontab每小時重啓一次 

[root@localhost ~]# cat /home/admin/restart_centstorage.sh
/etc/init.d/centstorage stop
sleep 8
killall -TERM centstorage
/etc/init.d/centstorage start


這有個詳細的排查過程,我可是按着英文對話一步步的對啊,結果是上面的原因,shit
wiki上這東西還是不錯的,過程形式是一人提問題,另一人協助排查解決

Troubleshooting:Graphs

Contents

 [hide]

.. I know, currently this is only a log from an IRC session, but me (or any volunteers) may turn this in a good and structured HowTo :)
(14:22:09) grandmoun: How can i create graphs in centreon??
(14:28:27) nfilus: graphs are autogenerated, if the service you defined returns performance data
(14:29:01) zelia5: how do i generate perfdata ? ^^
(14:31:09) nfilus: # /usr/local/nagios/libexec/check_centreon_ping -H www.google.de
(14:31:15) nfilus: GPING OK - rtt min/avg/max/mdev = 23.269/23.269/23.269/0.000 ms|time=23.269ms;20;40;; ok=1
(14:31:25) nfilus: |time .... is the perfdata
compare this with the Service Details in Monitoring -> Services -> Details -> [your_service] like shown in the below image: Perfdata.png
(09:43:30) nfilus: let's try to analyze step by step
(09:43:35) dharrison: ok cool
(09:44:25) nfilus: your service is running and the last check timestamp is quite recent in centreon?
(09:44:58) nfilus: look for "last check" at the main page in centreon or in monitoring
(09:45:58) dharrison: everything seems to be running ok
(09:46:40) nfilus: goto administratin -> options -> centstorage -> options
(09:47:38) nfilus: no empty fields?
(09:47:49) dharrison: nope
(09:48:06) nfilus: what storage type :)
(09:48:17) nfilus: rrd & mysql?
(09:48:24) dharrison: yup
(09:48:54) nfilus: check on filesystem if service-perfdata file exists and is 644 user:nagios group:nagios
(09:50:20) dharrison: looks like its 777 nagios & www-data
(09:50:43) nfilus: that's too much, but shouldn't be the problem
(09:50:53) nfilus: ok
(09:51:07) nfilus: goto centstorage -> manage in left menu
(09:51:35) nfilus: and choose the service you are interested in
(09:51:43) dharrison: theres nothing there
(09:51:48) dharrison: its empty
(09:52:35) nfilus: that's a symptom, lets look for the cause ...
(09:52:47) nfilus: na values - no graphs, sorry! :)
(09:52:59) dharrison: lol that would make sense  :-)
(09:53:29) nfilus: go to monitoring to your service details
(09:53:41) dharrison: any service?
(09:53:54) nfilus: the one you are interested in mostly
(09:54:31) dharrison: ok i have picked a host, and we will go for CPU Usage
(09:55:08) nfilus: ok
(09:55:31) nfilus: in status details: you have a status and performance data?
(09:56:16) dharrison: yes
(09:56:35) nfilus: please paste the perfdata here
(09:56:54) dharrison: '5 min avg Load'=1%;85;90;0;100
(09:57:35) nfilus: looks ok
(09:57:55) nfilus: so, perfdata is generated, but not processed
(09:58:56) nfilus: go to config -> command -> misc 
(09:59:31) nfilus: you should have sth like a process-service-perfdata command
(09:59:47) nfilus: (i think my definition is not standard)
(09:59:55) dharrison: yup i have that
(10:00:13) nfilus: open it and paste the command line
(10:01:13) dharrison: $USER1$/process-service-perfdata  "$LASTSERVICECHECK$" "$HOSTNAME$" "$SERVICEDESC$" "$LASTSERVICESTATE$" "$SERVICESTATE$" "$SERVICEPERFDATA$"
(10:02:04) nfilus: looks ok
(10:04:20) nfilus: config -> nagios -> nagios.cfg -> data
(10:05:10) dharrison: ok
(10:05:16) nfilus: perdata option is yes
(10:05:26) nfilus: service command is process-service-perfdata
(10:05:37) nfilus: service data file is /usr/local/nagios/var/service-perfdata
(10:06:13) nfilus: ok?
(10:06:16) dharrison: its /var/log/nagios3/service-perfdata
(10:06:28) dharrison: and perfdata option is yes
(10:07:22) nfilus: is this the same path as defined in administratin -> options -> centstorage -> options?
(10:08:14) dharrison: yes, just checked
(10:08:38) nfilus: so, this is the file you checked before for access, right?
(10:09:10) dharrison: yup
(10:09:36) dharrison: but its not the same file that $USER1$ points to. is that correct?
(10:10:24) nfilus: you mean  $USER1$/process-service-perfdata?
(10:11:08) dharrison: yup
(10:11:40) nfilus: no, this was the command that gets the perfdata from service checks and writes them into /var/log/nagios3/service-perfdata
(10:11:47) dharrison: oh ok
(10:12:30) nfilus: please do
(10:12:36) nfilus: tail -f /var/log/nagios3/service-perfdata
(10:13:04) nfilus: and watch for changes for 1-2 minutes
(10:13:27) dharrison: ok running now
(10:13:33) nfilus: is there any data comming in?
(10:13:37) dharrison: yes
(10:14:37) nfilus: ok, 
(10:14:38) nfilus:  ps ax | grep cent
(10:14:43) nfilus: centstorage is running?
(10:16:24) dharrison: seems to be
(10:17:07) nfilus: ok, do
(10:17:18) nfilus: tail -f /usr/local/centreon/log/centstorage.log
(10:17:28) nfilus: any errors or warnings?
(10:18:18) dharrison: no such log file
(10:20:11) nfilus: path centreon is in usr local, yes?
(10:20:59) dharrison: yes
(10:24:57) nfilus: grep LOG /usr/local/centreon/bin/centstorage
(10:25:07) nfilus: what's the log path?
(10:26:07) dharrison: "/usr/local/centreon/log/centstorage.log";
(10:26:46) nfilus: ls -lad  /usr/local/centreon/log
(10:26:57) nfilus: drwxrwxr-x 2 www-data nagios ?
(10:27:42) dharrison: yup   lol
(10:28:50) nfilus: that'S not normal, that no log file is there if centstorage is running!
(10:29:11) nfilus: is there a logAnalyser.log?
(10:29:21) dharrison: yes
(10:34:56) nfilus: can you restart centstorage
(10:35:05) dharrison: yeah 2secs
(10:36:15) dharrison: it did bring this up when i stopped it No lock file found in /var/run/centreon/centstorage.pid
(10:36:49) dharrison: ive stopped it but says its still running????
(10:37:04) dharrison: whats the process name for centstorage?
(10:37:40) nfilus: something like /usr/bin/perl -w /usr/local/centreon/bin/centstorage
(10:38:59) dharrison: hey hey  can't write /usr/local/centreon/log/centstorage.log: Permission denied
(10:39:17) dharrison: when i typed that command above
(10:40:23) nfilus: you are root?
(10:41:10) nfilus: there is no centstorage.log until now and  /usr/local/centreon/log is writeable, yes?
(10:41:30) dharrison: i have now ran that as sudo and came back ok
(10:42:55) dharrison: i ran  /usr/bin/perl -w /usr/local/centreon/bin/centstorage   as sudo which i should have done tbh. sorry
(10:43:03) dharrison: and there is now a centstorage.log
(10:43:39) nfilus: watch it for progress and errors
(10:43:41) nfilus: tail -f 
(10:44:16) dharrison: just two lines at the mo.
(10:44:26) dharrison: 1 stating that its starting
(10:44:32) dharrison: 2 with the PID Number
(10:44:44) nfilus: woow, that's progress :)
(10:44:52) dharrison: lol certainly is
(10:45:26) dharrison: nothing else is coming through
(10:46:13) nfilus: it should stay silent if no errors occur
(10:46:27) nfilus: like in my case:
(10:46:29) nfilus: 22/10/2009 10:47:01 - ERROR while updating /var/lib/centreon/status/186.rrd at 1256201216 -> 100 : illegal attempt to update using time 1256201216 when last update time is 1529719541 (minimum one second step)
(10:47:31) dharrison: lol
(10:47:42) dharrison: nope still silent......but no graphs still
(10:48:46) nfilus: wait 5 minutes and then go back to admin -> options -> centstorage -> manage
(10:48:58) nfilus: there should be some data now
(10:49:44) dharrison: ok currently still empty. but you reckon to wait a few more minutes?
(10:50:51) nfilus: yes, the perfdata needs to be filled in
(10:51:30) dharrison: ok
(10:56:10) nfilus: so, .... is there any data?
(10:56:32) dharrison: WHOA DUDE!
(10:56:33) nfilus: ... or any errors
(10:56:37) dharrison: data
(10:56:39) dharrison: lots
Graph

centstorage.log errors

unitialized value ...

Use of uninitialized value in multiplication (*) at /usr/local/centreon/bin/centstorage line 506
(14:54:33) nfilus: the problem is : $interval = getServiceCheckIntervalWithSVCid($index) * getIntervalLenght($con_oreon);
(14:55:27) nfilus: either the global interval (Configuration -> Nagios -> nagios.cfg -> Tuning : Timing Interval) 
           is not defined in config, or there is no check interval for some services
(14:56:56) iLLiZT: Hmm, there might not be a check interval defined for a couple of services, but shouldn't they use some kind of default then?
(14:58:40) nfilus: no
(14:58:59) iLLiZT: Ok, so I have to define the normal check interval and retry check interval for all services?
(14:59:32) nfilus: either for every service or in the used templates

timestamp error while updating - case A

31/1/2010 13:31:30 - ERROR while updating /var/lib/centreon/metrics/561.rrd at 1264941084 -> 31 : illegal attempt to update using time 1264941084 when last update time is 1264941084 (minimum one second step)
In this case, where all timestamps are the same (1264941084) the reason was the service check_smart and a very old smartctl producing a malformed perfdata by repeating a metric twice (... temp=55234323 temp=34 ...). You can query mysql to which service the metric id (example: 561) corresponds to by using:
mysql> select host_name, service_description from metrics, index_data where index_id = id and metric_id = 561;
Afterwards execute the check_command for service_description on host_name on the command line, to see the unparsed performance data output.

timestamp error while updating - case B

31/1/2010 13:31:30 - ERROR while updating /var/lib/centreon/metrics/561.rrd at 1264941084 -> 31 : illegal attempt to update using time 1264941084 when last update time is 1564941084 (minimum one second step)
In this second case, where these errors occur, the last timestamp in error message is (mucht) greater than the first one (in the future of year 2011). Please check the system clock on your monitoring server. It might be that the systime is jumping or beeing re-adjusted by NTP, /etc/adjtime or vmware-tools.

Can't use string (...) as a HASH ref while "strict refs"

Can't use string ("HOSTSTATE::UP") as a HASH ref while "strict refs" in use at 419
This error is common for people migrating from pnp4nagios or who did import their old nagios commands into centreon and who chose to overwrite the default values. For centstorage to work correctly it is essential to process the performance data coming from the plugins, which is expected in a well-defined format. If the format deviates, centstorage can't parse the values anymore. The format is determined by the command definition which nagios is using as Service Performance Data Processing Command in Configuration -> Nagios -> nagios.cfg -> Data (default: process-service-perfdata). Please check the parameters of this command as defined in Configuration -> Commands -> Miscellaneous -> "command-name", which should be:
$USER1$/process-service-perfdata  "$LASTSERVICECHECK$" "$HOSTNAME$" "$SERVICEDESC$" "$LASTSERVICESTATE$" "$SERVICESTATE$" "$SERVICEPERFDATA$"

Customize graphs

Q: Where and how do I configure Centreon that it has to use the performance data to create a graph?
A: Centreon uses the data as soon as it is parsed by centstorage and copied into the configured storages (RRD, RRD and DB). Go to Views -> Curves and define colors for your metrics (time, temperature, total, ...). In Administration -> Options -> CentStorage -> Manage you can disable not needed performace metrics to be not displayed on the graphs. For more control of graph output use the graph templates.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章