ganglia分佈式監控系統

一.介紹
Ganglia是由UC Berkeley發起的一個開源監控項目,設計用於監控數以千幾的節點。每臺服務器都運行一個收集和發送監控數據名爲gmond的守護進程。它將從操作系統和指定主機中收集。接收所有監控數據的主機可以顯示這些數據並且可以將這些數據的精簡表單傳遞到層次結構中。正因爲有這種層次架構模式,使ganglia可以實現良好的擴展。Gmond帶來的系統負載非常小,這使得它成爲集羣中各個服務器上運行一段代碼而不會影響用戶性能。

Ganglia主要用來監控系統性能的軟件,通過曲線很容易見到每個節點的工作狀態,對合理調整,分配系統資源,提高系統整體性能起到重要作用,支持瀏覽器方式訪問,但不能監控節點硬件技術指標。Ganglia是分佈式的監控系統。

Ganglia的組件
Ganglia包括如下程序,它們之間通過xml格式傳遞監控數據。
服務端ganglia meta daemon(gmetad):負責收集各個cluster的數據,並更新到rrd數據庫中
客戶端ganglia monitoring daemon(gmond):收集本機的監控數據,發送到其他服務器上,收集其他服務器的監控數據,供gmetad讀取。
基於web的動態訪問方式ganglia
PHP web Frontend:一個基於web的監控界面,需要和gmetad安裝在同一個節點上,從gmetad取數據,並且讀取rrd數據庫,生成圖片顯示。

Ganglia工作模式
Ganglia收集數據可以工作在單播或多播模式下,默認爲多播模式
單播:發送自己收集到的監控數據到特定的一臺或幾臺服務器上,可以跨網段。
多播:發送自己收集到的監控數據到同一網段所有的服務器上,同時收集同一網段的所有服務器發送過來的監控數據。因爲是以廣播包的形式發送,因此需要在同一網段內,但同一網段內,又可以定義不同的發送通道。

二.安裝
# apt-get install libconfuse-dev expat libpcre3-dev libpango1.0-dev libxml2-dev libapr1-dev libexpat-dev libpcre3-dev rrdtool librrds-
perl librrd2-dev python-dev
# wget http://nchc.dl.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.2.0/ganglia-3.2.0.tar.gz
# tar zxvf ganglia-3.2.0.tar.gz -C ../software/
# ./configure --prefix=/usr/local/ganglia-3.2.0 --with-gmetad --enable-gexec
# make
# make install

三.配置
# vim gmetad.conf
data_source "cluster-db" node1 node2 //定義集羣名稱,以及集羣中的節點。由於採用multicast模式,每臺gmond節點都有本集羣內節點服務器的所有監控數據,因此不必把所有節點都列出來。建議寫入不低於2個,在node1節點當掉後,會自動找node2節點取數據。啓動gmetad時,會進行域名解析的。
data_source "cluster-
memcache" 192.168.41.129
rrd_rootdir "/data/ganglia/rrds" //定義rrd數據庫的存放路徑,gmetad收集到監控數據後,會更新到該目錄下對應的rrd數據庫中。
case_sensitive_hostnames 1

# /usr/local/ganglia-3.2.0 /gmond -t > /usr/local/ganglia-3.2.0/etc/gmond.conf
# vim gmond.conf
globals {
daemonize = yes //守護進程運行
setuid = yes
user = nobody //運行用戶
debug_level = 0 //調式級別
max_udp_msg_len = 1472 //upd包長度
mute = no //啞巴,本節點將不會再廣播任何自己收集到的數據到網絡上
deaf = no //聾子,本節點將不再接收任何其他節點廣播的數據包
allow_extra_data = yes
host_dmax = 0 /*secs */
host_tmax = 20 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no //是否使用gexec
send_metadata_interval = 0 /*secs */
}
cluster {
name = "cluster-db" //本節點屬於哪個cluster,需要與data_source對應
owner = "xuhh" //誰是該節點的所有者
latlong = "unspecified"
url = "unspecified"
}
host {
location = "node1"
}
udp_send_channel { //udp發送通道
mcast_join = 239.2.11.71 //多播地址,工作在239.2.11.71通道下。如果使用單播模式,則要寫host=node1,單播模式下可以配置多個upd_send_channel
port = 8649 //監聽端口
ttl = 1
}
udp_recv_channel { //udp接收通道
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
tcp_accept_channel { //tcp接收通道,可以配置多個tcp_accept_channels來共享集羣中監控數據
port = 8649 //遠端可以通過連接8649端口來得到監控數據
}
modules { //監控模塊
module {
name = "core_metrics"
}
module {
name = "cpu_module"
path = "modcpu.so"
}
module {
name = "disk_module"
path = "moddisk.so"
}
module {
name = "load_module"
path = "modload.so"
}
module {
name = "mem_module"
path = "modmem.so"
}
module {
name = "net_module"
path = "modnet.so"
}
module {
name = "proc_module"
path = "modproc.so"
}
module {
name = "sys_module"
path = "modsys.so"
}
}
/* This collection group will cause a heartbeat (or beacon) to be sent every
20 seconds. In the heartbeat is the GMOND_STARTED data which expresses
the age of the running gmond. */
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "heartbeat"
}
}
/* This collection group will send general info about this host every
1200 secs.
This information doesn't change between reboots and is only collected
once. */
collection_group {
collect_once = yes
time_threshold = 1200
metric {
name = "cpu_num"
title = "CPU Count"
}
metric {
name = "cpu_speed"
title = "CPU Speed"
}
metric {
name = "mem_total"
title = "Memory Total"
}
/* Should this be here? Swap can be added/removed between reboots. */
metric {
name = "swap_total"
title = "Swap Space Total"
}
metric {
name = "boottime"
title = "Last Boot Time"
}
metric {
name = "machine_type"
title = "Machine Type"
}
metric {
name = "os_name"
title = "Operating System"
}
metric {
name = "os_release"
title = "Operating System Release"
}
metric {
name = "location"
title = "Location"
}
}
/* This collection group will send the status of gexecd for this host
every 300 secs.*/
/* Unlike 2.5.x the default behavior is to report gexecd OFF. */
collection_group {
collect_once = yes
time_threshold = 300
metric {
name = "gexec"
title = "Gexec Status"
}
}
/* This collection group will collect the CPU status info every 20 secs.
The time threshold is set to 90 seconds. In honesty, this
time_threshold could be set significantly higher to reduce
unneccessary network chatter. */
collection_group {
collect_every = 20
time_threshold = 90
/* CPU status */
metric {
name = "cpu_user"
value_threshold = "1.0"
title = "CPU User"
}
metric {
name = "cpu_system"
value_threshold = "1.0"
title = "CPU System"
}
metric {
name = "cpu_idle"
value_threshold = "5.0"
title = "CPU Idle"
}
metric {
name = "cpu_nice"
value_threshold = "1.0"
title = "CPU Nice"
}
metric {
name = "cpu_aidle"
value_threshold = "5.0"
title = "CPU aidle"
}
metric {
name = "cpu_wio"
value_threshold = "1.0"
title = "CPU wio"
}
/* The next two metrics are optional if you want more detail...
... since they are accounted for in cpu_system.
metric {
name = "cpu_intr"
value_threshold = "1.0"
title = "CPU intr"
}
metric {
name = "cpu_sintr"
value_threshold = "1.0"
title = "CPU sintr"
}
*/
}
collection_group {
collect_every = 20
time_threshold = 90
/* Load Averages */
metric {
name = "load_one"
value_threshold = "1.0"
title = "One Minute Load Average"
}
metric {
name = "load_five"
value_threshold = "1.0"
title = "Five Minute Load Average"
}
metric {
name = "load_fifteen"
value_threshold = "1.0"
title = "Fifteen Minute Load Average"
}
}
/* This group collects the number of running and total processes */
collection_group {
collect_every = 80
time_threshold = 950
metric {
name = "proc_run"
value_threshold = "1.0"
title = "Total Running Processes"
}
metric {
name = "proc_total"
value_threshold = "1.0"
title = "Total Processes"
}
}
/* This collection group grabs the volatile memory metrics every 40 secs and
sends them at least every 180 secs. This time_threshold can be increased
significantly to reduce unneeded network traffic. */
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "mem_free"
value_threshold = "1024.0"
title = "Free Memory"
}
metric {
name = "mem_shared"
value_threshold = "1024.0"
title = "Shared Memory"
}
metric {
name = "mem_buffers"
value_threshold = "1024.0"
title = "Memory Buffers"
}
metric {
name = "mem_cached"
value_threshold = "1024.0"
title = "Cached Memory"
}
metric {
name = "swap_free"
value_threshold = "1024.0"
title = "Free Swap Space"
}
}
collection_group {
collect_every = 40
time_threshold = 300
metric {
name = "bytes_out"
value_threshold = 4096
title = "Bytes Sent"
}
metric {
name = "bytes_in"
value_threshold = 4096
title = "Bytes Received"
}
metric {
name = "pkts_in"
value_threshold = 256
title = "Packets Received"
}
metric {
name = "pkts_out"
value_threshold = 256
title = "Packets Sent"
}
}
/* Different than 2.5.x default since the old config made no sense */
collection_group {
collect_every = 1800
time_threshold = 3600
metric {
name = "disk_total"
value_threshold = 1.0
title = "Total Disk Space"
}
}
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "disk_free"
value_threshold = 1.0
title = "Disk Space Available"
}
metric {
name = "part_max_used"
value_threshold = 1.0
title = "Maximum Disk Space Used"
}
}
include ("/usr/local/ganglia-3.2.0/etc/conf.d/*.conf")

# mkdir –p /data/ganglia/{rrds, dwoo}
# chown –R nobody.nobody /data/ganglia
# chmod –R 777 /data/ganglia/rrds
對於有兩張網卡的服務器,服務器之間的監控通信通過內網,需要如下設置組播網關:
# ip route add 239.2.11.71 dev eth0 //eth0是內網卡

客戶端:
安裝如上所示
# vim gmond.conf
cluster {
name = "cluster-memcache"
owner = "xuhh"
latlong = "unspecified"
url = "unspecified"
}

host {
location = "192.168.41.129"
}

ganglia PHP web Frontend配置:
PHP環境配置省略
# cp -a /usr/local/src/software/ganglia-3.2.0/web/ /var/www/ganglia
# vim /var/www/ganglia/conf.php
$gmetad_root = "/data/ganglia";


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章