文章目錄
集羣概況
當前共有三個ceph集羣,需要對三個ceph集羣搭建一套高可用對監控告警系統,因爲prometheus支持對ceph集羣的監控告警,所以本文采用prometheus+alertmanager的形式搭建一套相對健壯的監控告警系統。
prometheus高可用(待優化)
bj03、bj04以及k8s-test共部署了三套獨立的prometheus服務,它們都拉取全量的指標數據;
當其中一臺prometheus掛掉之後,grafana仍然可以查看到來自另一個prometheus的指標數據;
alertmanager高可用
bj03、bj04以及kes-test部署了三套相互感知的alertmanager服務,它們均接收來自三個prometheus的告警信息;
alertmanager本身實現了gossip協議,通過配置啓動參數,啓動後可以使得alertmanager集羣內部對於同一個告警信息不會重複發送;
集羣監控告警部署架構圖
綜合上面prometheus以及alertmanager部署情況,給出下面的監控告警系統架構圖:
監控告警流程分析
話不多說,先上一張流程圖:
如上圖所示,
1.prometheus通過配置文件prometheus.yml中的scrape_configs配置項識別到ceph_exporter及其所在機器;
2.prometheus從ceph_exporter拉取指標數據;
3.prometheus將拉取到的指標數據存儲在自身集成的時序型數據庫,並將符合告警規則的指標告警發往AlertManager;
4.AlertManager定義了路由規則以及接收告警消息的接口地址,AlertManager將告警信息整合後發往webhook;
5.webhook回調告警中心接口,至此,告警信息已經發送到告警中心;(webhook實現參考)
6.告警中心通過管理平臺的配置,將告警信息通過V消息和短信等形式發送給組內成員。
7.grafana也支持prometheus數據源,只需要在grafana中配置一下即可使用。(grafana配置參考文章)
監控告警安裝部署
可以採用容器化安裝,也可以非容器化安裝。
容器化安裝部署
#ops on 10.xxx.xxx.xxx
wget http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
rpm -ivh ceph_exporter-2.0.0-1.x86_64.rpm
systemctl start ceph_exporter
systemctl status ceph_exporter
#ops on 10.xxx.xxx.xxx
docker pull prom/prometheus:v2.3.2
docker pull prom/alertmanager:v0.16.0
docker pull docker.io/grafana/grafana:5.2.1
mkdir -p /etc/prometheus
cat /etc/prometheus/alert_config.yml
cat /etc/prometheus/alert_rules_szsk_04_17.yml
cat /etc/prometheus/prometheus_sz02_04_17.yml
docker run -d --name alertmager_sz02ceph -p 9096:9093 -v /etc/prometheus/alert_config.yml:/etc/alertmanager/config.yml prom/alertmanager:v0.16.0
docker run -d --name promethues_sz02ceph -p 9191:9090 -v /etc/prometheus/prometheus_sz02_04_17.yml:/etc/prometheus/prometheus.yml -v /etc/prometheus/alert_rules_sz02_04_17.yml:/etc/prometheus/l prom/prometheus:v2.3.2
docker run -d --name=grafana -p 3000:3000 docker.io/grafana/grafana:5.2.1
非容器化安裝部署
wget http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
rpm -qpl ceph_exporter-2.0.0-1.x86_64.rpm
rpm -ivh ceph_exporter-2.0.0-1.x86_64.rpm
systemctl status ceph_exporter
systemctl start ceph_exporter
systemctl enable ceph_exporter
wget http://static.zybuluo.com/zphj1987/7ro7up6r03kx52rkwy1qjuwm/prometheus-2.3.2-1.x86_64.rpm
rpm -qpl prometheus-2.3.2-1.x86_64.rpm
rpm -ivh prometheus-2.3.2-1.x86_64.rpm
vim /usr/lib/systemd/system/prometheus.service
--config-file=.../prometheus_xxx.yml
systemctl status prometheus
systemctl start prometheus
systemctl enable prometheus
netstat -tunlp|grep 9090
wget --content-disposition https://packagecloud.io/prometheus-rpm/release/packages/el/7/alertmanager-0.16.0-1.el7.centos.x86_64.rpm/download.rpm
###注意:這裏alertmanager採用的是1.6版本的,之前的1.3版本在配置alertmanger高可用的時候,對於--cluster.listen-address等參數無法識別
rpm -qpl alertmanager-0.13.0-1.el7.centos.x86_64.rpm
rpm -ivh alertmanager-0.13.0-1.el7.centos.x86_64.rpm
vim /usr/lib/systemd/system/alertmanager.service
--config-file=.../alert_config.yml \
--web.listen-address=:9096 \
--cluster.listen-address=:8001 \
--cluster.peer=[the other alertmanager ip:port]
systemctl status alertmanager
systemctl start alertmanager
systemctl enable alertmanager
netstat -tunlp | grep 9096
wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.2.1-1.x86_64.rpm
yum install grafana-5.2.1-1.x86_64.rpm
systemctl start grafana-server.service
netstat -tunlp|grep grafana
需要用到的配置文件
prometheus配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.1xx.xxx.xxx:9093
- 10.1xx.xxx.xxx:9093
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- alert_rules.yml
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'ceph-exporter'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['10.xx.xx.xx:9128']
- targets: ['10.xx.xx.xx:9128']
- targets: ['10.xx.xx.xx:9128']
#The job of "ceph-exporter-alias" use the cluster name instead of host:port by the label <relabel_configs> link:https://zhuanlan.zhihu.com/p/77020680 ; https://github.com/prometheus/prometheus/blob/release-2.18/config/testdata/conf.good.yml
- job_name: 'ceph-exporter-alias'
file_sd_configs:
- refresh_interval: 10s
files:
- '/etc/prometheus/ceph_exporter.yml'
relabel_configs:
- source_labels:
- '__address__'
regex: '(.*)'
target_label: '__address__'
action: replace
replacement: '${1}'
告警規則配置文件
groups:
- name: ceph.rules
rules:
- alert: CephTargetDown
expr: up{job="ceph"} == 0
for: 10m
labels:
severity: critical
annotations:
description: CEPH target down for more than 2m, please check - it could be a either exporter crash or a whole cluster crash
summary: CEPH exporter down
- alert: CephErrorState
expr: ceph_health_status > 1
for: 5m
labels:
severity: critical
annotations:
description: Ceph is in Error state longer than 5m, please check status of pools and OSDs
summary: CEPH in ERROR
- alert: OsdDown
expr: ceph_osd_up == 0
for: 30m
labels:
severity: warning
annotations:
description: OSD is down longer than 30 min, please check whats the status
summary: OSD down
- alert: OsdApplyLatencyTooHigh
expr: ceph_osd_perf_apply_latency_seconds > 10
for: 90s
labels:
severity: warning
annotations:
description: OSD latency for {{ $labels.osd }} is too high. Please check if it doesn't stuck in weird state
summary: OSD latency too high {{ $labels.osd }}
- alert: MonitorClockSkewTooHigh
expr: abs(ceph_monitor_clock_skew_seconds) > 0.1
for: 60s
labels:
severity: warning
annotations:
description: Monitor clock skew detected on {{ $labels.monitor }} - please check ntp and harware clock settins
summary: Clock skew detected on {{ $labels.monitor }}
- alert: MonitorAvailableStorage
expr: ceph_monitor_avail_percent < 30
for: 60s
labels:
severity: warning
annotations:
description: Monitor storage for {{ $labels.monitor }} less than 30% - please check why its too high
summary: Nonitor storage for {{ $labels.monitor }} less than 30%
- alert: MonitorAvailableStorage
expr: ceph_monitor_avail_percent < 15
for: 60s
labels:
severity: critical
annotations:
description: Monitor storage for {{ $labels.monitor }} less than 15% - please check why its too high
summary: Nonitor storage for {{ $labels.monitor }} less than 15%
- alert: CephOSDUtilizatoin
expr: ceph_osd_utilization > 90
for: 60s
labels:
severity: critical
annotations:
description: Osd free space for {{ $labels.osd }} is higher tan 90%. Please validate why its so big, reweight or add storage
summary: OSD {{ $labels.osd }} is going out of space
- alert: CephPgDown
expr: ceph_pg_down > 0
for: 3m
labels:
severity: critical
annotations:
description: Some groups are down (unavailable) for too long on {{ $labels.cluster }}. Please ensure that all the data are available
summary: PG DOWN [{{ $value }}] on {{ $labels.cluster }}
- alert: CephPgIncomplete
expr: ceph_pg_incomplete > 0
for: 2m
labels:
severity: critical
annotations:
description: Some groups are incomplete (unavailable) for too long on {{ $labels.cluster }}. Please ensure that all the data are available
summary: PG INCOMPLETE [{{ $value }}] on {{ $labels.cluster }}
- alert: CephPgInconsistent
expr: ceph_pg_inconsistent > 0
for: 1m
labels:
severity: warning
annotations:
description: Some groups are inconsistent for too long on {{ $labels.cluster }}. Data is available but inconsistent across nodes
summary: PG INCONSISTENT [{{ $value }}] on {{ $labels.cluster }}
- alert: CephPgActivating
expr: ceph_pg_activating > 0
for: 5m
labels:
severity: critical
annotations:
description: Some groups are activating for too long on {{ $labels.cluster }}. Those PGs are unavailable for too long!
summary: PG ACTIVATING [{{ $value }}] on {{ $labels.cluster }}
- alert: CephPgBackfillTooFull
expr: ceph_pg_backfill_toofull > 0
for: 5m
labels:
severity: warning
annotations:
description: Some groups are located on full OSD on cluster {{ $labels.cluster }}. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.
summary: PG TOO FULL [{{ $value }}] on {{ $labels.cluster }}
- alert: CephPgUnavailable
expr: ceph_pg_total - ceph_pg_active > 0
for: 5m
labels:
severity: critical
annotations:
description: Some groups are unavailable on {{ $labels.cluster }}. Please check their detailed status and current configuration.
summary: PG UNAVAILABLE [{{ $value }}] on {{ $labels.cluster }}
- alert: CephOsdReweighted
expr: ceph_osd_weight < 1
for: 1h
labels:
severity: warning
annotations:
description: OSD {{ $labels.ceph_daemon}} on cluster {{ $labels.cluster}} was reweighted for too long. Please either create silent or fix that issue
summary: OSD {{ $labels.ceph_daemon }} on {{ $labels.cluster }} reweighted - {{ $value }}
- alert: CephAvailableBytesNotEnough
expr: ceph_cluster_available_bytes / ceph_cluster_capacity_bytes < 0.3
for: 1m
labels:
severity: warning
annotations:
description: ceph cluster {{ $labels.cluster}} has no enough available bytes. Please check the cluster available bytes.
summary: ceph cluster {{ $labels.cluster }} available bytes [{{ $value }}].
alertmanager配置文件
global:
# The directory from which notification templates are read.
templates:
- '/etc/alertmanager/template/*.tmpl'
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 30m
# A default receiver
receiver: 'team-ceph-ops-mails'
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
#routes:
#- receiver: 'caas'
# match:
# alertname: 'PodCpuUsage'
routes:
- match_re:
alertname: ^ceph.*
receiver: team-ceph-ops-mails
- match_re:
alertname: ^skidc.*
receiver: team-skidc-ops-mails
receivers:
- name: 'team-skidc-ops-mails'
webhook_configs:
- url: http://10.xx.xx.xx:8101/sendmms
- url: http://10.xx.xx.xx:8101/sendmsg
- name: 'team-ceph-ops-mails'
webhook_configs:
- url: http://10.xx.xx.xx:8106/webhook/sendMsg