部署prometheus-alertmanager監控告警

集羣概況

當前共有三個ceph集羣,需要對三個ceph集羣搭建一套高可用對監控告警系統,因爲prometheus支持對ceph集羣的監控告警,所以本文采用prometheus+alertmanager的形式搭建一套相對健壯的監控告警系統。

prometheus高可用(待優化)

bj03、bj04以及k8s-test共部署了三套獨立的prometheus服務,它們都拉取全量的指標數據;
當其中一臺prometheus掛掉之後,grafana仍然可以查看到來自另一個prometheus的指標數據;


alertmanager高可用

bj03、bj04以及kes-test部署了三套相互感知的alertmanager服務,它們均接收來自三個prometheus的告警信息;
alertmanager本身實現了gossip協議,通過配置啓動參數,啓動後可以使得alertmanager集羣內部對於同一個告警信息不會重複發送;


集羣監控告警部署架構圖

綜合上面prometheus以及alertmanager部署情況,給出下面的監控告警系統架構圖:
在這裏插入圖片描述


監控告警流程分析

話不多說,先上一張流程圖:
在這裏插入圖片描述
如上圖所示,
1.prometheus通過配置文件prometheus.yml中的scrape_configs配置項識別到ceph_exporter及其所在機器;
2.prometheus從ceph_exporter拉取指標數據;
3.prometheus將拉取到的指標數據存儲在自身集成的時序型數據庫,並將符合告警規則的指標告警發往AlertManager;
4.AlertManager定義了路由規則以及接收告警消息的接口地址,AlertManager將告警信息整合後發往webhook;
5.webhook回調告警中心接口,至此,告警信息已經發送到告警中心;(webhook實現參考)
6.告警中心通過管理平臺的配置,將告警信息通過V消息和短信等形式發送給組內成員。
7.grafana也支持prometheus數據源,只需要在grafana中配置一下即可使用。(grafana配置參考文章

監控告警安裝部署

可以採用容器化安裝,也可以非容器化安裝。

容器化安裝部署

#ops on 10.xxx.xxx.xxx
wget http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
rpm -ivh ceph_exporter-2.0.0-1.x86_64.rpm
systemctl start ceph_exporter
systemctl status  ceph_exporter
 
#ops on 10.xxx.xxx.xxx
docker pull prom/prometheus:v2.3.2
docker pull prom/alertmanager:v0.16.0
docker pull docker.io/grafana/grafana:5.2.1
 
mkdir -p /etc/prometheus
cat /etc/prometheus/alert_config.yml
cat /etc/prometheus/alert_rules_szsk_04_17.yml
cat /etc/prometheus/prometheus_sz02_04_17.yml
 
docker run -d --name alertmager_sz02ceph -p 9096:9093 -v /etc/prometheus/alert_config.yml:/etc/alertmanager/config.yml prom/alertmanager:v0.16.0
docker run -d --name  promethues_sz02ceph -p 9191:9090 -v /etc/prometheus/prometheus_sz02_04_17.yml:/etc/prometheus/prometheus.yml -v /etc/prometheus/alert_rules_sz02_04_17.yml:/etc/prometheus/l  prom/prometheus:v2.3.2
docker run -d --name=grafana -p 3000:3000 docker.io/grafana/grafana:5.2.1

非容器化安裝部署

wget http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
rpm -qpl ceph_exporter-2.0.0-1.x86_64.rpm
rpm -ivh ceph_exporter-2.0.0-1.x86_64.rpm
systemctl status ceph_exporter
systemctl start ceph_exporter
systemctl enable ceph_exporter
 
wget http://static.zybuluo.com/zphj1987/7ro7up6r03kx52rkwy1qjuwm/prometheus-2.3.2-1.x86_64.rpm
rpm -qpl prometheus-2.3.2-1.x86_64.rpm
rpm -ivh prometheus-2.3.2-1.x86_64.rpm
vim /usr/lib/systemd/system/prometheus.service
  --config-file=.../prometheus_xxx.yml
systemctl status prometheus
systemctl start prometheus
systemctl enable prometheus
netstat -tunlp|grep 9090
 
wget --content-disposition https://packagecloud.io/prometheus-rpm/release/packages/el/7/alertmanager-0.16.0-1.el7.centos.x86_64.rpm/download.rpm
###注意:這裏alertmanager採用的是1.6版本的,之前的1.3版本在配置alertmanger高可用的時候,對於--cluster.listen-address等參數無法識別
rpm -qpl alertmanager-0.13.0-1.el7.centos.x86_64.rpm
rpm -ivh alertmanager-0.13.0-1.el7.centos.x86_64.rpm
vim /usr/lib/systemd/system/alertmanager.service
  --config-file=.../alert_config.yml \
  --web.listen-address=:9096 \
  --cluster.listen-address=:8001 \
  --cluster.peer=[the other alertmanager ip:port]
systemctl status alertmanager
systemctl start alertmanager
systemctl enable alertmanager
netstat -tunlp | grep 9096
 
wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.2.1-1.x86_64.rpm
yum install grafana-5.2.1-1.x86_64.rpm
systemctl start grafana-server.service
netstat -tunlp|grep grafana

需要用到的配置文件

prometheus配置文件

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 10.1xx.xxx.xxx:9093
      - 10.1xx.xxx.xxx:9093
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - alert_rules.yml
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'ceph-exporter'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['10.xx.xx.xx:9128']
    - targets: ['10.xx.xx.xx:9128']
    - targets: ['10.xx.xx.xx:9128']
  #The job of "ceph-exporter-alias" use the cluster name instead of host:port by the label <relabel_configs> link:https://zhuanlan.zhihu.com/p/77020680 ; https://github.com/prometheus/prometheus/blob/release-2.18/config/testdata/conf.good.yml
  - job_name: 'ceph-exporter-alias'
    file_sd_configs:
    - refresh_interval: 10s
      files:
      - '/etc/prometheus/ceph_exporter.yml'
    relabel_configs:
    - source_labels:
      - '__address__'
      regex: '(.*)'
      target_label: '__address__'
      action: replace
      replacement: '${1}'

告警規則配置文件

groups:
- name: ceph.rules
  rules:
  - alert: CephTargetDown
    expr: up{job="ceph"} == 0
    for: 10m
    labels:
      severity: critical
    annotations:
      description: CEPH target down for more than 2m, please check - it could be a either exporter crash or a whole cluster crash
      summary: CEPH exporter down
  - alert: CephErrorState
    expr: ceph_health_status > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      description: Ceph is in Error state longer than 5m, please check status of pools and OSDs
      summary: CEPH in ERROR
  - alert: OsdDown
    expr: ceph_osd_up == 0
    for: 30m
    labels:
      severity: warning
    annotations:
      description: OSD is down longer than 30 min, please check whats the status
      summary: OSD down
  - alert: OsdApplyLatencyTooHigh
    expr: ceph_osd_perf_apply_latency_seconds > 10
    for: 90s
    labels:
      severity: warning
    annotations:
      description: OSD latency for {{ $labels.osd }} is too high. Please check if it doesn't stuck in weird state
      summary: OSD latency too high {{ $labels.osd }}
  - alert: MonitorClockSkewTooHigh
    expr: abs(ceph_monitor_clock_skew_seconds) > 0.1
    for: 60s
    labels:
      severity: warning
    annotations:
      description: Monitor clock skew detected on  {{ $labels.monitor }} - please check ntp and harware clock settins
      summary: Clock skew detected on {{ $labels.monitor }}
  - alert: MonitorAvailableStorage
    expr: ceph_monitor_avail_percent < 30
    for: 60s
    labels:
      severity: warning
    annotations:
      description: Monitor storage for {{ $labels.monitor }} less than 30% - please check why its too high
      summary: Nonitor storage for  {{ $labels.monitor }} less than 30%
  - alert: MonitorAvailableStorage
    expr: ceph_monitor_avail_percent < 15
    for: 60s
    labels:
      severity: critical
    annotations:
      description: Monitor storage for {{ $labels.monitor }} less than 15% - please check why its too high
      summary: Nonitor storage for  {{ $labels.monitor }} less than 15%
  - alert: CephOSDUtilizatoin
    expr: ceph_osd_utilization > 90
    for: 60s
    labels:
      severity: critical
    annotations:
      description: Osd free space for  {{ $labels.osd }} is higher tan 90%. Please validate why its so big, reweight or add storage
      summary: OSD {{ $labels.osd }} is going out of space
  - alert: CephPgDown
    expr: ceph_pg_down > 0
    for: 3m
    labels:
      severity: critical
    annotations:
      description: Some groups are down (unavailable) for too long on {{ $labels.cluster }}. Please ensure that all the data are available
      summary: PG DOWN [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephPgIncomplete
    expr: ceph_pg_incomplete > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      description: Some groups are incomplete (unavailable) for too long on {{ $labels.cluster }}. Please ensure that all the data are available
      summary: PG INCOMPLETE [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephPgInconsistent
    expr: ceph_pg_inconsistent > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      description: Some groups are inconsistent for too long on {{ $labels.cluster }}. Data is available but inconsistent across nodes
      summary: PG INCONSISTENT [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephPgActivating
    expr: ceph_pg_activating > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      description: Some groups are activating for too long on {{ $labels.cluster }}. Those PGs are unavailable for too long!
      summary: PG ACTIVATING [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephPgBackfillTooFull
    expr: ceph_pg_backfill_toofull > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      description: Some groups are located on full OSD on cluster {{ $labels.cluster }}. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.
      summary: PG TOO FULL [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephPgUnavailable
    expr: ceph_pg_total - ceph_pg_active > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      description: Some groups are unavailable on {{ $labels.cluster }}. Please check their detailed status and current configuration.
      summary: PG UNAVAILABLE [{{ $value }}] on {{ $labels.cluster }}
  - alert: CephOsdReweighted
    expr: ceph_osd_weight < 1
    for: 1h
    labels:
      severity: warning
    annotations:
      description: OSD {{ $labels.ceph_daemon}} on cluster {{ $labels.cluster}} was reweighted for too long. Please either create silent or fix that issue
      summary: OSD {{ $labels.ceph_daemon }} on {{ $labels.cluster }} reweighted - {{ $value }}
  - alert: CephAvailableBytesNotEnough
    expr: ceph_cluster_available_bytes / ceph_cluster_capacity_bytes < 0.3
    for: 1m
    labels:
      severity: warning
    annotations:
      description: ceph cluster {{ $labels.cluster}} has no enough available bytes. Please check the cluster available bytes.
      summary: ceph cluster {{ $labels.cluster }} available bytes [{{ $value }}].


alertmanager配置文件

global:

# The directory from which notification templates are read.
templates:
- '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname', 'cluster', 'service']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 30m

  # A default receiver
  receiver: 'team-ceph-ops-mails'

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  #routes:
  #- receiver: 'caas'
  #  match:
  #    alertname: 'PodCpuUsage'

  routes:
  - match_re:
      alertname: ^ceph.*
    receiver: team-ceph-ops-mails
  - match_re:
      alertname: ^skidc.*
    receiver: team-skidc-ops-mails

receivers:
- name: 'team-skidc-ops-mails'
  webhook_configs:
  - url: http://10.xx.xx.xx:8101/sendmms
  - url: http://10.xx.xx.xx:8101/sendmsg

- name: 'team-ceph-ops-mails'
  webhook_configs:
  - url: http://10.xx.xx.xx:8106/webhook/sendMsg

附錄

參考文章:
[1]https://ceph.io/planet/快速構建ceph可視化監控系統/

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章