基於Prometheus+Grafana+Alertmanager監控Pulsar發釘釘告警

基於Prometheus+Grafana+Alertmanager監控Pulsar並通過釘釘發告警

下載prometheus

https://prometheus.io/download/

安裝prometheus

tar zxvf prometheus-2.24.0.linux-amd64.tar.gz  -d /workspace/
cd /workspace/prometheus/


修改配置文件

vim prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  external_labels:
    cluster: pulsar-cluster-1 #添加pulsar集羣名稱
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - "localhost:9093" #開啓alertmanagers監控

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   #- "rules/*.yaml"
   - "alert_rules/*.yaml" #指定告警規則路徑
   #- "first_rules.yml"
   #- "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'nacos-cluster'
    honor_labels: true
    scrape_interval: 60s
    metrics_path: '/nacos/actuator/prometheus'
    static_configs:
      - targets:
         - 10.9.4.42:8848
         - 10.9.5.47:8848
         - 10.9.5.75:8848

  - job_name: "broker"
    honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets: ['10.9.4.42:8080','10.9.5.47:8080','10.9.5.75:8080']
      

  - job_name: "bookie"
    honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets: ['10.9.4.42:8100','10.9.5.47:8100','10.9.5.75:8100'] 
    #在安裝bookie的時候我把端口改了,默認爲8000
    
      

  - job_name: "zookeeper"
    honor_labels: true
    static_configs:
    - targets: ['10.9.4.42:8000','10.9.5.47:8000','10.9.5.75:8000']
 
  - job_name: "node"
    honor_labels: true
    static_configs:
    - targets: ['10.9.4.42:9100','10.9.5.47:9100','10.9.5.75:9100']


創建規則目錄

cd /workspace/prometheus/mkdir  alert_rules


啓動prometheus

nohup ./prometheus --config.file="prometheus.yml" > /dev/null 2>&1 &


檢查prometheus是否啓動成功

http://10.9.5.71:9090/targets  如圖,說明promether啓動成功

image.png

安裝監控主機節點node_exporter

安裝node_exporter

https://prometheus.io/download/

每臺主機上都需要安裝上node_exporter

tar -zxvf node_exporter-1.0.1.linux-amd64.tar.gz 
cd node_exporter-1.0.1.linux-amd64/

啓動node_exporter

nohup ./node_exporter > /dev/null 2>&1 &

安裝Grafana

下載並安裝grafana

sudo yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.2.4-1.x86_64.rpm
sudo yum install -y grafana-5.2.4-1.x86_64.rpm
sudo service grafana-server start

登陸grafana

http://10.9.5.71:3000  默認賬號密碼爲:admin

image.png


下載grafana所需要監控的pulsar json文件

https://github.com/streamnative/apache-pulsar-grafana-dashboard

文檔解壓,在dashboards目錄中所有的文件中json文件裏的{{ PULSAR_CLUSTER }} 替換成prometheus數據源,足個導入到granfan中

WX20210128-165804@2x.png

image (1).png

選擇導入 Upload.json File

image.png

數據源選擇prometheus

image.png

選擇Bookie Metrics面板就可以看到面板了

image.png

添加釘釘機器人

打開釘釘--羣設置--羣助力管理

image.png

image (2).png

需要將圖中兩個地址複製下來,後面需要用到

image (3).png

安裝alertmanager

下載安裝alertmanager

https://prometheus.io/download/


安裝alertmanager

tar xf alertmanager-0.21.0.linux-amd64.tar.gz  -d /workspace/
cd /workspace/alertmanager-0.21.0.linux-amd64/


修改配置文件

vim alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity', 'namespace']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10s
  receiver: 'dingding.webhook1'
  routes:
  - receiver: 'dingding.webhook1'
    match:
      team: DevOps
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h
  - receiver: 'dingding.webhook.all'
    match:
      team: SRE
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h

receivers:
- name: 'dingding.webhook1'
  webhook_configs:
  - url: 'http://10.9.5.71:8060/dingtalk/webhook1/send'
    send_resolved: true
- name: 'dingding.webhook.all'
  webhook_configs:
  - url: 'http://10.9.5.71:8060/dingtalk/webhook_mention_all/send'
    send_resolved: true



啓動alertmanager

nohup ./alertmanager  > /dev/null 2>&1 &

到prometherus目錄下創建報警規則文件

cd /workspace/prometheus/alert_rules
vim node-alert.yaml 
groups:
- name: hostStatsAlert
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{$labels.instance}} down"
      description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

創建 plusar告警規則文件

cd /workspace/prometheus/alert_rules
vim  plusar.yaml
groups:
  - name: node
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          status: danger
        annotations:
          summary: "Instance {{ $labels.instance }} down."
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

      - alert: HighCpuUsage
        expr: (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) > 60
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High cpu usage."
          description: "High cpu usage on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}"

      - alert: HighIOUtils
        expr: irate(node_disk_io_time_seconds_total[1m]) > 0.6
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High IO utils."
          description: "High IO utils on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}%"

      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes)  / node_filesystem_size_bytes > 0.8
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High disk usage"
          description: "High IO utils on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}%"

      - alert: HighInboundNetwork
        expr: rate(node_network_receive_bytes_total{instance="$instance", device!="lo"}[5s]) or irate(node_network_receive_bytes_total{instance="$instance", device!="lo"}[5m]) / 1024 / 1024 > 512
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High inbound network"
          description: "High inbound network on instance {{ $labels.instance }} of job {{ $labels.job }} over than 512MB/s, current value is {{ $value }}/s"

  - name: zookeeper
    rules:
      - alert: HighWatchers
        expr: zookeeper_server_watches_count{job="zookeeper"} > 1000000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Watchers of Zookeeper server is over than 1000k."
          description: "Watchers of Zookeeper server {{ $labels.instance }} is over than 1000k, current value is {{ $value }}."

      - alert: HighEphemerals
        expr: zookeeper_server_ephemerals_count{job="zookeeper"} > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Ephemeral nodes of Zookeeper server is over than 10k."
          description: "Ephemeral nodes of Zookeeper server {{ $labels.instance }} is over than 10k, current value is {{ $value }}."

      - alert: HighConnections
        expr: zookeeper_server_connections{job="zookeeper"} > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Connections of Zookeeper server is over than 10k."
          description: "Connections of Zookeeper server {{ $labels.instance }} is over than 10k, current value is {{ $value }}."

      - alert: HighDataSize
        expr: zookeeper_server_data_size_bytes{job="zookeeper"} > 107374182400
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Data size of Zookeeper server is over than 100TB."
          description: "Data size of Zookeeper server {{ $labels.instance }} is over than 100TB, current value is {{ $value }}."

      - alert: HighRequestThroughput
        expr: sum(irate(zookeeper_server_requests{job="zookeeper"}[30s])) by (type) > 1000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Request throughput on Zookeeper server is over than 1000 in 30 seconds."
          description: "Request throughput of {{ $labels.type}} on Zookeeper server {{ $labels.instance }} is over than 1k, current value is {{ $value }}."

      - alert: HighRequestLatency
        expr: zookeeper_server_requests_latency_ms{job="zookeeper", quantile="0.99"} > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Request latency on Zookeeper server is over than 100ms."
          description: "Request latency {{ $labels.type }} in p99 on Zookeeper server {{ $labels.instance }} is over than 100ms, current value is {{ $value }} ms."

  - name: bookie
    rules:
      - alert: HighEntryAddLatency
        expr: bookkeeper_server_ADD_ENTRY_REQUEST{job="bookie", quantile="0.99", success="true"} > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Entry add latency is over than 100ms"
          description: "Entry add latency on bookie {{ $labels.instance }} is over than 100ms, current value is {{ $value }}."

      - alert: HighEntryReadLatency
        expr: bookkeeper_server_READ_ENTRY_REQUEST{job="bookie", quantile="0.99", success="true"} > 1000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Entry read latency is over than 1s"
          description: "Entry read latency on bookie {{ $labels.instance }} is over than 1s, current value is {{ $value }}."

  - name: broker
    rules:
      - alert: StorageWriteLatencyOverflow
        expr: pulsar_storage_write_latency{job="broker"} > 1000
        for: 30s
        labels:
          status: danger
        annotations:
          summary: "Topic write data to storage latency overflow is more than 1000."
          description: "Topic {{ $labels.topic }} is more than 1000 messages write to storage latency overflow , current value is {{ $value }}."

      - alert: TooManyTopics
        expr: sum(pulsar_topics_count{job="broker"}) by (cluster) > 1000000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Topic count are over than 1000000."
          description: "Topic count in cluster {{ $labels.cluster }} is more than 1000000 , current value is {{ $value }}."

      - alert: TooManyProducersOnTopic
        expr: pulsar_producers_count > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Producers on topic are more than 10000."
          description: "Producers on topic {{ $labels.topic }} is more than 10000 , current value is {{ $value }}."

      - alert: TooManySubscriptionsOnTopic
        expr: pulsar_subscriptions_count > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Subscriptions on topic are more than 100."
          description: "Subscriptions on topic {{ $labels.topic }} is more than 100 , current value is {{ $value }}."

      - alert: TooManyConsumersOnTopic
        expr: pulsar_consumers_count > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Consumers on topic are more than 10000."
          description: "Consumers on topic {{ $labels.topic }} is more than 10000 , current value is {{ $value }}."

      - alert: TooManyBacklogsOnTopic
        expr: pulsar_msg_backlog > 50000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Backlogs of topic are more than 50000."
          description: "Backlogs of topic {{ $labels.topic }} is more than 50000 , current value is {{ $value }}."

      - alert: TooManyGeoBacklogsOnTopic
        expr: pulsar_replication_backlog > 50000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Geo backlogs of topic are more than 50000."
          description: "Geo backlogs of topic {{ $labels.topic }} is more than 50000, current value is {{ $value }}."

重啓prometheus

nohup ./prometheus --config.file="prometheus.yml" > /dev/null 2>&1 &

登陸prometheus查看alertmanager

image.png


安裝釘釘插件配置告警

下載釘釘插件

https://github.com/timonwong/prometheus-webhook-dingtalk/releases

安裝釘釘webhook

tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz -d /workspace/ 
cd /workspace/prometheus-webhook-dingtalk-1.4.0.linux-amd64/

修改配信息

vim config.yml
## Request timeout
timeout: 5s

## Customizable templates path
templates:
  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
default_message:
  title: '{{ template "legacy.title" . }}'
  text: '{{ template "legacy.content" . }}'
  #告警模板

##Targets, previously was known as "profiles"
targets:
  webhook1:
    #釘釘機器人的地址
    url: https://oapi.dingtalk.com/robot/send?access_token=0984thjkl36fd6c60eb6de6d0a0d50432df175bc38beb544d75b704e360b5fee
     #secret for signature
     #機器人標籤
    secret: SEC07c9120064529d241891452e315b6258ed159053da681d79f21xxdbe93axxexx
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=0984thjkl36fd6c60eb6de6d0a0d50432df175bc38beb544d75b704e360b5fee
    secret: SEC07c9120064529d241891452e315b6258ed159053da681d79f21xxdbe93axxexx
    mention:
      all: true
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=0984thjkl36fd6c60eb6de6d0a0d50432df175bc38beb544d75b704e360b5fee
      mobiles: ['18618666666']

啓動釘釘webhook

nohup ./prometheus-webhook-dingtalk  > /dev/null 2>&1 &

測試釘釘告警

把任意一個節點的node_exporter關閉,釘釘羣裏就能收到機器發出的告警

WX20210129-154456@2x.png

image (4).png

將node_exporter啓動,同樣也能收到釘釘機器發來的告警

image.png

plusar告警也同樣生效

image (5).png


恢復故障後,也同樣收到故障恢復告警


image (6).png


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章