Prometheus + Grafana (2) mysql、redis、Docker容器、服務端點以及預警

接着上一節 《Prometheus + Grafana (1) 監控 》,我們繼續探討 Prometheus + Grafana 的複雜應用

實現目標

這節我們的目標是搭建一個多維度監控微服務的可視化平臺,包括Docker容器監控、MySQL監控、Redis監控和微服務JVM監控等,並且在必要的情況下可以發送預警郵件。

主要用到的組件有Prometheus、Grafana、alertmanager、node_exporter、mysql_exporter、redis_exporter、cadvisor。各自作用如下所示:

  1. Prometheus:獲取、存儲監控數據,供第三方查詢;
  2. Grafana:提供Web頁面,從Prometheus獲取監控數據可視化展示;
  3. alertmanager:定義預警規則,發送預警信息;
  4. node_exporter:收集微服務端點監控數據(與Prometheus一套);
  5. mysql_exporter:收集MySQL數據庫監控數據;
  6. redis_exporter:收集Redis監控數據;
  7. cadvisor:收集Docker容器監控數據。

使用docker安裝 Grafana、Prometheus及監控服務

上一節我們是直接使用的Windows下的安裝軟件安裝Grafana和Prometheus,但是在我們的日常生產=環境中多是用的Linux,所以我們選擇了方便的docker進行安裝部署。

  • 在自己的掛載目錄下創建 prometheus.yml
#創建Prometheus掛載目錄
mkdir -p /dimples/volumes/prometheus

#在該目錄下創建Prometheus配置文件
vim /dimples/volumes/prometheus/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  • 在自己的掛載目錄下創建 alertmanager.yml
global:
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  # qq郵箱獲取的授權碼
  smtp_auth_password: 'xxxxxxxxxxxxxxxxx'
  smtp_require_tls: false

#templates:
#  - '/alertmanager/template/*.tmpl'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 5m
  receiver: 'default-receiver'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
  • 創建創建 docker-compose.yml 文件
version: '3'

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    volumes:
      - /dimples/volumes/prometheus/:/etc/prometheus/
    ports:
      - 9090:9090
    restart: on-failure
    command: 
      - '--web.enable-lifecycle '
  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - 3000:3000
  node_exporter:
    image: prom/node-exporter
    container_name: node_exporter
    ports:
      - 9100:9100
  redis_exporter:
    image: oliver006/redis_exporter
    container_name: redis_exporter
    command:
      - "--redis.addr=redis://127.0.0.1:6379"
      - "--redis.password 'ZHONG9602.class'"    # 認證密碼,如果沒有密碼,該參數不需要
    ports:
      - 9101:9121
    restart: on-failure
  mysql_exporter:
    image: prom/mysqld-exporter
    container_name: mysql_exporter
    environment:
      - DATA_SOURCE_NAME=root:123456@(127.0.0.1:3306)/
    ports:
      - 9102:9104
  cadvisor:
    image: google/cadvisor
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - 9103:8080
  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    volumes:
      - /dimples/volumes/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - 9104:9093

使用 docker-compose up -d 啓動服務

# 不使用docker-compose安裝
docker run -d --name prometheus -p 9090:9090 -v /dimples/volumes/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

docker run -d --name redis_exporter -p 9101:9121 oliver006/redis_exporter --redis.addr redis://127.0.0.1:6379 --redis.password 'ZHONG9602.class'
  • 測試是否監控到數據

http://127.0.0.1:9090/alerts

如上圖所示,我們剛剛定義的兩個警告規則已經成功加載

接着訪問 http://127.0.0.1:9090/targets 觀察在Prometheus配置文件裏定義的各個job的狀態:

可以看的都是監控的UP狀態。

還可以點擊上面這個頁面的各個 Endpoint 的鏈接,如果頁面顯示出了收集的數據,則說明各個Endpoint已經成功採集到了數據,以mysql_exporter爲例子,訪問
http://127.0.0.1:9102/metrics

訪問http://127.0.0.1:9104/#/status看看我們在alertmanager.yml配置的規則是否已經生效:

配置Java程序監控

在上面的配置中我們簡單的將Prometheus採集的對於自身的數據通過Grafana進行了展示,而我們的核心是通過Prometheus去採集Java應用的數據,這就需要針對前面提到的通過Prometheus的pull模式定時去拉取SpringBoot通過Actuator暴露的Micrometer採集的監控指標

  • 首先需要的做的是完成Java應用的Micrometer集成,訪問actuator/prometheus或者/prometheus能夠正常的返回Micrometer採集的數據指標(這一步操作在上節中已經很詳細的介紹了,此處不再贅述)
  • 進入部署Prometheus的文件目錄,打prometheus.yml進行拉取節點的配置,在配置文件的scrape_configs節點添加針對java的配置

修改 prometheus.yml 配置所有監控服務

在上面啓動的 prometheus,我們沒有配置任何的監控,所以我們要修改 prometheus.yml 文件,使其監控我們想監控的數據源,具體的修改內容如下圖所示

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['127.0.0.1:9100']
        labels:
          instance: 'node_exporter'
  - job_name: 'redis_exporter'
    static_configs:
      - targets: ['127.0.0.1:9101']
        labels:
          instance: 'redis_exporter'
  - job_name: 'mysql_exporter'
    static_configs:
      - targets: ['127.0.0.1:9102']
        labels:
          instance: 'mysql_exporter'
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['127.0.0.1:9103']
        labels:
          instance: 'cadvisor'

  - job_name: 'server-demo-actuator'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['127.0.0.1:8001']
        labels:
          instance: 'server-demo'
rule_files:
  - 'memory_over.yml'
  - 'server_down.yml'
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["127.0.0.1:9104"]

PS: 每個服務的targets都是一個數組,可以收集多個服務器下的exporter提供的監控數據。

接着創建上面提到的兩個監控規則 memory_over.yml 和 server_down.yml

# 創建 memory_over.yml
vim /dimples/volumes/prometheus/memory_over.yml

內容如下:

groups:
  - name: server_down
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 20s
        labels:
          user: Dimples
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."

當某個節點的內存使用率大於80%,並且持續時間大於20秒後,觸發監控預警。

接着創建 server_down.yml:

# server_down.yml
vim /dimples/volumes/prometheus/server_down.yml

內容如下:

groups:
  - name: server_down
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 20s
        labels:
          user: Dimples
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."

當某個節點宕機(up==0表示宕機,1表示正常運行)超過20秒後,則觸發監控。

在 Grafana 中使用

使用瀏覽器訪問 http://127.0.0.1:9090,用戶名密碼爲admin/admin,首次登錄需要修改密碼。

第一步:首先需要添加數據源,上一節中已經詳細介紹過了,此處不再贅述,結果如圖:

添加數據源成功後,我們就可以添加監控面板了,同樣的,我們可以去Grafana官方市場選擇別人配置好的模板:https://grafana.com/grafana/dashboards

此處我收集了幾個好用的監控模板,已經上傳到微雲網盤,只需要下載然後導入即可( 鏈接:https://share.weiyun.com/XDzICKtf

下面以 MySql 監控爲例,演示導入模板:

點擊 Upload JSON file 後,選擇對應的文件,成功後會自動彈出一下界面,然後點擊Import

額外補充

alertmanager 豐富的預警配置

groups:
- name: example #定義規則組
  rules:
  - alert: InstanceDown  #定義報警名稱
    expr: up == 0   #Promql語句,觸發規則
    for: 1m            # 一分鐘
    labels:       #標籤定義報警的級別和主機
      name: instance
      severity: Critical
    annotations:  #註解
      summary: " {{ $labels.appname }}" #報警摘要,取報警信息的appname名稱
      description: " 服務停止運行 "   #報警信息
      value: "{{ $value }}%"  # 當前報警狀態值
- name: Host
  rules:
  - alert: HostMemory Usage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 >  80
    for: 1m
    labels:
      name: Memory
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: "宿主機內存使用率超過80%."
      value: "{{ $value }}"
  - alert: HostCPU Usage
    expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance,appname) > 0.65
    for: 1m
    labels:
      name: CPU
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: "宿主機CPU使用率超過65%."
      value: "{{ $value }}"
  - alert: HostLoad 
    expr: node_load5 > 4
    for: 1m
    labels:
      name: Load
      severity: Warning
    annotations:
      summary: "{{ $labels.appname }} "
      description: " 主機負載5分鐘超過4."
      value: "{{ $value }}"
  - alert: HostFilesystem Usage
    expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) >  0.8
    for: 1m
    labels:
      name: Disk
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: " 宿主機 [ {{ $labels.mountpoint }} ]分區使用超過80%."
      value: "{{ $value }}%"
  - alert: HostDiskio
    expr: irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10
    for: 1m
    labels:
      name: Diskio
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: " 宿主機 [{{ $labels.device }}]磁盤1分鐘平均寫入IO負載較高."
      value: "{{ $value }}iops"
  - alert: Network_receive
    expr: irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3 
    for: 1m
    labels:
      name: Network_receive
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: " 宿主機 [{{ $labels.device }}] 網卡5分鐘平均接收流量超過3Mbps."
      value: "{{ $value }}3Mbps"
  - alert: Network_transmit
    expr: irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3
    for: 1m
    labels:
      name: Network_transmit
      severity: Warning
    annotations:
      summary: " {{ $labels.appname }} "
      description: " 宿主機 [{{ $labels.device }}] 網卡5分鐘內平均發送流量超過3Mbps."
      value: "{{ $value }}3Mbps"
- name: Container
  rules:
  - alert: ContainerCPU Usage
    expr: (sum by(name,instance) (rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
    for: 1m
    labels:
      name: CPU
      severity: Warning
    annotations:
      summary: "{{ $labels.name }} "
      description: " 容器CPU使用超過60%."
      value: "{{ $value }}%"
  - alert: ContainerMem Usage
#    expr: (container_memory_usage_bytes - container_memory_cache)  / container_spec_memory_limit_bytes   * 100 > 10  
    expr:  container_memory_usage_bytes{name=~".+"}  / 1048576 > 1024
    for: 1m
    labels:
      name: Memory
      severity: Warning
    annotations:
      summary: "{{ $labels.name }} "
      description: " 容器內存使用超過1GB."
      value: "{{ $value }}G"

預警除了使用郵件外,也可以使用企業微信接收,可以參考:https://songjiayang.gitbooks.io/prometheus/content/alertmanager/wechat.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章