Prometheus監控系統

Prometheus整體概述

概述

Prometheus(由go語言(golang)開發)是一套開源的監控&報警&時間序列數據庫的組合。適合監控docker容器。因爲kubernetes(俗稱k8s)的流行帶動了prometheus的發展。

Prometheus算是一個全能型選手，原生支持容器監控，也支持傳統應用的監控。所有監控系統的基本流程，數據採集--->數據處理--->數據存儲--->數據展示--->告警。

特點

多維數據模型：由度量名稱和鍵值對標識的時間序列數據
PromSQL: —種靈活的查詢語言，可以利用多維數據完成複雜的查詢
不依賴分佈式存儲，單個服務器節點可直接工作
基於HTTP的pull方式釆集時間序列數據
推送時間序列數據通過PushGateway組件支持
通過服務發現或靜態配罝發現目標
多種圖形模式及儀表盤支持(grafana)

任務分析

爲什麼需要監控？

進行實時的數據收集，通過報警及時發現問題，並進行處理。收集的數據爲優化也可以提供依據。
監控的對象
- 主機狀態，操作系統
- 服務，應用
- 資源 CPU,內存，硬盤
- url 以及端口
用什麼進行監控 Prometheus， node_exporter,mysqld_exporter,blackbox_exporter； zabbix-server zabbix-agent 等。
什麼時間監控 7X24
報警給誰管理員

Prometheus的組成與架構

名稱	說明
Prometheus Server	收集指標和存儲時間序列數據，並提供查詢接口
Push Gateway	短期存儲指標數據，主要用於臨時性任務
Exporters	採集已有的三方服務監控指標並暴露metrics
Alertmanager	告警組件
Web UI	簡單的WEB控制檯

Prometheus官網架構圖

集成了數據的採集，處理，存儲，展示，告警一系列流程都已經具備了。

其大概的工作流程是：

Prometheus server 定期從配置好的 jobs 或者 exporters 中拉 metrics，或者接收來自 Pushgateway 發過來的 metrics，或者從其他的 Prometheus server 中拉 metrics。
Prometheus server 在本地存儲收集到的 metrics，並運行已定義好的 alert.rules，記錄新的時間序列或者向 Alertmanager 推送警報。
Alertmanager 根據配置文件，對接收到的警報進行處理，發出告警。
在圖形界面中，可視化採集數據。

Prometheus數據模型

Prometheus將所有的數據存儲爲時間序列，具有相同度量名稱以及標籤的屬於同個指標。

Prometheus從數據源拿到數據之後都會存到內置的TSDB數據庫中，這裏存儲的就是時間序列數據，它存儲的數據會有一個度量名稱，譬如你現在監控一個nginx，首先你要給他起個名字，這個名稱就是度量名，還會有N個標籤，你可以理解爲表名，標籤爲字段，所以每個時間序列都由度量標準名稱和一組鍵值對作爲唯一標識。

時間序列的格式如下：

<metrice name> {<label name>=<label value>,...}

metrice name指的就是度量標準名稱，label name也就是標籤名，這個標籤可以有多個:

nginx_http_access{method="GET",uri="/index.html"}

這個度量名稱爲nginx_http_access，後面是兩個標籤，和他們各對應的值，當然你還可以繼續指定標籤，你指定的標籤越多查詢的維度就越多。

Prometheus指標類型

類型名稱	說明
Counter	遞增計數器，適合收集接口請求次數
Guage	可以任意變化的數值，適用CPU使用率
Histogram	對一段時間內數據進行採集，並對有所數值求和於統計數量,可以對觀察結果採樣，分組及統計
Summary	與Histogram類型類似，典型的應用如：請求持續時間，響應大小,提供觀測值的 count 和 sum 功能,提供百分位的功能，即可以按百分比劃分跟蹤結果

Prometheus安裝部署

這裏我們先從二進制部署入手，先監控傳統型應用。

Prometheus下載地址：https://github.com/prometheus/prometheus/releases

找到適合自己平臺的release版本，下載即可。

cd /usr/local/
export VER="2.21.0"
wget https://github.com/prometheus/prometheus/releases/download/v${VER}/prometheus-${VER}.linux-amd64.tar.gz
ln -s prometheus-2.21.0.linux-amd64 prometheus
echo "PATH=/usr/local/prometheus/bin:$PATH:$HOME/bin" >> /etc/profile
source /etc/profile

prometheus.yml就是他的配置文件。

啓動項有很多，主要是以下兩點。

--config.file=/usr/local/prometheus/config/prometheus.yml #指定配置文件位置
--storage.tsdb.path=/usr/local/prometheus/data   #數據存儲目錄
--storage.tsdb.retention=60d  #數據存儲時間，默認是15天

TSDB不太適合長期去存儲數據，數據量大了支持並不是很好，這裏其實是可以引入外部存儲的。譬如說使用InfluxDB.

寫入開機啓動項

vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target


[Service]
Type=simple
ExecStart=/usr/local/prometheus/bin/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data --storage.tsdb.retention=60d
Restart=on-failure


[Install]
WantedBy=multi-user.target

systemctl daemon-reload && systemctl start prometheus.service

這樣就啓動成功了，去訪問http://yourip:9090就可以看到Prometheus的webui界面了。

Prometheus配置文件詳解

prometheus.yml，官方說地址：https://prometheus.io/docs/prometheus/latest/configuration/configuration/

global:
  [ scrape_interval: <duration> | default = 1m ]      ##採集間隔
  [ scrape_timeout: <duration> | default = 10s ]      ##採集超時時間
  [ evaluation_interval: <duration> | default = 1m ]  ##告警評估週期
  external_labels:                                    ##外部標籤             
    [ <labelname>: <labelvalue> ... ]

指定告警規則

rule_files:
  [ - <filepath_glob> ... ]

配置被監控端

scrape_configs:
  [ - <scrape_config> ... ]

配置告警方式

alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

指定遠程存儲

remote_write:
  [ - <remote_write> ... ]
remote_read:
  [ - <remote_read> ... ]

scape_configs

這裏就是我們需要監控的內容，以下是我們所需要用到的常用配置

job_name: <job_name>  ##指定job名字
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]  ##這兩段指定採集時間，默認繼承全局
[ metrics_path: <path> | default = /metrics ]  ##metrics路徑，默認metrics
[ honor_labels: <boolean> | default = false ]  ##默認附加的標籤，默認不覆蓋
[ scheme: <scheme> | default = http ]  ## 默認使用http方式去訪問
params:
  [ <string>: [<string>, ...] ]        ## 配置訪問時攜帶的參數
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]          ## 配置訪問接口的用戶名密碼
[ bearer_token: <secret> ]
[ bearer_token_file: /path/to/bearer/token/file ]  ##指定認證token
tls_config:
  [ <tls_config> ]                     ## 指定CA證書
[ proxy_url: <string> ]                ## 使用代理模式訪問目標

consul_sd_configs:                   ##通過consul去發現
  [ - <consul_sd_config> ... ]
dns_sd_configs:                      ##通過DNS去發現
  [ - <dns_sd_config> ... ]
file_sd_configs:                   ##通過文件去發現
  [ - <file_sd_config> ... ]
kubernetes_sd_configs:               ##通過kubernetes去發現
  [ - <kubernetes_sd_config> ... ]
  
static_configs:
  [ - <static_config> ... ]     ##靜態配置被監控端

監控Prometheus自己的配置：

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

最後的標籤配置：

relabel_configs:
  [ - <relabel_config> ... ]          ##在數據採集前對標籤進行重新標記
metric_relabel_configs:
  [ - <relabel_config> ... ]          ##在數據採集之後對標籤進行重新標記
[ sample_limit: <int> | default = 0 ] ##採集樣本數量，默認0

relabel_configs

這個是用來重新打標記的，對於Prometheus數據模型最關鍵點就是一個指標名稱和一組標籤來組成一個多維度的數據模型，想要完成一個複雜的查詢就需要有多維度，relabel_configs就是對標籤進行處理的，能夠幫助你在數據採集之前對任目標的標籤進行修改，重打標籤的意義就是如果標籤有重複的可以幫你重命名。

現在instance是他默認給我加的標籤，relabel_configs也可以重打標籤，也可以刪除標籤，也可以過濾標籤。，具體配置段如下:

relabel_configs: 
  [ source_labels: '[' <labelname> [, ...] ']' ]   ##源標籤，指定對哪個現有標籤進行操作
  [ separator: <string> | default = ; ]            ##多個源標籤時連接的分隔符
  [ target_label: <labelname> ]                    ##要將源標籤換成什麼名字
  [ regex: <regex> | default = (.*) ]              ##怎麼來匹配源標籤，默認匹配所有
  [ modulus: <uint64> ]                            ##不怎麼會用到
  [ replacement: <string> | default = $1 ]         ##替換正則表達式匹配到的分組，分組引用$1,$2,$3
  [ action: <relabel_action> | default = replace ] ##基於正則表達式匹配執行的操作，默認替換

2.1 添加標籤

- targets:
  - "192.168.227.132:9100"
  - "192.168.227.133:9100"
  labels:
    server: 'c6-node'

重啓Prometheus，就可以看到添加的標籤了。

然後可以根據這個標籤去查了，語法是這樣的，內置函數。

sum(process_cpu_seconds_total{server="c6-node"})

2.2 標籤重命名

就是將一個已有的標籤重命名一個新的標籤。

如上圖所示的，現在要將job="DMC_HOST" 改爲 rmhost="c6-node1",下面開始用relabel進行重命名，改完之後的配置是這樣的，

relabel_configs:
    - action: replace
      source_labels: ['job'] ##源標籤
      regex: (.*)            ##正則，會匹配到job值，也就是DMC_HOST
      replacement: 'c6-node1' ##引用正則匹配到的內容，也就是c6-node1
      target_label: 'rmhost'  ##賦予新的標籤，名爲rmhost

這樣修改就可以了。新的數據已經有了，之前的標籤還會保留，因爲沒有配置刪除他，這樣就可以了，現在就可以聚合了

action重新打標籤動作

如下表所示：

值	描述
replace	默認，通過正則匹配source_label的值，使用replacement來引用表達式匹配的分組
keep	刪除regex於鏈接不匹配的目標source_labels
drop	刪除regex與連接匹配的目標source_labels
labeldrop	匹配Regex所有標籤名稱
labelkeep	不匹配regex所有標籤名稱
hashmod	設置target_label爲modulus連接的哈希值source_lanels
labelmap	匹配regex所有標籤名稱，複製匹配標籤的值分組，replacement分組引用(${1},${2})替代

基於文件的服務發現

如下所示：

- job_- job_name: 'DMC_HOST'
  file_sd_configs:
    - files: ['/usr/local/prometheus/files_sd_configs/*.yml']

重啓，然後在改目錄下創建yml文件即可。

Prometheus 監控

node_exporter

node_exporter的導出器，會幫你收集系統指標和一些軟件運行的指標，並且把指標暴漏出去，這樣Prometheus就可以去採集了。

官方的GitHub地址：https://github.com/prometheus/node_exporter

cd /usr/local
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
ln -s node_exporter-1.0.1.linux-amd64 /usr/local/node_exporter
cd /usr/local/node_exporter
./node_exporter --help

配置開機啓動腳本

Centos7 下：

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=mulser.target

Centos6 下：

這裏我們推薦使用supervisord

如果我們使用yum install supervisor 來進行安裝，安裝的版本是2.1.9；2.x版本有很多問題，可以啓動supervisord進程，但是無法正常使用supervisorctl這個命令。

用pip install supervisor，默認裝的是最新的4.0.3版本，但是centos6.5默認的只有python2.6，4.0.3的supervisor跑不起來，具體錯誤沒有記錄了，可以升級到python2.7，比較麻煩。

所以這裏建議制定安裝supervisor3.1.3，這個版本可以用python2.6，直接裝了就能用.

yum install python-pip
pip install supervisor==3.1.3
#創建相關目錄
mkdir /var/run/supervisor
mkdir /etc/supervisor
mkdir /etc/supervisor/supervisord.d
mkdir /var/log/supervisor
#創建配置文件
echo_supervisord_conf > /etc/supervisor/supervisord.conf
vim /etc/supervisor/supervisord.conf #基本都是默認配置，只需改最後一行，包含的子配置文件

; Sample supervisor config file.

[unix_http_server]
file=/var/run/supervisor/supervisor.sock   ; (the path to the socket file)
;chmod=0700                 ; sockef file mode (default 0700)
;chown=nobody:nogroup       ; socket file uid:gid owner
;username=user              ; (default is no username (open server))
;password=123               ; (default is no password (open server))

;[inet_http_server]         ; inet (TCP) server disabled by default
;port=127.0.0.1:9001        ; (ip_address:port specifier, *:port for all iface)
;username=user              ; (default is no username (open server))
;password=123               ; (default is no password (open server))

[supervisord]
logfile=/var/log/supervisor/supervisord.log  ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB       ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10          ; (num of main logfile rotation backups;default 10)
loglevel=info               ; (log level;default info; others: debug,warn,trace)
pidfile=/var/run/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false              ; (start in foreground if true;default false)
minfds=1024                 ; (min. avail startup file descriptors;default 1024)
minprocs=200                ; (min. avail process descriptors;default 200)
;umask=022                  ; (process file creation umask;default 022)
;user=chrism                 ; (default is current user, required if root)
;identifier=supervisor       ; (supervisord identifier, default is 'supervisor')
;directory=/tmp              ; (default is not to cd during start)
;nocleanup=true              ; (don't clean up tempfiles at start;default false)
;childlogdir=/tmp            ; ('AUTO' child log dir, default $TEMP)
;environment=KEY=value       ; (key value pairs to add to environment)
;strip_ansi=false            ; (strip ansi escape codes in logs; def. false)

; the below section must remain in the config file for RPC
; (supervisorctl/web interface) to work, additional interfaces may be
; added by defining them in separate rpcinterface: sections
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///var/run/supervisor/supervisor.sock ; use a unix:// URL  for a unix socket
;serverurl=http://127.0.0.1:9001 ; use an http:// url to specify an inet socket
;username=chris              ; should be same as http_username if set
;password=123                ; should be same as http_password if set
;prompt=mysupervisor         ; cmd line prompt (default "supervisor")
;history_file=~/.sc_history  ; use readline history if available

; The below sample program section shows all possible program subsection values,
; create one or more 'real' program: sections to be able to control them under
; supervisor.

;[program:theprogramname]
;command=/bin/cat              ; the program (relative uses PATH, can take args)
;process_name=%(program_name)s ; process_name expr (default %(program_name)s)
;numprocs=1                    ; number of processes copies to start (def 1)
;directory=/tmp                ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
;priority=999                  ; the relative start priority (default 999)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=true              ; retstart at unexpected quit (default: true)
;startsecs=10                  ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
;user=chrism                   ; setuid to this UNIX account to run the program
;redirect_stderr=true          ; redirect proc stderr to stdout (default false)
;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)

; The below sample eventlistener section shows all possible
; eventlistener subsection values, create one or more 'real'
; eventlistener: sections to be able to handle event notifications
; sent by supervisor.

;[eventlistener:theeventlistenername]
;command=/bin/eventlistener    ; the program (relative uses PATH, can take args)
;process_name=%(program_name)s ; process_name expr (default %(program_name)s)
;numprocs=1                    ; number of processes copies to start (def 1)
;events=EVENT                  ; event notif. types to subscribe to (req'd)
;buffer_size=10                ; event buffer queue size (default 10)
;directory=/tmp                ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
;priority=-1                   ; the relative start priority (default -1)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=unexpected        ; restart at unexpected quit (default: unexpected)
;startsecs=10                  ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
;user=chrism                   ; setuid to this UNIX account to run the program
;redirect_stderr=true          ; redirect proc stderr to stdout (default false)
;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups        ; # of stderr logfile backups (default 10)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions
;serverurl=AUTO                ; override serverurl computation (childutils)

; The below sample group section shows all possible group values,
; create one or more 'real' group: sections to create "heterogeneous"
; process groups.

;[group:thegroupname]
;programs=progname1,progname2  ; each refers to 'x' in [program:x] definitions
;priority=999                  ; the relative start priority (default 999)

; The [include] section can just contain the "files" setting.  This
; setting can list multiple files (separated by whitespace or
; newlines).  It can also contain wildcards.  The filenames are
; interpreted as relative to this file.  Included files *cannot*
; include files themselves.

[include]
files = /etc/supervisor/*.ini

supervisord的開機啓動文件/etc/init.d/supervisord：

#!/bin/bash
#
# supervisord   This scripts turns supervisord on
#
# Author:       Mike McGrath <[email protected]> (based off yumupdatesd)
#
# chkconfig:    - 95 04
#
# description:  supervisor is a process control utility.  It has a web based
#               xmlrpc interface as well as a few other nifty features.
# processname:  supervisord
# config: /etc/supervisord.conf
# pidfile: /var/run/supervisord.pid
#

# source function library
. /etc/rc.d/init.d/functions
PIDFILE=/var/run/supervisord.pid
RETVAL=0

start() {
        echo -n $"Starting supervisord: "
        daemon "supervisord --pidfile=$PIDFILE -c /etc/supervisor/supervisord.conf"
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && touch /var/lock/subsys/supervisord
}

stop() {
        echo -n $"Stopping supervisord: "
        killproc supervisord
        echo
        [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/supervisord
}

restart() {
        stop
        start
}

case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  restart|force-reload|reload)
        restart
        ;;
  condrestart)
        [ -f /var/lock/subsys/supervisord ] && restart
        ;;
  status)
        status supervisord
        RETVAL=$?
        ;;
  *)
        echo $"Usage: $0 {start|stop|status|restart|reload|force-reload|condrestart}"
        exit 1
esac

exit $RETVAL



#添加到開機啓動項
chkconfig supervisord on

配置node_exporter子配置文件,

[root@c6-node1 supervisor]# cat /etc/supervisor/supervisord.d/node_exporter.ini
[program:node_exporter]
# 啓動程序的命令;
command = /usr/local/node_exporter/node_exporter
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
user = root
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/usr/local/node_exporter/logs/out.log
# 錯誤日誌輸出;
stderr_logfile=/usr/local/node_exporter/logs/err.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

#啓動
/etc/init.d/supervisord start
supervisorctl restart node_exporter

然後可以通過curl -s 127.0.0.1:9100/metrics | grep head 進行測試。

配置Prometheus監控此主機

# vim prometheus.yml
  - job_name: "nodes"
    file_sd_configs:
      - files: ['/usr/local/prometheus/node_sd_configs/*.yml']
        refresh_interval: 5s

# cd /usr/local/prometheus/ && mkdir nodes_sd_configs
# 重新加載Prometheus配置
# ps aux | grep prometheus.yml  | grep -v grep  | awk {'print $2'} | xargs kill -hup
# vim nodes1.yml
- targets: ['192.168.227.132:9100'] 
  labels:
    name: server132

此時可以去Prometheus的webui界面進行驗證。

mysqld_exporter

官網地址：https://github.com/prometheus/mysqld_exporter

創建MySQL相關權限賬戶

mysql> CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter';
Query OK, 0 rows affected (0.02 sec)

mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> select user,host from mysql.user;

下載mysqld_exporter並進行安裝配置

cd /usr/local
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
ln -s mysqld_exporter-0.12.1.linux-amd64 mysqld_exporter

centos7 下的開機啓動腳本

[Unit]
Description=mysqld_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --collect.info_schema.processlist.processes_by_host --collect.info_schema.processlist.processes_by_user --collect.info_schema.innodb_tablespaces --collect.info_schema.innodb_metrics --collect.perf_schema.tableiowaits --collect.perf_schema.indexiowaits --collect.perf_schema.tablelocks --collect.engine_innodb_status --collect.perf_schema.file_events --collect.binlog_size --collect.info_schema.clientstats --collect.perf_schema.eventswaits --config.my-cnf=/etc/.my.cdf
Restart=on-failure

[Install]
WantedBy=mulser.target

centos6 下依然使用supervisrd進行管理

創建相關配置文件：

[program:mysqld_exporter]
# 啓動程序的命令;
command = /usr/local/mysqld_exporter/mysqld_exporter --collect.info_schema.processlist.processes_by_host --collect.info_schema.processlist.processes_by_user --collect.info_schema.innodb_tablespaces --collect.info_schema.innodb_metrics --collect.perf_schema.tableiowaits --collect.perf_schema.indexiowaits --collect.perf_schema.tablelocks --collect.engine_innodb_status --collect.perf_schema.file_events --collect.binlog_size --collect.info_schema.clientstats --collect.perf_schema.eventswaits --config.my-cnf=/etc/my.cnf
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
user = root
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/usr/local/mysqld_exporter/logs/out.log
# 錯誤日誌輸出;
stderr_logfile=/usr/local/mysqld_exporter/logs/err.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

然後進行啓動，

訪問MySQL的metrics進行驗證 http://192.168.227.133:9104/metrics

在Prometheus的配置文件中，進行添加

  - job_name: 'MySQL'
    file_sd_configs:
      - files: ['./mysql.yml']
        refresh_interval: 15s
        
# mysql.yml      
- targets:
  - "192.168.227.133:9104"

也可以在Prometheus 的webui中targets中進行驗證。

jmx_exporter

官網地址：https://github.com/prometheus/jmx_exporter

由於官方只有jmx的exporter，所以這裏我們在監控tomcat的時候，需要修改tomcat的JAVA_OPTS。

安裝jmx_exporter

cd /usr/local/
mkdir jmx_exporter && cd jmx_exporter
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.14.0/jmx_prometheus_javaagent-0.14.0.jar
#下載tomcat的yml配置文件
wget https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/tomcat.yml

啓動

## JMX_exporter支持jar包啓動直接添加javaagent，但是這裏監控的是tomcat，一般是修改path/bin/catalina.sh添加JAVA_OPTS，讓jar包跟隨tomcat啓動。
## 根據tomcat啓動順序，不建議直接修改catalina.sh，可以新建path/bin/setenv.sh寫入內容，啓動tomcat會直接加載，
## 這裏我們監控tomcat使用38080端口，如果有多個tomcat實例，請注意每一個實例需要有不同的監控端口
# 新建添加內容，path爲你tomcat目錄
vi path/bin/setenv.sh
export JAVA_OPTS="-javaagent:/usr/local/jmx_exporter/jmx_prometheus_javaagent-0.14.0.jar=38080:/usr/local/jmx_exporter/tomcat.yml"
# 重新啓動tomcat
cd path/bin
sh shutdown.sh
sh startup.sh

然後進行訪問測試
```
curl http://127.0.0.1:38080/metrics
```

配置Prometheus

  - job_name: "java"
    file_sd_configs:
      - files: ['./tomcat.yml']
        refresh_interval: 15s
#配置tomcat.yml
- targets:
  - "192.168.50.204:38080"

重新加載Prometheus，查看target內容即可。

添加agent之後，tomcat無法正常shutdown；手動添加shutdown腳本。

#!/bin/sh


tomcat_base=/usr/local/server/tomcat
TOMCAT_PATH=${tomcat_base}/bin

echo "TOMCAT_PATH is $TOMCAT_PATH"

PID=`ps aux | grep ${tomcat_base} | grep java | awk '{print $2}'`

if [ -n "$PID" ]; then
        echo "Try to shutdown Tomcat: $PID"
        sh "$TOMCAT_PATH/shutdown.sh"
                sleep 1
fi

for((i=0;i<10;i++))
do
        PID2=`ps aux | grep ${tomcat_base} | grep java | awk '{print $2}'`
            
        if [ -n "$PID2" ]; then
                        if [ $i -ge 9 ] ; then
                                echo "Try to kill Tomcat: $PID2"
                                ((i--))
                                kill -9 $PID2
                        else
                                echo "wait to kill Tomcat: $PID2"
                        fi
                        sleep 1
        else 
                echo "Tomcat is closed"
                break
        fi
done

redis_exporter

官方地址：https://github.com/oliver006/redis_exporter

安裝

cd /usr/local/
wget https://github.com/oliver006/redis_exporter/releases/download/v1.11.1/redis_exporter-v1.11.1.linux-amd64.tar.gz
ln -s redis_exporter-v1.11.1.linux-amd64 redis_exporter

配置開機啓動項

Centos 7下：

[root@c2 local]# cat /usr/lib/systemd/system/redis_exporter.service 
[Unit]
Description=redis_exporter
After=network.target

[Service]
Restart=on-failure
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr 192.168.50.182:6379

[Install]
WantedBy=multi-user.target

Prometheus配置

  - job_name: 'Redis'
    static_configs:
    - targets: ['localhost:9121']

重新加載Prometheus即可

nginx-prometheus-exporter

官網地址：https://github.com/nginxinc/nginx-prometheus-exporter

nginx開啓stub_status

server {
    listen localhost:38080;
    location /metrics {
      stub_status on;
    }
}

安裝並配置nginx-prometheus-exporter

cd /usr/loca/
mkdir nginx-prometheus-exporter && cd nginx-prometheus-exporter
wget https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v0.8.0/nginx-prometheus-exporter-0.8.0-linux-amd64.tar.gz
#centos7下添加到開機啓動項
[root@camp204 local]# cat /usr/lib/systemd/system/nginx_prometheus_exporter.service 
[Unit]
Description=NGINX Prometheus Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/nginx-prometheus-exporter/nginx-prometheus-exporter \
    -web.listen-address=192.168.50.204:9113 \
    -nginx.scrape-uri http://127.0.0.1:38080/metrics

SyslogIdentifier=nginx_prometheus_exporter
Restart=always

[Install]
WantedBy=multi-user.target
#centos6下依然採用supervisord
[program:nginx-prometheus-exporter]
# 啓動程序的命令;
command = /usr/local/nginx-prometheus-exporter/nginx-prometheus-exporter -web.listen-address=192.168.50.204:9113 -nginx.scrape-uri http://127.0.0.1:38080/metrics
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
user = root
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/usr/local/nginx-prometheus-exporter/logs/out.log
# 錯誤日誌輸出;
stderr_logfile=/usr/local/nginx-prometheus-exporter/logs/err.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

添加Prometheus配置

  - job_name: "Nginx"
    static_configs:
      - targets: ['192.168.50.204:9113']

重新加載Prometheus服務。

blackbox_exporter

blackbox_exporter是Prometheus 官方提供的 exporter 之一，可以提供 http、dns、tcp、icmp 的監控數據採集.

官方地址：https://github.com/prometheus/blackbox_exporter

應用場景

http測試
- 定義 Request Header 信息
- 判斷 Http status / Http Respones Header / Http Body 內容
TCP測試
- 業務組件端口狀態監聽
- 應用層協議定義與監聽
ICMP測試
- 主機探活機制
POST測試
- 接口聯通性
SSL證書過期時間

安裝並配置開機啓動項

cd /usr/local/
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz
ln -s blackbox_exporter-0.17.0.linux-amd64 blackbox_exporter
#centos6下 使用supervisord管理
[program:blackbox_exporter]
# 啓動程序的命令;
command = /usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
user = root
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/usr/local/blackbox_exporter/logs/out.log
# 錯誤日誌輸出;
stderr_logfile=/usr/local/blackbox_exporter/logs/err.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 10

#centos7下:
[Unit]
Description=blackbox exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml \
    --web.listen-address=:9115 

SyslogIdentifier=blackbox_exporter
Restart=always

[Install]
WantedBy=multi-user.target

Prometheus配置

#HTTP檢測
- job_name: 'http_get_all'  # blackbox_export module
    scrape_interval: 30s
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://frognew.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115 #blackbox-exporter 所在的機器和端口
#http檢測除了可以探測http服務的存活外，還可以根據指標probe_ssl_earliest_cert_expiry進行ssl證書有效期預警。

#監控主機存活狀態
- job_name: node_status
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets: ['10.165.94.31']
        labels:
          instance: node_status
          group: 'node'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 172.19.155.133:9115
        
#監控主機端口存活狀態
- job_name: 'prometheus_port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: ['172.19.155.133:8765']
        labels:
          instance: 'port_status'
          group: 'tcp'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.19.155.133:9115
        
        
#監控網站狀態
- job_name: web_status
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets: ['http://www.baidu.com']
        labels:
          instance: user_status
          group: 'web'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 172.19.155.133:9115

Alertmanager

安裝並配置

官方地址： https://github.com/prometheus/alertmanager

下載並配置

cd /usr/local/src/
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
ln -s alertmanager-0.21.0.linux-amd64 alertmanager
#配置開機啓動項
cat /usr/lib/systemd/system/alertmanager.service 
[Unit]
Description=Prometheus: the alerting system
Documentation=http://prometheus.io/docs/
After=prometheus.service

[Service]
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

systemctl enable alertmanager && systemctl start alertmanager

修改配置文件，這裏我們使用的是企業微信告警，如需其他告警可自行百度：

訪問企業微信官網（https://work.weixin.qq.com/），註冊企業微信賬號（不需要企業認證）

登錄成功後--->>應用管理--->>創建第三方應用，點擊創建應用按鈕 -> 填寫應用信息：

[root@IT-ECS-Prometheus alertmanager]# cat alertmanager.yml 
global:
  resolve_timeout: 2m  #每兩分鐘查看是否恢復
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'             #企業微信的api
  wechat_api_secret: 'MWqZ2yaIZvBKkVi5lUxPpuEIgl7q2D6tEFadnf7Hcbs'   #第三方企業應用密鑰
  wechat_api_corp_id: 'ww5fc93f267b956282'            #企業賬戶唯一ID，可以在我的企業中查看

route:                           #用來設置報警的分發策略
  group_by: ['alertname']        #採用哪個標籤來作爲分組依據
  group_wait: 10s                #組告警等待時間。也就是告警產生後等待10s，如果有同組告警一起發出
  group_interval: 10s            #重複告警的間隔時間，減少相同微信告警的發送頻率
  repeat_interval: 8h    #重複告警的間隔時間，減少相同微信告警的發送頻率
  receiver: 'wechat'     #設置默認接收人

receivers:                  #定義接收者
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    to_user: 'LiYang'   #需要發送的用戶，也可以選擇發送到組 to_party
    agent_id: '1000016'  #第三方企業應用的ID
templates:
- /usr/local/alertmanager/template.tmpl  #告警模板

配置告警規則，告警規則已上傳到GitHub，地址是：https://github.com/skymyyang/Prometheus-alertrules-configs

告警規則可參考此項目中的PrometheusConfig 配置。

重啓Prometheus服務，然後在Prometheus的WebUI中可以查看到報警規則。

釘釘報警參考鏈接：https://mp.weixin.qq.com/s/QRK49Xa-HWjnrw9Vvkmh4A

Grafana

安裝並配置

cd /usr/local/src/
export VER="7.1.5"
wget https://dl.grafana.com/oss/release/grafana-${VER}-1.x86_64.rpm
yum localinstall -y grafana-${VER}-1.x86_64.rpm

啓動服務

systemctl daemon-reload
systemctl enable grafana-server.service
systemctl stop grafana-server.service
systemctl restart grafana-server.service

安裝相關插件

#需要安裝餅圖的插件
grafana-cli plugins install grafana-piechart-panel
#安裝consul數據源插件
grafana-cli plugins install sbueringer-consul-datasource
#安裝pmm-singlestat-panel插件
#這裏我在使用percona的MySQLdashboard 需要用到這個插件，但是一直安裝失敗
cd /usr/local/src/
wget https://jira.percona.com/secure/attachment/22830/22830_pmm-singlestat-panel.tgz
tar xvf pmm-singlestat-panel.tgz
mv pmm-singlestat-panel /var/lib/grafana/plugins/
systemctl restart grafana-server

導入相關展示圖表

https://grafana.com/grafana/dashboards/11074  基礎監控-new

https://grafana.com/dashboards/8919   基礎監控

https://grafana.com/dashboards/7362   數據庫監控
https://github.com/percona/grafana-dashboards/blob/master/dashboards/MySQL_Overview.json #percona MySQL圖表

圖表中遇到的問題，由於部分機器是centos6 的操作系統，系統內核版本過低，在metrics中沒有MemAvailable指標，這裏我們只能使用node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes 來接近顯示內存使用率。

可以編輯圖表，修改其中內存使用率的計算表達式爲：

(1 - ((node_memory_MemFree_bytes{job=~"$job"} + node_memory_Buffers_bytes{job=~"$job"} + node_memory_Cached_bytes{job=~"$job"} ) / (node_memory_MemTotal_bytes{job=~"$job"})))* 100

修改完成之後，圖表顯示正常。

安裝blackbox_exporter 的dashboard 展示看板

grafana的模板ID爲:9965
Redis的dashboard

grafana的模板ID爲：763

Prometheus監控系統

Prometheus整體概述

概述

特點

任務分析

Prometheus的組成與架構

Prometheus數據模型

Prometheus指標類型

Prometheus安裝部署

Prometheus配置文件詳解

基於文件的服務發現

Prometheus 監控

node_exporter

mysqld_exporter

jmx_exporter

redis_exporter

nginx-prometheus-exporter

blackbox_exporter

Alertmanager

安裝並配置

Grafana

k8s-1.14.2安裝文檔

centos6 yum源失效

基於kubernetes構建動態Jenkins-slave

Prometheus監控系統

Centos 6.10下安裝MySQL5.6.48

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結