1. 安裝組件基本介紹:
- prometheus:
- server端守護進程,負責拉取各個終端exporter收集到metrics(監控指標數據),並記錄在本身提供的tsdb時序記錄數據庫中,默認保留天數15天,可以通過啓動參數自行設置數據保留天數。
- prometheus官方提供了多種exporter,
- 默認監聽9090端口,對外提供web圖形查詢頁面,以及數據庫查詢訪問接口。
- 配置監控規則rules(需自行手動配置),並將觸發規則的告警發送至alertmanager ,並由alertmanager中配置的告警媒介向外發送告警。
- grafana:
- 由於prometheus本身提供的圖形頁面過於簡陋,所以使用grafana來提供圖形頁面展示。
- grafana 是專門用於圖形展示的軟件,支持多種數據來源,prometheus只是其中一種。
- 自帶告警功能,且告警規則可在監控圖形上直接配置,不過由於此種方式不支持模板變量(dashboard中爲了方便展示配置的特殊變量),即每一個指標,每一臺設備均需要單獨配置,所以實用性較低
- 默認監聽端口:3000
- node_exporter:
- agent端,prometheus官方提供的諸多exporter中的一種,安裝與各監控節點主機
- 負責抓取主機及系統各項信息,如cpu,mem ,disk,networtk.filesystem,...等等各項基本指標,非常全面。並將抓取到的各項指標metrics 通過http協議對方發佈,供prometheus server端抓取。
- 默認監聽端口: 9100
- cadvisor:
- agent端,安裝與docker主機,抓取主機和docker容器運行中各項數據。
- 本身也已容器方式運行,監聽端口8080(可自行設置對外映射端口,且建議映射到其他端口)。
- 提供基本的graph展示頁面,同時提供metrics抓取頁面
- alertmanager:
- 接受prometheus發送的告警,並通過一定規則分組,控制告警的發送(如告警頻率,規則抑制,匹配不同的告警後端媒介,設置靜默等)。
- 可配置多種不同的告警後端媒介,如:郵件,webhook,wechat(企業微信)已經一些企業版的監控告警平臺等。
- 默認監聽端口:9093
- blackbox_exporter:
- Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的監控數據採集
- 可直接配置與prometheus server節點,也可配置在單獨節點
- 默認監聽端口:9115
- nginx:
- 由於prometheus,alertmanager本身不具有認證功能,所以前端使用nginx對外訪問,提供基本basic認證已經配置https
- 以上各組件均需暴露自身端口,所以在docker-compos 部署過程中,將容器部署在同一網絡中,前端入口映射端口由nginx統一配置,方便管理
2.prometheus-server
2.1 官方地址:
- github項目下載地址: https://github.com/prometheus/prometheus
2.2 安裝 prometheus server
2.2.1 linux(centos7) 下載安裝
- 創建運行prometheus server進程的系統用戶,併爲其創建家目錄/var/lib/prometheus 作爲數據存儲目錄
~]# useradd -r -m -d /var/lib/prometheus prometheus
- 下載並安裝prometheus server,以2.14.0爲例:
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
tar -xf prometheus-2.14.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local
ln -sv prometheus-2.14.0.linux-amd64 prometheus
- 創建unit file,讓systemd 管理prometheus
vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=The Prometheus 2 monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
EnvironmentFile=-/etc/sysconfig/prometheus
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--storage.tsdb.path=/home/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--web.listen-address=0.0.0.0:9090 \
--web.external-url= $PROM_EXTRA_ARGS
Restart=on-failure
StartLimitInterval=1
RestartSec=3
[Install]
WantedBy=multi-user.target
-
其他運行時參數: ./prometheus --help
- 啓動服務
systemctl daemon-reload
systemctl start prometheus.service
- 注意開啓防火牆端口:
iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT
- 瀏覽器訪問:
http://IP:PORT
2.2.2 docker安裝:
- image: prom/prometheus
- 啓動命令:
$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d
2.3 prometheus配置:
2.3.1 啓動參數
- 常用啓動參數:
--config.file=/etc/prometheus/prometheus.yml # 指明主配置文件 --web.listen-address="0.0.0.0:9090" # 指明監聽地址端口 --storage.tsdb.path=/prometheus # 指明數據庫目錄 --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles # 指明console lib 和 tmpl --storage.tsdb.retention=60d # 指明數據保留天數,默認15
2.3.2 配置文件:
-
Prometheus的主配置⽂件爲prometheus.yml
它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏個配置段組成:
- global:全局配置段; - rule_files:指定告警規則文件的路徑 - scrape_configs: scrape配置集合,⽤於定義監控的⽬標對象(target)的集合,以及描述如何抓取 (scrape)相關指標數據的配置參數; 通常,每個scrape配置對應於⼀個單獨的作業(job), ⽽每個targets可通過靜態配置(static_configs)直接給出定義,也可基於Prometheus⽀持的服務發現機制進 ⾏⾃動配置;
- job_name: 'nodes'
static_configs: # 靜態指定,targets中的 host:port/metrics 將會作爲metrics抓取對象
- targets: ['localhost:9100']
- targets: ['172.20.94.1:9100']
- job_name: 'docker_host'
file_sd_configs: # 基於文件的服務發現,文件中(yml 和json 格式)定義的host:port/metrics將會成爲抓取對象
- files:
- ./sd_files/docker_host.yml
refresh_interval: 30s
- alertmanager_configs:
可由Prometheus使⽤的Alertmanager實例的集合,以及如何同這些Alertmanager交互的配置參數;
每個Alertmanager可通過靜態配置(static_configs)直接給出定義, 也可基於Prometheus⽀持的服務發現機制進⾏⾃動配置;
- remote_write:
配置“遠程寫”機制,Prometheus需要將數據保存於外部的存儲系統(例如InfluxDB)時 定義此配置段, 隨後Prometheus將樣本數據通過HTTP協議發送給由URL指定適配器(Adaptor);
-
remote_read:
配置“遠程讀”機制,Prometheus將接收到的查詢請求交給由URL指定適配器 (Adpater)執⾏, Adapter將請求條件轉換爲遠程存儲服務中的查詢請求,並將獲取的響應數據轉換爲Prometheus可⽤的格式;
-
監控及告警規則配置文件:*.yml
- 定義監控規則
- 需要在主配置文件rule_files: 中指定纔會生效
rule_files:
- "test_rules.yml" # 指定配置告警規則的文件路徑
-
服務發現定義文件:支持yaml 和 json 兩種格式
- 也是需要在主配置文件中定義
file_sd_configs: - files: - ./sd_files/http.yml refresh_interval: 30s
2.3.3 簡單的配置文件示例:
- prometheus.yml 示例
global:
scrape_interval: 15s #每過15秒抓取一次指標數據
evaluation_interval: 15s#每過15秒執行一次報警規則,也就是說15秒執行一次報警
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]# 設置報警信息推送地址 , 一般而言設置的是alertManager的地址
rule_files:
- "test_rules.yml" # 指定配置告警規則的文件路徑
scrape_configs:
- job_name: 'node'#自己定義的監控的job_name
static_configs: # 配置靜態規則,直接指定抓取的ip:port
- targets: ['localhost:9100']
- job_name: 'CDG-MS'
honor_labels: true
metrics_path: '/prometheus'
static_configs:
- targets: ['localhost:8089']
relabel_configs:
- target_label: env
replacement: dev
- job_name: 'eureka'
file_sd_configs: # 基於文件的服務發現
- files:
- "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 兩種格式
refresh_interval: 30s # 30s鍾自行刷新配置,讀取文件,修改之後無需手動reload
relabel_configs:
- source_labels: [__job_name__]
regex: (.*)
target_label: job
replacement: ${1}
- target_label: env
replacement: dev
-
告警規則配置文件示例:
[root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml groups:
-
name: "container monitor"
rules:- alert: "Container down: env1"
expr: time() - container_last_seen{name="env1"} > 60
for: 30s
labels:
severity: critical
annotations:
summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
- alert: "Container down: env1"
-
基於文件的服務發現定義文件: *.yml
[root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml - targets: ['10.10.11.179:9100'] - targets: ['10.10.11.178:9100']
[root@host40 monitor]# cat prometheus/sd_files/tcp.yml - targets: ['10.10.11.178:8001'] labels: server_name: http_download - targets: ['10.10.11.178:3307'] labels: server_name: xiaojing_db - targets: ['10.10.11.178:3001'] labels: server_name: test_web
2.3.5其他配置
- 由於prometheus很多配置需要和其他組件耦合,所以在介紹到相應組件時再行介紹
2.4 prometheus web-gui
- web頁面訪問地址: http://ip:port 如:http://10.10.11.40:9090/
- alerts: 查看告警規則
- graph: 查詢收集到的指標數據,並提供簡單的繪圖
- status: prometheus運行時配置已經監聽主機相關信息
- 詳情自行查看web-gui頁面
3.node_exporter
3.1 基本介紹
-
node_exporter 在被監控節點安裝,抓取主機監控信息,並對外提供http服務,供prometheus抓取監控信息。
- prometheus官方提供了很多不同類型的exporter,列表地址: https://prometheus.io/docs/instrumenting/exporters/
3.2 安裝node_exporter
3.2.1 linux(centos7)下載安裝:
-
下載並解壓
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/ cd /usr/local ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter
-
創建用戶:
useradd -r -m -d /var/lib/prometheus prometheus
-
配置unit file:
vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=Prometheus exporter for machine metrics, written in Go with pluggable metric collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target [Service] EnvironmentFile=-/etc/sysconfig/node_exporter User=prometheus ExecStart=/usr/local/node_exporter/node_exporter \ $NODE_EXPORTER_OPTS Restart=on-failure StartLimitInterval=1 RestartSec=3 [Install] WantedBy=multi-user.target
-
啓動服務:
systemctl daemon-reload systemctl start node_exporter.service
-
可以手動測試是否可以獲取metrics信息:
curl http://localhost:9100/metrics
-
開啓防火牆:
iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT
3.2.2 docker安裝
-
image: quay.io/prometheus/node-exporter,prom/node-exporter
-
啓動命令:
docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100
-
對於部分低版本的docker,出現報錯:Error response from daemon: linux mounts: Could not find source mount of /
解決辦法:-v "/:/host:ro,rslave" -> -v "/:/host:ro"
3.3 配置node_exporter
-
開啓關閉collectors:
./node_exporter --help # 查看支持的所有collectors,可根據實際需求 enable 和 disabled 各項指標收集
如 --collector.cpu=disabled ,不再收集cpu相關信息
-
Textfile Collector: 文本文件收集器
通過 啓動參數 --collector.textfile.directory="DIR" 可開啓文本文件收集器 收集器會收集目錄下所有*.prom的文件中的指標,指標必須滿足 prom格式
示例:
echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$ mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$ mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom rpc_duration_seconds{quantile="0.5"} 4773 http_request_duration_seconds_bucket{le="0.5"} 129389
即如果node_exporter 不能滿足自身指標抓取,可以通過腳本形式將指標抓取之後寫入文件,由node_exporter對外提供個prometheus抓取
可以省掉pushgateway - 有關prom格式和查詢語法,將再之後介紹
3.4 配置prometheus抓取node_exporter 指標
-
示例: prometheus.yml
scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
metrics_path defaults to '/metrics'
scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
-
job_name: 'nodes'
static_configs:- targets: ['localhost:9100']
- targets: ['172.20.94.1:9100']
- job_name: 'node_real_lan' file_sd_configs: - files:
- ./sd_files/real_lan.yml
refresh_interval: 30s
params: # 可選
collect[]:- cpu
- meminfo
- diskstats
- netdev
- netstat
- filefd
- filesystem
- xfs
4.cadvisor
4.1 官方地址:
- https://github.com/google/cadvisor
- image: gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能訪問google
- image: google/cadvisor:v0.33.0 # docker hub鏡像,版本沒有google的新
4.2 docker run
sudo docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=9080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
google/cadvisor:v0.33.0
4.3 web頁面查看簡單的單機圖形監控信息
4.4 配置prometheus抓取
-
配置示例:
- job_name: 'docker' static_configs:
- targets: ['localhost:9080']
5.grafana
5.1 官方地址
- grafana程序下載地址:https://grafana.com/grafana/download
- grafana dashboard 下載地址: https://grafana.com/grafana/download/
5.2 安裝grafana
5.2.1 linux(centos7)安裝
-
下載並安裝
wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm sudo yum install grafana-7.2.2-1.x86_64.rpm
-
準備service 文件:
[Unit] Description=Grafana instance Documentation=http://docs.grafana.org Wants=network-online.target After=network-online.target After=postgresql.service mariadb.service mysqld.service [Service] EnvironmentFile=/etc/sysconfig/grafana-server User=grafana Group=grafana Type=notify Restart=on-failure WorkingDirectory=/usr/share/grafana RuntimeDirectory=grafana RuntimeDirectoryMode=0750 ExecStart=/usr/sbin/grafana-server \ --config=${CONF_FILE} \ --pidfile=${PID_FILE_DIR}/grafana-server.pid\ --packaging=rpm \ cfg:default.paths.logs=${LOG_DIR} \ cfg:default.paths.data=${DATA_DIR} \ cfg:default.paths.plugins=${PLUGINS_DIR} \ cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} LimitNOFILE=10000 TimeoutStopSec=20 [Install] WantedBy=multi-user.target
-
啓動grafana
systemctl enable grafana-server.service systemctl restart grafana-server.service
默認監聽3000端口
-
開啓防火牆:
iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT
5.2.2 docker安裝
-
image: grafana/grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2
5.3 grafana 簡單使用流程
-
web頁面訪問:
http://ip:port
首次登陸會要求自行設置賬號密碼
7.2版本會要求輸入賬號密碼之後重置,初始賬號密碼都是admin -
使用流程:
- 添加數據源
- 添加dashboard,配置圖形監控面板,也可在官網下載對應服務的dashboard模板,下載地址:https://grafana.com/grafana/download/
- 導入模板,json 或 鏈接 或模板編號
- 查看dashboard
-
常用模板編號:
- node-exporter: cn/8919,en/11074
- k8s: 13105
- docker: 12831
- alertmanager: 9578
- blackbox_exportre: 9965
-
重置管理員密碼:
查看Grafana配置文件,確定grafana.db的路徑 配置文件路徑:/etc/grafana/grafana.ini [paths] ;data = /var/lib/grafana [database] # For "sqlite3" only, path relative to data_path setting ;path = grafana.db 通過配置文件得知grafana.db的完整路徑如下: /var/lib/grafana/grafana.db
使用sqlites修改admin密碼 sqlite3 /var/lib/grafana/grafana.db sqlite> update user set password = '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', salt = 'F3FAxVm33R' where login = 'admin'; .exit
使用admin admin 登錄
5.4 grafana告警配置:
-
grafana-server配置 smtp服務器,配置發件郵箱
vim /etc/grafana/grafana.ini [smtp] enabled = true host = smtp.126.com:465 user = [email protected] password = PASS skip_verify = false from_address = [email protected] from_name = Grafana Alart
-
grafana頁面添加Notification Channel
Alerting -> Notification Channel save之前 可以send test
-
進入dashboard,添加alart rules
- 由於現階段grafana(7.2.2)不支持在報警查詢中使用模板變量。所以報警功能實用性很低。生產中建議使用alertmanager
6.prometheus and PromQL:
6.1 PromQL 簡述
-
prometheus用來查詢數據庫的語法規則,用來將數據庫中存儲的由各exporter 採集到的metric指標組織成可視化的圖標信息,以及告警規則
- promQL一個多維數據模型,其中包含通過metric name 和鍵/值對標識的時間序列數據
- 一種靈活的查詢語言 ,可利用此維度
- 不依賴分佈式存儲;單服務器節點是自治的
- 多種圖形和儀表板支持模式
6.2 使用到promQL的組件:
- prometheus server
- client libraries for instrumenting application c7ode
- push gateway
- exporters
- alertmanager
6.3 metric 介紹
6.3.1 metric類型
-
gauges: 返回單一數值,如:
- node_boot_time_seconds
node_boot_time_seconds{instance="10.10.11.40:9100",job="node_real_lan"} 1574040030
-
counters: 計數,
-
histograms: 直方圖,統計數據的分佈情況。比如最大值,最小值,中間值,中位數,百分位數等。
- summaries: 採樣點分位圖統計。
6.3.2 label
-
node_boot_time_seconds{instance="10.10.11.40:9100",job="node_real_lan"}
如上示例,這裏的instance,和job 就是label
- job : job_name,在prometheus.yml 中定義
- instance: host:port
-
也可以在配置文件自行定義label,如:
- targets: ['10.10.11.178:3001'] labels: server_name: test_web
添加的label即會在prometheus查詢數據使用:
metric{servername=...,}
6.4 PromQL 表達式
- PromQL表達式即是grafana繪製圖標的基本語句,也是prometheus用來設置告警規則的基本語句,所以能弄懂或者看懂promQL 非常重要。
6.4.1 先看示例:
-
計算cpu使用率:
(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100
其中metric:
node_cpu_seconds_total # 總cpu 使用時間 node_cpu_seconds_total{mode="idle"} # 空閒cpu使用時間,其他類似標籤: user , system , steal , softirq , irq , nice , iowait , idle
用到的函數:
increase( [1m]) # 1分鐘之類的增量。 sum() sum() by (TAG) # 其中 TAG 是標籤,此地 instance 代表的是機器名. 按主機名進行相加,否則多主機只會顯示一條線。
6.4.2 標籤選擇
-
匹配運算:
= #等於 Select labels that are exactly equal to the provided string. != #不等於 Select labels that are not equal to the provided string. =~ #正則表達式匹配 Select labels that regex-match the provided string. !~ #正則表達式不匹配 Select labels that do not regex-match the provided string.
-
示例:
node_cpu_seconds_total{mode="idle"} # mode : 標籤,metric自帶屬性。 api_http_requests_total{method="POST", handler="/messages"}
http_requests_total{environment=~"staging|testing|development",method!="GET"}
-
注意: 必須指定一個名稱或至少一個與空字符串不匹配的標籤匹配器
{job=~".*"} # Bad! {job=~".+"} # Good! {job=~".*",method="get"} # Good!
6.4.3 運算
-
時間範圍:
s -秒 m - 分鐘 h - 小時 d - 天 w -周 y -年
-
運算符:
+ (addition) - (subtraction) * (multiplication) / (division) % (modulo) ^ (power/exponentiatio == (equal) != (not-equal) > (greater-than) < (less-than) >= (greater-or-equal) <= (less-or-equal) and (intersection) or (union) unless (complement)
-
集合運算符:
sum (calculate sum over dimensions) min (select minimum over dimensions) max (select maximum over dimensions) avg (calculate the average over dimensions) stddev (calculate population standard deviation over dimensions) stdvar (calculate population standard variance over dimensions) count (count number of elements in the vector) count_values (count number of elements with the same value) bottomk (smallest k elements by sample value) topk (largest k elements by sample value) quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
6.4.4 函數
sum() by (instance) #求和(根據條件求和)
increase() # 取增量,針對counter類型
示例:
increase(node_network_receive_bytes_total[30s]) # 接受流量
rate() # 專門搭配counter類型數據使用的函數,按照設置的一個時間段,取counter在這個時間段中的平均每秒的增量
示例:
rate(node_network_receive_bytes_total[30s])*8 # 入口帶寬
topk() # 給定數字x,根據數值排序之後去最高的x個數
示例:
topk(5,node_cpu_seconds_total) # 取node_cpu_seconds_total 最長的前5個
topk(5,increase(node_network_receive_bytes_total[10m])) # 10m鍾之內收到的流量前5
注意:
會造成圖像散列。
console中執行,取一次值。
count() # 計數,如 count(node_load1 > 5)
avg() by () # 取均值,by(label)
6.4.5 rules
配置規則以將抓取的數據彙總到新的時間序列中
示例:
-
將以下規則,記錄進prometheus.rules.yml文件
avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
-
記錄進prometheus.rules.yml文件中
groups: - name: example rules:
-
record: job_service:rpc_durations_seconds_count:avg_rate5m
expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service) -
在prometheus.yml文件中:
rule_files:
- 'prometheus.rules.yml'
相當於生產了一個新的matric,不過此matric不是抓取來的,而是計算來的。
7.alertmanager
7.1 prometheus + alertmanager 報警過程
-
文檔地址:https://prometheus.io/docs/alerting/latest/configuration/
- 設置警報和通知的主要步驟是:
- 安裝和配置 Alertmanager
- 配置Prometheus與Alertmanager對話
- 在Prometheus中創建警報規則
7.2 安裝alertmanager
7.2.1 linux(centos7)安裝:
-
下載並安裝:
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/ alertmanager-0.20.0.linux-amd64.tar.gz tar -xf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local cd /usr/local && ln -sv alertmanager-0.20.0.linux-amd64/ alertmanager && cd alertmanager
啓動:
nohup ./alertmanager --config.file="alertmanager.yml" --storage.path="data/ --web.listen-address=":9093" &
7.2.2 docker安裝:
-
image: prom/alertmanager
-
docker run
docker run -dit --name monitor-alertmanager -v ./alertmanager/db/:/alertmanager -v ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml ./alertmanager/templates/:/etc/alertmanager/templates -p 9093:9093 --restart always --privileged true prom/alertmanager --config.file="/etc/alertmanager/alertmanager.yml" --storage.path="/alertmanager --web.listen-address=":9093"
7.3 核心概念
-
grouping: 分組
- 分組將類似性質的警報分類爲單個通知。當許多系統同時發生故障並且可能同時觸發數百到數千個警報時,此功能特別有用。
示例:
發生網絡分區時,羣集中正在運行數十個或數百個服務實例。您有一半的服務實例不再可以訪問數據庫。 Prometheus中的警報規則配置爲在每個服務實例無法與數據庫通信時爲其發送警報。結果,數百個警報被髮送到Alertmanager。 作爲用戶,人們只希望獲得一個頁面,同時仍然能夠準確查看受影響的服務實例。因此,可以將Alertmanager配置爲按警報的羣集和 警報名稱分組警報,以便它發送一個緊湊的通知。
- 警報的分組,分組通知的時間以及這些通知的接收者由配置文件中的路由樹(routing tree)配置。
-
Inhibition: 抑制
-
抑制是一種概念,如果某些其他警報已經觸發,則抑制某些警報的通知。
- 示例:
正在觸發警報,通知您無法訪問整個羣集。可以將Alertmanager配置爲使與該羣集有關的所有其他警報靜音。這樣可以防止與實際問題無關的數百或數千個觸發警報的通知。
- 通過Alertmanager的配置文件配置禁止。
-
-
Silences: 靜默
- 靜默是一種簡單的方法,可以在給定時間內簡單地使警報靜音。沉默是根據匹配器配置的,就像路由樹一樣。檢查傳入警報是否與活動靜默的所有相等或正則表達式匹配項匹配。
- 如果這樣做,則不會針對該警報發送任何通知。
7.4 配置prometheus對接alertmanager
-
alerting:
alerting: alertmanagers:
-
static_configs:
- targets: ["127.0.0.1:9093"]
- targets: ["127.0.0.1:9093"]
-
rule_files:
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files:
-
"rules/*.yml"
-
scrape_configs:
scrape_configs:
- job_name: 'alertmanager'
static_configs:- targets: ['127.0.0.1:9093']
- targets: ['127.0.0.1:9093']
7.5 prometheus rules編寫
-
示例:
[root@xiang-03 /usr/local/prometheus]#cat rules/node.yml groups:
-
name: "system info"
rules:- alert: "服務器宕機" # 告警名稱 alertname
expr: up == 0 # 告警表達式,當表達式條件滿足,即發送告警
for: 1m # 等待時長,等待自動恢復的時間。
labels: # 此label不同於 metric中的label,發送給alertmanager之後用於管理告警項,比如匹配到那個label即觸發哪種告警
severity: critical # key:value 皆可完全自定義
annotations: # 定義發送告警的內容,注意此地的labels爲metric中的label
summary: "{{$labels.instance}}:服務器宕機"
description: "{{$labels.instance}}:服務器無法連接,持續時間已超過3mins" - alert: "CPU 使用過高"
expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))by(instance)*100) > 40
for: 1m
labels:
servirity: warning
annotations:
summary: "{{$labels.instance}}:CPU 使用過高"
description: "{{$labels.instance}}:CPU 使用率超過 40%"
value: "{{$value}}" - alert: "CPU 使用率超過90%"
expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by(instance)* 100) > 90
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}}:CPU 使用率90%"
description: "{{$labels.instance}}:CPU 使用率超過90%,持續時間超過5mins"
value: "{{$value}}"
- alert: "服務器宕機" # 告警名稱 alertname
- 如果需要在配置文件中使用中文,務必注意編碼規則爲utf8,否則報錯
7.6 配置alertmanager
- 詳細文檔地址: https://prometheus.io/docs/alerting/latest/configuration/
- 主配置文件: alertmanager.yml
- 模板配置文件: *.tmpl
- 只是介紹少部需要用到的配置,如需查看完整配置,請查看官方文檔
7.6.1 alertmanager.yml
-
主配置文件中需要配置:
- global: 發件郵箱配置,
- templates: 指定郵件模板文件(如果不指定,則使用alertmanager默認模板),
- routes: 配置告警規則,比如匹配哪個label的規則發送到哪個後端
- receivers: 配置後端告警媒介: email,wechat,webhook等等
-
先看示例:
vim alertmanager.yml global: smtp_smarthost: 'xxx' smtp_from: 'xxx' smtp_auth_username: 'xxx' smtp_auth_password: 'xxx' smtp_require_tls: false templates:
- '/alertmanager/template/*.tmpl'
route:
receiver: 'default-receiver'
group_wait: 1s #組報警等待時間
group_interval: 1s #組報警間隔時間
repeat_interval: 1s #重複報警間隔時間
group_by: [cluster, alertname]
routes: - receiver: test
group_wait: 1s
match_re:
severity: test
receivers:- name: 'default-receiver'
email_configs:
- name: 'default-receiver'
- to: '[email protected]'
html: '{{ template "xx.html" . }}'
headers: { Subject: " {{ .CommonAnnotations.summary }}" }- name: 'test'
email_configs:
- name: 'test'
-
to: '[email protected]'
html: '{{ template "xx.html" . }}'
headers: { Subject: " {{ 第二路由匹配測試}}" }vim test.tmpl
{{ define "xx.html" }}
<table border="5">
<tr><td>報警項</td>
<td>磁盤</td>
<td>報警閥值</td>
<td>開始時間</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr><td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "value" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }} - 詳解:
gloable:
resolve_timeout: # 在沒有報警的情況下聲明爲已解決的時間
- 其他郵件相關配置,如示例
route: # 所有報警信息進入後的根路由,用來設置報警的分發策略
group_by: ['LABEL_NAME','alertname', 'cluster','job','instance',...]
這裏的標籤列表是接收到報警信息後的重新分組標籤,例如,接收到的報警信息裏面有許多具有 cluster=A
和alertname=LatncyHigh 這樣的標籤的報警信息將會批量被聚合到一個分組裏面
group_wait: 30s
當一個新的報警分組被創建後,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間爲同一分組來獲取多個警報,然後一起觸發這個報警信息。
group_interval: 5m
當第一個報警發送後,等待'group_interval'時間來發送新的一組報警信息。
repeat_interval: 5m
如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們
match:
label_name: NAME
匹配報警規則,滿足條件的告警將被髮給 receiver
match_re:
label_name: <regex>, ...
正則表達式匹配。滿足條件的告警將被髮給 receiver
receiver: receiver_name
將滿足match 和 match_re的告警發給後端 告警媒介(郵件,webhook,pagerduty,wechat,...)
必須有一個default receivererr="root route must specify a default receiver"
routes:
- <route> ...
配置多條規則。
templates:
[ - <filepath> ... ]
配置模板,比如郵件告警頁面模板
receivers:
- <receiver> ...# 列表
- name: receiver_name # 用於填寫在route.receiver中的名字
email_configs: # 配置郵件告警
- to: <tmpl_string>
send_resolved: <boolean> | default = false # 故障恢復之後,是否發送恢復通知
配置接受郵件告警的郵箱,也可以配置單獨配置發件郵箱。 詳見官方文檔
https://prometheus.io/docs/alerting/latest/configuration/#email_config
- name: ...
wechat_configs:
- send_resolved: <boolean> | default = false
api_secret: <secret> | default = global.wechat_api_secret
api_url: <string> | default = global.wechat_api_url
corp_id: <string> | default = global.wechat_api_corp_id
message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}'
agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}'
to_user: <string> | default = '{{ template "wechat.default.to_user" . }}'
to_party: <string> | default = '{{ template "wechat.default.to_party" . }}'
to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}'
# 說明
to_user: 企業微信用戶ID
to_party: 需要發送的組id
corp_id: 企業微信賬號唯一ID 可以在 我的企業 查看
agent_id: 應用的 ID,應用管理 --> 打開自定應用查看
api_secret: 應用的密鑰
打開企業微信註冊 https://work.weixin.qq.com
微信API官方文檔 https://work.weixin.qq.com/api/doc#90002/90151/90854
企業微信告警配置
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
抑制相關配置
7.6.2 配置企業微信告警
-
註冊企業: https://work.weixin.qq.com
可以註冊未認證企業,人數上限200,綁定個人微信即可使用web後臺
微信API官方文檔 : https://work.weixin.qq.com/api/doc#90002/90151/90854
-
註冊之後綁定私人微信即可掃碼進入管理後臺。
-
發送告警的應用需要新建,操作也很簡單
-
需要注意的參數:
- corp_id: 企業微信賬號唯一ID 可以在 我的企業 查看
- agent_id: 應用的 ID,應用管理 --> 打開自定應用查看
- api_secret: 應用的密鑰
- to_user: 企業微信用戶ID,
- to_party: 需要發送的組id,通訊錄,點擊組名旁邊的點可查看
- 配置示例:
receivers:
- name: 'default'
email_configs:
- to: 'XXX'
send_resolved: true
wechat_configs:
- send_resolved: true
corp_id: 'XXX'
api_secret: 'XXX'
agent_id: 1000002
to_user: XXX
to_party: 2
message: '{{ template "wechat.html" . }}'
-
template:
-
由於alertmanager默認的微信報警模板太醜醜陋和冗長,所以使用告警模板,郵件模板默認的倒是還可以
- 示例1:
-
cat wechat.tmpl
{{ define "wechat.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
[@警報~]
實例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
詳情: {{ .Annotations.description }}
值: {{ .Annotations.value }}
時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
[@恢復~]
實例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢復: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- end }}
7.6.3 告警模板時間問題:
-
參考來源: https://blog.csdn.net/knight_zhou/article/details/106323719
-
Prometheus 郵件告警自定義模板的默認使用的是utc時間。
觸發時間: {{ .StartsAt.Format "2020-01-02 15:04:05" }} 修改之後:{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
7.7 prometheus常用告警規則:
- 很厲害的一個頁面,包括的好多寫好的規則: https://awesome-prometheus-alerts.grep.to/rules
7.7.1 容器告警指標,容器down掉告警
vim rules/docker_monitor.yml
groups:
- name: "container monitor"
rules:
- alert: "Container down: env1"
expr: time() - container_last_seen{name="env1"} > 60
for: 30s
labels:
severity: critical
annotations:
summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
注意:
此項指標只能監控容器down 掉,無法準確監控容器恢復(不準),即便容器沒有成功啓動,過一段時間,也會受到resolve通知
7.7.2 針對磁盤CPU,IO ,磁盤使用、內存使用、TCP、網絡流量配置監控告警:
groups:
- name: 主機狀態-監控告警
rules:
- alert: 主機狀態
expr: up == 0
for: 1m
labels:
status: 非常嚴重
annotations:
summary: "{{$labels.instance}}:服務器宕機"
description: "{{$labels.instance}}:服務器延時超過5分鐘"
- alert: CPU使用情況
expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100)
for: 1m
labels:
status: 一般告警
annotations:
summary: "{{$labels.mountpoint}} CPU使用率過高!"
description: "{{$labels.mountpoint }} CPU使用大於60%(目前使用:{{$value}}%)"
- alert: cpu使用率過高告警 # 查詢提供了hostname label
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10
nodename) (node_uname_info) > 85
for: 5m
labels:
region: 成都
annotations:
summary: "{{$labels.instance}}({{$labels.nodename}})CPU使用率過高!"
description: '服務器{{$labels.instance}}({{$labels.nodename}})CPU使用率超過85%(
$value}}%)'
- alert: 系統負載過高
expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}
nodename) (node_uname_info)>1.1
for: 3m
labels:
region: 成都
annotations:
summary: "{{$labels.instance}}({{$labels.nodename}})系統負載過高!"
description: '{{$labels.instance}}({{$labels.nodename}})當前負載超標率 {{printf
- alert: 內存不足告警
expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o
nodename) (node_uname_info) > 80
for: 3m
labels:
region: 成都
annotations:
summary: "{{$labels.instance}}({{$labels.nodename}})內存使用率過高!"
description: '服務器{{$labels.instance}}({{$labels.nodename}})內存使用率超過80%(
$value}}%)'
- alert: IO操作耗時
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) <
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"
description: "{{$labels.mountpoint }} 流入磁盤IO大於60%(目前使用:{{$value}})"
- alert: 網絡流入
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|do
instance)) / 100) > 102400
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流入網絡帶寬過高!"
description: "{{$labels.mountpoint }}流入網絡帶寬持續2分鐘高於100M. RX帶寬使用率{
- alert: 網絡流出
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d
instance)) / 100) > 102400
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流出網絡帶寬過高!"
description: "{{$labels.mountpoint }}流出網絡帶寬持續2分鐘高於100M. RX帶寬使用率{
- alert: network in
expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1
for: 1m
labels:
name: network
severity: Critical
annotations:
summary: "{{$labels.mountpoint}} 流入網絡帶寬過高"
description: "{{$labels.mountpoint }}流入網絡異常,高於100M"
value: "{{ $value }}"
- alert: network out
expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 /
for: 1m
labels:
name: network
severity: Critical
annotations:
summary: "{{$labels.mountpoint}} 發送網絡帶寬過高"
description: "{{$labels.mountpoint }}發送網絡異常,高於100M"
value: "{{ $value }}"
- alert: TCP會話
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大於1000%(目前使用:{{$valu
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b
> 80
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盤分區使用率過高!"
description: "{{$labels.mountpoint }} 磁盤分區使用大於80%(目前使用:{{$value}}%)"
- alert: 硬盤空間不足告警 # 查詢結果多了hostname等label
expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by
)* on(instance) group_left(nodename) (node_uname_info)> 80
for: 3m
labels:
region: 成都
annotations:
summary: "{{$labels.instance}}({{$labels.nodename}})硬盤使用率過高!"
description: '服務器{{$labels.instance}}({{$labels.nodename}})硬盤使用率超過80%(
$value}}%)'
- alert: volume fullIn fourdaysd # 預計磁盤4天后寫滿
expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
for: 5m
labels:
name: disk
severity: Critical
annotations:
summary: "{{$labels.mountpoint}} 預計主機可用磁盤空間4天后將寫滿"
description: "{{$labels.mountpoint }}"
value: "{{ $value }}%"
- alert: disk write rate
expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024
for: 1m
labels:container_memory_max_usage_bytes
name: disk
severity: Critical
annotations:
summary: "disk write rate (instance {{ $labels.instance }})"
description: "磁盤寫入速率大於50MB/s"
value: "{{ $value }}%"
- alert: disk read latency
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet
for: 1m
labels:
name: disk
severity: Critical
annotations:
summary: "unusual disk read latency (instance {{ $labels.instance }})"
description: "磁盤讀取延遲大於100毫秒"
value: "{{ $value }}%"
- alert: disk write latency
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl
for: 1m
labels:
name: disk
severity: Critical
annotations:
summary: "unusual disk write latency (instance {{ $labels.instance }})"
description: "磁盤寫入延遲大於100毫秒"
value: "{{ $value }}%"
7.8 alertmanager 管理api
GET /-/healthy
GET /-/ready
POST /-/reload
- 示例:
curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy
OK
curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
[root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
failed to reload config: yaml: unmarshal errors:
line 26: field receiver already set in type config.plain
等同: docker exec -it monitor-alertmanager kill -1 1 ,但是失敗會報錯
8.blackbox_exporter
8.1 blackbox_exporter簡介
-
blackbox_exporter是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的監控數據採集。
-
應用場景:
HTTP 測試 定義 Request Header 信息 判斷 Http status / Http Respones Header / Http Body 內容 TCP 測試 業務組件端口狀態監聽 應用層協議定義與監聽 ICMP 測試 主機探活機制 POST 測試 接口聯通性 SSL 證書過期時間
8.2 blackbox_exporter安裝
8.2.1 linux(centos7) 二進制下載安裝blackbox_exporter
-
下載並解壓
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/ blackbox_exporter-0.18.0.linux-amd64.tar.gz tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/ cd /usr/local ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter cd blackbox_exporter ./blackbox_exporter --version
-
添加systemd服務unit:
vim /lib/systemd/system/blackbox_exporter.service [Unit] Description=blackbox_exporter After=network.target [Service] User=root Type=simple ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml Restart=on-failure [Install] WantedBy=multi-user.target
systemctl daemon-reload systemctl enable blackbox_exporter systemctl start blackbox_exporter
- 默認監聽端口: 9115
8.2.2 docker 安裝blackbox_exporter
-
image: prom/blackbox-exporter:master
-
docker run:
docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
8.3 配置blackbox_exporter
-
默認配置文件:
-
blackbox_exporter 默認情況配置文件已經能夠滿足大多數需求,後續如需自行配置,參見官方文檔,以及項目類一個示例配置文件
cat blackbox.yml modules: http_2xx: prober: http http_post_2xx: prober: http http: method: POST tcp_connect: prober: tcp pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true tls_config: insecure_skip_verify: false ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: prober: icmp
8.4 配置prometheus: <relabel_config>
-
官方介紹: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
-
參考文檔: https://blog.csdn.net/qq_25934401/article/details/84325356
-
說明:
labels: job: job_name __address__: <host>:<port> instance: 默認__address__,如果沒有被重新標籤的話 __scheme__: scheme __metrics_path__: path __param_<name>: url 中第一個出現的 <name> 參數
8.4.1 http/https 測試示例:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://prometheus.io # Target to probe with http.
- https://prometheus.io# Target to probe with https.
- http://example.com:8080 # Target to probe with http on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
8.4.2 tcp探測示例:
- job_name: "blackbox_telnet_port]"
scrape_interval: 5s
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: [ '1x3.x1.xx.xx4:443' ]
labels:
group: 'xxxidc機房ip監控'
- targets: ['10.xx.xx.xxx:443']
labels:
group: 'Process status of nginx(main) server'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.xxx.xx.xx:9115
8.4.3 icmp探測示例:
- job_name: 'blackbox00_ping_idc_ip'
scrape_interval: 10s
metrics_path: /probe
params:
module: [icmp] #ping
static_configs:
- targets: [ '1x.xx.xx.xx' ]
labels:
group: 'xxnginx 虛擬IP'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: ping
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 1x.xxx.xx.xx:9115
8.4.4 POST探測示例:
- job_name: 'blackbox_http_2xx_post'
scrape_interval: 10s
metrics_path: /probe
params:
module: [http_post_2xx_query]
static_configs:
- targets:
- https://xx.xxx.com/api/xx/xx/fund/query.action
labels:
group: 'Interface monitoring'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 1x.xx.xx.xx:9115 # The blackbox exporter's real hostname:port.
8.4.5 SSL證書時間監測:
cat << 'EOF' > prometheus.yml
rule_files:
- ssl_expiry.rules
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- example.com # Target to probe
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # Blackbox exporter.
EOF
cat << 'EOF' > ssl_expiry.rules
groups:
- name: ssl_expiry.rules
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30
for: 10m
EOF
8.5 查看監聽過程:
-
類似於:
curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true
8.6 添加告警:
-
icmp、tcp、http、post 監測是否正常可以觀察probe_success 這一指標
probe_success == 0 ##聯通性異常 probe_success == 1 ##聯通性正常
-
告警也是判斷這個指標是否等於0,如等於0 則觸發異常報警
[sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules groups: - name: blackbox_network_stats rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "This requires immediate action!"
9.docker-compose部署完整prometheus監控系統
- 部署主機: 10.10.11.40
9.1 部署組件:
prometheus
alertmanager
grafana
nginx
node_exporter
cadvisor
blackbox_exporter
- image:
prom/prometheus
prom/alertmanager
quay.io/prometheus/node-exporter ,prom/node-exporter
gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能訪問google
google/cadvisor:v0.33.0 # docker hub鏡像,版本沒有google的新
grafana/grafana
nginx
-
將iamge pull下來之後從新tag ,並上傳至本地harbor 倉庫
image: 10.10.11.40:80/base/nginx:1.19.3 image: 10.10.11.40:80/base/prometheus:2.22.0 image: 10.10.11.40:80/base/grafana:7.2.2 image: 10.10.11.40:80/base/alertmanager:0.21.0 image: 10.10.11.40:80/base/node_exporter:1.0.1 image: 10.10.11.40:80/base/cadvisor:v0.33.0 image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
9.2 部署結構
-
目錄結構一覽
mkdir /home/deploy/monitor cd /home/deploy/monitor
[root@host40 monitor]# tree . ├── alertmanager │ ├── alertmanager.yml │ ├── db │ │ ├── nflog │ │ └── silences │ └── templates │ └── wechat.tmpl ├── blackbox_exporter │ └── blackbox.yml ├── docker-compose.yml ├── grafana │ └── db │ ├── grafana.db │ ├── plugins ... ├── nginx │ ├── auth │ └── nginx.conf ├── node-exporter │ └── textfiles ├── node_exporter_install_docker.sh ├── prometheus │ ├── db │ ├── prometheus.yml │ ├── rules │ │ ├── docker_monitor.yml │ │ ├── system_monitor.yml │ │ └── tcp_monitor.yml │ └── sd_files │ ├── docker_host.yml │ ├── http.yml │ ├── icmp.yml │ ├── real_lan.yml │ ├── real_wan.yml │ ├── sedFDm5Rw │ ├── tcp.yml │ ├── virtual_lan.yml │ └── virtual_wan.yml └── sd_controler.sh
-
nginx basic認證需要的文件:
[root@host40 monitor-bak]# ls nginx/auth/ -a . .. .htpasswd
-
部分掛在目錄權限:
prometheus,grafana,alertmanager 的 db目錄 需要777權限 單獨掛在的配置文件 alertmanager.yml,prometheus.yml,nginx.conf 需要 666權限。 如果爲了安全起見,建議將配置文件放入專門目錄中掛載,並在command 中修改啓動參數指定配置文件即可
9.3 docker-compose.yml
[root@host40 monitor-bak]# cat docker-compose.yml
version: "3"
services:
nginx:
image: 10.10.11.40:80/base/nginx:1.19.3
hostname: nginx
container_name: monitor-nginx
restart: always
privileged: false
ports:
- 3001:3000
- 9090:9090
- 9093:9093
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
- ./nginx/auth:/etc/nginx/basic_auth
networks:
monitor:
aliases:
- nginx
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
prometheus:
image: 10.10.11.40:80/base/prometheus:2.22.0
container_name: monitor-prometheus
hostname: prometheus
restart: always
privileged: true
volumes:
- ./prometheus/db/:/prometheus/
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules/:/etc/prometheus/rules/
- ./prometheus/sd_files/:/etc/prometheus/sd_files/
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--storage.tsdb.retention=60d'
networks:
monitor:
aliases:
- prometheus
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
grafana:
image: 10.10.11.40:80/base/grafana:7.2.2
container_name: monitor-grafana
hostname: grafana
restart: always
privileged: true
volumes:
- ./grafana/db/:/var/lib/grafana
networks:
monitor:
aliases:
- grafana
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
alertmanger:
image: 10.10.11.40:80/base/alertmanager:0.21.0
container_name: monitor-alertmanager
hostname: alertmanager
restart: always
privileged: true
volumes:
- ./alertmanager/db/:/alertmanager
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- ./alertmanager/templates/:/etc/alertmanager/templates
networks:
monitor:
aliases:
- alertmanager
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
node-exporter:
image: 10.10.11.40:80/base/node_exporter:1.0.1
container_name: monitor-node-exporter
hostname: host40
restart: always
privileged: true
volumes:
- /:/host:ro,rslave
- ./node-exporter/textfiles/:/textfiles
network_mode: "host"
command:
- '--path.rootfs=/host'
- '--web.listen-address=:9100'
- '--collector.textfile.directory=/textfiles'
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
cadvisor:
image: 10.10.11.40:80/base/cadvisor:v0.33.0
container_name: monitor-cadvisor
hostname: cadvisor
restart: always
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- 9080:8080
networks:
monitor:
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
blackbox_exporter:
image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
container_name: monitor-blackbox
hostname: blackbox-exporter
restart: always
privileged: true
volumes:
- ./blackbox_exporter/:/etc/blackbox_exporter
networks:
monitor:
aliases:
- blackbox
command:
- '--config.file=/etc/blackbox_exporter/blackbox.yml'
logging:
driver: json-file
options:
max-file: '5'
max-size: 50m
networks:
monitor:
ipam:
config:
- subnet: 192.168.17.0/24
9.4 nginx
-
由於prometheus,alertmanager 本身不帶認證功能,所以前端使用nginx完成調度和basic auth 認證,同一代理後端監聽端口,便於管理。
- 各程序默認端口
prometheus: 9090
grafana:3000
alertmanager: 9093
node_exproter: 9100
cadvisor: 8080 (客戶端)
- nginx基礎image使用basic認證:
echo monitor:`openssl passwd -crypt 123456` > .htpasswd
-
單獨掛在配置文件容器不更新:(當然也可以選擇掛在目錄,而不是直接掛在文件)
chmod 666 nginx.conf
-
nginx容器加載配置文件:
docker exec -it web-director nginx -s reload
-
nginx.conf
[root@host40 monitor-bak]# cat nginx/nginx.conf user nginx; worker_processes auto; error_log /var/log/nginx/error.log; pid /run/nginx.pid; include /usr/share/nginx/modules/*.conf; events { worker_connections 10240; } http { log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfileon; tcp_nopush on; tcp_nodelayon; keepalive_timeout65; types_hash_max_size 2048; include /etc/nginx/mime.types; default_type application/octet-stream;
proxy_connect_timeout500ms;
proxy_send_timeout1000ms;
proxy_read_timeout3000ms;
proxy_buffers 64 8k;
proxy_busy_buffers_size 128k;
proxy_temp_file_write_size 64k;
proxy_redirect off;
proxy_next_upstream error invalid_header timeout http_502 http_504;
proxy_http_version 1.1;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Real-Port $remote_port;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
client_max_body_size 10m;
client_body_buffer_size 512k;
client_body_timeout 180;
client_header_timeout 10;
send_timeout 240;
gzip on;
gzip_min_length 1k;
gzip_buffers 4 16k;
gzip_comp_level 2;
gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
gzip_vary off;
gzip_disable "MSIE [1-6].";
server {
listen 3000;
servername ;
location / {
proxy_pass http://grafana:3000;
}
}
server {
listen 9090;
servername ;
location / {
auth_basic "auth for monitor";
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://prometheus:9090;
}
}
server {
listen 9093;
servername ;
location / {
auth_basic "auth for monitor";
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://alertmanager:9093;
}
}
}
### 9.5 prometheus
- 注意db目錄需可寫,給777權限
#### 9.5.1 主配置文件: prometheus.yml
[root@host40 monitor-bak]# cat prometheus/prometheus.yml
my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout is set to the global default (10s).
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- targets: ["alertmanager:9093"]
- "rules/*.yml"
A scrape configuration containing exactly one endpoint to scrape:
Here it's Prometheus itself.
scrape_configs:
The job name is added as a label
job=<job_name>
to any timeseries scraped from this config. -
job_name: 'prometheus'
static_configs:- targets: ['localhost:9090']
- job_name: 'alertmanager'
static_configs:- targets: ['alertmanager:9093']
-
job_name: 'node_real_lan'
file_sd_configs:- files:
- ./sd_files/real_lan.yml
refresh_interval: 30s
-
job_name: 'node_virtual_lan'
file_sd_configs:- files:
- ./sd_files/virtual_lan.yml
refresh_interval: 30s
-
job_name: 'node_real_wan'
file_sd_configs:- files:
- ./sd_files/real_wan.yml
refresh_interval: 30s
-
job_name: 'node_virtual_wan'
file_sd_configs:- files:
- ./sd_files/virtual_wan.yml
refresh_interval: 30s
- job_name: 'docker_host'
file_sd_configs:- files:
- ./sd_files/docker_host.yml
refresh_interval: 30s
- job_name: 'tcp'
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:- files:
- ./sd_files/tcp.yml
refresh_interval: 30s
relabel_configs: - source_labels: [address]
target_label: __param_target - source_labels: [__param_target]
target_label: instance - target_label: address
replacement: blackbox:9115
- job_name: 'http'
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:- files:
- ./sd_files/http.yml
refresh_interval: 30s
relabel_configs: - source_labels: [address]
target_label: __param_target - source_labels: [__param_target]
target_label: instance - target_label: address
replacement: blackbox:9115
- job_name: 'icmp'
metrics_path: /probe
params:
module: [icmp]
file_sd_configs:- files:
- ./sd_files/icmp.yml
refresh_interval: 30s
relabel_configs: - source_labels: [address]
target_label: __param_target - source_labels: [__param_target]
target_label: instance - target_label: address
replacement: blackbox:9115
9.5.2 全部節點使用基於文件的服務發現:
- 將需要監控的主機targets 寫入相應job的target文件即可。示例如下:
ls prometheus/sd_files/
docker_host.yml http.yml icmp.yml real_lan.yml real_wan.yml sedFDm5Rw tcp.yml virtual_lan.yml virtual_wan.yml
cat prometheus/sd_files/docker_host.yml
- targets: ['10.10.11.178:9080']
- targets: ['10.10.11.99:9080']
- targets: ['10.10.11.40:9080']
- targets: ['10.10.11.35:9080']
- targets: ['10.10.11.45:9080']
- targets: ['10.10.11.46:9080']
- targets: ['10.10.11.48:9080']
- targets: ['10.10.11.47:9080']
- targets: ['10.10.11.65:9081']
- targets: ['10.10.11.61:9080']
- targets: ['10.10.11.66:9080']
- targets: ['10.10.11.68:9080']
- targets: ['10.10.11.98:9080']
- targets: ['10.10.11.75:9080']
- targets: ['10.10.11.97:9080']
- targets: ['10.10.11.179:9080']
cat prometheus/sd_files/tcp.yml
- targets: ['10.10.11.178:8001']
labels:
server_name: http_download
- targets: ['10.10.11.178:3307']
labels:
server_name: xiaojing_db
- targets: ['10.10.11.178:3001']
labels:
server_name: test_web
9.5.3 rules文件:
- docker rules:
cat prometheus/rules/docker_monitor.yml
groups:
- name: "container monitor"
rules:
- alert: "Container down: env1"
expr: time() - container_last_seen{name="env1"} > 60
for: 30s
labels:
severity: critical
annotations:
summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
-
tcp rules:
cat prometheus/rules/tcp_monitor.yml groups: - name: blackbox_network_stats rules:
-
alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down"
description: "連接不通..." - system rules: # cpu ,mem, disk, network, filesystem...
cat prometheus/rules/system_monitor.yml
groups:
- name: "system info"
rules:
- alert: "服務器宕機"
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}}:服務器宕機"
description: "{{$labels.instance}}:服務器無法連接,持續時間已超過3mins"
- alert: "系統負載過高"
expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))* on(instance) group_left(
nodename) (node_uname_info) > 1.1
for: 3m
labels:
servirity: warning
annotations:
summary: "{{$labels.instance}}:系統負載過高"
description: "{{$labels.instance}}:系統負載過高."
value: "{{$value}}"
- alert: "CPU 使用率超過90%"
expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 90
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}}:CPU 使用率90%"
description: "{{$labels.instance}}:CPU 使用率超過90%."
value: "{{$value}}"
- alert: "內存使用率超過80%"
expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
nodename) (node_uname_info) > 80
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}}:內存使用率80%"
description: "{{$labels.instance}}:內存使用率超過80%"
value: "{{$value}}"
- alert: "IO操作耗時超過60%"
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 40
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}}:IO操作耗時超過60%"
description: "{{$labels.instance}}:IO操作耗時超過60%"
value: "{{$value}}"
- alert: "磁盤分區容量超過85"
expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes
{fstype=~"ext4|xfs"}*100) )* on(instance) group_left(nodename) (node_uname_info)> 85
for: 3m
labels:
severity: longtime
annotations:
summary: "{{$labels.instance}}:磁盤分區容量超過85%"
description: "{{$labels.instance}}:磁盤分區容量超過85%"
value: "{{$value}}"
- alert: "磁盤將在4天后寫滿"
expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
for: 3m
labels:
severity: longtime
annotations:
summary: "{{$labels.instance}}: 預計將有磁盤分區在4天后寫滿,"
description: "{{$labels.instance}}:預計將有磁盤分區在4天后寫滿,"
value: "{{$value}}"
9.6 alertmanager:
-
注意db目錄可寫:
-
主配置文件:
cat alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtphz.qiye.163.com:25' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'XXX' smtp_hello: 'qiye.163.com' smtp_require_tls: true route: group_by: ['instance'] group_wait: 30s receiver: default routes: - group_interval: 3m repeat_interval: 10m match: severiry: warning receiver: 'default' - group_interval: 3m repeat_interval: 30m match: severiry: critical receiver: 'default' - group_interval: 5m repeat_interval: 24h match: severiry: longtime receiver: 'default' templates:
- ./templates/*.tmpl
receivers: -
name: 'default'
email_configs:- to: '[email protected]'
send_resolved: true
wechat_configs:
- send_resolved: true
corp_id: 'XXX'
api_secret: 'XXX'
agent_id: 1000002
to_user: XXX
to_party: 2
message: '{{ template "wechat.html" . }}'
- to: '[email protected]'
-
name: 'critical'
email_configs:- to: '[email protected]'
send_resolved: true - to: '[email protected]'
send_resolved: true
- to: '[email protected]'
-
告警模板文件
cat alertmanager/templates/wechat.tmpl {{ define "wechat.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} [@警報~] 實例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 詳情: {{ .Annotations.description }} 值: {{ .Annotations.value }} 時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} [@恢復~] 實例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢復: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- end }}
9.7 grafana
- 只需要掛載volume即可,配置文件無需更改,db目錄也不大,可以保存配置和dashboard
10.客戶端部署
10.1 被監控主機無docker,單獨安裝node_exporter
-
安裝腳本:
http://10.10.11.178:8001/node_exporter_install.sh
10.2 被監控主機運行docker,docker 安裝 node_exporter cadvisor
-
安裝腳本:
http://10.10.11.178:8001/node_exporter_install_docker.sh
-
需要的image,對於沒有添加10.10.11.40:80 倉庫的docker主機,可以下載save的image,先load image 在安裝
http://10.10.11.178:8001/monitor-client.tgz
11.prometheus使用和維護
11.1 通過腳本添加和刪除監控節點
-
所有的job都使用基於文件的服務發現,所以,只用將target寫入sd_file即可,無需重讀配置文件
-
基於此寫了一個文本處理腳本作爲sd_files的前端,通過命令行的形式添加和刪除targets,無需手動編輯文件
-
腳本名稱: sd_controler.sh
-
腳本使用:./sd_controler.sh 即可查看usage
-
完整腳本如下:
[root@host40 monitor]# cat sd_controler.sh #!/bin/bash #version: 1.0 #Description: add | del | show instance from|to prometheus file_sd_files. # rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file. # tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails. # each time can only add|del for one instance. #說明:用來添加、刪除、查看prometheus基於文件的服務發現中的條目。比如IP:PORT 組合。 # rl | vl | dk | rw | vw | tcp | http | icmp :這寫prometheus job名稱的簡稱,每一項代表一個job,操作一個sd_file 即job文件服務發現使用的文件。 # tcp | http | icmp,由於常常無法根據服務端口第一時間確認掛掉的是什麼服務,所以,在tcp http icmp(順帶)添加的時候要求帶上server_name的標籤label, #讓監控人員收到告警郵件第十時間知道掛掉的是什麼服務。 # 每一次只能添加、刪除一條記錄,如果需要批量添加,可以直接使用vim 文本操作,或者寫for 語句批量執行。 ### vars SD_DIR=./prometheus/sd_files DOCKER_SD=$SD_DIR/docker_host.yml RL_HOST_SD=$SD_DIR/real_lan.yml VL_HOST_SD=$SD_DIR/virtual_lan.yml RW_HOST_SD=$SD_DIR/real_wan.yml VW_HOST_SD=$SD_DIR/virtual_wan.yml TCP_SD=$SD_DIR/tcp.yml HTTP_SD=$SD_DIR/http.yml ICMP_SD=$SD_DIR/icmp.yml SDFILE= ### funcs usage(){ echo -e "Usage: $0 < rl | vl | dk | rw | vw | tcp | http | icmp > < add | del | show > [ IP:PORT | FQDN ] [ server-name ]" echo -e " example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show." exit } add(){ # $1: SDFILE, $2: IP:PORT grep -q $2 $1 || echo -e "- targets: ['$2']" >> $1 } del(){ # $1: SDFILE, $2: IP:PORT sed -i '/'$2'/d' $1 } add_with_label(){ # $1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME LABEL_01="server_name" if ! grep -q '$2' $1;then echo -e "- targets: ['$2']" >> $1 echo -e " labels:" >> $1 echo -e " ${LABEL_01}: $3" >> $1 fi } del_with_label(){ # $1: SDFILE, $2: [IP:[PROT]|FQDN] NUM=`cat -n $SDFILE |grep "'$2'"|awk '{print $1}'` let ENDNUM=NUM+2
sed -i $NUM,${ENDNUM}d $1
}
action(){
if [ "$1" == "add" ];then
add $SDFILE $2
elif [ "$1" == "del" ];then
del $SDFILE $2
elif [ "$1" == "show" ];then
cat $SDFILE
fi
}
action_with_label(){
if [ "$1" == "add" ];then
add_with_label $SDFILE $2 $3
elif [ "$1" == "del" ];then
del_with_label $SDFILE $2 $3
elif [ "$1" == "show" ];then
cat $SDFILE
fi
}
main code
[ "$2" == "" ] || [[ ! "$2" =~ ^(add|del|show)$ ]] && usage
curl --version &>/dev/null || { echo -e "no curl found. " && exit 15; }
if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ "$2" == "add" ];then
[ "$3" == "" ] && usage
if [ "$4" != "-f" ];then
COOD=curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics
[ "$COOD" != "200" ] && echo -e "http://$3/metrics is not arriable. check it again. or you can use -f to ignor it." && exit 11
fi
fi
if [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ "$2" == "add" ];then
[ "$4" == "" ] && echo -e "監聽 tcp http icmp 服務時必須指明 server-name." && usage
fi
case $1 in
rl)
SDFILE=$RL_HOST_SD
action $2 $3 && echo $2 OK
;;
vl)
SDFILE=$VL_HOST_SD
action $2 $3 && echo $2 OK
;;
dk)
SDFILE=$DOCKER_SD
action $2 $3 && echo $2 OK
;;
rw)
SDFILE=$RW_HOST_SD
action $2 $3 && echo $2 OK
;;
vw)
SDFILE=$VW_HOST_SD
action $2 $3 && echo $2 OK
;;
tcp)
SDFILE=$TCP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
http)
SDFILE=$HTTP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
icmp)
SDFILE=$ICMP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
*)
usage
;;
esac