服務端搭建
https://blog.csdn.net/qq_37598011/article/details/101105086
pushgateway安裝及其使用
https://prometheus.io/download/
wget https://github.com/prometheus/pushgateway/releases/download/v1.0.0/pushgateway-1.0.0.linux-amd64.tar.gz
tar -zxvf pushgateway-1.0.0.linux-amd64.tar.gz
cd pushgateway-1.0.0.linux-amd64/
./pushgateway
於此同時需要修改Prometheus的配置文件
cd /usr/local/prometheus
vim prometheus.yml
- job_name: pushgateway
static_configs:
- targets: ['localhost:9091']
labels:
instance: pushgateway
重啓
./prometheus
訪問:http://localhost:9090/targets
測試(對於傳過去的監控項會添加此處定義的標籤 job=test instance=192.168.78.133 hostname=ip-192.168.78.133)
curl 127.0.0.1:9100/metrics|curl --data-binary @- http://127.0.0.1:9091/metrics/job/test/instance/192.168.78.133/hostname/ip-192.168.78.133
使用客戶端庫:https://prometheus.io/docs/instrumenting/clientlibs/
Alertmanager安裝及其使用
Alertmanager安裝
https://prometheus.io/download/
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0-rc.0/alertmanager-0.20.0-rc.0.linux-amd64.tar.gz
tar -zxvf alertmanager-0.20.0-rc.0.linux-amd64.tar.gz
mv alertmanager-0.20.0-rc.0.linux-amd64 /usr/local/alertmanager
修改配置文件
cd /usr/local/alertmanager/
vim alertmanager.yml
global:
resolve_timeout: 5m #處理超時時間,默認爲5min
smtp_smarthost: 'smtp.qq.com:25' # 郵箱smtp服務器代理
smtp_from: '[email protected]' # 發送郵箱名稱
smtp_auth_username: '[email protected]' # 郵箱名稱
smtp_auth_password: 'xxxxxxxx' # 授權碼
smtp_require_tls: false
templates:
- 'template/*.tmpl'
route:
group_by: ['alertname'] #報警分組依據
group_wait: 10s #最初即第一次等待多久時間發送一組警報的通知
group_interval: 10m # 在發送新警報前的等待時間
repeat_interval: 1h # 發送重複警報的週期 對於email配置中,此項不可以設置過低,>否則將會由於郵件發送太多頻繁,被smtp服務器拒絕
receiver: 'email' # 發送警報的接收者的名稱,以下receivers name的名稱
receivers:
- name: 'email'
email_configs: # 郵箱配置
- to: '[email protected]' # 接收警報的email配置
headers: { Subject: "[WARN] 報警郵件"} # 接收郵件的標題
send_resolved: true
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
創建模板文件
mkdir template
cd template
vim test.tmpl
{{ define "test.html" }}
<table border="1">
<tr>
<td>報警項</td>
<td>實例</td>
<td>報警閥值</td>
<td>開始時間</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr>
<td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "value" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
檢測
cd ..
./amtool check-config alertmanager.yml
Alertmanager啓動!!!!
./alertmanager
修改Prometheus
創建Alertmanager規則
cd /usr/local/prometheus
這裏我舉例服務掉線的情況
vim alertmanager_rules.yml
groups:
- name: test-rules
rules:
- alert: InstanceDown # 告警名稱
expr: up == 0 # 告警的判定條件,參考Prometheus高級查詢來設定
for: 2m # 滿足告警條件持續時間多久後,纔會發送告警
labels: #標籤項
team: node
annotations: # 解析項,詳細解釋告警信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
value: "{{$value}}"
修改Prometheus.yml
vim prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alertmanager_rules.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'server'
static_configs:
- targets: ['localhost:9100']
- job_name: pushgateway
static_configs:
- targets: ['localhost:9091']
labels:
instance: pushgateway
檢測
./promtool check config prometheus.yml
./prometheus
OK啓動!!!
測試
停止node_exporter
例子:針對磁盤CPU、IO、磁盤使用、內存使用、TCP、網絡流量監控告警
groups:
- name: 主機狀態-監控告警
rules:
- alert: 主機狀態
expr: up == 0
for: 5m
labels:
status: 非常嚴重
annotations:
summary: "{{$labels.instance}}:服務器宕機"
description: "{{$labels.instance}}:服務器延時超過5分鐘"
- alert: CPU使用情況
expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 80
for: 1m
labels:
status: 一般告警
annotations:
summary: "{{$labels.mountpoint}} CPU使用率過高!"
description: "{{$labels.mountpoint }} CPU使用大於80%(目前使用:{{$value}}%)"
- alert: 內存使用
expr: 100 -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes ) / node_memory_MemTotal_bytes * 100> 80
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 內存使用率過高!"
description: "{{$labels.mountpoint }} 內存使用大於80%(目前使用:{{$value}}%)"
- alert: IO性能
expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 80
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"
description: "{{$labels.mountpoint }} 流入磁盤IO大於80%(目前使用:{{$value}})"
- alert: 網絡
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流入網絡帶寬過高!"
description: "{{$labels.mountpoint }}流入網絡帶寬持續2分鐘高於100M. RX帶寬使用率{{$value}}"
- alert: 網絡
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 流出網絡帶寬過高!"
description: "{{$labels.mountpoint }}流出網絡帶寬持續2分鐘高於100M. RX帶寬使用率{{$value}}"
- alert: TCP會話
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大於1000%(目前使用:{{$value}}%)"
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
for: 1m
labels:
status: 嚴重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盤分區使用率過高!"
description: "{{$labels.mountpoint }} 磁盤分區使用大於80%(目前使用:{{$value}}%)"
媽媽再也不用擔心我的服務器???