部署alertmanager
考慮到prometheus需要在配置文件中設置alertmanager監聽地址和端口,因此採用把alertmanager和prometheus部署在同一個pod中的方式,當然也可以另外以單獨pod部署,然後通過service和port的方式來配置,但是不知爲啥,沒測試成功.增加相應的配置到prometheus.yml中:
prometheus.yml: |-
global:
scrape_interval: 90s
evaluation_interval: 90s
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
#- alertmanager:9093
rule_files:
- /etc/prometheus/rules.yml
增加alertmanager需要用的告警規則到prometheus.yml中:
rules.yml: |-
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
修改prometheus-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: kube-system
#annotations:
# used to scrape app's metrics which deployed in pod
# prometheus.io/scrape: 'true'
# prometheus scrape path, default /metrics
# prometheus.io/path: '/metrics'
# prometheus.io/port relvant port
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
runAsUser: 0
containers:
- name: prometheus
image: prom/prometheus:v2.2.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- name: gluster-volume
mountPath: /prometheus
- name: config-volume
mountPath: /etc/prometheus
- name: alertmanager
image: x.x.x.x/library/prom/alertmanager:latest
args:
- '--config.file=/etc/alertmanager/config.yml'
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: alert-volume
mountPath: /etc/alertmanager
imagePullSecrets:
- name: my-secret
volumes:
- name: gluster-volume
persistentVolumeClaim:
claimName: gluster-prometheus
- name: config-volume
configMap:
name: prometheus-server-conf
- name: alert-volume
configMap:
name: alertmanager
準備alertmanager告警需要用到的郵件設置:
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-system
data:
config.yml: |-
global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxx'
templates:
- '/root/alertmanager/template/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 10m
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
注意,163的郵箱設置中必須打開SMTP,否則會報如下錯誤:
evel=error ts=2018-04-03T03:39:32.793284112Z caller=notify.go:303 component=dispatcher msg="Error on notify" err="*notify.loginAuth failed: 550 User has no permission"
level=error ts=2018-04-03T03:39:32.793463167Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="*notify.loginAuth failed: 550 User has no permission"
進行創建部署即可.