prometheus + alertmanager + grafana强强联合

1. Prometheus简介

Prometheus又称之为普罗米修斯，是一个最初在SoundCloud上构建的开源系统监视和警报工具包。自2012年成立以来，许多公司和组织都采用了Prometheus，该项目拥有一个非常活跃的开发人员和用户社区。它现在是一个独立的开源项目，可以独立于任何公司进行维护。 Prometheus于2016年加入CNCF（云原生计算基金会），作为继kubernetes之后的第二个托管项目。

Prometheus具有如下特点：

具有由metric和key/value标识的时间序列数据的多维数据模型；
使用PromQL，在多维度上灵活的查询语言；
不依赖分布式存储，单主节点工作；
通过基于HTTP的pull方式采集时序数据；
可以通过push gateway进行时序列数据推送(pushing)；
通过服务发现或者静态配置去获取要采集的目标服务器；
支持多种可视化图表及仪表盘

Prometheus具有如下优点

易于管理，核心部分只有一个单独的二进制文件，不存在任何的第三方依赖(数据库，缓存等等)；
强大的数据模型，所有采集的监控数据均以指标(metric)的形式保存在内置的时间序列数据库当中(TSDB)；
高效，对于监控系统而言大量的监控任务必然有大量的数据产生，而Prometheus可以高效地处理这些数据，单一Prometheus Server实例可以处理数以百万的监控指标，每秒处理数十万的数据点；
丰富的client库，基于Prometheus丰富的Client库，用户可以轻松的在应用程序中添加对Prometheus的支持，从而让用户可以获取服务和应用内部真正的运行状态；
可扩展，每个数据中心、每个团队可以运行独立Prometheus Sevrer，同时Prometheus支持联邦集群，可以让多个Prometheus实例产生一个逻辑集群，当单实例Prometheus Server处理的任务量过大时，通过使用功能分区(sharding)+联邦集群(federation)可以对其进行扩展；
易于集成，使用Prometheus可以快速搭建监控服务，并且可以非常方便地在应用程序中进行集成，目前支持： Java， JMX， Python， Go，Ruby， .Net， Node.js等等语言的客户SDK，基于这些SDK可以快速让应用程序纳入到Prometheus的监控当中，或者开发自己的监控数据收集程序，同时这些客户端收集的监控数据，不仅仅支持Prometheus，还能支持Graphite这些其他的监控工具

2. Prometheus架构

以下是来自官方的一幅架构图

(1）Prometheus Server：Prometheus的核心，根据配置完成数据采集，服务发现以及数据存储

（2）Service discovery：支持根据配置file_sd监控本地配置文件的方式实现服务发现（需配合其他工具修改本地配置文件），同时支持配置监听kubernetes的API来动态发现服务

（3）Prometheus targets：探针（exporter）提供采集接口，或应用本身提供的支持prometheus数据模型的采集接口

（4）Pushgateway：为应对部分push场景提供的插件，监控数据先推送到pushgateway上，然后再由server端采集pull（若server采集间隔期间，pushgateway上的数据没有变化，server将采集2次相同数据，仅时间戳不同）

（5）Alertmanager：告警插件，支持发送告警到邮件，Pagerduty，HipChat，Wechat等

（6）Prometheus web UI：可视化的图形界面，图形展示采集的数据

3. 环境准备

现在结合工作中生产环境Prometheus的部署详细记录其部署过程

机器名称	配置	系统	ip地址	角色
prometheus	8C16G	ubuntu16.04	10.13.103.151	prometheus server,grafana server
prometheus-alertmanager	8C16G	ubuntu16.04	10.13.103.152	alertmanager server

3.1 prometheus server部署

prometheus server是prometheus的核心，负责采集数据，存储数据

# 下载二进制文件并解压

root@prometheus:~# wget https://github.com/prometheus/prometheus/releases/download/v2.4.3/prometheus-2.4.3.linux-amd64.tar.gz

root@prometheus:~# tar -xf prometheus-2.4.3.linux-amd64.tar.gz -C /data/

root@prometheus:~# cd /data/prometheus-2.4.3/

root@prometheus:/data/prometheus-2.4.3# mkdir log

# 修改prometheus配置文件

root@prometheus:/data/prometheus-2.4.3# vim prometheus.yml
# my global config
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 25s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.13.103.152:9093 # alertmanager主机地址

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/data/prometheus-2.4.3/rules/node_down.yml" # 实例存活报警规则文件
- "/data/prometheus-2.4.3/rules/memory_over.yml" # 内存报警规则文件
- "/data/prometheus-2.4.3/rules/disk_over.yml" # 磁盘报警规则文件
- "/data/prometheus-2.4.3/rules/cpu_over.yml" # cpu报警规则文件

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

- job_name: 'GICHOST'
file_sd_configs:
- files: ['./host.json'] # 被监控的主机，可以通过static_configs罗列所有机器，这里通过file_sd_configs参数加载文件的形式读取

# 被监控的主机，可以json或yaml格式书写，我这里以json格式书写，target里面写监控机器的ip，labels非必须，可以由你自己定义

root@prometheus:/data/prometheus-2.4.3# vim host.json
[
{
"targets":[
"10.13.101.131:9100",
"10.13.101.132:9100",

"10.13.103.251:9100"

],
"labels":{
"host":"GIC_node"
}
},

{
"targets":[
"10.13.101.10:9100",
"10.13.101.11:9100",

"10.13.103.22:9100"

],
"labels":{
"service":"web"
}
}

]

# 配置报警规则，这里我设置的cpu超过90%报警，内存超过80%报警，磁盘使用超过80%报警

root@prometheus:/data/prometheus-2.4.3# mkdir rules

root@prometheus:/data/prometheus-2.4.3# cd rules

root@prometheus:/data/prometheus-2.4.3/rules# touch cpu_over.yml disk_over.yml memory_over.yml node_down.yml

root@prometheus:/data/prometheus-2.4.3/rules/# ls
cpu_over.yml disk_over.yml memory_over.yml node_down.yml
root@prometheus:/data/prometheus-2.4.3# cd rules/

# cpu报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim cpu_over.yml
groups:
- name: CPU报警规则
rules:
- alert: NodeCPUUsage
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 90
for: 1m
annotations:
description: "机器: CPU使用超过90%！ (当前值:%)"
summary: "机器: CPU检测"

# 磁盘报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim disk_over.yml
groups:
- name: 磁盘报警规则
rules:
- alert: NodeDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
annotations:
description: "机器: 磁盘设备: 使用超过80%！ (挂载点: 当前值:%)"
summary: "机器: 磁盘检测"

# 内存报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim memory_over.yml
groups:
- name: 内存报警规则
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
annotations:
description: "机器: 内存使用超过80%！ (当前值:$value%)"
summary: "机器: 内存检测"

# 机器存活报警
root@prometheus:/data/prometheus-2.4.3/rules# vim node_down.yml
groups:
- name: 机器存活报警规则
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
annotations:
description: "机器: 所属job: 已经宕机超过1分钟，请检查！"
summary: "机器:Instance 存活检测"

# 设置使用supervisor启动prometheus，可以保持promethues异常停止后自动启动，亦可以配置systemd启动prometheus

root@prometheus:/data/prometheus-2.4.3# apt-get install -y supervisor

root@prometheus:/data/prometheus-2.4.3# cd /etc/supervisor/conf.d/

# 配置prometheus启动相关事项，config.file设置服务启动是加载的配置文件，storage.tsdb.path设置采集数据存储的位置，storage.tsdb.retention设置数据存储保留的时间

root@prometheus:/etc/supervisor/conf.d# vim prometheus.conf
[program:prometheus]
# 启动程序的命令;
command = /data/prometheus-2.4.3/prometheus --config.file=/data/prometheus-2.4.3/prometheus.yml --storage.tsdb.path=/data/prometheus-2.4.3/data --storage.tsdb.retention=60d
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出，就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数，默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout，默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/prometheus-2.4.3/log/out-prometheus.log
# 错误日志输出;
stderr_logfile=/data/prometheus-2.4.3/log/err-prometheus.log
# 标准日志文件大小，默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20

root@prometheus:/etc/supervisor/conf.d# supervisorctl start prometheus

root@prometheus:/etc/supervisor/conf.d# supervisorctl status

3.2 node_exporter部署

以上prometheus采集到cup，内存，磁盘的数据是通过node_exporter获取的，需要在被监控机器上部署node_exporter

# 下载node_exporter并解压

root@prometheus:~# wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz

root@prometheus:~# tar -xf node_exporter-0.16.0.linux-amd64.tar.gz -C /data/

# 配置supervisor启动node_exporter

root@prometheus:~# cd /etc/supervisor/conf.d/

root@prometheus:/etc/supervisor/conf.d# vim node_exporter.conf
[program:node_exporter]
# 启动程序的命令;
command = /data/node_exporter-0.16.0/node_exporter
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出，就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数，默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout，默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/node_exporter-0.16.0/log/out-node_exporter.log
# 错误日志输出;
stderr_logfile=/data/node_exporter-0.16.0/log/err-node_exporter.log
# 标准日志文件大小，默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20

root@prometheus:/etc/supervisor/conf.d# supervisorctl start node_exporter

root@prometheus:/etc/supervisor/conf.d# supervisorctl status

此时我们可以登录prometheus默认的web http://10.13.103.151:9090查看监控数据了

3.3 alertmanager server部署

当我们设置的报警值超标后，prometheus触发报警alert，并传递给alertmanager，alertmanager给我们发送告警通知

# 下载alertmanager并解压

root@prometheus-alertmanager:~# wget https://github.com/prometheus/alertmanager/releases/download/v0.15.1/alertmanager-0.15.1.linux-amd64.tar.gz

root@prometheus-alertmanager:~# tar -xf alertmanager-0.15.1.linux-amd64.tar.gz -C /data

root@prometheus-alertmanager:~# cd /data/alertmanager-0.15.1/

root@prometheus-alertmanager:/data/alertmanager-0.15.1# mkdir log

# 修改alertmanager配置文件

root@prometheus-alertmanager:/data/alertmanager-0.15.1# vim alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications. # 设置邮件发送的相关信息，根据你实际的邮件账号和密码设置
smtp_smarthost: 'smtp.exmail.qq.com:25'
smtp_from: 'XXXXXX'
smtp_auth_username: 'XXXXXX'
smtp_auth_password: 'XXXXXX'
smtp_require_tls: false
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 设置微信接口

# The directory from which notification templates are read.
templates:
- '/data/alertmanager-0.15.1/template/*.tmpl' # 设置我们接受信息的模板

# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']

# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 12h

# A default receiver
receiver: default

receivers:
- name: 'default'
email_configs:
- to: 'appops.capitalonline.net'
# headers: { Subject: "Alertmanager报警邮件"}
wechat_configs: # 设置微信接受的相关账号信息
- corp_id: 'XXXXXX'
send_resolved: true
to_user: '@all'
# to_party: '2'
agent_id: '1000003'
api_secret: 'XXXXXX'

# 由于默认的微信发送格式比较乱，这里我们设置微信的格式模板，邮件采用默认的格式

root@prometheus-alertmanager:/data/alertmanager-0.15.1# cd template/

root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# vim wechat.tmpl
{{ define "wechat.default.message" }}
{{ range .Alerts }}
**********start**********
[告警程序]：alertmanager
[告警类型]：{{ .Labels.alertname }}
[故障主机]: {{ .Labels.instance }}
[故障主题]: {{ .Annotations.summary }}
[故障详情]: {{ .Annotations.description }}
[触发时间]: {{ .StartsAt }}
**********end**********
{{ end }}
{{ end }}

# 设置supervisor启动alertmanager

root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# cd /etc/supervisor/conf.d/

root@prometheus-alertmanager:/etc/supervisor/conf.d# vim alertmanager.conf
[program:alertmanager]
# 启动程序的命令;
command = /data/alertmanager-0.15.1/alertmanager --config.file=/data/alertmanager-0.15.1/alertmanager.yml --storage.path=/data/alertmanager-0.15.1/data/
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出，就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数，默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout，默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/alertmanager-0.15.1/log/out-alertmanager.log
# 错误日志输出;
stderr_logfile=/data/alertmanager-0.15.1/log/err-alertmanager.log
# 标准日志文件大小，默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20

root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl start alertmanager

root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl status

3.4 grafana server部署

prometheus默认的web UI比较简单，这里我们采用grafana结合prometheus来展示采集的数据

root@prometheus:~# curl https://packagecloud.io/gpg.key | sudo apt-key add -

root@prometheus:~# wget https://packagecloud.io/grafana/stable/debian/pool/stretch/main/g/grafana/grafana_5.3.4_amd64.deb

root@prometheus:~# apt-get install grafana

root@prometheus:~# systemctl start grafana-server.service

root@prometheus:~# systemctl enable grafana-server.service

root@prometheus:~# grafana-server -version

登录grafana web界面http://10.13.103.131:3000 添加data source和dashboard，grafana官方提供和很多dashboard模板可以使用，你可以根据你的需要下载添加，你也可以自己根据你的实际需要自己写dashboard模板

参考资料:

https://prometheus.io/docs/introduction/overview/

https://github.com/prometheus

prometheus + alertmanager + grafana强强联合

金山雲api簽名（go語言）

linux工作利器之二，網絡分析工具tcpdump

linux網絡分析、性能分析、文本格式化、文件讀寫操作之利器(mtr、top、jq、sponge)

kubernetes高可用集羣（多master，v1.15官方最新版）

利用python爬取貝殼網租房信息

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結