使用 Prometheus 和 Grafana 監控 Spark 應用 (親測)

背景

每個開發者都想了解自己任務運行時的狀態，便於調優及排錯，Spark 提供的 webui 已經提供了很多信息，用戶可以從上面瞭解到任務的 shuffle，任務運行等信息，但是運行時 Executor JVM 的狀態對用戶來說是個黑盒，在應用內存不足報錯時，初級用戶可能不瞭解程序究竟是 Driver 還是 Executor 內存不足，從而也無法正確的去調整參數。

Spark 的度量系統提供了相關數據，我們需要做的只是將其採集並展示。

實現

技術方案

後端存儲使用 Prometheus，類似的時序數據庫還有 influxDB/opentsdb 等。
前端展示使用的 Grafana，也可以使用 Graphite 或者自己繪圖。

這套方案最大的好處就是所有的組件都是開箱即用。

在集羣規模較大的情況下，建議可以先將指標採集到 kafka，然後再消費寫入數據庫。這樣做對採集和數據庫進行了解耦，還能在一定程度上能提高吞吐量，並且只需要實現一個 Kafka Sink，不需要對每個數據庫進行適配。建議使用現成輪子：jvm-profiler

版本信息：
grafana-5.2.4
graphite_exporter-0.3.0
prometheus-2.3.2

採集數據寫入數據庫

spark 默認沒有 Prometheus Sink ，這時候一般需要去自己實現一個，例如 spark-metrics。

其實 prometheus 還提供了一個插件（graphite_exporter），可以將 spark程序的 Graphite metrics 進行轉化並寫入 Prometheus metrics，spark 是自帶 Graphite Sink 的，這下省事了，只需要配置一把就可以生效了。

vim /home/hadoopfile/bigdataprograms/user/conf/metrics.properties

添加以下內容：

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=192.168.0.150 #spark IP
*.sink.graphite.port=9109 #spark 端口
*.sink.graphite.period=5
*.sink.graphite.unit=seconds

*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

如下圖所示：

提交時記得使用 --files /path/to/spark/conf/metrics.properties 參數將配置文件分發到所有的 Executor，否則將採集不到相應的數據。

啓動應用後，如果採集成功，將在 http://<metrics_hostname>:<metrics_port>/metrics 頁面中看到相應的信息：如下圖

原生的 Graphite 數據（spark程序web頁面的json數據）可以通過映射文件轉化爲有 label 維度的 Prometheus 數據

下圖爲spark程序的json

下圖爲配置文件（根據json裏的指標參數）

vim graphite_exporter_mapping.conf

#注 match參數對應的name不支持某些特殊符號(比如 - 和 .)所以name無法跟macth一模一樣，只能改成_(下劃線類型)

mappings:
- match: '*.*.*.StreamingMetrics.streaming.lastCompletedBatch_processingDelay'
name: StreamingMetrics_streaming_lastCompletedBatch_processingDelay
labels:
application: $1
executor_id: $2
app_name: $3

- match: '*.*.*.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay'
name: StreamingMetrics_streaming_lastCompletedBatch_schedulingDelay
labels:
application: $1
executor_id: $2
app_name: $3

- match: '*.*.*.StreamingMetrics.streaming.lastCompletedBatch_totalDelay'
name: StreamingMetrics_streaming_lastCompletedBatch_totalDelay
labels:
application: $1
executor_id: $2
app_name: $3

- match: '*.*.jvm.PS_MarkSweep.count'
name: jvm_PS_MarkSweep_count
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.PS_MarkSweep.time'
name: jvm_PS_MarkSweep_time
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.PS_Scavenge.count'
name: jvm_PS_Scavenge_count
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.PS_Scavenge.time'
name: jvm_PS_Scavenge_time
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.heap.usage'
name: jvm_heap_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.non_heap.usage'
name: jvm_non_heap_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.Code_Cache.usage'
name: jvm_pools_Code_Cache_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.Compressed_Class_Space.usage'
name: jvm_pools_Compressed_Class_Space_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.Metaspace.usage'
name: jvm_pools_Metaspace_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.PS_Eden_Space.usage'
name: jvm_pools_PS_Eden_Space_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.PS_Old_Gen.usage'
name: jvm_pools_PS_Old_Gen_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.pools.PS_Survivor_Space.usage'
name: jvm_pools_PS_Survivor_Space_usage
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.total.committed'
name: jvm_total_committed
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.total.init'
name: jvm_total_init
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.total.max'
name: jvm_total_max
labels:
application: $1
executor_id: $2

- match: '*.*.jvm.total.used'
name: jvm_total_used
labels:
application: $1
executor_id: $2

上述配置文件會將Graphite數據match轉化成prometheus數據 name ，label 爲 application，executor_id，app_name 的格式，如圖下（部分數據，圖太長，截不全）

啓動 graphite_exporter 時加載配置文件
./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping

配置 Prometheus 從 graphite_exporter 獲取數據
/path/to/prometheus/prometheus.yml

大功告成。

使用 Prometheus 和 Grafana 監控 Spark 應用 (親測)

背景

實現

技術方案

採集數據寫入數據庫

Debian9.5 系統配置NFS詳細說明

DevOps的概念

ubantu 7 PXE部署詳解

docker常用命令小結

ubuntu 7 - 本地軟件源與ISO製作 dpkg-dev genisoimage

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結