使用 Prometheus 在 KubeSphere 上監控 KubeEdge 邊緣節點(Jetson) CPU、GPU 狀態

作者:朱亞光,之江實驗室工程師,雲原生/開源愛好者。

KubeSphere 邊緣節點的可觀測性

在邊緣計算場景下,KubeSphere 基於 KubeEdge 實現應用與工作負載在雲端與邊緣節點的統一分發與管理,解決在海量邊、端設備上完成應用交付、運維、管控的需求。

根據 KubeSphere 的支持矩陣,只有 1.23.x 版本的 K8s 支持邊緣計算,而且 KubeSphere 界面也沒有邊緣節點資源使用率等監控信息的顯示。

本文基於 KubeSphere 和 KubeEdge 構建雲邊一體化計算平臺,通過 Prometheus 來監控 Nvidia Jetson 邊緣設備狀態,實現 KubeSphere 在邊緣節點的可觀測性。

組件 版本
KubeSphere 3.4.1
containerd 1.7.2
K8s 1.26.0
KubeEdge 1.15.1
Jetson 型號 NVIDIA Jetson Xavier NX (16GB ram)
Jtop 4.2.7
JetPack 5.1.3-b29
Docker 24.0.5

部署 K8s 環境

參考 KubeSphere 部署文檔。通過 KubeKey 可以快速部署一套 K8s 集羣。

//  all in one 方式部署一臺 單 master 的 k8s 集羣

./kk create cluster --with-kubernetes v1.26.0 --with-kubesphere v3.4.1 --container-manager containerd

部署 KubeEdge 環境

參考 在 KubeSphere 上部署最新版的 KubeEdge,部署 KubeEdge。

開啓邊緣節點日誌查詢功能

  1. vim /etc/kubeedge/config/edgecore.yaml

  2. enable=true

開啓後,可以方便查詢 pod 日誌,定位問題。

修改 KubeSphere 配置

開啓 KubeEdge 邊緣節點插件

  1. 修改 configmap--ClusterConfiguration

  1. advertiseAddress 設置爲 cloudhub 所在的物理機地址

KubeSphere 開啓邊緣節點文檔鏈接:https://www.kubesphere.io/zh/docs/v3.3/pluggable-components/kubeedge/。

修改完發現可以顯示邊緣節點,但是沒有 CPU 和 內存信息,發現邊緣節點沒有 node-exporter 這個 pod。

修改 node-exporter 親和性

kubectl get ds -n kubesphere-monitoring-system 發現不會部署到邊緣節點上。

修改爲:

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/edgetest  -- 修改這裏,讓親和性失效
                operator: DoesNotExist

node-exporter 是部署在邊緣節點上了,但是 pods 起不來。

通過kubectl edit 該失敗的 pod,我們發現 node-exporter 這個pod 裏面有兩個容器,其中 kube-rbac-proxy 這個容器啓動失敗。看這個容器的日誌,發現是 kube-rbac-proxy 想要獲取 KUBERNETES_SERVICE_HOSTKUBERNETES_SERVICE_PORT 這兩個環境變量,但是獲取失敗,所以容器啓動失敗。

在 K8s 的集羣中,當創建 pod 時,會在 pod 中增加 KUBERNETES_SERVICE_HOSTKUBERNETES_SERVICE_PORT 這兩個環境變量,用於 pod 內的進程對 kube-apiserver 的訪問,但是在 KubeEdge 的 edge 節點上創建的 pod 中,這兩個環境變量存在,但它是空的。

向 KubeEdge 的開發人員諮詢,他們說會在 KubeEdge 1.17 版本上增加這兩個環境變量的設置。參考如下: https://github.com/wackxu/kubeedge/blob/4a7c00783de9b11e56e56968b2cc950a7d32a403/docs/proposals/edge-pod-list-watch-natively.md

另一方面,推薦安裝 EdgeMesh,安裝之後在 edge 的 pod 上就可以訪問 kubernetes.default.svc.cluster.local:443 了。

EdgeMesh 部署

  1. 配置 cloudcore configmap

    kubectl edit cm cloudcore -n kubeedge 設置 dynamicController=true.

    修改完 重啓 cloudcore kubectl delete pod cloudcore-776ffcbbb9-s6ff8 -n kubeedge

  2. 配置 edgecore 模塊,配置 metaServer=true 和 clusterDNS

    $ vim /etc/kubeedge/config/edgecore.yaml
    
    modules:
      ...
      metaManager:
        metaServer:
          enable: true   //配置這裏
    ...
    
    modules:
      ...
      edged:
        ...
        tailoredKubeletConfig:
          ...
          clusterDNS:     //配置這裏
          - 169.254.96.16
    ...
    
    //重啓edgecore
    $ systemctl restart edgecore
    

修改完,驗證是否修改成功。

$ curl 127.0.0.1:10550/api/v1/services

{"apiVersion":"v1","items":[{"apiVersion":"v1","kind":"Service","metadata":{"creationTimestamp":"2021-04-14T06:30:05Z","labels":{"component":"apiserver","provider":"kubernetes"},"name":"kubernetes","namespace":"default","resourceVersion":"147","selfLink":"default/services/kubernetes","uid":"55eeebea-08cf-4d1a-8b04-e85f8ae112a9"},"spec":{"clusterIP":"10.96.0.1","ports":[{"name":"https","port":443,"protocol":"TCP","targetPort":6443}],"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}},{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"prometheus.io/port":"9153","prometheus.io/scrape":"true"},"creationTimestamp":"2021-04-14T06:30:07Z","labels":{"k8s-app":"kube-dns","kubernetes.io/cluster-service":"true","kubernetes.io/name":"KubeDNS"},"name":"kube-dns","namespace":"kube-system","resourceVersion":"203","selfLink":"kube-system/services/kube-dns","uid":"c221ac20-cbfa-406b-812a-c44b9d82d6dc"},"spec":{"clusterIP":"10.96.0.10","ports":[{"name":"dns","port":53,"protocol":"UDP","targetPort":53},{"name":"dns-tcp","port":53,"protocol":"TCP","targetPort":53},{"name":"metrics","port":9153,"protocol":"TCP","targetPort":9153}],"selector":{"k8s-app":"kube-dns"},"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}}],"kind":"ServiceList","metadata":{"resourceVersion":"377360","selfLink":"/api/v1/services"}}

  1. 安裝 EdgeMesh

    git clone https://github.com/kubeedge/edgemesh.git
    cd edgemesh
    
    kubectl apply -f build/crds/istio/
    
    kubectl apply -f build/agent/resources/
    

dnsPolicy

EdgeMesh 部署完成後,edge 節點上的 node-exporter 中的兩個境變量還是空的,也無法訪問 kubernetes.default.svc.cluster.local:443,原因是該 pod 中 DNS 服務器配置錯誤,應該是 169.254.96.16 的,但是卻是跟宿主機一樣的 DNS 配置。

kubectl exec -it node-exporter-hcmfg -n kubesphere-monitoring-system -- sh
Defaulted container "node-exporter" out of: node-exporter, kube-rbac-proxy
$ cat /etc/resolv.conf
nameserver 127.0.0.53

將 dnsPolicy 修改爲 ClusterFirstWithHostNet,之後重啓 node-exporter,DNS 的配置正確。

kubectl edit ds node-exporter -n kubesphere-monitoring-system

  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true

添加環境變量

vim /etc/systemd/system/edgecore.service

Environment=METASERVER_DUMMY_IP=kubernetes.default.svc.cluster.local
Environment=METASERVER_DUMMY_PORT=443

修改完重啓 edgecore。

systemctl daemon-reload
systemctl restart edgecore

node-exporter 變成 running!!!!

在邊緣節點 curl http://127.0.0.1:9100/metrics 可以發現採集到了邊緣節點的數據。

最後我們可以將 KubeSphere 的 K8s 服務通過 NodePort 暴露出來。就可以在頁面查看。

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.39.1
  name: prometheus-k8s-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
    nodePort: 32143
  selector:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort

通過訪問 master IP + 32143 端口,就可以訪問邊緣節點 node-exporter 數據。

然後界面上也出現了 CPU 和內存的信息。

搞定了 CPU 和內存,接下來就是 GPU 了。

監控 Jetson GPU 狀態

安裝 Jtop

首先 Jetson 是一個 ARM 設備,所以無法運行 nvidia-smi ,需要安裝 Jtop。

sudo apt-get install python3-pip python3-dev -y
sudo -H pip3 install jetson-stats
sudo systemctl restart jtop.service

安裝 Jetson GPU Exporter

參考博客,製作 Jetson GPU Exporter 鏡像,並且對應的 Grafana 儀表盤都有。

Dockerfile

FROM python:3-buster
RUN pip install --upgrade pip && pip install -U jetson-stats prometheus-client
RUN mkdir -p /root
COPY jetson_stats_prometheus_collector.py /root/jetson_stats_prometheus_collector.py
WORKDIR /root
USER root
RUN chmod +x /root/jetson_stats_prometheus_collector.py
ENTRYPOINT ["python3", "/root/jetson_stats_prometheus_collector.py"]

jetson_stats_prometheus_collector.py 代碼

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import atexit
import os
from jtop import jtop, JtopException
from prometheus_client.core import InfoMetricFamily, GaugeMetricFamily, REGISTRY, CounterMetricFamily
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server

class CustomCollector(object):
    def __init__(self):
        atexit.register(self.cleanup)
        self._jetson = jtop()
        self._jetson.start()

    def cleanup(self):
        print("Closing jetson-stats connection...")
        self._jetson.close()

    def collect(self):
        # spin傳入true,表示不會等待下一次數據讀取完成
        if self._jetson.ok(spin=True):
            #
            # Board info
            #
            i = InfoMetricFamily('gpu_info_board', 'Board sys info', labels=['board_info'])
            i.add_metric(['info'], {
                'machine': self._jetson.board['info']['machine'] if 'machine' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Module'],
                'jetpack': self._jetson.board['info']['jetpack'] if 'jetpack' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Jetpack'],
                'l4t':  self._jetson.board['info']['L4T'] if 'L4T' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['L4T']
                })
            yield i

            i = InfoMetricFamily('gpu_info_hardware', 'Board hardware info', labels=['board_hw'])
            i.add_metric(['hardware'], {
                'codename': self._jetson.board['hardware'].get('Codename', self._jetson.board['hardware'].get('CODENAME', 'unknown')),
                'soc': self._jetson.board['hardware'].get('SoC', self._jetson.board['hardware'].get('SOC', 'unknown')),
                'module': self._jetson.board['hardware'].get('P-Number', self._jetson.board['hardware'].get('MODULE', 'unknown')),
                'board': self._jetson.board['hardware'].get('699-level Part Number', self._jetson.board['hardware'].get('BOARD', 'unknown')),
                'cuda_arch_bin': self._jetson.board['hardware'].get('CUDA Arch BIN', self._jetson.board['hardware'].get('CUDA_ARCH_BIN', 'unknown')),
                'serial_number': self._jetson.board['hardware'].get('Serial Number', self._jetson.board['hardware'].get('SERIAL_NUMBER', 'unknown')),
                })
            yield i

            #
            # NV power mode
            #
            i = InfoMetricFamily('gpu_nvpmode', 'NV power mode', labels=['nvpmode'])
            i.add_metric(['mode'], {'mode': self._jetson.nvpmodel.name})
            yield i

            #
            # System uptime
            #
            g = GaugeMetricFamily('gpu_uptime', 'System uptime', labels=['uptime'])
            days = self._jetson.uptime.days
            seconds = self._jetson.uptime.seconds
            hours = seconds//3600
            minutes = (seconds//60) % 60
            g.add_metric(['days'], days)
            g.add_metric(['hours'], hours)
            g.add_metric(['minutes'], minutes)
            yield g

            #
            # CPU usage
            #
            g = GaugeMetricFamily('gpu_usage_cpu', 'CPU % schedutil', labels=['cpu'])
            g.add_metric(['cpu_1'], self._jetson.stats['CPU1'] if ('CPU1' in self._jetson.stats and isinstance(self._jetson.stats['CPU1'], int)) else 0)
            g.add_metric(['cpu_2'], self._jetson.stats['CPU2'] if ('CPU2' in self._jetson.stats and isinstance(self._jetson.stats['CPU2'], int)) else 0)
            g.add_metric(['cpu_3'], self._jetson.stats['CPU3'] if ('CPU3' in self._jetson.stats and isinstance(self._jetson.stats['CPU3'], int)) else 0)
            g.add_metric(['cpu_4'], self._jetson.stats['CPU4'] if ('CPU4' in self._jetson.stats and isinstance(self._jetson.stats['CPU4'], int)) else 0)
            g.add_metric(['cpu_5'], self._jetson.stats['CPU5'] if ('CPU5' in self._jetson.stats and isinstance(self._jetson.stats['CPU5'], int)) else 0)
            g.add_metric(['cpu_6'], self._jetson.stats['CPU6'] if ('CPU6' in self._jetson.stats and isinstance(self._jetson.stats['CPU6'], int)) else 0)
            g.add_metric(['cpu_7'], self._jetson.stats['CPU7'] if ('CPU7' in self._jetson.stats and isinstance(self._jetson.stats['CPU7'], int)) else 0)
            g.add_metric(['cpu_8'], self._jetson.stats['CPU8'] if ('CPU8' in self._jetson.stats and isinstance(self._jetson.stats['CPU8'], int)) else 0)
            yield g

            #
            # GPU usage
            #
            g = GaugeMetricFamily('gpu_usage_gpu', 'GPU % schedutil', labels=['gpu'])
            g.add_metric(['val'], self._jetson.stats['GPU'])
            yield g

            #
            # Fan usage
            #
            g = GaugeMetricFamily('gpu_usage_fan', 'Fan usage', labels=['fan'])
            g.add_metric(['speed'], self._jetson.fan.get('speed', self._jetson.fan.get('pwmfan', {'speed': [0] })['speed'][0]))
            yield g

            #
            # Sensor temperatures
            #
            g = GaugeMetricFamily('gpu_temperatures', 'Sensor temperatures', labels=['temperature'])
            keys = ['AO', 'GPU', 'Tdiode', 'AUX', 'CPU', 'thermal', 'Tboard']
            for key in keys:
                if key in self._jetson.temperature:
                    g.add_metric([key.lower()], self._jetson.temperature[key]['temp'] if isinstance(self._jetson.temperature[key], dict) else self._jetson.temperature.get(key, 0))
            yield g
            #
            # Power
            #
            g = GaugeMetricFamily('gpu_usage_power', 'Power usage', labels=['power'])
            if isinstance(self._jetson.power, dict):
                g.add_metric(['cv'], self._jetson.power['rail']['VDD_CPU_CV']['avg'] if 'VDD_CPU_CV' in self._jetson.power['rail'] else self._jetson.power['rail'].get('CV', { 'avg': 0 }).get('avg'))
                g.add_metric(['gpu'], self._jetson.power['rail']['VDD_GPU_SOC']['avg'] if 'VDD_GPU_SOC' in self._jetson.power['rail'] else self._jetson.power['rail'].get('GPU', { 'avg': 0 }).get('avg'))
                g.add_metric(['sys5v'], self._jetson.power['rail']['VIN_SYS_5V0']['avg'] if 'VIN_SYS_5V0' in self._jetson.power['rail'] else self._jetson.power['rail'].get('SYS5V', { 'avg': 0 }).get('avg'))
            if isinstance(self._jetson.power, tuple):
                g.add_metric(['cv'], self._jetson.power[1]['CV']['cur'] if 'CV' in self._jetson.power[1] else 0)
                g.add_metric(['gpu'], self._jetson.power[1]['GPU']['cur'] if 'GPU' in self._jetson.power[1] else 0)
                g.add_metric(['sys5v'], self._jetson.power[1]['SYS5V']['cur'] if 'SYS5V' in self._jetson.power[1] else 0)
            yield g

            #
            # Processes
            #
            try:
                processes = self._jetson.processes
                # key exists in dict
                i = InfoMetricFamily('gpu_processes', 'Process usage', labels=['process'])
                for index in range(len(processes)):
                    i.add_metric(['info'], {
                        'pid': str(processes[index][0]),
                        'user': processes[index][1],
                        'gpu': processes[index][2],
                        'type': processes[index][3],
                        'priority': str(processes[index][4]),
                        'state': processes[index][5],
                        'cpu': str(processes[index][6]),
                        'memory': str(processes[index][7]),
                        'gpu_memory': str(processes[index][8]),
                        'name': processes[index][9],
                    })
                yield i
            except AttributeError:
                # key doesn't exist in dict
                i = 0

if __name__ == '__main__':
    port = os.environ.get('PORT', 9998)
    REGISTRY.register(CustomCollector())
    app = make_wsgi_app()
    httpd = make_server('', int(port), app)
    print('Serving on port: ', port)
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        print('Goodbye!')

記得給 Jetson 的板子打標籤,確保 GPU 的 Exporter 在 Jetson 上執行。否則在其他 node 上執行會因爲採集不到數據而報錯.

kubectl label node edge-wpx machine.type=jetson

新建 KubeSphere 資源

新建 ServiceAccount、DaemonSet、Service、servicemonitor,目的是將 jetson-exporter 採集到的數據提供給 KubeSphere 的 Prometheus。

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: jetson-exporter
      app.kubernetes.io/part-of: kube-prometheus
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: jetson-exporter
        app.kubernetes.io/part-of: kube-prometheus
        app.kubernetes.io/version: 1.0.0
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/edge
                operator: Exists
      containers:
      - image: jetson-status-exporter:v1
        imagePullPolicy: IfNotPresent
        name: jetson-exporter
        resources:
          limits:
            cpu: "1"
            memory: 500Mi
          requests:
            cpu: 102m
            memory: 180Mi
        ports:
        - containerPort: 9998
          hostPort: 9998
          name: http
          protocol: TCP
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run/jtop.sock
          name: jtop-sock
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      hostPID: true
      nodeSelector:
        kubernetes.io/os: linux
        machine.type: jetson
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: jetson-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /run/jtop.sock
          type: Socket
        name: jtop-sock
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 9998
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/vendor: kubesphere
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 1m
    port: http
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    - action: labeldrop
      regex: (service|endpoint|container)
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: jetson-exporter
      app.kubernetes.io/part-of: kube-prometheus

部署完成後,jetson-exporter pod running。

重啓 Prometheus pod,重新加載配置後,可以在 Prometheus 界面看到新增加的 GPU exporter 的 target。

kubectl delete pod prometheus-k8s-0 -n kubesphere-monitoring-system

在 KubeSphere 前端,查看 GPU 監控數據

前端需要修改 KubeSphere 的 console 的代碼,這裏屬於前端內容,這裏就不詳細說明了。

其次將 Prometheus 的 SVC 端口暴露出來,通過 nodeport 的方式將 Prometheus 的端口暴露出來,前端通過 http 接口來查詢 GPU 的狀態。

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.39.1
  name: prometheus-k8s-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
    nodePort: 32143
  selector:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort

http 接口

查詢瞬時值:
get http://masterip:32143/api/v1/query?query=gpu_info_board_info&time=1711431293.686
get http://masterip:32143/api/v1/query?query=gpu_info_hardware_info&time=1711431590.574
get http://masterip:32143/api/v1/query?query=gpu_usage_gpu&time=1711431590.574
其中query爲查詢字段名,time是查詢的時間

查詢某個時間段的採集值:
get http://10.11.140.87:32143/api/v1/query_range?query=gpu_usage_gpu&start=1711428221.998&end=1711431821.998&step=14
其中query爲查詢字段名,start和end是起始結束時間,step是間隔時間

這樣就成功在 KubeSphere,監控 KubeEdge 邊緣節點 Jetson 的 GPU 狀態了。

總結

基於 KubeEdge,我們在 KubeSphere 的前端界面上實現了邊緣設備的可觀測性,包括 GPU 信息的可觀測性。

對於邊緣節點 CPU、內存狀態的監控,首先修改親和性,讓 KubeSphere 自帶的 node-exporter 能夠採集邊緣節點監控數據,接下來利用 KubeEdge 的 EdgeMesh 將採集的數據提供給 KubeSphere 的 Prometheus。這樣就實現了 CPU、內存信息的監控。

對於邊緣節點 GPU 狀態的監控,安裝 jtop 獲取 GPU 使用率、溫度等數據,然後開發 Jetson GPU Exporter,將 jtop 獲取的信息發送給 KubeSphere 的 Prometheus,通過修改 KubeSphere 前端 ks-console 的代碼,在界面上通過 http 接口獲取 Prometheus 數據,這樣就實現了 GPU 使用率等信息監控。

本文由博客一文多發平臺 OpenWrite 發佈!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章