【k8s】etcd集羣took too long to execute慢日誌告警問題分析

背景

  • 目前 機器學習平臺 後端採用k8s架構進行GPU和CPU資源的調度和容器編排。總所周知,k8s的後端核心存儲使用etcd進行metadata持久化存儲。機器學習平臺採取[External etcd topology](http://way.xiaojukeji.com/article/External etcd topology)結構進行etcd的HA部署。 avatar
  • etcd集羣的穩定性直接關係到k8s集羣和機器學習平臺的穩定性。odin平臺直接接入etcd集羣的慢日誌(etcd請求操作>100ms)告警,實時監控etcd的穩定性。 avatar

問題記錄

  • 2020-01-06 運維同學反饋2019年12月中旬etcd慢日誌監控出現大量的告警記錄,而且告警呈上升趨勢。 avatar
  • 2020-01-20 運維同學繼續反饋etcd慢日誌告警數量繼續上漲,未呈現穩態趨勢。 avatar

問題分析

  • 2020-01-06 運維同學反饋告警問題時,當時懷疑etcd 集羣磁盤latency性能問題,通過etcd metrics接口dump backend_commit_duration_seconds 和 wal_fsync_duration_seconds,latency區間在128ms。etcd官方文檔

    what-does-the-etcd-warning-apply-entries-took-too-long-mean

    建議排查磁盤性能問題,保證 "p99 duration should be less than 25ms"。和運維同學討論,先將etcd設置的慢日誌閾值調整到100,繼續跟蹤告警問題。

    root@xxx-ser00:~$ curl -L -s http://localhost:2379/metrics | grep backend_commit_duration_seconds
    # HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend.
    # TYPE etcd_disk_backend_commit_duration_seconds histogram
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 28164
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 357239
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 1.9119004e+07
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 2.07783083e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 3.02929134e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 3.0395167e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 3.04047027e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 3.04051485e+08 # 大部分disk_backend_commit_duration <= 128ms
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 3.04051486e+08
    etcd_disk_backend_commit_duration_seconds_sum 2.193424564702583e+06
    etcd_disk_backend_commit_duration_seconds_count 3.04051486e+08
    root@xxx-ser00:~$ curl -L -s http://localhost:2379/metrics | grep wal_fsync_duration_seconds
    # HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by wal.
    # TYPE etcd_disk_wal_fsync_duration_seconds histogram
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 9.661794e+06
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 1.134905564e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 1.638654117e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1.64588859e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 1.646517588e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 1.646521494e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 1.646521531e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 1.646521634e+09 # 大部分disk_wal_fsync_duration <= 128ms
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 1.646521635e+09
    etcd_disk_wal_fsync_duration_seconds_sum 3.2505764775865404e+06
    etcd_disk_wal_fsync_duration_seconds_count 1.646521635e+09
    
  • 2020-01-20 運維同學繼續反饋慢日誌告警,調整到100個告警閾值,慢日誌告警數量還是呈上升趨勢。查看每臺etcd服務的監控,發現告警集中在某一臺etcd服務上,當etcd服務重啓時,告警會發生遷移;告警值會繼上一個etcd服務重啓後的數,繼續上漲。 avatar

  • 登陸其中一臺慢日誌告警etcd服務器,在/var/log/message 日誌中統計etcd 'took too long' 超過100ms的慢日誌,68%的range request,其中87%的job查詢。

    root@xxx-ser00:/var/log$ grep 'took too long' /var/log/messages | grep 'Jan 19' | wc -l
    176013
    root@xxx-ser00:/var/log$ grep 'took too long' /var/log/messages | grep 'Jan 19' | grep 'read-only range request' | wc -l
    120199
    root@xxx-ser00:/var/log$ echo 'ibase=10; obase=10; scale=2; 120199/176013' | bc
    .68
    root@xxx-ser00:/var/log$ grep 'took too long' /var/log/messages | grep 'Jan 19' | grep 'read-only range request' | grep '\/registry\/jobs' | wc -l
    104698
    root@xxx-ser00:/var/log$ echo 'ibase=10; obase=10; scale=2; 104698/120199' | bc
    .87
    
  • 隨機抽取一條慢日誌,通過日誌,直接使用etcdctl 查詢,返回500條job數據。懷疑有服務通過kube-apiserver 進行大量的get job lists的請求。

    Jan 20 19:40:09 xxx-ser00 etcd: read-only range request "key:\"/registry/jobs/65-ns/k8s-job-tdc6eu-1576247407339\\000\" range_end:\"/registry/jobs0\" limit:500 revision:1712050719 " took too long (117.960324ms) to execute
    root@xxx-ser00:/home/xiaoju/etcd/bin$ cd /home/xiaoju/etcd/bin/
    root@xxx-ser00:/home/xiaoju/etcd/bin$ export ETCDCTL_API=3
    root@xxx-ser00:/home/xiaoju/etcd/bin$ ./etcdctl get /registry/jobs/65-ns/k8s-job-tdc6eu-1576247407339\\000 --limit=500 --keys-only /registry/jobs0 | sed '/^$/d' | wc -l
    500
    
  • 在其中一臺apiserver和xxx-ser00服務器上進行抓包

    root@kube-master-ser01:~$ tcpdump -v -i any -s 0 -w apiserver.pcap port 8080
    root@xxx-ser00:~$ tcpdump -v -i eth0 -s 0 -w etcd.pcap port 2379
    
  • 將慢日誌中的job name k8s-job-tdc6eu-1576247407339 在etcd.pcap中查詢,看到源IP是xx.xx.128.58 kube-master-ser01,端口20994,就是kube-apiserver的發起的請求。

    root@kube-master-ser01:~$ netstat -antp | grep 20994
    tcp        0      0 xx.xx.128.58:20994      xx.xx.76.33:2379        ESTABLISHED 25969/kube-apiserve
    
  • 將慢日誌中的job name k8s-job-tdc6eu-1576247407339 在apiserver.pcap中查詢,能查詢到job相關信息,查看wireshark http stream的上下文,發現kube-controller-manager中cronjob-controller調用了get jobs。解析url中的參數,發現和etcd中慢日誌告警類似,將apiserver.pcap中的url 另存爲日誌。可以查詢到url請求,也是cronjob-controller調用了get jobs。

    root@kube-master-ser01:~$ echo eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTcxMjA1MDcxOSwic3RhcnQiOiI2NS1ucy9rOHMtam9iLXN2Z2Q5NC0xNTczNzQyNDAyNDMyXHUwMDAwIn0 | base64 -d
    {"v":"meta.k8s.io/v1","rv":1712050719,"start":"65-ns/k8s-job-svgd94-1573742402432\u0000"}base64: invalid input
    Jan 20 19:40:09 xxx-ser00 etcd: read-only range request "key:\"/registry/jobs/65-ns/k8s-job-tdc6eu-1576247407339\\000\" range_end:\"/registry/jobs0\" limit:500 revision:1712050719 " took too long (117.960324ms) to execute
    root@kube-master-ser01:~$ grep limit=500 apiserver.log | cut -d '=' -f2 | cut -d '&' -f1 | while read line; do echo $line | base64 -D | grep k8s-job-tdc6eu-1576247407339 && echo $line; done
    {"v":"meta.k8s.io/v1","rv":1712050719,"start":"65-ns/k8s-job-tdc6eu-1576247407339\u0000
    eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTcxMjA1MDcxOSwic3RhcnQiOiI2NS1ucy9rOHMtam9iLXRkYzZldS0xNTc2MjQ3NDA3MzM5XHUwMDAwIn0
    
  • 查看cronjob_controller的代碼,發現不是使用的 JobInformer 而是通過clientset list了所有namespace下的job。當前

    機器學習平臺

    k8s集羣,全部namespace下的job,一次list 分頁查詢會返回慢日誌。

      root@kube-master-ser01:~$ time kubectl get jobs -A | wc -l
      xxxx
      real    2m1.287s
      user    0m32.263s
      sys 0m8.845s
    

解決方案

  • 平臺job需要配置ttlSecondsAfterFinished,執行完成的job可以自動過期。防止大數據量的任務查詢,造成etcd庫查詢壓力。
  • 平臺未使用cronjob,可以禁止cronjob_controller的功能。
  • 之前k8s社區有相關issue跟蹤CronJob controller should use shared informers已提交相關信息給Contributor,重新修改問題。
  • Cronjob後續GA版本KEP for Graduating CronJob to GA

實施方案

  • 生產環境歷史job數據通過patch ttlSecondsAfterFinished,更新job過期時間
  • 生產環境新建job通過增加ttlSecondsAfterFinished,配置job過期時間
  • 變更後的etcd慢查詢告警曲線 avatar

參考資料

作者:童超【 滴滴出行資深軟件開發工程師】

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章