XGBoost Operator源碼分析

原創

2020-05-10 04:26

文章目錄

1 Overview

分佈式的 XGBoost 可以用 Spark 來跑，當然也支持用其他分佈式的方法去跑，比如用 XGBoost Operator，可以很輕鬆的實現 XGBoost 算法的分佈式執行。

2 Code

目前在 Kubeflow 的框架下去開發一個機器學習相關的 Operator 已經比較容易了，首先 kubebuilder 打造好 Operator 的框架，然後通過 Kubeflow 社區抽象的 common 包，在新的 Operator 下調整業務邏輯還是比較簡單的。XGBoost Operator 也是在這樣的背景下誕生的，所以可以看到其源碼是相對 tf-operator 這些 Kubeflow 早起的項目，代碼會更加簡練清晰一點。

重點分析 XGBoost Operator 的 Reconcile 協調方法。

func (r *ReconcileXGBoostJob) Reconcile(request reconcile.Request) (reconcile.Result, error) {
	// Fetch the XGBoostJob instance
	xgboostjob := &v1alpha1.XGBoostJob{}
	err := r.Get(context.Background(), request.NamespacedName, xgboostjob)
	if err != nil {
		if errors.IsNotFound(err) {
			// Object not found, return.  Created objects are automatically garbage collected.
			// For additional cleanup logic use finalizers.
			return reconcile.Result{}, nil
		}
		// Error reading the object - requeue the request.
		return reconcile.Result{}, err
	}

	// Check reconcile is required.
	needSync := r.satisfiedExpectations(xgboostjob)

	if !needSync || xgboostjob.DeletionTimestamp != nil {
		log.Info("reconcile cancelled, job does not need to do reconcile or has been deleted",
			"sync", needSync, "deleted", xgboostjob.DeletionTimestamp != nil)
		return reconcile.Result{}, nil
	}
	// Set default priorities for xgboost job
	scheme.Scheme.Default(xgboostjob)

	// Use common to reconcile the job related pod and service
	err = r.xgbJobController.ReconcileJobs(xgboostjob, xgboostjob.Spec.XGBReplicaSpecs, xgboostjob.Status.JobStatus, &xgboostjob.Spec.RunPolicy)

	if err != nil {
		logrus.Warnf("Reconcile XGBoost Job error %v", err)
		return reconcile.Result{}, err
	}

	return reconcile.Result{}, err
}

實際上，自定義資源對象 XGBoostJob 由 XGBoost Operator 的 Reconcile 方法來協調就可以了，因爲這個方法的背後，是 Kubeflow 的 common 包，會統一再做 Pod/Service 的協調的，所以開發者只要專注自定義資源的協調就夠了。

就這？對的，就是挺簡單的。

3 Test

下面運行一個 XGBoost Opearator 提供的 Demo。

按照官方文檔，build 鏡像。

docker build -f Dockerfile -t kubeflow/xgboost-dist-rabit-test:1.2 ./

鏡像裏主要運行的代碼是 xgboost_smoke_test.py。

Master 正常運行的日誌。

Worker 正常運行的日誌。

這個 smoke test 僅僅是建立一個 rabit 拓撲並進行通信的簡單例子，運行成功說明 XGBoost Operator 的部署也是成功的，因爲 worker 之間以及與 master 通過 pod ip 是可以建立 tcp 連接的。

4 Summary

目前在 Kubeflow Common 包的框架下開發一個分佈式的機器學習 Operator 還是比較方便的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

XGBoost Operator源碼分析

文章目錄

1 Overview

2 Code

3 Test

4 Summary

yum install空間不足

prometheus-nginxlog-exporter構建Nginx日誌監控

s3cmd put文件的過程

Ceph RGW配置Nginx代理出現S3Error: 403 (Forbidden)

OmniDiskSweeper清理系統文件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結