KubeFlow 1.2.0部署(Ubuntu20.04 + k8s 1.21.0)

Kubeflow 部署(使用 kfctl_k8s_istio)

安裝 Kubeflow的一些指南(使用 kfctl_k8s_istio 配置部署到已有的Kubernetes集羣中)。該配置清單創建Kubeflow的核心部件部署,但不包括外部依賴,可以根據您的環境需要進行優化。KubeFlow 1.2.0部署到Ubuntu20.04和k8s 1.21.0,其它平臺可能有些變化。

前期工作

該Kubeflow deployment要求 StorageClass 支持 dynamic volume provisioner。確認缺省的StorageClass的 provisioner 域的設置。如果換沒有 provisioner, 確保配置了 volume provisioning,按照下面的描述在 Kubernetes cluster 進行設置(參考 below)。

Kubeflow依賴Istio的安裝,參考:

使用 kfctl_k8s_istio.v1.2.0.yaml 配置安裝,需要考慮下面的選項:

  • Disabling istio installation - 如果 Kubernetes cluster 已有 Istio 安裝,可以選擇不安裝Istio,通過一移除配置文件kfctl_k8s_istio.v1.0.2.yaml中的 istio-crdsistio-install 參數。

準備環境

下載 kfctl 的 Kubeflow CLI 工具,然後手動設置環境變量:

wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz

tar -vxf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz

sudo cp kfctl /usr/bin/
  • 創建環境變量,簡化部署過程:
# The following command is optional. It adds the kfctl binary to your path.
# If you don't add kfctl to your path, you must use the full path
# each time you run kfctl.
# Use only alphanumeric characters or - in the directory name.
export PATH=$PATH:"<path-to-kfctl>"

##實際如下:
##export PATH=$PATH:"/home/supermap/openthings/kubeflow"

# Set KF_NAME to the name of your Kubeflow deployment. You also use this
# value as directory name when creating your configuration directory.
# For example, your deployment name can be 'my-kubeflow' or 'kf-test'.
export KF_NAME=<your choice of name for the Kubeflow deployment>

##實際如下:
##export KF_NAME="kubeflow"

# Set the path to the base directory where you want to store one or more 
# Kubeflow deployments. For example, /opt/.
# Then set the Kubeflow application directory for this deployment.
export BASE_DIR=<path to a base directory>
export KF_DIR=${BASE_DIR}/${KF_NAME}

##實際如下:
##export BASE_DIR="/home/supermap/openthings/"
##export KF_DIR=${BASE_DIR}/${KF_NAME}

# Set the configuration file to use when deploying Kubeflow.
# The following configuration installs Istio by default. Comment out 
# the Istio components in the config file to skip Istio installation. 
# See https://github.com/kubeflow/kubeflow/pull/3663
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml"

##實際如下:
##export CONFIG_URI=${BASE_DIR}/${KF_NAME}/kfctl_k8s_istio.v1.2.0.yaml

注意:

  • ${KF_NAME} -  Kubeflow 部署的名稱。如果要定製部署的 deployment name,通過該參數指定。例如, my-kubeflowkf-test。該 KF_NAME 必須小寫字母或者 ‘-', 開始和結束必須爲字母。該變量不能超過 25 個字符。只能包含名稱,不能包含目錄路徑。同時將作爲創建目錄的名稱,用於保存 Kubeflow configurations,即Kubeflow application directory。

  • ${KF_DIR} -  Kubeflow application directory的全路徑。

  • ${CONFIG_URI} -該 GitHub address,位於 https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml。當運行kfctl applykfctl build (see the next step), kfctl 創建一個 YAML 文件的本地版本,可以進一步定製化。

⚠️注意

  • 運行kfctl build或apply -V -f xxx時,出現manifest下載不成功,可以將其下載下來,然後修改kfctl_k8s_istio.v1.2.0.yaml的下面內容,將manifest指向本地路徑。如下:
  repos:
  - name: manifests
    uri: /home/supermap/openthings/kubeflow/v1.2.0.tar.gz
  version: v1.2-branch

設置和部署 Kubeflow

使用缺省設置來設置和部署 Kubeflow using the default settings,運行 kfctl apply 如下:

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}

檢查 resources deployed in namespace kubeflow:

kubectl -n kubeflow get all

可選,設置以後部署的配置參數:

在部署 Kubeflow時,如果需要定製安裝參數,可以編輯該配置文件,然後運行 Kubeflow的部署命令即可:

  1. 運行 kfctl build 命令設置安裝參數:

    mkdir -p ${KF_DIR}
    cd ${KF_DIR}
    kfctl build -V -f ${CONFIG_URI}
    
  2. 編輯配置文件,描述如上 customizing your Kubeflow deployment

  3. 設置環境變量指向本地的配置文件:

    export CONFIG_FILE=${KF_DIR}/kfctl_k8s_istio.v1.2.0.yaml
    
  4. 運行 kfctl apply 目錄實施 Kubeflow 部署:

    kfctl apply -V -f ${CONFIG_FILE}
    

出現錯誤

2021/04/28 10:24:44 absolute path error in '/home/supermap/openthings/kubeflow/.cache/manifests/namespaces/base' : evalsymlink failure on '/home/supermap/openthings/kubeflow/.cache/manifests/namespaces/base' : lstat /home/supermap/openthings/kubeflow/.cache/manifests/namespaces: no such file or directory
ERRO[0000] Error evaluating kustomization manifest for namespaces: accumulating resources: accumulating resources from '../../.cache/manifests/namespaces/base': open /home/supermap/openthings/kubeflow/.cache/manifests/namespaces/base: no such file or directory  filename="kustomize/kustomize.go:155"
Error: failed to apply:  (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize:  (kubeflow.error): Code 500 with message: error evaluating kustomization manifest for namespaces: accumulating resources: accumulating resources from '../../.cache/manifests/namespaces/base': open /home/supermap/openthings/kubeflow/.cache/manifests/namespaces/base: no such file or directory
  • 查看.cache目錄,發現manifest位於 ~/openthings/kubeflow/.cache/manifests/manifests-1.2.0 目錄下,而不是上面的manifests目錄。
  • 將manifests-1.2.0下的所有文件移到上一級即manifest目錄下,再次運行kfctl apply。如下:
cd manifests-1.2.0
mv -r * ../
  • 但是運行kfctl apply發現會將.cache目錄刪除,導致上面的複製方法失效。

  • 直接修改配置文件,修改所有自定義資源的路徑,改完後的配置文件如下:

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  creationTimestamp: null
  namespace: kubeflow
spec:
  applications:
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/namespaces/base
    name: namespaces
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/application/v3
    name: application
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/istio-1-3-1-stack
    name: istio-stack
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/cluster-local-gateway-1-3-1
    name: cluster-local-gateway
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/istio/istio/base
    name: istio
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/cert-manager-crds
    name: cert-manager-crds
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/cert-manager-kube-system-resources
    name: cert-manager-kube-system-resources
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/add-anonymous-user-filter
    name: add-anonymous-user-filter
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/metacontroller/base
    name: metacontroller
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/admission-webhook/bootstrap/overlays/application
    name: bootstrap
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/spark-operator
    name: spark-operator
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes
    name: kubeflow-apps
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/knative/installs/generic
    name: knative
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/kfserving/installs/generic
    name: kfserving
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: manifests-1.2.0/stacks/kubernetes/application/spartakus
    name: spartakus
  repos:
  - name: manifests
    uri: /home/supermap/openthings/kubeflow/v1.2.0.tar.gz
  version: v1.2-branch
status: {}
  • 刪除kustomize目錄,重新運行kfctl build和kfctl apply.

KubeFlow的鏡像較多,完全啓動需要比較長的時間,需要耐心等待。

過段時間看,有些pod已經啓動起來了,主界面已經可以訪問了。

查看狀態,有些鏡像和服務有問題,包括鏡像下載、存儲卷設置等,留待後續解決。

(base) supermap@xriver02:~$ kubectl get pod -n kubeflow
NAME                                                     READY   STATUS             RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0               0/1     ImagePullBackOff   0          36h
admission-webhook-deployment-5cd7dc96f5-l9rxl            1/1     Running            0          36h
application-controller-stateful-set-0                    0/1     ImagePullBackOff   0          36h
argo-ui-657cf69ff5-kn966                                 1/1     Running            0          36h
cache-deployer-deployment-5f4979f45-q6psq                1/2     ImagePullBackOff   0          36h
cache-server-7859fd67f5-kx8zm                            0/2     Init:0/1           0          36h
centraldashboard-86744cbb7b-44rbc                        1/1     Running            0          36h
jupyter-web-app-deployment-8486d5ffff-9czzl              1/1     Running            0          36h
katib-controller-7fcc95676b-tsbzx                        1/1     Running            1          36h
katib-db-manager-67867f5498-jzrgh                        0/1     Running            442        36h
katib-mysql-6b5d848bf5-gs95h                             0/1     Pending            0          36h
katib-ui-65dc4cf6f5-pqj5p                                1/1     Running            0          36h
kfserving-controller-manager-0                           1/2     ImagePullBackOff   0          36h
kubeflow-pipelines-profile-controller-797fb44db9-vznlv   1/1     Running            0          36h
metacontroller-0                                         1/1     Running            0          36h
metadata-db-c65f4bc75-m2ggv                              0/1     Pending            0          36h
metadata-envoy-deployment-67bd5954c-jl7pn                1/1     Running            0          36h
metadata-grpc-deployment-577c67c96f-29dwx                0/1     CrashLoopBackOff   433        36h
metadata-writer-756dbdd478-tlrpw                         2/2     Running            325        36h
minio-54d995c97b-jrmqq                                   0/1     Pending            0          36h
ml-pipeline-8d6749d9c-drv2h                              1/2     CrashLoopBackOff   662        36h
ml-pipeline-persistenceagent-d984c9585-mhstn             2/2     Running            0          36h
ml-pipeline-scheduledworkflow-5ccf4c9fcc-wqg4d           2/2     Running            0          36h
ml-pipeline-ui-8ccbf585c-77krb                           2/2     Running            0          36h
ml-pipeline-viewer-crd-56c68f6c85-bssgc                  1/2     ImagePullBackOff   0          36h
ml-pipeline-visualizationserver-7446b96877-ffs7b         2/2     Running            0          36h
mpi-operator-d5bfb8489-75m6b                             1/1     Running            0          36h
mxnet-operator-7576d697d6-jwks8                          1/1     Running            0          36h
mysql-74f8f99bc8-ndzqg                                   0/2     Pending            0          36h
notebook-controller-deployment-dd4c74b47-k9fng           0/1     ImagePullBackOff   0          36h
profiles-deployment-65f54cb5c4-9xtws                     0/2     ImagePullBackOff   0          36h
pytorch-operator-847c8d55d8-x6l4t                        0/1     ImagePullBackOff   0          36h
seldon-controller-manager-6bf8b45656-d7rvf               1/1     Running            0          36h
spark-operatorsparkoperator-fdfbfd99-cst9l               0/1     ImagePullBackOff   0          36h
spartakus-volunteer-558f8bfd47-tcvpn                     1/1     Running            0          36h
tf-job-operator-58477797f8-wr79t                         1/1     Running            0          36h
workflow-controller-64fd7cffc5-m6gkc                     1/1     Running            0          36h

訪問 Kubeflow 用戶界面 (UI)

Kubeflow 部署完成後,Kubeflow Dashboard 的訪問通過服務 istio-ingressgateway 來獲取。loadbalancer在環境中不可用,NodePort 或 Port forwarding 可以用於訪問 Kubeflow Dashboard,參考 Ingress Gateway guide 或者:

刪除 Kubeflow

運行下面的命令刪除部署並回收資源:

cd ${KF_DIR}
# If you want to delete all the resources, run:
kfctl delete -f ${CONFIG_FILE}

理解部署過程

 kfctl 部署過程包含下面幾個命令:

  • kfctl build - (可選) 創建配置文件,只在需要自行修改配置參數時才需要在 kfctl apply之前運行 kfctl build
  • kfctl apply - 創建或更新資源。
  • kfctl delete - 刪除資源。

應用的佈局

您的 Kubeflow 應用目錄 ${KF_DIR} 包含下面的文件和目錄:

  • ${CONFIG_FILE} 是一個 YAML 文件定義了kubeflow部署的參數:

  • kustomize 是一個目錄,包含 Kubeflow applications應用的定製化包。參考: how Kubeflow uses kustomize

    • 該目錄在運行 kfctl buildkfctl apply時創建出來。
    • 可以通過修改目錄中的manifests來定製 Kubernetes resources ,然後重新運行 kfctl apply 進行部署和更新。

建議將${KF_DIR} 目錄中的內容納入版本管理系統。

Provisioning of Persistent Volumes in Kubernetes

如果已經有 dynamic volume provisioner,可以跳過本步驟:

如果沒有,參考:

  • 在部署Kubeflow後手動創建 PVs 。
  • 安裝dynamic volume provisioner,如 Local Path Provisioner。確保provisioner使用的 StorageClass爲缺省的 StorageClass。

問題解決

Persistent Volume Claims 處於 Pending 狀態

檢查PersistentVolumeClaims 是否 Bound 到 PersistentVolumes,如下:

kubectl -n kubeflow get pvc

如果PersistentVolumeClaims (PVCs) 在 Pending 狀態,部署後沒有bound 到 PersistentVolumes (PVs),就需要手動爲每一個PVC創建PV,或者安裝 dynamic volume provisioning 來按需創建PVs ,以及刪除存在的PVCs然後重新部署 Kubeflow。

下一步

更多參考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章