15分鐘連接Jetson Nano與K8S,輕鬆搭建機器學習集羣

在本文中我將展示如何將Jetson Nano開發板連接到Kubernetes集羣以作爲一個GPU節點。我將介紹使用GPU運行容器所需的NVIDIA docker設置,以及將Jetson連接到Kubernetes集羣。在成功將節點連接到集羣后,我還將展示如何在Jetson Nano上使用GPU運行簡單的TensorFlow 2訓練會話。

K3s還是K8s?

K3s是一個輕量級Kubernetes發行版,其大小不超過100MB。在我看來,它是單板計算機的理想選擇,因爲它所需的資源明顯減少。你可以查看我們的往期文章,瞭解更多關於K3s的教程和生態。在K3s生態中,有一款不得不提的開源工具K3sup,這是由Alex Ellis開發的,用於簡化K3s集羣安裝。你可以訪問Github瞭解這款工具:
https://github.com/alexellis/k3sup

我們需要準備什麼?

  • 一個K3s集羣——只需要一個正確配置的主節點即可
  • NVIDIA Jetson Nano開發板,並安裝好開發者套件

如果你想了解如何在開發板上安裝開發者套件,你可以查看以下文檔:
https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit#write

  • K3sup
  • 15分鐘的時間

計劃步驟

  1. 設置NVIDIA docker
  2. 添加Jetson Nano到K3s集羣
  3. 運行一個簡單的MNIST例子來展示Kubernetes pod內GPU的使用

設置NVIDIA docker

在我們配置Docker以使用nvidia-docker作爲默認的運行時之前,我需要先解釋一下爲什麼要這樣做。默認情況下,當用戶在Jetson Nano上運行容器時,運行方式與其他硬件設備相同,你不能從容器中訪問GPU,至少在沒有黑客攻擊的情況下不能。如果你想自己測試一下,你可以運行以下命令,應該會看到類似的結果:

 1. root@jetson:~# echo "python3 -c 'import tensorflow'" | docker run -i
    icetekio/jetson-nano-tensorflow /bin/bash
 2. 2020-05-14 00:10:23.370761: W
    tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
    not load dynamic library 'libcudart.so.10.2'; dlerror:
    libcudart.so.10.2: cannot open shared object file: No such file or
    directory; LD_LIBRARY_PATH:
    /usr/local/cuda-10.2/targets/aarch64-linux/lib:
 3. 2020-05-14 00:10:23.370859: I
    tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above
    cudart dlerror if you do not have a GPU set up on your machine.
 4. 2020-05-14 00:10:25.946896: W
    tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
    not load dynamic library 'libnvinfer.so.7'; dlerror:
    libnvinfer.so.7: cannot open shared object file: No such file or
    directory; LD_LIBRARY_PATH:
    /usr/local/cuda-10.2/targets/aarch64-linux/lib:
 5. 2020-05-14 00:10:25.947219: W
    tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
    not load dynamic library 'libnvinfer_plugin.so.7'; dlerror:
    libnvinfer_plugin.so.7: cannot open shared object file: No such file
    or directory; LD_LIBRARY_PATH:
    /usr/local/cuda-10.2/targets/aarch64-linux/lib:
 6. 2020-05-14 00:10:25.947273: W
    tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen
    some TensorRT libraries. If you would like to use Nvidia GPU with
    TensorRT, please make sure the missing libraries mentioned above are
    installed properly.
 7. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
    Conversion of the second argument of issubdtype from `float` to
    `np.floating` is deprecated. In future, it will be treated as
    `np.float64 == np.dtype(float).type`.
 8. from ._conv import register_converters as _register_converters

如果你現在嘗試運行相同的命令,但在docker命令中添--runtime=nvidia參數,你應該看到類似以下的內容:

 1. root@jetson:~# echo "python3 -c 'import tensorflow'" | docker run
    --runtime=nvidia -i icetekio/jetson-nano-tensorflow /bin/bash
 2. 2020-05-14 00:12:16.767624: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libcudart.so.10.2
 3. 2020-05-14 00:12:19.386354: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libnvinfer.so.7
 4. 2020-05-14 00:12:19.388700: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libnvinfer_plugin.so.7
 5. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
    Conversion of the second argument of issubdtype from `float` to
    `np.floating` is deprecated. In future, it will be treated as
    `np.float64 == np.dtype(float).type`.
 6. from ._conv import register_converters as _register_converters

nvidia-docker已經配置完成,但是默認情況下並沒有啓用。要啓用docker運行nvidia-docker運行時作爲默認值,需要將"default-runtime":"nvidia"添加到/etc/docker/daemon.json配置文件中,如下所示:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

現在你可以跳過docker run命令中--runtime=nvidia參數,GPU將被默認初始化。這樣K3s就會用nvidia-docker運行時來使用Docker,讓Pod不需要任何特殊配置就能使用GPU。

將Jetson作爲K8S節點連接

使用K3sup將Jetson作爲Kubernetes節點連接只需要1個命令,然而要想成功連接Jetson和master節點,我們需要能夠在沒有密碼的情況下同時連接到Jetson和master節點,並且在沒有密碼的情況下做sudo,或者以root用戶的身份連接。

如果你需要生成SSH 密鑰並複製它們,你需要運行以下命令:

 1. ssh-keygen -t rsa -b 4096 -f ~/.ssh/rpi -P ""
 2. ssh-copy-id -i .ssh/rpi user@host

默認情況下,Ubuntu安裝要求用戶在使用sudo命令時輸入密碼,因此,更簡單的方法是用root賬戶來使用K3sup。要使這個方法有效,需要將你的~/.ssh/authorized_keys複製到/root/.ssh/目錄下。

在連接Jetson之前,我們查看一下想要連接到的集羣:

 1. upgrade@ZeroOne:~$ kubectl get node -o wide
 2. NAME      STATUS   ROLES    AGE   VERSION        INTERNAL-IP   
    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     
    CONTAINER-RUNTIME
 3. nexus     Ready    master   32d   v1.17.2+k3s1   192.168.0.12  
    <none>        Ubuntu 18.04.4 LTS   4.15.0-96-generic  
    containerd://1.3.3-k3s1
 4. rpi3-32   Ready    <none>   32d   v1.17.2+k3s1   192.168.0.30  
    <none>        Ubuntu 18.04.4 LTS   5.3.0-1022-raspi2  
    containerd://1.3.3-k3s1
 5. rpi3-64   Ready    <none>   32d   v1.17.2+k3s1   192.168.0.32  
    <none>        Ubuntu 18.04.4 LTS   5.3.0-1022-raspi2  
    containerd://1.3.3-k3s1

你可能會注意到,master節點是一臺IP爲192.168.0.12nexus主機,它正在運行containerd。默認狀態下,k3s會將containerd作爲運行時,但這是可以修改的。由於我們設置了nvidia-docker與docker一起運行,我們需要修改containerd。無需擔心,將containerd修改爲Docker我們僅需傳遞一個額外的參數到k3sup命令即可。所以,運行以下命令即可連接Jetson到集羣:

 1. k3sup join --ssh-key ~/.ssh/rpi  --server-ip 192.168.0.12  --ip
    192.168.0.40   --k3s-extra-args '--docker'

IP 192.168.0.40是我的Jetson Nano。正如你所看到的,我們傳遞了--k3s-extra-args'--docker'標誌,在安裝k3s agent 時,將--docker標誌傳遞給它。多虧如此,我們使用的是nvidia-docker設置的docker,而不是containerd。

要檢查節點是否正確連接,我們可以運行kubectl get node -o wide:

 1. upgrade@ZeroOne:~$ kubectl get node -o wide
 2. NAME      STATUS   ROLES    AGE   VERSION        INTERNAL-IP   
    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     
    CONTAINER-RUNTIME
 3. nexus     Ready    master   32d   v1.17.2+k3s1   192.168.0.12  
    <none>        Ubuntu 18.04.4 LTS   4.15.0-96-generic  
    containerd://1.3.3-k3s1
 4. rpi3-32   Ready    <none>   32d   v1.17.2+k3s1   192.168.0.30  
    <none>        Ubuntu 18.04.4 LTS   5.3.0-1022-raspi2  
    containerd://1.3.3-k3s1
 5. rpi3-64   Ready    <none>   32d   v1.17.2+k3s1   192.168.0.32  
    <none>        Ubuntu 18.04.4 LTS   5.3.0-1022-raspi2  
    containerd://1.3.3-k3s1
 6. jetson    Ready    <none>   11s   v1.17.2+k3s1   192.168.0.40  
    <none>        Ubuntu 18.04.4 LTS   4.9.140-tegra      
    docker://19.3.6

簡易驗證

我們現在可以使用相同的docker鏡像和命令來運行pod,以檢查是否會有與本文開頭在Jetson Nano上運行docker相同的結果。要做到這一點,我們可以應用這個pod規範:

 1. apiVersion: v1
 2. kind: Pod
 3. metadata:
  
 4. name: gpu-test
 5. spec:
 
 6. nodeSelector:
    
 7. kubernetes.io/hostname: jetson
 
 8. containers:
 9. image: icetekio/jetson-nano-tensorflow
   
 10. name: gpu-test
   
 11. command:
    - 
 12. "/bin/bash"
    - 
 13. "-c"
    - 
 14. "echo 'import tensorflow' | python3"
 15. restartPolicy: Never

等待docker鏡像拉取,然後通過運行以下命令查看日誌:

1. upgrade@ZeroOne:~$ kubectl logs gpu-test
 2. 2020-05-14 10:01:51.341661: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libcudart.so.10.2
 3. 2020-05-14 10:01:53.996300: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libnvinfer.so.7
 4. 2020-05-14 10:01:53.998563: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libnvinfer_plugin.so.7
 5. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
    Conversion of the second argument of issubdtype from `float` to
    `np.floating` is deprecated. In future, it will be treated as
    `np.float64 == np.dtype(float).type`.
  
 6. from ._conv import register_converters as _register_converters

如你所見,我們的日誌信息與之前在Jetson上運行Docker相似。

運行MNIST訓練

我們有一個支持GPU的運行節點,所以現在我們可以測試出機器學習的 "Hello world",並使用MNIST數據集運行TensorFlow 2模型示例。

要運行一個簡單的訓練會話,以證明GPU的使用情況,應用下面的manifest:

 1. apiVersion: v1
 2. kind: Pod
 3. metadata:
 
 4. name: mnist-training
 5. spec:
 6. nodeSelector:
  
 7. kubernetes.io/hostname: jetson
 
 8. initContainers:
    - 
 9. name: git-clone
      
 10. image: iceci/utils
    
 11. command:
        - 
 12. "git"
        - 
 13. "clone"
    
 14. - "<https://github.com/IceCI/example-mnist-training.git>"
        - 
 15. "/workspace"
    
 16. volumeMounts:
        - 
 17. mountPath: /workspace
     
 18. name: workspace
 19. containers:
    - 
 20. image: icetekio/jetson-nano-tensorflow
   
 21. name: mnist
   
 22. command:
       - 
 23. "python3"
        - 
 24. "/workspace/mnist.py"
     
 25. volumeMounts:
        - 
 26. mountPath: /workspace
      
 27. name: workspace
 
 28. restartPolicy: Never
 29. volumes:
    - 
 30. name: workspace
 
 31. emptyDir: {}

從下面的日誌中可以看到,GPU正在運行:

 1. ...
 2. 2020-05-14 11:30:02.846289: I
    tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding
    visible gpu devices: 0
 3. 2020-05-14 11:30:02.846434: I
    tensorflow/stream_executor/platform/default/dso_loader.cc:48]
    Successfully opened dynamic library libcudart.so.10.2
 4. ....

如果你在節點上,你可以通過運行tegrastats命令來測試CPU和GPU的使用情況:

1. upgrade@jetson:~$ tegrastats --interval 5000
 2. RAM 2462/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
    [52%@1479,41%@1479,43%@1479,34%@1479] EMC_FREQ 0% GR3D_FREQ 9%
    [email protected] CPU@26C PMIC@100C GPU@24C [email protected] thermal@25C POM_5V_IN
    3410/3410 POM_5V_GPU 451/451 POM_5V_CPU 1355/1355
 3. RAM 2462/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
    [53%@1479,42%@1479,45%@1479,35%@1479] EMC_FREQ 0% GR3D_FREQ 9%
    [email protected] CPU@26C PMIC@100C GPU@24C [email protected] [email protected]
    POM_5V_IN 3410/3410 POM_5V_GPU 451/451 POM_5V_CPU 1353/1354
 4. RAM 2461/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
    [52%@1479,38%@1479,43%@1479,33%@1479] EMC_FREQ 0% GR3D_FREQ 10%
    PLL@24C CPU@26C PMIC@100C GPU@24C AO@29C [email protected] POM_5V_IN
    3410/3410 POM_5V_GPU 493/465 POM_5V_CPU 1314/1340

總 結

如你所見,將Jetson Nano連接到Kubernetes集羣是一個非常簡單的過程。只需幾分鐘,你就能利用Kubernetes來運行機器學習工作負載——同時也能利用NVIDIA袖珍GPU的強大功能。你將能夠在Kubernetes上運行任何爲Jetson Nano設計的GPU容器,這可以簡化你的開發和測試。

作者: Jakub Czapliński,Icetek編輯
原文鏈接:
https://medium.com/icetek/how-to-connect-jetson-nano-to-kubernetes-using-k3s-and-k3sup-c715cf2bf212

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章