Open OnDemand部署安裝使用手冊

一、Open OnDemand

Open OnDemand支持的功能

  • Graphical file management - Upload, download, move and delete files and folders through the web browser.
    圖形文件管理 - 通過網絡瀏覽器上傳,下載,移動和刪除文件和文件夾

  • File editor - Edit and save files without the need to launch a shell session.
    文件編輯器 - 編輯和保存文件,而無需啓動Shell會話。

  • Shell Access - Pop into a command line shell straight from the web portal.
    Shell訪問 - 直接從Web門戶彈出命令行Shell

  • Queue Management - View up to date details of pending or running running on the cluster.
    隊列管理 - 查看羣集上掛起或正在運行的最新詳細信息

  • Job submission templates - Submit jobs from the web console using preset templates or customize your own. (Includes capability to edit job scripts and parameters on the fly).
    作業提交模板 - 使用預設模板從Web控制檯提交作業或自定義自己的模板。(包括動態編輯作業腳本和參數的功能)

  • Full linux desktop streaming via web - Run a full low latency XFCE linux desktop on the compute nodes for GUI heavy jobs such as Matlab, Mathematica etc. Graphical jobs continue to run while disconnected from compute host.
    通過Web的完整linux桌面流-在計算節點上運行完整的低延遲XFCE linux桌面,以處理諸如Matlab,Mathematica等之類的GUI繁重任務。圖形任務在與計算主機斷開連接的情況下繼續運行。

  • No need to install a local xserver in order to run graphical jobs as all rendering is performed on the compute nodes.
    由於所有渲染均在計算節點上執行,因此無需安裝本地xserver即可運行圖形作業。

Open OnDemand體系架構

總覽

在這裏插入圖片描述
Apache是​​服務器前端,以Apache用戶身份運行,並接受來自用戶的所有請求,具有四個主要功能

  • 驗證用戶
  • 啓動每個用戶的NGINX進程(PUNs)
  • 通過Unix域套接字將每個用戶反向代理到其PUN
  • 通過TCP套接字反向代理在計算節點(RStudio,Jupyter,VNC桌面)上運行的交互式應用程序

系統環境 System context

用戶使用OnDemand通過Web瀏覽器與其HPC資源進行交互。
在這裏插入圖片描述

容器上下文 Container context

前端代理是與所有客戶端共享的唯一組件。前端代理將爲每個用戶創建Nginx(PUN)進程。
在這裏插入圖片描述

請求流程 Request Flow

用戶通過瀏覽器發起請求,下圖說明了該請求如何通過系統傳播到特定應用程序(包括儀表板)。
在這裏插入圖片描述

二、Open OnDemand安裝

  • RedHat/CentOS 7+
  • a common user/group database, e.g., LDAP + NSS
  • a common host file list
  • the resource manager (e.g., Torque, Slurm, or LSF) client binaries and libraries used by the batch servers installed
  • configuration on both OnDemand node and batch servers to be able to submit, status, and delete jobs from command line
  • signed SSL certificate with corresponding intermediate certificate for your advertised OnDemand host name (e.g., ondemand.my_center.edu)
  • your LDAP URL, base DN, and attribute to search for (in some rare cases a bind DN and corresponding bind password)

安裝系統依賴項

從常規RPM服務器安裝軟件包

sudo yum install centos-release-scl lsof sudo git

OSC提供的Deps

OnDemand的許多系統級依賴項都可以從

https://yum.osc.edu/ondemand/$ONDEMAND_RELEASE/web/$ENTERPRISE_LINUX_VERSION/x86_64/

節點上獲得,並且需要將其安裝到將成爲OnDemand Web服務器的節點上。
在這裏插入圖片描述

To install deps for OnDemand 1.7.x on CentOS 7

sudo yum install \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/cjose-0.6.1-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/cjose-devel-0.6.1-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/httpd24-mod_auth_openidc-2.4.1-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-1.7.14-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-apache-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-build-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-gems-1.7.14-1.7.14-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-gems-1.7.14-1.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-nginx-1.17.3-6.p6.0.4.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-nodejs-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-passenger-6.0.4-6.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-passenger-devel-6.0.4-6.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-python-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-ruby-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-rubygem-bundler-1.17.3-1.el7.noarch.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-runtime-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-scldevel-1.7-8.el7.x86_64.rpm \
    https://yum.osc.edu/ondemand/1.7/web/el7/x86_64/ondemand-selinux-1.7.14-1.el7.x86_64.rpm

安裝OnDemand基礎架構

OnDemand的核心基礎結構存儲在/opt/ood下:

  • ondemand
  • mod_ood_proxy
  • nginx_stage
  • ood_auth_map
  • ood_portal_generator

在這裏插入圖片描述

安裝OnDemand核心應用程序

OnDemand的核心應用程序存儲在/var/www/ood/apps/sys/$APP下:

  • ood-activejobs:/var/www/ood/apps/sys/activejobs
  • ood-dashboard:/var/www/ood/apps/sys/dashboard
  • ood-fileeditor:/var/www/ood/apps/sys/file-editor
  • ood-fileexplorer:/var/www/ood/apps/sys/files
  • ood-myjobs:/var/www/ood/apps/sys/myjobs
  • ood-shell:/var/www/ood/apps/sys/shell

在這裏插入圖片描述
每個應用程序都有其自己的依賴關係,需要通過運行以下命令(從NPM或Ruby Gems)進行安裝:

cd /var/www/ood/apps/sys/$APP
# We have both Node and Rails applications, let's cover both in a single command
sudo NODE_ENV=production RAILS_ENV=production scl enable ondemand -- bin/setup

構建配置

更新Apache服務環境

sudo sed -i 's/^HTTPD24_HTTPD_SCLS_ENABLED=.*/HTTPD24_HTTPD_SCLS_ENABLED="httpd24 rh-ruby25"/' \
/opt/rh/httpd24/service-environment

在這裏插入圖片描述

更新sudoers列表

sudo /etc/sudoers.d/ood << EOF
Defaults:apache !requiretty, !authenticate
apache ALL=(ALL) NOPASSWD: /opt/ood/nginx_stage/sbin/nginx_stage
EOF

在這裏插入圖片描述

爲核心Web應用程序添加NGINX配置

touch /var/lib/ondemand-nginx/config/apps/sys/activejobs.conf
touch /var/lib/ondemand-nginx/config/apps/sys/dashboard.conf
touch /var/lib/ondemand-nginx/config/apps/sys/file-editor.conf
touch /var/lib/ondemand-nginx/config/apps/sys/files.conf
touch /var/lib/ondemand-nginx/config/apps/sys/myjobs.conf
touch /var/lib/ondemand-nginx/config/apps/sys/shell.conf
/opt/ood/nginx_stage/sbin/update_nginx_stage &>/dev/null || :

在這裏插入圖片描述

添加cronjob來刪除長時間運行的PUN

每隔2個小時移除不活動的PUN。

sudo /etc/cron.d/ood << EOF
#!/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
0 */2 * * * root [ -f /opt/ood/nginx_stage/sbin/nginx_stage ] && /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean 2>&1 | logger -t nginx_clean
EOF

在這裏插入圖片描述

添加Apache配置

此時,如果我們訪問我們的Web節點,我們仍然看不到OnDemand頁面,因爲尚未生成ood-portal配置。現在生成一個通用的:

sudo /opt/ood/ood-portal-generator/sbin/update_ood_portal

在這裏插入圖片描述
這是基本的OnDemand門戶配置。

修改系統安全性

防火牆功能

打開防火牆中的端口80(http)和443(https),通常使用 firewalld或iptables完成。

防火牆示例:

$ sudo firewall-cmd --zone=public --add-port=80/tcp --permanent
$ sudo firewall-cmd --zone=public --add-port=443/tcp --permanent
$ sudo firewall-cmd --reload

iptables示例:

$ sudo iptables -I INPUT -p tcp -m tcp --dport 80 -j ACCEPT
$ sudo iptables -I INPUT -p tcp -m tcp --dport 443 -j ACCEPT
$ sudo iptables-save > /etc/sysconfig/iptables

啓動Apache服務

sudo systemctl start httpd24-httpd

在這裏插入圖片描述

將帳戶添加到Apache使用的密碼文件中

sudo scl enable ondemand -- htpasswd -c /opt/rh/httpd24/root/etc/httpd/.htpasswd $USER
# New password:
# Re-type new password:
# Adding password for user .......

在這裏插入圖片描述

添加LDAP支持

LDAP支持允許用戶使用其本地用戶名和密碼登錄。它還消除了系統管理員繼續更新.htpasswd文件的需要。

  • LDAP服務器(openldap.my_center.edu:636)

編輯Open OnDemand Portal 配置文件

/etc/ood/config/ood_portal.yml:

# /etc/ood/config/ood_portal.yml
---
# ...

auth:
  - 'AuthType Basic'
  - 'AuthName "private"'
  - 'AuthBasicProvider ldap'
  - 'AuthLDAPURL "ldaps://openldap.my_center.edu:636/ou=People,ou=hpc,o=my_center?uid"'
  - 'AuthLDAPGroupAttribute memberUid'
  - 'AuthLDAPGroupAttributeIsDN off'
  - 'RequestHeader unset Authorization'
  - 'Require valid-user'

構建/安裝更新的Apache配置文件:

sudo /opt/ood/ood-portal-generator/sbin/update_ood_portal

重新啓動Apache服務器以使更改生效:

sudo systemctl try-restart httpd24-httpd.service httpd24-htcacheclean.service

在這裏插入圖片描述
用戶可以使用其本地用戶名和密碼登錄。

三、資源管理器配置

Open OnDemand集羣配置文件

集羣配置文件描述了用戶可以向其提交作業的每個集羣以及用戶可以ssh到的登錄主機。

需要正確配置羣集的應用包括:

  • Shell App (connect to a cluster login node from the Dashboard App)
  • Active Jobs App (view a list of active jobs for the various clusters)
  • Job Composer App (submit jobs to various clusters)
  • All interactive apps such as Jupyter and RStudio

創建羣集配置文件所在的默認目錄:

sudo mkdir -p /etc/ood/config/clusters.d

爲要提供訪問權限的每個HPC羣集創建一個羣集YAML配置文件。他們必須具有*.yml擴展名。
僅具有登錄節點且沒有資源管理器的HPC集羣的最簡單集羣配置文件如下所示:

# /etc/ood/config/clusters.d/my_cluster.yml
---
v2:
  metadata:
    title: "My Cluster"
  login:
    host: "my_cluster.my_center.edu"

集羣配置模式示例V2

---
v2:
  metadata:
    title: "Owens"
    url: "https://www.osc.edu/supercomputing/computing/owens"
    hidden: false
  login:
    host: "owens.osc.edu"
  job:
    adapter: "torque"
    host: "owens-batch.ten.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"
  acls:
  - adapter: "group"
    groups:
      - "cluster_users"
      - "other_users_of_the_cluster"
    type: "whitelist"
  custom:
    grafana:
          host: "https://grafana.osc.edu"
          orgId: 3
          dashboard:
            name: "ondemand-clusters"
            uid: "aaba6Ahbauquag"
            panels:
              cpu: 20
              memory: 24
          labels:
            cluster: "cluster"
            host: "host"
            jobid: "jobid"
  batch_connect:
      basic:
        script_wrapper: "module restore\n%s"
      vnc:
        script_wrapper: "module restore\nmodule load ondemand-vnc\n%s"

v2:

Version 2是當前模式示例,並且是集羣配置的頂級映射

---
v2:

meta:

Meta描述瞭如何將集羣顯示給用戶

metadata:
    # title: is the display label that will be used anywhere the cluster is referenced
    title: "Owens"
    # url: provides the ability to show a link to information about the cluster
    url: "https://www.osc.edu/supercomputing/computing/owens"
    # hidden: setting this to true causes OnDemand to not show this cluster to the user, the cluster is still available for use by other applications
    hidden: false

login:

Login控制嘗試通過Shell應用程序進行SSH時的主機。由 Dashboard和Job Composer (MyJobs)使用。

login:
  host: "owens.osc.edu"

job:

job映射特定於羣集的資源管理器。

job:
  adapter: "torque"
  host: "owens-batch.ten.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

bin_overrides:

# An example in Slurm
job:
  adapter: "slurm"
  bin: "/opt/slurm/bin"
  conf: "/opt/slurm/etc/slurm.conf"
  bin_overrides:
      squeue: "/usr/local/slurm/bin/squeue_wrapper"
      # Override just want you want/need to
      # scontrol: "/usr/local/slurm/bin/scontrol_wrapper"
      sbatch: "/usr/local/slurm/bin/sbatch_wrapper"
      # Will be ignored because bsub is not a command used in the Slurm adapter
      bsub: "/opt/lsf/bin/bsub"

ACL:

ACL訪問控制列表提供了一種通過組成員身份限制羣集訪問的方法。ACL是隱式白名單,但可以顯式設置爲白名單或黑名單。

acls:
- adapter: "group"
  groups:
    - "cluster_users"
    - "other_users_of_the_cluster"
  type: "whitelist"  # optional, one of "whitelist" or "blacklist"

要查找組成員身份,ood_core使用ood_support庫,"id -G USERNAME"用於獲取用戶所在的列表組,"getgrgid"用於查找組的名稱。

batch_connect:

batch_connect控制交互式應用程序(如Jupyter或交互式桌面)的默認設置。

batch_connect:
    basic:
      script_wrapper: "module restore\n%s"
    vnc:
      script_wrapper: "module restore\nmodule load ondemand-vnc\n%s"

配置Slurm

HPC羣集上Slurm資源管理器的YAML羣集配置文件如下所示:

# /etc/ood/config/clusters.d/my_cluster.yml
---
v2:
  metadata:
    title: "My Cluster"
  login:
    host: "my_cluster.my_center.edu"
  job:
    adapter: "slurm"
    cluster: "my_cluster"
    bin: "/path/to/slurm/bin"
    conf: "/path/to/slurm.conf"
    # bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: ""
      # scontrol: ""
      # scancel: ""

具有以下配置選項:

adapter:

設置爲slurm

cluster:

Slurm集羣名稱

bin:

Slurm客戶端安裝二進制文件的路徑

conf:

Slurm配置文件的路徑

bin_overrides:

Replacements/wrappers for Slurm’s job submission and control clients.

Supports the following clients:

  • sbatch
  • squeue
  • scontrol
  • scancel

測試配置

對於所有rake任務,我們都需要位於 Dashboard App的根目錄下:

cd /var/www/ood/apps/sys/dashboard

列出我們可以運行的所有可用任務:

scl enable ondemand -- bin/rake -T test:jobs

在這裏插入圖片描述
該列表是從駐留在下的所有可用羣集配置文件動態生成的 。/etc/ood/config/clusters.d/*.yml
在這裏插入圖片描述
我的cluster集羣的名字叫linux,所以我在此使用linux.yml
在這裏插入圖片描述

測試集羣

sudo su $USER -c 'scl enable ondemand -- bin/rake test:jobs:$CLUSTER_NAME RAILS_ENV=production'

在這裏插入圖片描述

Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/dashboard/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/dashboard/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
mkdir -p /home/export/base/systest/jiangyt/test_jobs
Testing cluster 'linux'...
Submitting job...
[2020-06-15 16:50:42 +0800 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/usr/sw-slurm/slurm-16.05.3/etc/slurm.conf\"}, \"/usr/sw-slurm/slurm-16.05.3/bin/sbatch\", \"-D\", \"/home/export/base/systest/jiangyt/test_jobs\", \"-J\", \"test_jobs_linux\", \"-o\", \"/home/export/base/systest/jiangyt/test_jobs/output_linux_2020_06_15t16_50_42_08_00_log\", \"-t\", \"00:01:00\", \"--parsable\", \"-M\", \"linux\"]"
Got job id '9273109'
[2020-06-15 16:50:43 +0800 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/usr/sw-slurm/slurm-16.05.3/etc/slurm.conf\"}, \"/usr/sw-slurm/slurm-16.05.3/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%A\\u001F%i\\u001F%t\", \"-j\", \"9273109\", \"-M\", \"linux\"]"
Job has status of queued
[2020-06-15 16:50:48 +0800 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/usr/sw-slurm/slurm-16.05.3/etc/slurm.conf\"}, \"/usr/sw-slurm/slurm-16.05.3/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%A\\u001F%i\\u001F%t\", \"-j\", \"9273109\", \"-M\", \"linux\"]"
Job has status of completed
Test for 'linux' PASSED!
Finished testing cluster 'linux'

測試成功。

定製配置文件

配置文件根目錄位於 /etc/ood。公共資產位於 /var/www/ood/public。

  • /etc/ood/profile

    • If exists, this file is sourced instead of the default at /opt/ood/nginx_stage/etc/profile by /opt/ood/nginx_stage/sbin/nginx_stage script when running as root, prior to launching the PUN.
    • You should source /opt/ood/nginx_stage/etc/profile in your custom /etc/ood/profile if you add one to load the correct software collections.
  • /etc/ood/config/nginx_stage.yml

    • YAML file to override default configuration for the PUN. You can set environment variables via key-value pairs in the mapping pun_custom_env. You can specify a list of environment variables set in /etc/ood/profile to pass through to the PUN by defining the sequence pun_custom_env_declarations.
    • An example of both of these uses may be found in nginx_stage_example.yml. Variables set here are set for all OnDemand applications.
  • /etc/ood/config/apps/$APP/env

    • Used to provide application specific config
    • env files do not override values set by prior methods.
  • /etc/ood/config/apps/$APP/initializers/ood.rb

    • Modify Rails application behavior using Ruby code. Since this is application code environment variables can be set, removed.
    • This method is specific to Ruby on Rails applications: Activejobs, Dashboard, File Editor, and Job Composer. You can add multiple initializer files in this directory and they will be loaded in alphabetical order

四、設置Interactive Apps

Interactive Apps 需要在計算節點上安裝VNC服務器,而不是 OnDemand節點。

For VNC server support:

  • nmap-ncat
  • TurboVNC 2.1+
  • websockify 0.8.0+

在這裏插入圖片描述
在這裏插入圖片描述

修改集羣配置

vim /etc/ood/config/clusters.d/linux.yml
---
v2:
    metadata:
        title: "Linux Cluster"
    login:
        host: "173.0.20.110"
    job:
        adapter: "slurm"
        cluster: "linux"
        bin: "/usr/sw-slurm/slurm-16.05.3/bin"
        conf: "/usr/sw-slurm/slurm-16.05.3/etc/slurm.conf"
        bin_overrides:
            squeue: "/usr/sw-slurm/slurm-16.05.3/bin/squeue"
    batch_connect:
        basic:
            script_wrapper: |
                module purge
                %s
        vnc:
            script_wrapper: |
                module purge
                export PATH="/opt/TurboVNC/bin:$PATH"
                export WEBSOCKIFY_CMD="/root/workspace/websockify-master/run"
                %s

啓用反向代理

在Apache中啓用服務

修改ood-portal-generator的YAML配置文件

  • /etc/ood/config/ood_portal.yml
# /etc/ood/config/ood_portal.yml

host_regex: '[\w.-]+\.my_center\.edu'
node_uri: '/node'
rnode_uri: '/rnode'

在這裏插入圖片描述

更新Apache配置文件:

sudo /opt/ood/ood-portal-generator/sbin/update_ood_portal

重新啓動Apache服務器使更改生效

sudo systemctl try-restart httpd24-httpd.service httpd24-htcacheclean.service

驗證是否有效

  • SSH到與上面的正則表達式匹配的任何計算節點:
ssh cn060587
  • 啓動一個偵聽服務器:
nc -l 5432

在這裏插入圖片描述

  • 在瀏覽器中,使用具有以下URL格式的Apache反向代理導航到該服務器:
http://ondemand.my_center.edu/node/<host>/<port>/...

在這裏插入圖片描述

  • 返回您的SSH會話,並確認它已收到瀏覽器請求:
    在這裏插入圖片描述
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章