容器能不能將 volume 掛載直接掛到根目錄？（上）—— 從 runc 說起

這件事起源於有小夥伴在某羣裏問，在 K8s 中，能不能把 volume 掛載直接掛到根目錄？我的第一反應是不能。容器會使用 union filesystem 將容器的內容掛到根目錄下，這點在正常情況下是無法更改的。但是就止於此嗎？發現給不出合理解釋的時候，突然感覺自己對於容器的認知只停留在了很表面的階段。

一、從 runc 源碼開始

於是我翻到了 runc 的代碼，一起看看他是怎麼做的，看看有沒有什麼切入點。我們首先關注容器的創建這一部分：libcontainer/init_linux.go:78

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, logFd int, mountFds []int) (initer, error) {
	var config *initConfig
	if err := json.NewDecoder(pipe).Decode(&config); err != nil {
		return nil, err
	}
	if err := populateProcessEnvironment(config.Env); err != nil {
		return nil, err
	}
	switch t {
	case initSetns:
		// mountFds must be nil in this case. We don't mount while doing runc exec.
		if mountFds != nil {
			return nil, errors.New("mountFds must be nil; can't mount from exec")
		}

		return &linuxSetnsInit{
		}, nil
	case initStandard:
		return &linuxStandardInit{
		}, nil
	}
	return nil, fmt.Errorf("unknown init type %q", t)
}

這裏做的事情比較簡單，一個是從 Pipe 拿到初始化配置，解析配置中注入的 env，將其設置到本進程中。容器初始化的方式有兩種，其一是 initSetns，啓動一個已有的容器。其次是 initStandard，啓動一個標準容器。

initStandard 中與 rootfs 最密切相關的就是 err := prepareRootfs(l.pipe, l.config, l.mountFds)，在 prepareRootfs 之前，主要進行了網絡的初始化，比如 lo 網卡和 route 的初始化。不過我們主要還是關注 rootfs 部分，從註釋我們可以看到這裏主要做了這幾件事情：設備、掛載點、fs的初始化，最後提醒你調用 finalizeRootfs 來完成初始化，我們先以 prepareRootfs 爲核心，逐行解析這裏面發生了什麼：

// prepareRootfs sets up the devices, mount points, and filesystems for use
// inside a new mount namespace. It doesn't set anything as ro. You must call
// finalizeRootfs after this function to finish setting up the rootfs.
func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err error) {
	config := iConfig.Config
	if err := prepareRoot(config); err != nil {
		return fmt.Errorf("error preparing rootfs: %w", err)
	}

	if mountFds != nil && len(mountFds) != len(config.Mounts) {
		return fmt.Errorf("malformed mountFds slice. Expected size: %v, got: %v. Slice: %v", len(config.Mounts), len(mountFds), mountFds)
	}

	mountConfig := &mountConfig{
		root:            config.Rootfs,
		label:           config.MountLabel,
		cgroup2Path:     iConfig.Cgroup2Path,
		rootlessCgroups: iConfig.RootlessCgroups,
		cgroupns:        config.Namespaces.Contains(configs.NEWCGROUP),
	}
	setupDev := needsSetupDev(config)
	for i, m := range config.Mounts {
		// Just before the loop we checked that if not empty, len(mountFds) == len(config.Mounts).
		// Therefore, we can access mountFds[i] without any concerns.
		if mountFds != nil && mountFds[i] != -1 {
			mountConfig.fd = &mountFds[i]
		} else {
			mountConfig.fd = nil
		}

		if err := mountToRootfs(m, mountConfig); err != nil {
			return fmt.Errorf("error mounting %q to rootfs at %q: %w", m.Source, m.Destination, err)
		}
	}

	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		}
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		}
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)
		}
	}

	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err
	}

	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).

	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	}

	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err
	}

	if config.NoPivotRoot {
		err = msMoveRoot(config.Rootfs)
	} else if config.Namespaces.Contains(configs.NEWNS) {
		err = pivotRoot(config.Rootfs)
	} else {
		err = chroot()
	}
	if err != nil {
		return fmt.Errorf("error jailing process inside rootfs: %w", err)
	}

	if setupDev {
		if err := reOpenDevNull(); err != nil {
			return fmt.Errorf("error reopening /dev/null inside container: %w", err)
		}
	}

	if cwd := iConfig.Cwd; cwd != "" {
		// Note that spec.Process.Cwd can contain unclean value like  "../../../../foo/bar...".
		// However, we are safe to call MkDirAll directly because we are in the jail here.
		if err := os.MkdirAll(cwd, 0o755); err != nil {
			return err
		}
	}

	return nil
}

1、prepareRoot

1.1 RootPropagation

func prepareRoot(config *configs.Config) error {
   flag := unix.MS_SLAVE | unix.MS_REC
   if config.RootPropagation != 0 {
      flag = config.RootPropagation
   }
   if err := mount("", "/", "", "", uintptr(flag), ""); err != nil {
      return err
   }

   // Make parent mount private to make sure following bind mount does
   // not propagate in other namespaces. Also it will help with kernel
   // check pass in pivot_root. (IS_SHARED(new_mnt->mnt_parent))
   if err := rootfsParentMountPrivate(config.Rootfs); err != nil {
      return err
   }

   return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")
}

在 prepareRoot 的最一開始，先進行了一次 mount，這次 mount 實際上是一個 propagation 的遞歸修改（unix.MS_REC）。默認情況下 flag 是 unix.MS_SLAVE。從 linux 小手冊上可以得知，這個 flag 表示 mount 點從屬掛載下的 mount 事件單向傳播，此從節點下的掛載將不會影響到主節點。由於它這裏 mount 的是 "/" 目錄，而且使用了遞歸參數，即表示在此 ns 中的任何 mount 操作，都不對外界產生影響，不過反過來（準確的說是 peer group 之間）是產生影響的。

我們這裏模擬一下，進行一個 tmpfs 的 mount，並設置傳播等級爲 shared：

mount -t tmpfs myt /root/dir1 --make-shared

findmnt -o TARGET,PROPAGATION 查看一下傳播等級：

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 shared

我模擬 runc clone 一個 ns，然後同樣查看傳播等級，發現結果與上面一樣。執行 mount --make-rslave /，再次查看傳播等級，發現已經變成了 slave，而原先的 private 則保持不變：

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 private,slave

行爲也和 man page 的描述一致，不是 shared 的並不會因爲此命令而改變：

MS_SLAVE
              If this is a shared mount that is a member of a peer group
              that contains other members, convert it to a slave mount.
              If this is a shared mount that is a member of a peer group
              that contains no other members, convert it to a private
              mount.  Otherwise, the propagation type of the mount is
              left unchanged.

當然我們可以看到這裏留了個口子，可以依據 config.RootPropagation 來改變這個默認行爲，docker 的默認是 rprivate，即雙向的 mount 都互不產生影響。K8s 的默認也是 private，在 K8s 1.2.1以後，支持對 Volume 進行傳播等級配置，比如 HostToContainer，其實就是 MS_SLAVE。還有一種 Bidirectional，則是 MS_SHARED，表示此 ns 下的 mount 與外界共享，這個口子靈活又危險，比如可以在容器裏進行 device 的 mount/unmount。

1.2 rootfsParentMountPrivate

這塊的註釋非常全，其實就是檢查一下準備作爲 root 的這個目錄是不是 shared，如果是 shared，則改爲 private。也就是無論如何，容器都要求 rootfs 爲 private，即使我們將 RootPropagation 設置爲 shared 或者其他。

這塊意圖也合理，如果 rootfs 如果隨意被 propagation 影響，很容易導致容器崩潰。（不過我也不太確定我這個猜想是否正確。）

另外，這裏註釋提到，把他改成 private 也是避免後續做 bind 操作的時候，將 mount 傳播到其他 namespace。以及，pivot_root 也不允許此 mount 爲 shared。

1.3 bind

Bind 和硬鏈看起來有點點像，不過底層實現完全不同。man page 提到 bind 是一種對 fs attach 的操作，而軟硬鏈是藉助 inode 來完成的。

mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")

REC 參數的意圖和上面提到的 propagation 時的一致，就是遞歸。man page 中提到，如果沒有 REC 參數，則 bind 只 mount 當前這個目錄，而目錄底下的 submounts 不會被複制。我們發現它把 rootfs 目錄 bind 到 rootfs 目錄了，這是爲了創建一個 mountpoint。這個 mountpoint 是容器根目錄的 mount，比如：

➜  ~ mount
/dev/disk3s3s1 on / (apfs, sealed, local, read-only, journaled)

2、mountToRootfs

在 bind 完 rootfs 這個 mountpoint 後，會根據 config.Mounts 中的配置，去逐個創建對應的 mount，這裏就是處理我們掛載的地方：

func mountToRootfs(m *configs.Mount, c *mountConfig) error {
	rootfs := c.root
	mountLabel := c.label
	mountFd := c.fd
	dest, err := securejoin.SecureJoin(rootfs, m.Destination)
	if err != nil {
		return err
	}

	switch m.Device {
		case "proc", "sysfs": ...
		case "mqueue": ...
		case "tmpfs": ...
		case "bind": ...
		case "cgroup": ...
		default: ...
	}
	if err := setRecAttr(m, rootfs); err != nil {
		return err
	}
	return nil
}

整體的流程不難看懂，不同的類型有不通的 mount 流程，而最後的 setRecAttr 感興趣的可以看下 mount_setattr(2)。

就以 proc/sysfs 爲例，就是檢查一下 dst，確保是一個目錄，並且不能是 symlink。註釋這裏提到了有意思的 symlink-exchange attacks，感興趣的可以看看 mounts outside，提到了 symlink 導致的 mount 逃逸，講的十分詳細（其實我也就大略看了一下）。

	case "proc", "sysfs":
		// If the destination already exists and is not a directory, we bail
		// out This is to avoid mounting through a symlink or similar -- which
		// has been a "fun" attack scenario in the past.
		// TODO: This won't be necessary once we switch to libpathrs and we can
		//       stop all of these symlink-exchange attacks.
		if fi, err := os.Lstat(dest); err != nil {
			if !os.IsNotExist(err) {
				return err
			}
		} else if fi.Mode()&os.ModeDir == 0 {
			return fmt.Errorf("filesystem %q must be mounted on ordinary directory", m.Device)
		}
		if err := os.MkdirAll(dest, 0o755); err != nil {
			return err
		}
		// Selinux kernels do not support labeling of /proc or /sys
		return mountPropagate(m, rootfs, "", nil)

底層調用的都是 mountPropagate ，這是 runc 對 mount 的一層安全封裝，確保沒有一些惡意掛載：

// Do the mount operation followed by additional mounts required to take care
// of propagation flags. This will always be scoped inside the container rootfs.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {}

其他類型的掛載實際上大同小異，它們基本都圍繞 “安全” 爲核心，對掛載做各種檢查，並執行。

3、setupDev

	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		}
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		}
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)
		}
	}

這塊內容略過，主要我也不是很瞭解比如 mknod 之類的指令。對 linux 有一定了解的小夥伴應該知道 dev 指的是設備，對應 /dev 目錄。

我們知道 docker 可以用 --device 來綁定設備，createDevices 本質上也是通過 mount 來完成的，它這裏會將 host 的設備通過 bind 或者 mknode 到容器目錄中。

setupPtmx 是將 pts/ptmx 軟鏈到了容器中，以便支持 pty。最後部分的 setupDevSymlinks 則是一些小優化，比如它會把標準輸入輸出的 fd 通過軟鏈放到 /dev 底下。

4、容器初始化時簡單的 hook

4.1 syncParentHooks

	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err
	}

這塊內容與主題無關，不過有點小意思。我們知道 runc 由父進程來創建 namespace，再由子進程來初始化容器，這裏就用了 Pipe 來實現 PreStart，這個點正好是還沒 chroot/pivot_root 的時候，理論上是可以做一些危險操作的，不過要注意，這個調用是發生在父進程：

// syncParentHooks sends to the given pipe a JSON payload which indicates that
// the parent should execute pre-start hooks. It then waits for the parent to
// indicate that it is cleared to resume.
func syncParentHooks(pipe io.ReadWriter) error {
	// Tell parent.
	if err := writeSync(pipe, procHooks); err != nil {
		return err
	}

	// Wait for parent to give the all-clear.
	return readSync(pipe, procResume)
}

4.2 createContainerHooks

這個 Hooks 則是發生在當前進程（容器主進程），代碼很簡單，不多說：

	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).

	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	}

	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err
	}

5、msMoveRoot/chroot/pivotRoot

我們知道，進入容器後，只能看到容器內的目錄，這實際上就是這上面三個命令的功勞。可能大家最熟悉的就是 chroot，這個 jail 技術已經存在很多年了。

不過在 runc 中，chroot 並不是最優選擇，chroot 設計之初就不是爲了創建一個安全且隔離的環境，它存在不少限制。其實從 man page 的定義中就可以看出 pivotRoot 和 chRoot 的底層原理是不同的：

chroot - run command or interactive shell with special root directory
pivot_root - change the root mount

chroot 是改變了 cmd/shell 的 root dir，而 pivot_root 是直接改了 root mount，chroot 有一個著名的越獄方案就是在 chroot 中調用 chroot，這裏直接貼一下維基百科的說法：

chroot 機制的設計中，並不包括抵抗特權用戶（root）的蓄意篡改。在大多數的系統中，chroot環境沒有設計出適當的堆棧，所以一個在chroot下執行的程序，可能會透過第二次chroot來獲得足夠權限，逃出chroot的限制。爲了減輕這種安全漏洞所帶來的風險，在使用chroot後，在chroot下執行的程序，應該儘快放棄root權限，或是改用其他機制來替代，例如FreeBSD jail。在某些操作系統中，例如FreeBSD，已經採取預防措施，來防止第二次chroot的攻擊[1]。

在支持設備節點的文件系統中，一個在chroot中的root用戶仍然可以創建設備節點和掛載在chroot根目錄的文件系統；儘管，chroot機制不是被打算用來阻止低特權用戶級訪問系統設備。
在啓動時，程序都期望能在某些預設位置找到scratch space，配置文件，設備節點和共享庫。對於一個成功啓動的被chroot的程序，在chroot目錄必須最低限度配備的這些文件設置。這使得chroot難以作爲一般的沙箱來使用。
只有root用戶可以執行chroot。這是爲了防止用戶把一個setuid的程序放入一個特製的chroot監牢（例如一個有着假的/etc/passwd和/etc/shadow文件的chroot監牢）由於引起提權攻擊。
在chroot的機制本身也不是爲限制資源的使用而設計，如I/O，帶寬，磁盤空間或CPU時間。大多數Unix系統都沒有以完全文件系統爲導向，以即給可能通過網絡和過程控制，通過系統調用接口來提供一個破壞chroot的程序。

msMoveRoot 本質上也是調用了 chroot，是一個 chroot 的安全加強版：

// Before we move the root and chroot we have to mask all "full" sysfs and
	// procfs mounts which exist on the host. This is because while the kernel
	// has protections against mounting procfs if it has masks, when using
	// chroot(2) the *host* procfs mount is still reachable in the mount
	// namespace and the kernel permits procfs mounts inside --no-pivot
	// containers.
	//
	// Users shouldn't be using --no-pivot except in exceptional circumstances,
	// but to avoid such a trivial security flaw we apply a best-effort
	// protection here. The kernel only allows a mount of a pseudo-filesystem
	// like procfs or sysfs if there is a *full* mount (the root of the
	// filesystem is mounted) without any other locked mount points covering a
	// subtree of the mount.
	//
	// So we try to unmount (or mount tmpfs on top of) any mountpoint which is
	// a full mount of either sysfs or procfs (since those are the most
	// concerning filesystems to us).
	mountinfos, err := mountinfo.GetMounts(func(info *mountinfo.Info) (skip, stop bool) {
		// Collect every sysfs and procfs filesystem, except for those which
		// are non-full mounts or are inside the rootfs of the container.
		if info.Root != "/" ||
			(info.FSType != "proc" && info.FSType != "sysfs") ||
			strings.HasPrefix(info.Mountpoint, rootfs) {
			skip = true
		}
		return
	})
	if err != nil {
		return err
	}
	for _, info := range mountinfos {
		p := info.Mountpoint
		// Be sure umount events are not propagated to the host.
		if err := mount("", p, "", "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
			if errors.Is(err, unix.ENOENT) {
				// If the mountpoint doesn't exist that means that we've
				// already blasted away some parent directory of the mountpoint
				// and so we don't care about this error.
				continue
			}
			return err
		}
		if err := unmount(p, unix.MNT_DETACH); err != nil {
			if !errors.Is(err, unix.EINVAL) && !errors.Is(err, unix.EPERM) {
				return err
			} else {
				// If we have not privileges for umounting (e.g. rootless), then
				// cover the path.
				if err := mount("tmpfs", p, "", "tmpfs", 0, ""); err != nil {
					return err
				}
			}
		}
	}

  // Move the rootfs on top of "/" in our mount namespace.
	if err := mount(rootfs, "/", "", "", unix.MS_MOVE, ""); err != nil {
		return err
	}
	return chroot()
}

代碼很長，實際上做的事情不復雜：這裏把當前 ns 中的 proc/sysfs，且不屬於 rootfs 底下的 mount 過濾出來 umount 掉了。

最後做了一手 MS_MOVE，把 rootfs 這個 mount 挪到了 /，這裏猜測是防止 chdir(../) chroot(/) 這種組合拳，因爲把 mount 挪過去，原來的 rootfs mount 就不存在了，最後再執行一下 chroot。不過即使如此，runc 依舊不推薦使用 chroot。

6、finalizeRootfs

prepareRootfs 簡單的過了一下，其實也就是它註釋提到的那幾件事情，進行設備、掛載點、fs的初始化，而 finalizeRootfs 是 prepareRootfs 的收尾。

// finalizeRootfs sets anything to ro if necessary. You must call
// prepareRootfs first.
func finalizeRootfs(config *configs.Config) (err error) {
	// All tmpfs mounts and /dev were previously mounted as rw
	// by mountPropagate. Remount them read-only as requested.
	for _, m := range config.Mounts {
		if m.Flags&unix.MS_RDONLY != unix.MS_RDONLY {
			continue
		}
		if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
			if err := remountReadonly(m); err != nil {
				return err
			}
		}
	}

	// set rootfs ( / ) as readonly
	if config.Readonlyfs {
		if err := setReadonly(); err != nil {
			return fmt.Errorf("error setting rootfs as readonly: %w", err)
		}
	}

	if config.Umask != nil {
		unix.Umask(int(*config.Umask))
	} else {
		unix.Umask(0o022)
	}
	return nil
}

finalizeRootfs 第一段代碼實際上是和之前的 mountPropagate（就那個 runc 對 mount 操作的安全封裝）交相呼應：

// Do the mount operation followed by additional mounts required to take care
// of propagation flags. This will always be scoped inside the container rootfs.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {
	var (
		data  = label.FormatMountLabel(m.Data, mountLabel)
		flags = m.Flags
	)
	// Delay mounting the filesystem read-only if we need to do further
	// operations on it. We need to set up files in "/dev", and other tmpfs
	// mounts may need to be chmod-ed after mounting. These mounts will be
	// remounted ro later in finalizeRootfs(), if necessary.
	if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
		flags &= ^unix.MS_RDONLY
	}
  
  ...
}

由於 tmpfs 和 /dev 有可能會在 mount 之後做一些初始化，或者 chmod，所以當初掛的時候即使是 MS_RDONLY，也 ^ 掉了，在最後 finalize 的時候，如果配置了 MS_RDONLY，再 remount 一下，讓它真正 mount 成 MS_RDONLY。

第二段代碼也是一樣的道理，如果配置中設置了 Readonlyfs，同樣也是在最後關頭再設置成只讀。

第三段代碼做了一個 umask，我們只看這個默認的 umask 022，因爲正常的文件/目錄如果未經過設置，是 666/777，umask 實際上就是把其他用戶組的寫權限拿掉，變成 644/755。

二、 runc 的使用

分析到這裏，我們發現，runc 並沒有對 rootfs 這個 mountpoint 是什麼掛載去做定義，只是做了下 bind，而且允許我們自由定義 rootfs 如何掛載。至少看到這裏，我們認爲從外部提供一個 mount 配置掛載到 root 上是可行的。不如體驗一下在不同的配置下， runc 是如何爲我們生成容器的。

1、使用 runc 創建並進入容器

我們先跑下 runc spec && cat config.json，通過這個命令，能提供一個缺省的配置，實際上這坨配置就是 OCI-runtime-spec，描述如下：

{
	"ociVersion": "1.0.2-dev",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"sh"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "rootfs",
		"readonly": true
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"ro"
			]
		}
	],
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": false,
					"access": "rwm"
				}
			]
		},
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			}
		],
		"maskedPaths": [
			"/proc/acpi",
			"/proc/asound",
			"/proc/kcore",
			"/proc/keys",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/sys/firmware",
			"/proc/scsi"
		],
		"readonlyPaths": [
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]
	}
}

不過 rootfs 這個缺省目錄下並沒有一套根文件系統（現在都不存在這個目錄），直接運行肯定是會報錯的，如下：

> runc run config.json
ERRO[0000] runc run failed: invalid rootfs: stat /root/rootfs: no such file or directory

這裏藉助 docker export 了一個 busybox 的根文件系統，並放在 /root/rootfs 下，並將剛纔的配置 root.path 修改爲 /root/rootfs：

> ls
VERSION  bin  custom  dev  etc  home  json  lib  lib64  proc  root  tmp  usr  var
> pwd
/root/rootfs

執行 runc 命令啓動此容器：

> runc run config.json

/ # ls
VERSION  bin      custom   dev      etc      home     json     lib      lib64    proc     root     sys      tmp      usr      var

/ # echo $$
1

/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    8 root      0:00 ps -ef
    
/ # mount
/dev/vda1 on / type ext4 (ro,noatime)
...

/ # echo something > test.log
sh: can't create test.log: Read-only file system

確實如配置那樣，rootfs 被設置爲只讀，對應我們在第一小節裏講到的 finalizeRootfs 中的第二段操作。

	"root": {
		"path": "rootfs",
		"readonly": true
	},

2、嘗試在 chroot 下進行越獄

我們試試在不安全的 chroot 底下進行越獄，先改一下剛纔生成的配置，把 capabilities 的權限打開，不然有些命令比如 chroot 跑不了。另外就是 namespaces 需要去掉 mount（就是 NEWNS），如果打開了 NEWNS，根據我們前面的源碼分析，它會自動去進行 pivot_root。最後再去掉 MaskPaths 和 ReadonlyPaths，否則無法通過安全檢查：

{
	"ociVersion": "1.0.2-dev",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"bash"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "/root/ubuntu"
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"ro"
			]
		}
	],
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": false,
					"access": "rwm"
				}
			]
		},
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			}
		]
	}
}

進入容器後，我們執行越獄教程中提供的代碼，成功 break out：

// 進入容器
[root@master ~]# runc run config.json

root@runc:/# ls -la
total 104
drwxr-xr-x  21 root root  4096 Feb 16 13:56 .
drwxr-xr-x  21 root root  4096 Feb 16 13:56 ..
-rwxr-xr-x   1 root root     0 Feb 14 03:20 .dockerenv
lrwxrwxrwx   1 root root     7 Jan 26 02:03 bin -> usr/bin
drwxr-xr-x   2 root root  4096 Apr 18  2022 boot
-rwxr-xr-x   1 root root 29160 Feb 16 10:22 break
drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r1
drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r2
drwxr-xr-x   2 root root  4096 Feb 16 14:01 d1r3
drwxr-xr-x   5 root root   360 Feb 16 14:40 dev
drwxr-xr-x  32 root root  4096 Feb 14 03:20 etc
drwxr-xr-x   2 root root  4096 Apr 18  2022 home
lrwxrwxrwx   1 root root     7 Jan 26 02:03 lib -> usr/lib
lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib32 -> usr/lib32
lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib64 -> usr/lib64
lrwxrwxrwx   1 root root    10 Jan 26 02:03 libx32 -> usr/libx32
drwxr-xr-x   2 root root  4096 Jan 26 02:03 media
drwxr-xr-x   2 root root  4096 Jan 26 02:03 mnt
drwxr-xr-x   2 root root  4096 Jan 26 02:03 opt
dr-xr-xr-x 375 root root     0 Feb 16 14:40 proc
drwx------   2 root root  4096 Feb 16 10:23 root
drwxr-xr-x   6 root root  4096 Feb 14 03:20 run
lrwxrwxrwx   1 root root     8 Jan 26 02:03 sbin -> usr/sbin
drwxr-xr-x   2 root root  4096 Jan 26 02:03 srv
dr-xr-xr-x  12 root root     0 Feb 16 14:40 sys
drwxrwxrwt   2 root root  4096 Jan 26 02:06 tmp
drwxr-xr-x  14 root root  4096 Jan 26 02:03 usr
drwxr-xr-x  11 root root  4096 Jan 26 02:06 var
drwxr-xr-x   2 root root  4096 Feb 16 10:22 waterbuffalo

// 其實就是 chdir(..) + chroot(.)
root@runc:/# ./break

// 越獄成功
[root@runc /]# ls -la
total 18920
dr-xr-xr-x  23 root root     4096 Feb 16 22:39 .
dr-xr-xr-x  23 root root     4096 Feb 16 22:39 ..
drwxr-xr-x   3 root root     4096 Jan  8  2021 agent
drwxr-xr-x   3 root root     4096 Nov 24 16:03 api-helm
-rw-r--r--   1 root root        0 Oct 30  2020 .autorelabel
lrwxrwxrwx   1 root root        7 Dec 14  2020 bin -> usr/bin
dr-xr-xr-x   5 root root     4096 Nov 30  2021 boot
-rwxr-xr-x   1 root root 19261816 Dec  8 14:30 cloud-agent
-rw-------   1 root root    12288 Nov 25 11:27 .conf.txt.swp
drwxr-xr-x  13 root root     4096 Feb 16 22:39 data
drwxr-xr-x  17 root root    14140 Sep  9 16:20 dev
drwxr-xr-x 108 root root    12288 Feb 16 10:59 etc
drwxr-xr-x   2 root root     4096 Dec 14  2020 home
lrwxrwxrwx   1 root root        7 Dec 14  2020 lib -> usr/lib
lrwxrwxrwx   1 root root        9 Dec 14  2020 lib64 -> usr/lib64
drwx------   2 root root    16384 Aug 18  2020 lost+found
drwxr-xr-x   2 root root     4096 Dec 14  2020 media
drwxr-xr-x   2 root root     4096 Dec 14  2020 mnt
drwxr-xr-x   6 root root     4096 Sep  9 16:31 opt
dr-xr-xr-x 375 root root        0 Sep  9 16:19 proc
dr-xr-x---  31 root root     4096 Feb 16 22:34 root
drwxr-xr-x   2 root root     4096 Dec  5 11:40 rot
drwxr-xr-x  33 root root     1120 Feb 16 14:59 run
lrwxrwxrwx   1 root root        8 Dec 14  2020 sbin -> usr/sbin
drwxr-xr-x   2 root root     4096 Dec 14  2020 srv
dr-xr-xr-x  12 root root        0 Sep  9 16:19 sys
drwxr-xr-x   2 root root     4096 Dec  5 11:33 test
drwxrwxrwt   4 root root     4096 Feb 16 22:35 tmp
drwxr-xr-x  12 root root     4096 Nov 30  2021 usr
drwxr-xr-x  21 root root     4096 Sep  9 15:53 var
[root@runc /]#

至此也算是告一段落了，起碼我們粗淺地瞭解了整個容器是如何初始化的，以及我們知道了，容器的 rootfs 是可以隨意指定目錄的。不過，開頭提到的問題還是沒能回答。不管是 Docker，還是 K8s，實際上都無法進行下列操作：

> docker run -it -d --name xxx -p 8091:8090 -v /xxx:/ ubuntu
docker: Error response from daemon: invalid volume specification: '/xxx:/': invalid mount config for type "bind": invalid specification: destination can't be '/'.
See 'docker run --help'.

下一期，我們將嘗試從 OCI、CRI 的角度再度探討這個問題。

文章如有錯誤，感謝指正。

Reference

小手冊：https://man7.org/linux/man-pages/
runc：https://github.com/opencontainers
大佬博客文章 Linux: Mount Shared Subtrees：https://pages.dogdog.run/tech/mount_subtree.html

容器能不能將 volume 掛載直接掛到根目錄？（上）—— 從 runc 說起

一、從 runc 源碼開始

1、prepareRoot

1.1 RootPropagation

1.2 rootfsParentMountPrivate

1.3 bind

2、mountToRootfs

3、setupDev

4、容器初始化時簡單的 hook

4.1 syncParentHooks

4.2 createContainerHooks

5、msMoveRoot/chroot/pivotRoot

6、finalizeRootfs

二、 runc 的使用

1、使用 runc 創建並進入容器

2、嘗試在 chroot 下進行越獄

Reference

這個網絡爬蟲代碼，拿到數據之後如何存到csv文件中去？

.NET開源強大、易於使用的緩存框架 - FusionCache

面試，有時候是個運氣活

客戶案例｜權威答案！靈犀醫療引入 Zilliz Cloud，千萬級向量數據庫賦能醫學 AIGC 平臺

MFC擴展庫BCGControlBar Pro v34.1 - 儀表盤、對話框組件升級

界面組件庫DevExpress Office File API（WinForms & WPF）v24.1新功能預覽

Pinecone: 大模型時代的智能索引與搜索解決方案

千帆杯AI原生應用創意挑戰賽-效率工具常規賽重磅上線！

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結