容器能不能將 volume 掛載直接掛到根目錄?(上)—— 從 runc 說起

這件事起源於有小夥伴在某羣裏問,在 K8s 中,能不能把 volume 掛載直接掛到根目錄?我的第一反應是不能。容器會使用 union filesystem 將容器的內容掛到根目錄下,這點在正常情況下是無法更改的。但是就止於此嗎?發現給不出合理解釋的時候,突然感覺自己對於容器的認知只停留在了很表面的階段。

一、從 runc 源碼開始

於是我翻到了 runc 的代碼,一起看看他是怎麼做的,看看有沒有什麼切入點。我們首先關注容器的創建這一部分:libcontainer/init_linux.go:78

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, logFd int, mountFds []int) (initer, error) {
	var config *initConfig
	if err := json.NewDecoder(pipe).Decode(&config); err != nil {
		return nil, err
	}
	if err := populateProcessEnvironment(config.Env); err != nil {
		return nil, err
	}
	switch t {
	case initSetns:
		// mountFds must be nil in this case. We don't mount while doing runc exec.
		if mountFds != nil {
			return nil, errors.New("mountFds must be nil; can't mount from exec")
		}

		return &linuxSetnsInit{
		}, nil
	case initStandard:
		return &linuxStandardInit{
		}, nil
	}
	return nil, fmt.Errorf("unknown init type %q", t)
}

這裏做的事情比較簡單,一個是從 Pipe 拿到初始化配置,解析配置中注入的 env,將其設置到本進程中。容器初始化的方式有兩種,其一是 initSetns,啓動一個已有的容器。其次是 initStandard,啓動一個標準容器。

initStandard 中與 rootfs 最密切相關的就是 err := prepareRootfs(l.pipe, l.config, l.mountFds),在 prepareRootfs 之前,主要進行了網絡的初始化,比如 lo 網卡和 route 的初始化。不過我們主要還是關注 rootfs 部分,從註釋我們可以看到這裏主要做了這幾件事情:設備、掛載點、fs的初始化,最後提醒你調用 finalizeRootfs 來完成初始化,我們先以 prepareRootfs 爲核心,逐行解析這裏面發生了什麼:

// prepareRootfs sets up the devices, mount points, and filesystems for use
// inside a new mount namespace. It doesn't set anything as ro. You must call
// finalizeRootfs after this function to finish setting up the rootfs.
func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err error) {
	config := iConfig.Config
	if err := prepareRoot(config); err != nil {
		return fmt.Errorf("error preparing rootfs: %w", err)
	}

	if mountFds != nil && len(mountFds) != len(config.Mounts) {
		return fmt.Errorf("malformed mountFds slice. Expected size: %v, got: %v. Slice: %v", len(config.Mounts), len(mountFds), mountFds)
	}

	mountConfig := &mountConfig{
		root:            config.Rootfs,
		label:           config.MountLabel,
		cgroup2Path:     iConfig.Cgroup2Path,
		rootlessCgroups: iConfig.RootlessCgroups,
		cgroupns:        config.Namespaces.Contains(configs.NEWCGROUP),
	}
	setupDev := needsSetupDev(config)
	for i, m := range config.Mounts {
		// Just before the loop we checked that if not empty, len(mountFds) == len(config.Mounts).
		// Therefore, we can access mountFds[i] without any concerns.
		if mountFds != nil && mountFds[i] != -1 {
			mountConfig.fd = &mountFds[i]
		} else {
			mountConfig.fd = nil
		}

		if err := mountToRootfs(m, mountConfig); err != nil {
			return fmt.Errorf("error mounting %q to rootfs at %q: %w", m.Source, m.Destination, err)
		}
	}

	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		}
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		}
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)
		}
	}

	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err
	}

	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).

	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	}

	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err
	}

	if config.NoPivotRoot {
		err = msMoveRoot(config.Rootfs)
	} else if config.Namespaces.Contains(configs.NEWNS) {
		err = pivotRoot(config.Rootfs)
	} else {
		err = chroot()
	}
	if err != nil {
		return fmt.Errorf("error jailing process inside rootfs: %w", err)
	}

	if setupDev {
		if err := reOpenDevNull(); err != nil {
			return fmt.Errorf("error reopening /dev/null inside container: %w", err)
		}
	}

	if cwd := iConfig.Cwd; cwd != "" {
		// Note that spec.Process.Cwd can contain unclean value like  "../../../../foo/bar...".
		// However, we are safe to call MkDirAll directly because we are in the jail here.
		if err := os.MkdirAll(cwd, 0o755); err != nil {
			return err
		}
	}

	return nil
}

1、prepareRoot

1.1 RootPropagation

func prepareRoot(config *configs.Config) error {
   flag := unix.MS_SLAVE | unix.MS_REC
   if config.RootPropagation != 0 {
      flag = config.RootPropagation
   }
   if err := mount("", "/", "", "", uintptr(flag), ""); err != nil {
      return err
   }

   // Make parent mount private to make sure following bind mount does
   // not propagate in other namespaces. Also it will help with kernel
   // check pass in pivot_root. (IS_SHARED(new_mnt->mnt_parent))
   if err := rootfsParentMountPrivate(config.Rootfs); err != nil {
      return err
   }

   return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")
}

prepareRoot 的最一開始,先進行了一次 mount,這次 mount 實際上是一個 propagation 的遞歸修改(unix.MS_REC)。默認情況下 flag 是 unix.MS_SLAVE。從 linux 小手冊上可以得知,這個 flag 表示 mount 點從屬掛載下的 mount 事件單向傳播,此從節點下的掛載將不會影響到主節點。由於它這裏 mount 的是 "/" 目錄,而且使用了遞歸參數,即表示在此 ns 中的任何 mount 操作,都不對外界產生影響,不過反過來(準確的說是 peer group 之間)是產生影響的。

我們這裏模擬一下,進行一個 tmpfs 的 mount,並設置傳播等級爲 shared:

mount -t tmpfs myt /root/dir1 --make-shared

findmnt -o TARGET,PROPAGATION 查看一下傳播等級:

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 shared

我模擬 runc clone 一個 ns,然後同樣查看傳播等級,發現結果與上面一樣。執行 mount --make-rslave /,再次查看傳播等級,發現已經變成了 slave,而原先的 private 則保持不變:

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 private,slave

行爲也和 man page 的描述一致,不是 shared 的並不會因爲此命令而改變:

MS_SLAVE
              If this is a shared mount that is a member of a peer group
              that contains other members, convert it to a slave mount.
              If this is a shared mount that is a member of a peer group
              that contains no other members, convert it to a private
              mount.  Otherwise, the propagation type of the mount is
              left unchanged.

當然我們可以看到這裏留了個口子,可以依據 config.RootPropagation 來改變這個默認行爲,docker 的默認是 rprivate,即雙向的 mount 都互不產生影響。K8s 的默認也是 private,在 K8s 1.2.1以後,支持對 Volume 進行傳播等級配置,比如 HostToContainer,其實就是 MS_SLAVE。還有一種 Bidirectional,則是 MS_SHARED,表示此 ns 下的 mount 與外界共享,這個口子靈活又危險,比如可以在容器裏進行 device 的 mount/unmount。

1.2 rootfsParentMountPrivate

這塊的註釋非常全,其實就是檢查一下準備作爲 root 的這個目錄是不是 shared,如果是 shared,則改爲 private。也就是無論如何,容器都要求 rootfs 爲 private,即使我們將 RootPropagation 設置爲 shared 或者其他。

這塊意圖也合理,如果 rootfs 如果隨意被 propagation 影響,很容易導致容器崩潰。(不過我也不太確定我這個猜想是否正確。)

另外,這裏註釋提到,把他改成 private 也是避免後續做 bind 操作的時候,將 mount 傳播到其他 namespace。以及,pivot_root 也不允許此 mount 爲 shared。

1.3 bind

Bind 和硬鏈看起來有點點像,不過底層實現完全不同。man page 提到 bind 是一種對 fs attach 的操作,而軟硬鏈是藉助 inode 來完成的。

mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")

REC 參數的意圖和上面提到的 propagation 時的一致,就是遞歸。man page 中提到,如果沒有 REC 參數,則 bind 只 mount 當前這個目錄,而目錄底下的 submounts 不會被複制。我們發現它把 rootfs 目錄 bind 到 rootfs 目錄了,這是爲了創建一個 mountpoint。這個 mountpoint 是容器根目錄的 mount,比如:

➜  ~ mount
/dev/disk3s3s1 on / (apfs, sealed, local, read-only, journaled)

2、mountToRootfs

在 bind 完 rootfs 這個 mountpoint 後,會根據 config.Mounts 中的配置,去逐個創建對應的 mount,這裏就是處理我們掛載的地方:

func mountToRootfs(m *configs.Mount, c *mountConfig) error {
	rootfs := c.root
	mountLabel := c.label
	mountFd := c.fd
	dest, err := securejoin.SecureJoin(rootfs, m.Destination)
	if err != nil {
		return err
	}

	switch m.Device {
		case "proc", "sysfs": ...
		case "mqueue": ...
		case "tmpfs": ...
		case "bind": ...
		case "cgroup": ...
		default: ...
	}
	if err := setRecAttr(m, rootfs); err != nil {
		return err
	}
	return nil
}

整體的流程不難看懂,不同的類型有不通的 mount 流程,而最後的 setRecAttr 感興趣的可以看下 mount_setattr(2)

就以 proc/sysfs 爲例,就是檢查一下 dst,確保是一個目錄,並且不能是 symlink。註釋這裏提到了有意思的 symlink-exchange attacks,感興趣的可以看看 mounts outside,提到了 symlink 導致的 mount 逃逸,講的十分詳細(其實我也就大略看了一下)。

	case "proc", "sysfs":
		// If the destination already exists and is not a directory, we bail
		// out This is to avoid mounting through a symlink or similar -- which
		// has been a "fun" attack scenario in the past.
		// TODO: This won't be necessary once we switch to libpathrs and we can
		//       stop all of these symlink-exchange attacks.
		if fi, err := os.Lstat(dest); err != nil {
			if !os.IsNotExist(err) {
				return err
			}
		} else if fi.Mode()&os.ModeDir == 0 {
			return fmt.Errorf("filesystem %q must be mounted on ordinary directory", m.Device)
		}
		if err := os.MkdirAll(dest, 0o755); err != nil {
			return err
		}
		// Selinux kernels do not support labeling of /proc or /sys
		return mountPropagate(m, rootfs, "", nil)

底層調用的都是 mountPropagate ,這是 runc 對 mount 的一層安全封裝,確保沒有一些惡意掛載:

// Do the mount operation followed by additional mounts required to take care
// of propagation flags. This will always be scoped inside the container rootfs.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {}

其他類型的掛載實際上大同小異,它們基本都圍繞 “安全” 爲核心,對掛載做各種檢查,並執行。

3、setupDev

	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		}
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		}
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)
		}
	}

這塊內容略過,主要我也不是很瞭解比如 mknod 之類的指令。對 linux 有一定了解的小夥伴應該知道 dev 指的是設備,對應 /dev 目錄。

我們知道 docker 可以用 --device 來綁定設備,createDevices 本質上也是通過 mount 來完成的,它這裏會將 host 的設備通過 bind 或者 mknode 到容器目錄中。

setupPtmx 是將 pts/ptmx 軟鏈到了容器中,以便支持 pty。最後部分的 setupDevSymlinks 則是一些小優化,比如它會把標準輸入輸出的 fd 通過軟鏈放到 /dev 底下。

4、容器初始化時簡單的 hook

4.1 syncParentHooks

	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err
	}

這塊內容與主題無關,不過有點小意思。我們知道 runc 由父進程來創建 namespace,再由子進程來初始化容器,這裏就用了 Pipe 來實現 PreStart,這個點正好是還沒 chroot/pivot_root 的時候,理論上是可以做一些危險操作的,不過要注意,這個調用是發生在父進程:

// syncParentHooks sends to the given pipe a JSON payload which indicates that
// the parent should execute pre-start hooks. It then waits for the parent to
// indicate that it is cleared to resume.
func syncParentHooks(pipe io.ReadWriter) error {
	// Tell parent.
	if err := writeSync(pipe, procHooks); err != nil {
		return err
	}

	// Wait for parent to give the all-clear.
	return readSync(pipe, procResume)
}

4.2 createContainerHooks

這個 Hooks 則是發生在當前進程(容器主進程),代碼很簡單,不多說:

	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).

	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	}

	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err
	}

5、msMoveRoot/chroot/pivotRoot

我們知道,進入容器後,只能看到容器內的目錄,這實際上就是這上面三個命令的功勞。可能大家最熟悉的就是 chroot,這個 jail 技術已經存在很多年了。

不過在 runc 中,chroot 並不是最優選擇,chroot 設計之初就不是爲了創建一個安全且隔離的環境,它存在不少限制。其實從 man page 的定義中就可以看出 pivotRoot 和 chRoot 的底層原理是不同的:

chroot - run command or interactive shell with special root directory
pivot_root - change the root mount

chroot 是改變了 cmd/shell 的 root dir,而 pivot_root 是直接改了 root mount,chroot 有一個著名的越獄方案就是在 chroot 中調用 chroot,這裏直接貼一下維基百科的說法:


chroot 機制的設計中,並不包括抵抗特權用戶(root)的蓄意篡改。在大多數的系統中,chroot環境沒有設計出適當的堆棧,所以一個在chroot下執行的程序,可能會透過第二次chroot來獲得足夠權限,逃出chroot的限制。爲了減輕這種安全漏洞所帶來的風險,在使用chroot後,在chroot下執行的程序,應該儘快放棄root權限,或是改用其他機制來替代,例如FreeBSD jail。在某些操作系統中,例如FreeBSD,已經採取預防措施,來防止第二次chroot的攻擊[1]

  • 在支持設備節點的文件系統中,一個在chroot中的root用戶仍然可以創建設備節點和掛載在chroot根目錄的文件系統;儘管,chroot機制不是被打算用來阻止低特權用戶級訪問系統設備。

  • 在啓動時,程序都期望能在某些預設位置找到scratch space,配置文件,設備節點共享庫。對於一個成功啓動的被chroot的程序,在chroot目錄必須最低限度配備的這些文件設置。這使得chroot難以作爲一般的沙箱來使用。

  • 只有root用戶可以執行chroot。這是爲了防止用戶把一個setuid的程序放入一個特製的chroot監牢(例如一個有着假的/etc/passwd/etc/shadow文件的chroot監牢)由於引起提權攻擊。

  • 在chroot的機制本身也不是爲限制資源的使用而設計,如I/O,帶寬,磁盤空間或CPU時間。大多數Unix系統都沒有以完全文件系統爲導向,以即給可能通過網絡和過程控制,通過系統調用接口來提供一個破壞chroot的程序。


msMoveRoot 本質上也是調用了 chroot,是一個 chroot 的安全加強版:

// Before we move the root and chroot we have to mask all "full" sysfs and
	// procfs mounts which exist on the host. This is because while the kernel
	// has protections against mounting procfs if it has masks, when using
	// chroot(2) the *host* procfs mount is still reachable in the mount
	// namespace and the kernel permits procfs mounts inside --no-pivot
	// containers.
	//
	// Users shouldn't be using --no-pivot except in exceptional circumstances,
	// but to avoid such a trivial security flaw we apply a best-effort
	// protection here. The kernel only allows a mount of a pseudo-filesystem
	// like procfs or sysfs if there is a *full* mount (the root of the
	// filesystem is mounted) without any other locked mount points covering a
	// subtree of the mount.
	//
	// So we try to unmount (or mount tmpfs on top of) any mountpoint which is
	// a full mount of either sysfs or procfs (since those are the most
	// concerning filesystems to us).
	mountinfos, err := mountinfo.GetMounts(func(info *mountinfo.Info) (skip, stop bool) {
		// Collect every sysfs and procfs filesystem, except for those which
		// are non-full mounts or are inside the rootfs of the container.
		if info.Root != "/" ||
			(info.FSType != "proc" && info.FSType != "sysfs") ||
			strings.HasPrefix(info.Mountpoint, rootfs) {
			skip = true
		}
		return
	})
	if err != nil {
		return err
	}
	for _, info := range mountinfos {
		p := info.Mountpoint
		// Be sure umount events are not propagated to the host.
		if err := mount("", p, "", "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
			if errors.Is(err, unix.ENOENT) {
				// If the mountpoint doesn't exist that means that we've
				// already blasted away some parent directory of the mountpoint
				// and so we don't care about this error.
				continue
			}
			return err
		}
		if err := unmount(p, unix.MNT_DETACH); err != nil {
			if !errors.Is(err, unix.EINVAL) && !errors.Is(err, unix.EPERM) {
				return err
			} else {
				// If we have not privileges for umounting (e.g. rootless), then
				// cover the path.
				if err := mount("tmpfs", p, "", "tmpfs", 0, ""); err != nil {
					return err
				}
			}
		}
	}

  // Move the rootfs on top of "/" in our mount namespace.
	if err := mount(rootfs, "/", "", "", unix.MS_MOVE, ""); err != nil {
		return err
	}
	return chroot()
}

代碼很長,實際上做的事情不復雜:這裏把當前 ns 中的 proc/sysfs,且不屬於 rootfs 底下的 mount 過濾出來 umount 掉了。

最後做了一手 MS_MOVE,把 rootfs 這個 mount 挪到了 /,這裏猜測是防止 chdir(../) chroot(/) 這種組合拳,因爲把 mount 挪過去,原來的 rootfs mount 就不存在了,最後再執行一下 chroot。不過即使如此,runc 依舊不推薦使用 chroot。

6、finalizeRootfs

prepareRootfs 簡單的過了一下,其實也就是它註釋提到的那幾件事情,進行設備、掛載點、fs的初始化,而 finalizeRootfs 是 prepareRootfs 的收尾。

// finalizeRootfs sets anything to ro if necessary. You must call
// prepareRootfs first.
func finalizeRootfs(config *configs.Config) (err error) {
	// All tmpfs mounts and /dev were previously mounted as rw
	// by mountPropagate. Remount them read-only as requested.
	for _, m := range config.Mounts {
		if m.Flags&unix.MS_RDONLY != unix.MS_RDONLY {
			continue
		}
		if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
			if err := remountReadonly(m); err != nil {
				return err
			}
		}
	}

	// set rootfs ( / ) as readonly
	if config.Readonlyfs {
		if err := setReadonly(); err != nil {
			return fmt.Errorf("error setting rootfs as readonly: %w", err)
		}
	}

	if config.Umask != nil {
		unix.Umask(int(*config.Umask))
	} else {
		unix.Umask(0o022)
	}
	return nil
}

finalizeRootfs 第一段代碼實際上是和之前的 mountPropagate(就那個 runc 對 mount 操作的安全封裝)交相呼應:

// Do the mount operation followed by additional mounts required to take care
// of propagation flags. This will always be scoped inside the container rootfs.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {
	var (
		data  = label.FormatMountLabel(m.Data, mountLabel)
		flags = m.Flags
	)
	// Delay mounting the filesystem read-only if we need to do further
	// operations on it. We need to set up files in "/dev", and other tmpfs
	// mounts may need to be chmod-ed after mounting. These mounts will be
	// remounted ro later in finalizeRootfs(), if necessary.
	if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
		flags &= ^unix.MS_RDONLY
	}
  
  ...
}

由於 tmpfs 和 /dev 有可能會在 mount 之後做一些初始化,或者 chmod,所以當初掛的時候即使是 MS_RDONLY,也 ^ 掉了,在最後 finalize 的時候,如果配置了 MS_RDONLY,再 remount 一下,讓它真正 mount 成 MS_RDONLY。

第二段代碼也是一樣的道理,如果配置中設置了 Readonlyfs,同樣也是在最後關頭再設置成只讀。

第三段代碼做了一個 umask,我們只看這個默認的 umask 022,因爲正常的文件/目錄如果未經過設置,是 666/777,umask 實際上就是把其他用戶組的寫權限拿掉,變成 644/755。


二、 runc 的使用

分析到這裏,我們發現,runc 並沒有對 rootfs 這個 mountpoint 是什麼掛載去做定義,只是做了下 bind,而且允許我們自由定義 rootfs 如何掛載。至少看到這裏,我們認爲從外部提供一個 mount 配置掛載到 root 上是可行的。不如體驗一下在不同的配置下, runc 是如何爲我們生成容器的。

1、使用 runc 創建並進入容器

我們先跑下 runc spec && cat config.json,通過這個命令,能提供一個缺省的配置,實際上這坨配置就是 OCI-runtime-spec,描述如下:

{
	"ociVersion": "1.0.2-dev",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"sh"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "rootfs",
		"readonly": true
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"ro"
			]
		}
	],
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": false,
					"access": "rwm"
				}
			]
		},
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			}
		],
		"maskedPaths": [
			"/proc/acpi",
			"/proc/asound",
			"/proc/kcore",
			"/proc/keys",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/sys/firmware",
			"/proc/scsi"
		],
		"readonlyPaths": [
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]
	}
}

不過 rootfs 這個缺省目錄下並沒有一套根文件系統(現在都不存在這個目錄),直接運行肯定是會報錯的,如下:

> runc run config.json
ERRO[0000] runc run failed: invalid rootfs: stat /root/rootfs: no such file or directory

這裏藉助 docker export 了一個 busybox 的根文件系統,並放在 /root/rootfs 下,並將剛纔的配置 root.path 修改爲 /root/rootfs:

> ls
VERSION  bin  custom  dev  etc  home  json  lib  lib64  proc  root  tmp  usr  var
> pwd
/root/rootfs

執行 runc 命令啓動此容器:

> runc run config.json

/ # ls
VERSION  bin      custom   dev      etc      home     json     lib      lib64    proc     root     sys      tmp      usr      var

/ # echo $$
1

/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    8 root      0:00 ps -ef
    
/ # mount
/dev/vda1 on / type ext4 (ro,noatime)
...

/ # echo something > test.log
sh: can't create test.log: Read-only file system

確實如配置那樣,rootfs 被設置爲只讀,對應我們在第一小節裏講到的 finalizeRootfs 中的第二段操作。

	"root": {
		"path": "rootfs",
		"readonly": true
	},

2、嘗試在 chroot 下進行越獄

我們試試在不安全的 chroot 底下進行越獄,先改一下剛纔生成的配置,把 capabilities 的權限打開,不然有些命令比如 chroot 跑不了。另外就是 namespaces 需要去掉 mount(就是 NEWNS),如果打開了 NEWNS,根據我們前面的源碼分析,它會自動去進行 pivot_root。最後再去掉 MaskPaths 和 ReadonlyPaths,否則無法通過安全檢查:

{
	"ociVersion": "1.0.2-dev",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"bash"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_MKNOD",
                                "CAP_SYS_ADMIN"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "/root/ubuntu"
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"ro"
			]
		}
	],
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": false,
					"access": "rwm"
				}
			]
		},
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			}
		]
	}
}

進入容器後,我們執行越獄教程中提供的代碼,成功 break out:

// 進入容器
[root@master ~]# runc run config.json

root@runc:/# ls -la
total 104
drwxr-xr-x  21 root root  4096 Feb 16 13:56 .
drwxr-xr-x  21 root root  4096 Feb 16 13:56 ..
-rwxr-xr-x   1 root root     0 Feb 14 03:20 .dockerenv
lrwxrwxrwx   1 root root     7 Jan 26 02:03 bin -> usr/bin
drwxr-xr-x   2 root root  4096 Apr 18  2022 boot
-rwxr-xr-x   1 root root 29160 Feb 16 10:22 break
drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r1
drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r2
drwxr-xr-x   2 root root  4096 Feb 16 14:01 d1r3
drwxr-xr-x   5 root root   360 Feb 16 14:40 dev
drwxr-xr-x  32 root root  4096 Feb 14 03:20 etc
drwxr-xr-x   2 root root  4096 Apr 18  2022 home
lrwxrwxrwx   1 root root     7 Jan 26 02:03 lib -> usr/lib
lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib32 -> usr/lib32
lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib64 -> usr/lib64
lrwxrwxrwx   1 root root    10 Jan 26 02:03 libx32 -> usr/libx32
drwxr-xr-x   2 root root  4096 Jan 26 02:03 media
drwxr-xr-x   2 root root  4096 Jan 26 02:03 mnt
drwxr-xr-x   2 root root  4096 Jan 26 02:03 opt
dr-xr-xr-x 375 root root     0 Feb 16 14:40 proc
drwx------   2 root root  4096 Feb 16 10:23 root
drwxr-xr-x   6 root root  4096 Feb 14 03:20 run
lrwxrwxrwx   1 root root     8 Jan 26 02:03 sbin -> usr/sbin
drwxr-xr-x   2 root root  4096 Jan 26 02:03 srv
dr-xr-xr-x  12 root root     0 Feb 16 14:40 sys
drwxrwxrwt   2 root root  4096 Jan 26 02:06 tmp
drwxr-xr-x  14 root root  4096 Jan 26 02:03 usr
drwxr-xr-x  11 root root  4096 Jan 26 02:06 var
drwxr-xr-x   2 root root  4096 Feb 16 10:22 waterbuffalo

// 其實就是 chdir(..) + chroot(.)
root@runc:/# ./break

// 越獄成功
[root@runc /]# ls -la
total 18920
dr-xr-xr-x  23 root root     4096 Feb 16 22:39 .
dr-xr-xr-x  23 root root     4096 Feb 16 22:39 ..
drwxr-xr-x   3 root root     4096 Jan  8  2021 agent
drwxr-xr-x   3 root root     4096 Nov 24 16:03 api-helm
-rw-r--r--   1 root root        0 Oct 30  2020 .autorelabel
lrwxrwxrwx   1 root root        7 Dec 14  2020 bin -> usr/bin
dr-xr-xr-x   5 root root     4096 Nov 30  2021 boot
-rwxr-xr-x   1 root root 19261816 Dec  8 14:30 cloud-agent
-rw-------   1 root root    12288 Nov 25 11:27 .conf.txt.swp
drwxr-xr-x  13 root root     4096 Feb 16 22:39 data
drwxr-xr-x  17 root root    14140 Sep  9 16:20 dev
drwxr-xr-x 108 root root    12288 Feb 16 10:59 etc
drwxr-xr-x   2 root root     4096 Dec 14  2020 home
lrwxrwxrwx   1 root root        7 Dec 14  2020 lib -> usr/lib
lrwxrwxrwx   1 root root        9 Dec 14  2020 lib64 -> usr/lib64
drwx------   2 root root    16384 Aug 18  2020 lost+found
drwxr-xr-x   2 root root     4096 Dec 14  2020 media
drwxr-xr-x   2 root root     4096 Dec 14  2020 mnt
drwxr-xr-x   6 root root     4096 Sep  9 16:31 opt
dr-xr-xr-x 375 root root        0 Sep  9 16:19 proc
dr-xr-x---  31 root root     4096 Feb 16 22:34 root
drwxr-xr-x   2 root root     4096 Dec  5 11:40 rot
drwxr-xr-x  33 root root     1120 Feb 16 14:59 run
lrwxrwxrwx   1 root root        8 Dec 14  2020 sbin -> usr/sbin
drwxr-xr-x   2 root root     4096 Dec 14  2020 srv
dr-xr-xr-x  12 root root        0 Sep  9 16:19 sys
drwxr-xr-x   2 root root     4096 Dec  5 11:33 test
drwxrwxrwt   4 root root     4096 Feb 16 22:35 tmp
drwxr-xr-x  12 root root     4096 Nov 30  2021 usr
drwxr-xr-x  21 root root     4096 Sep  9 15:53 var
[root@runc /]#

至此也算是告一段落了,起碼我們粗淺地瞭解了整個容器是如何初始化的,以及我們知道了,容器的 rootfs 是可以隨意指定目錄的。不過,開頭提到的問題還是沒能回答。不管是 Docker,還是 K8s,實際上都無法進行下列操作:

> docker run -it -d --name xxx -p 8091:8090 -v /xxx:/ ubuntu
docker: Error response from daemon: invalid volume specification: '/xxx:/': invalid mount config for type "bind": invalid specification: destination can't be '/'.
See 'docker run --help'.

下一期,我們將嘗試從 OCI、CRI 的角度再度探討這個問題。

文章如有錯誤,感謝指正。

Reference

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章