陽哥的程序:https://github.com/Pro-YY/jail
主進程:
- argp_parse 解析輸入參數。
- 用 root 身份建立了 cgroup(限制一組進程的資源),rlimit(限制一個進程或者一個用戶的資源)。
- 調用帶 CLONE_NEW* 標誌的 clone() 創建子進程。
- 配置網絡。
- 寫 eventfd,通知子進程,子進程繼續執行。
- 向 epoll 中註冊信號、超時事件。開啓事件循環,讀取並處理 epoll 中的信號、超時事件。
子進程:
- 在新的命名空間內獲取參數,包括要執行的命令等。
- 修改主機名。
- 在 eventfd 的讀事件上阻塞,讀到後繼續執行。
- 配置網絡,掛載 rootfs,seccomp 限制系統調用,prctl 限制功能。
- 調用 execve 執行命令。
用法
$ ./jail --help
Usage: jail [OPTION...] [<program> [<argument>...]]
Jail, a pretty sandbox to run program.
-b, --base=STRING Mount base dir, default to '/tmp'
-d, --detach Detach process as deaemon
-e, --env=STRING Environment variables
--ip=ADDRESS Assign ip address, within 172.17.0.1/16
-n, --name=STRING Jail name, default to random string
-r, --root=STRING Rootfs, default to '/'
-t, --timeout=SECONDS Running timeout
-v, --verbose Make the operation more talkative
-w, --writable Make rootfs writable mount
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Report bugs to <brookeyang@vip.qq.com>.
在容器中執行完命令後退出容器:
$ sudo ./jail /bin/ls
bin data etc imgcreate_linux_install_0.1.23 initrd.img.old lib64 media opt root sbin srv tmp var vmlinuz.old
boot dev home initrd.img lib lost+found mnt proc run snap sys usr vmlinuz
在容器中執行命令
$ sudo ./jail /usr/local/node/bin/node
>
Error: Could not open history file.
REPL session history will not be persisted.
>
>
命名空間
用戶命名空間
主機:
$ ps -ef | grep jail
root 20743 13977 0 20:03 pts/0 00:00:00 sudo ./jail /bin/sh -n myJail
$ cat /proc/20743/uid_map
0 0 4294967295
$ cat /proc/20743/gid_map
0 0 4294967295
容器:
# id
uid=0(root) gid=0(root) groups=0(root)
容器進程在主機內的 uid = 20743 映射到了 容器內到 uid = 0
UTS 命名空間
在新的 UTS 命名空間中修改了主機名:
$ sudo ./jail /bin/sh -n jail
# hostname
jail
PID 命名空間
主機中:
$ ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Nov01 ? 00:00:09 /sbin/init
root 2 0 0 Nov01 ? 00:00:00 [kthreadd]
root 4 2 0 Nov01 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Nov01 ? 00:00:00 [mm_percpu_wq]
root 7 2 0 Nov01 ? 00:00:01 [ksoftirqd/0]
root 8 2 0 Nov01 ? 00:00:27 [rcu_sched]
...
root 20743 13977 0 20:03 pts/0 00:00:00 sudo ./jail /bin/sh -n myJail
...
容器內只兩個進程:
# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:02 pts/1 00:00:00 /bin/sh
root 15 1 0 15:06 pts/1 00:00:00 ps -ef
# ls /proc
1 cmdline dma interrupts key-users loadavg mounts schedstat swaps tty zoneinfo
12 consoles driver iomem keys locks mtrr scsi sys uptime
acpi cpuinfo execdomains ioports kmsg mdstat net self sysrq-trigger version
buddyinfo crypto fb irq kpagecgroup meminfo pagetypeinfo slabinfo sysvipc version_signature
bus devices filesystems kallsyms kpagecount misc partitions softirqs thread-self vmallocinfo
cgroups diskstats fs kcore kpageflags modules sched_debug stat timer_list vmstat
網絡命名空間
主機:
...
jail0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.255.254 netmask 255.255.0.0 broadcast 172.17.255.255
ether e6:14:c9:b9:27:a0 txqueuelen 1000 (Ethernet)
RX packets 216 bytes 14220 (14.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth-myJail: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether e6:14:c9:b9:27:a0 txqueuelen 1000 (Ethernet)
RX packets 13 bytes 1006 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
容器:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
inet6 fe80::860:a7ff:fea2:10f9 prefixlen 64 scopeid 0x20<link>
ether 0a:60:a7:a2:10:f9 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13 bytes 1006 (1.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
...
容器與宿主機通信
主機中添加網橋 jail0,網絡段:172.17.255.254/16
創建一對 veth pair 的網卡,從一邊發送包,另一邊就能收到。
其中一個網卡 veth-myJail 打到 jail0 網橋上,另一個網卡 eth0 塞到容器中。
在容器內設置網卡 eth0 地址:172.17.0.1
容器與互聯網通信
如果從容器內訪問互聯網,需使用 SNAT
先設置 net.ipv4.ip_forward = 1,開啓物理機的轉發功能,直接做路由器。
然後在主機上,添加一條 iptables 規則:
iptables -t nat -A POSTROUTING -s 172.17.0.0/16 -j MASQUERADE
掛載命名空間
主機中:
$ cat /proc/20743/mountinfo | sed 's/ - .*//'
23 29 0:22 / /sys rw,nosuid,nodev,noexec,relatime shared:7
24 29 0:4 / /proc rw,nosuid,nodev,noexec,relatime shared:13
25 29 0:6 / /dev rw,nosuid,relatime shared:2
26 25 0:23 / /dev/pts rw,nosuid,noexec,relatime shared:3
27 29 0:24 / /run rw,nosuid,noexec,relatime shared:5
29 0 252:1 / / rw,relatime shared:1
30 23 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:8
31 25 0:26 / /dev/shm rw,nosuid,nodev shared:4
32 27 0:27 / /run/lock rw,nosuid,nodev,noexec,relatime shared:6
33 23 0:28 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:9
34 33 0:29 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:10
35 33 0:30 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11
36 23 0:31 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:12
37 33 0:32 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:14
38 33 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:15
39 33 0:34 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:16
40 33 0:35 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:17
41 33 0:36 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:18
42 33 0:37 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19
43 33 0:38 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:20
44 33 0:39 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:21
45 33 0:40 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:22
46 33 0:41 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23
47 33 0:42 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:24
48 24 0:43 / /proc/sys/fs/binfmt_misc rw,relatime shared:25
49 25 0:19 / /dev/mqueue rw,relatime shared:26
51 23 0:8 / /sys/kernel/debug rw,relatime shared:27
50 25 0:44 / /dev/hugepages rw,relatime shared:28
52 23 0:20 / /sys/kernel/config rw,relatime shared:29
53 23 0:45 / /sys/fs/fuse/connections rw,relatime shared:30
262 29 0:48 / /var/lib/lxcfs rw,nosuid,nodev,relatime shared:144
276 27 0:24 /netns /run/netns rw,nosuid,noexec,relatime shared:5
283 276 0:3 net:[4026532213] /run/netns/netns1 rw shared:155
284 27 0:3 net:[4026532213] /run/netns/netns1 rw shared:155
269 27 0:49 / /run/user/500 rw,nosuid,nodev,relatime shared:148
容器中:
# cat /proc/self/mountinfo | sed 's/ - .*//'
351 313 252:1 / / ro,relatime master:1
352 351 0:6 / /dev rw,nosuid,relatime master:2
353 352 0:23 / /dev/pts rw,nosuid,noexec,relatime master:3
354 352 0:26 / /dev/shm rw,nosuid,nodev master:4
355 352 0:19 / /dev/mqueue rw,relatime master:26
356 352 0:44 / /dev/hugepages rw,relatime master:28
357 351 0:24 / /run rw,nosuid,noexec,relatime master:5
358 357 0:27 / /run/lock rw,nosuid,nodev,noexec,relatime master:6
359 357 0:24 /netns /run/netns rw,nosuid,noexec,relatime master:5
360 359 0:3 net:[4026532213] /run/netns/netns1 rw master:155
361 357 0:3 net:[4026532213] /run/netns/netns1 rw master:155
362 357 0:49 / /run/user/500 rw,nosuid,nodev,relatime master:148
363 351 0:22 / /sys rw,nosuid,nodev,noexec,relatime master:7
364 363 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8
365 363 0:28 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9
366 365 0:29 /../../.. /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime master:10
367 365 0:30 /../../.. /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11
368 365 0:32 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:14
369 365 0:33 /.. /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:15
370 365 0:34 /.. /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:16
371 365 0:35 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:17
372 365 0:36 /.. /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:18
373 365 0:37 /../../.. /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:19
374 365 0:38 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime master:20
375 365 0:39 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:21
376 365 0:40 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:22
377 365 0:41 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:23
378 365 0:42 /.. /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:24
379 363 0:31 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:12
380 363 0:8 / /sys/kernel/debug rw,relatime master:27
381 363 0:20 / /sys/kernel/config rw,relatime master:29
382 363 0:45 / /sys/fs/fuse/connections rw,relatime master:30
383 351 0:4 / /proc rw,nosuid,nodev,noexec,relatime master:13
384 383 0:43 / /proc/sys/fs/binfmt_misc rw,relatime master:25
385 351 0:48 / /var/lib/lxcfs rw,nosuid,nodev,relatime master:144
386 383 0:51 / /proc ro,relatime
314 351 0:52 / /tmp rw,relatime
cgroup
內存:
$ ls -l /sys/fs/cgroup/memory
...
drwx------ 2 root root 0 Nov 10 01:20 myJail
...
$ sudo cat /sys/fs/cgroup/memory/myJail/memory.limit_in_bytes
49999872
$ sudo cat /sys/fs/cgroup/memory/myJail/memory.kmem.limit_in_bytes
49999872
cpu:
$ ls -l /sys/fs/cgroup/cpu
lrwxrwxrwx 1 root root 11 Nov 1 14:50 /sys/fs/cgroup/cpu -> cpu,cpuacct
$ ls -l /sys/fs/cgroup/cpu,cpuacct
...
drwx------ 2 root root 0 Nov 9 22:01 myJail
...
$ sudo cat /sys/fs/cgroup/cpu,cpuacct/myJail/cpu.cfs_period_us
1000000
$ sudo cat /sys/fs/cgroup/cpu,cpuacct/myJail/cpu.cfs_quota_us
1000000