ignite

下面以containerd作爲runtime進行介紹。

簡介

Ignite是一個啓動firecracker vm的引擎,它使用容器的方式承載了firecracker vm。目前項目處於停滯階段,也比較可惜,通過閱讀了解ignite的工作方式,學習到了很多,希望能借此維護該項目。

ignite的運作方式和kubernetes類似,可以將Firecracker看作是runc,將ignite看作是cri(還有一個Footloose可以看作是docker-compose)。此外它還使用了一個存儲(下面統稱Storage),用於保存集羣的元數據(image/kernel/vm),可以看做是kubernetes中的etcd。

image

ignite創建vm的過程如上,可以看到vm其實是由容器中的Firecracker 命令創建出來的。下面將會用到容器vm兩個概念,注意區分。

首先使用containerd創建一個名爲firecracker的命名空間,後續會在該命名空間下拉取鏡像和創建容器,然後在容器中通過ignite-spawn調用firecracker來創建vm。整個過程中需要涉及vm文件系統的製作和掛載、配置並使用containerd創建容器、vm網絡的配置、使用firecracker啓動容器等流程。vm進程啓動步驟如下:

  1. /usr/bin/containerd-shim-runc-v2 -namespace firecracker -id ignite-ddf49307b5b27c34 -address /run/containerd/containerd.sock
  2. /usr/local/bin/ignite-spawn --log-level=info ddf49307b5b27c34
  3. firecracker --api-sock /var/lib/firecracker/vm/ddf49307b5b27c34/firecracker.sock

其進程樹如下:

containerd-shim─┬─ignite-spawn─┬─firecracker───2*[{firecracker}]
                │              └─14*[{ignite-spawn}]
                └─11*[{containerd-shim}]

ignite通過如下接口來操作容器,可以看到和一般docker命令行支持的功能類似:

type Interface interface {
	PullImage(image meta.OCIImageRef) error
	InspectImage(image meta.OCIImageRef) (*ImageInspectResult, error)
	ExportImage(image meta.OCIImageRef) (io.ReadCloser, func() error, error)

	InspectContainer(container string) (*ContainerInspectResult, error)
	AttachContainer(container string) error
	RunContainer(image meta.OCIImageRef, config *ContainerConfig, name, id string) (string, error)
	StopContainer(container string, timeout *time.Duration) error
	KillContainer(container, signal string) error
	RemoveContainer(container string) error
	ContainerLogs(container string) (io.ReadCloser, error)

	Name() Name
	RawClient() interface{}

	PreflightChecker() preflight.Checker
}

ignite中有三種資源:ImageKernelVM三種,分別代表基礎鏡像,內核鏡像和虛擬機。它使用Storage來保存這些資源的元數據。元數據存儲路徑爲constants.DATA_DIR,代碼中定義爲/var/lib/firecracker

ignite中有兩個主要的目錄:

  • /var/lib/firecracker/:保存了ImageKernelVM的元數據,以及內核文件和vm的文件系統文件等。ignite的三種對象都有一個UID,相關資源保存在對應的/var/lib/firecracker/<image/kernel/vm>/<UID>目錄中。
  • /etc/firecracker/manifests:ignited守護進程使用的vm manifest文件,用於通過watch 文件的方式管理vm

ignite中使用了多種存儲類型,底層的Storage接口如下,可以看到其支持的方法與client-go操作kubernetes資源的方式十分類似,Storage內部保存了ignite 的CRD對象:

type Storage interface {
	// New creates a new Object for the specified kind
	New(gvk schema.GroupVersionKind) (runtime.Object, error)
	// Get returns a new Object for the resource at the specified kind/uid path, based on the file content
	Get(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
	// GetMeta returns a new Object's APIType representation for the resource at the specified kind/uid path
	GetMeta(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
	// Set saves the Object to disk. If the Object does not exist, the
	// ObjectMeta.Created field is set automatically
	Set(gvk schema.GroupVersionKind, obj runtime.Object) error
	// Patch performs a strategic merge patch on the Object with the given UID, using the byte-encoded patch given
	Patch(gvk schema.GroupVersionKind, uid runtime.UID, patch []byte) error
	// Delete removes an Object from the storage
	Delete(gvk schema.GroupVersionKind, uid runtime.UID) error
	// List lists Objects for the specific kind
	List(gvk schema.GroupVersionKind) ([]runtime.Object, error)
	// ListMeta lists all Objects' APIType representation. In other words,
	// only metadata about each Object is unmarshalled (uid/name/kind/apiVersion).
	// This allows for faster runs (no need to unmarshal "the world"), and less
	// resource usage, when only metadata is unmarshalled into memory
	ListMeta(gvk schema.GroupVersionKind) ([]runtime.Object, error)
	// Count returns the amount of available Objects of a specific kind
	// This is used by Caches to check if all Objects are cached to perform a List
	Count(gvk schema.GroupVersionKind) (uint64, error)
	// Checksum returns a string representing the state of an Object on disk
	// The checksum should change if any modifications have been made to the
	// Object on disk, it can be e.g. the Object's modification timestamp or
	// calculated checksum
	Checksum(gvk schema.GroupVersionKind, uid runtime.UID) (string, error)
	// RawStorage returns the RawStorage instance backing this Storage
	RawStorage() RawStorage
	// Serializer returns the serializer
	Serializer() serializer.Serializer
	// Close closes all underlying resources (e.g. goroutines) used; before the application exits
	Close() error
}

ignite使用CRD的方式定義了其管理的資源,對應的gvk:group爲ignite.weave.works;version有v1alpha2v1alpha3v1alpha4三個版本;kind有ImageKernelVM三種。scheme.Serializer提供了CRD的編解碼方式:

var (
	// Scheme is the runtime.Scheme to which all types are registered.
	Scheme = runtime.NewScheme()

	// codecs provides access to encoding and decoding for the scheme.
	// codecs is private, as Serializer will be used for all higher-level encoding/decoding
	codecs = k8sserializer.NewCodecFactory(Scheme)

	// Serializer provides high-level encoding/decoding functions
	Serializer = serializer.NewSerializer(Scheme, &codecs)
)

func init() {
	AddToScheme(Scheme)
}

// AddToScheme builds the scheme using all known versions of the api.
func AddToScheme(scheme *runtime.Scheme) {
	utilruntime.Must(ignite.AddToScheme(Scheme))
	utilruntime.Must(v1alpha2.AddToScheme(Scheme))
	utilruntime.Must(v1alpha3.AddToScheme(Scheme))
	utilruntime.Must(v1alpha4.AddToScheme(Scheme))
	utilruntime.Must(scheme.SetVersionPriority(v1alpha4.SchemeGroupVersion))
}

運行

ignite有兩種方式來管理vm,分別對應兩個命令:ignite和ignited。前者使用手動命令行的方式來管理vm,後者使用監聽vm manifest文件的方式來自動管理vm。

ignite使用的Storage稱爲GenericStorage,而ignited使用的Storage稱爲ManifestStorage。其初始化方式分別如下:

func SetGenericStorage() error {
	log.Trace("Initializing the GenericStorage provider...")
	providers.Storage = cache.NewCache(
		storage.NewGenericStorage(
			storage.NewGenericRawStorage(constants.DATA_DIR), scheme.Serializer))
	return nil
}
func SetManifestStorage() (err error) {
	log.Trace("Initializing the ManifestStorage provider...")
	ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
	if err != nil {
		return
	}

	providers.Storage = cache.NewCache(ManifestStorage)
	return
}

Ignite使用如下幾個變量來操作各個資源,Runtime用於管理containerd的資源,Client用於管理ignite自身的資源,NetworkPlugin則用於配置容器的CNI。

image

製作vm文件系統

image

vm文件系統包含兩部分內容:基礎文件系統和內核文件,這兩部分內容分別來自基礎鏡像和內核鏡像。在製作vm文件系統過程中,會將這兩部分合併成爲一個完整的vm文件系統,後續在使用firecracker啓動vm時,會將該文件系統掛載爲vm的root fs。

ignite cli通過如下兩個命令拉取基礎鏡像和內核鏡像:

$ ignite image import <OCI image> [flags]
$ ignite kernel import <OCI image> [flags]

製作vm基礎文件系統文件

ignite拉取鏡像時需要指定兩個參數:providers.RuntimeName(默認containerd)和providers.NetworkPluginName(默認cni),前者創建containerdClient,用於拉取鏡像和創建容器,後者用於配置CNI網絡,但在文件系統製作過程中並未用到。

創建contianerdClient

創建contianerdClient需要兩個參數:containerd.sockcontainerd-shim,這兩個是創建containerdClient的必要參數。在實際系統中,containerd-shim可能有多個版本,優先使用io.containerd.runc.v2

ignite使用containerd創建了名爲firecracker的命名空間,後續ignite的鏡像和vm都是在該命名空間下面操作的,可以使用ctr命令直接查看ignite的容器和鏡像。如下面展示了ignite創建出來的容器和鏡像,ignite的容器是以ignite-開頭的,但使用ignite vm ls時不會顯示該前綴(也可以理解爲該命令查看的是容器內的vm名稱):

$ ctr -n firecracker containers ls 
CONTAINER                  IMAGE                                  RUNTIME                  
ignite-ddf49307b5b27c34    docker.io/weaveworks/ignite:v0.10.0    io.containerd.runc.v2    
$ ctr -n firecracker images ls 
REF           TYPE             DIGEST                                                                  SIZE     PLATFORMS                  LABELS 
docker.io/weaveworks/ignite-kernel:5.10.51 application/vnd.docker.distribution.manifest.list.v2+json sha256:c1d99eafa5b2bcaeab26c0a093d83d709a560e4721f52b6e7c5ef7e9e771189d 15.0 MiB linux/amd64,linux/arm64    -      
docker.io/weaveworks/ignite-ubuntu:latest  application/vnd.docker.distribution.manifest.list.v2+json sha256:11550e0912d24aeaad847f06fdf2133302f2af2fd2ce231723d078ffce9216ba 78.1 MiB linux/amd64,linux/arm64/v8 -      
docker.io/weaveworks/ignite:v0.10.0        application/vnd.docker.distribution.manifest.list.v2+json sha256:b8cc53c5cba81d685b1dc95a0f34ca3fa732ddd450b6f0eba0c829ccc1c67462 16.5 MiB linux/amd64,linux/arm64    -   

下面是創建containerdClient的過程,獲取containerd socket和runtime即可。注意在獲取runtime的時候如果不存在RuntimeRuncV2,則會退一步查找RuntimeRuncV1

func GetContainerdClient() (*ctdClient, error) {
	ctdSocket, err := StatContainerdSocket()
	if err != nil {
		return nil, err
	}

	runtime, err := getNewestAvailableContainerdRuntime()//獲取可用的runtime
	if err != nil {
		// proceed with the default runtime -- our PATH can't see a shim binary, but containerd might be able to
		log.Warningf("Proceeding with default runtime %q: %v", runtime, err)
	}

	cli, err := containerd.New(
		ctdSocket,
		containerd.WithDefaultRuntime(runtime),
	)
	if err != nil {
		return nil, err
	}

	return &ctdClient{
		client: cli,
		ctx:    namespaces.WithNamespace(context.Background(), ctdNamespace), //設置ignite的命名空間
	}, nil
}
func getNewestAvailableContainerdRuntime() (string, error) {
	for _, rt := range v2ShimRuntimes {
		binary := v2shim.BinaryName(rt)
		if binary == "" {
			// this shouldn't happen if the matching test is passing, but it's not fatal -- just log and continue
			log.Errorf("shim binary could not be found -- %q is an invalid runtime/v2/shim", rt)
		} else if _, err := exec.LookPath(binary); err == nil {
			return rt, nil
		}
	}
	...
}
v2ShimRuntimes = []string{
  plugin.RuntimeRuncV2,
  plugin.RuntimeRuncV1,
}
const (
	// RuntimeLinuxV1 is the legacy linux runtime
	RuntimeLinuxV1 = "io.containerd.runtime.v1.linux"
	// RuntimeRuncV1 is the runc runtime that supports a single container
	RuntimeRuncV1 = "io.containerd.runc.v1"
	// RuntimeRuncV2 is the runc runtime that supports multiple containers per shim
	RuntimeRuncV2 = "io.containerd.runc.v2"
)

最後通過containerd.New創建containerdClient,並將其保存在providers.Runtime變量中:

	cli, err := containerd.New(
		ctdSocket,
		containerd.WithDefaultRuntime(runtime),
	)
  • 與containerd相關的默認配置如下:

    const (
    	// DefaultRootDir is the default location used by containerd to store
    	// persistent data
    	DefaultRootDir = "/var/lib/containerd"
    	// DefaultStateDir is the default location used by containerd to store
    	// transient data
    	DefaultStateDir = "/run/containerd"
    	// DefaultAddress is the default unix socket address
    	DefaultAddress = "/run/containerd/containerd.sock"
    	// DefaultDebugAddress is the default unix socket address for pprof data
    	DefaultDebugAddress = "/run/containerd/debug.sock"
    	// DefaultFIFODir is the default location used by client-side cio library
    	// to store FIFOs.
    	DefaultFIFODir = "/run/containerd/fifo"
    	// DefaultRuntime is the default linux runtime
    	DefaultRuntime = "io.containerd.runc.v2"
    	// DefaultConfigDir is the default location for config files.
    	DefaultConfigDir = "/etc/containerd"
    )
    
  • 不同版本的contianerd-shim的區別參見Containerd shim 原理深入解讀

創建cniInstance

cniInstance用於設置容器的網絡,這一步在製作vm文件系統中並沒有用到,但會被一併初始化。

創建cniInstance時會依賴上面獲取到的providers.Runtime,表示用於配置特定容器runtime的CNI網絡。結果保存在providers.NetworkPlugin中。

這一步主要就是通過gocni.New初始化一個cni實例cniInstance,後續通過cniInstance.Setup來設置容器網絡(見下面的"CNI"章節)。

func GetCNINetworkPlugin(runtime runtime.Interface) (network.Plugin, error) {
	// If the CNI configuration directory doesn't exist, create it
	if !util.DirExists(CNIConfDir) {
		if err := os.MkdirAll(CNIConfDir, constants.DATA_DIR_PERM); err != nil {
			return nil, err
		}
	}

	binDirs := []string{CNIBinDir}
	cniInstance, err := gocni.New(gocni.WithMinNetworkCount(2),
		gocni.WithPluginConfDir(CNIConfDir),
		gocni.WithPluginDir(binDirs))
	if err != nil {
		return nil, err
	}

	return &cniNetworkPlugin{
		runtime: runtime,
		cni:     cniInstance,
		once:    &sync.Once{},
	}, nil
}
	// CNIBinDir describes the directory where the CNI binaries are stored
	CNIBinDir = "/opt/cni/bin"
	// CNIConfDir describes the directory where the CNI plugin's configuration is stored
	CNIConfDir = "/etc/cni/net.d"
拉取基礎鏡像

首先通過從ignite的image元數據中查找鏡像來判斷是否已經存在該鏡像,如果不存在,則通過containerdClient從(如果runtime爲containerd的話)本地查找(類似執行 ctr --namespace firecracker images ls),再找不到纔會從遠端拉取鏡像。

func FindOrImportImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
	log.Debugf("Ensuring image %s exists, or importing it...", ociRef)
	image, err := c.Images().Find(filter.NewIDNameFilter(ociRef.String())) //查看元數據中是否存在需要的鏡像
	if err == nil {
		// Return the image found
		log.Debugf("Found image with UID %s", image.GetUID())
		return image, nil
	}

	switch err.(type) {
	case *filterer.NonexistentError:
		return importImage(c, ociRef) //從containerd本地或遠端加載鏡像
	default:
		return nil, err
	}
}

看下imageClient的初始化,其指定了Storage以及鏡像資源對應的gvk,這個跟使用client-go查找kubernetes的邏輯是一樣的。kernel Client和vm Client的初始化和image Client方式類似,只是需要將kind設置爲對應的類型。

func newImageClient(s storage.Storage, gv schema.GroupVersion) ImageClient {
	return &imageClient{
		storage:  s,
		filterer: filterer.NewFilterer(s),
		gvk:      gv.WithKind(api.KindImage.Title()),
	}
}

主要處理函數importImage如下,在從containerd本地或遠端加載鏡像成功之後,會初始化一個特定gvk的image對象,並配置相關參數,如鏡像名稱、鏡像的OCI地址(如weaveworks/ignite-ubuntu:latest)以及鏡像的UID,UID用於確定唯一的鏡像對象(注意UID表示的是CRD的對象,而不是鏡像的SHA值),可以在/var/lib/firecracker/image/<UID>/metadata.json中查看相關的鏡像元數據。

在配置好image對象之後,會(調用dmlegacy.CreateImageFilesystem)在/var/lib/firecracker/image/<UID>/中創建一個名爲image.ext4的文件,然後調用truncate調整文件大小,並使用"mkfs.ext4 -b 4096 -I 256 -F -E lazy_itable_init=0,lazy_journal_init=0 /var/lib/firecracker/image/<UID>/image.ext4 將其初始化爲一個空的ext4格式的文件,然後通過將image.ext4掛在到/dev/loop形成一個虛擬文件系統,掛載該虛擬文件系統並導入基礎鏡像文件(細節見"創建基礎文件系統文件"),最後umount掛載的文件系統,至此完成基礎文件系統文件(image.ext4)的製作。最後將image元數據保存到ignite的存儲中,便於後續檢索:

func importImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
	log.Debugf("Importing image with ociRef %q", ociRef)
	// Parse the source
	dockerSource := source.NewDockerSource()
	src, err := dockerSource.Parse(ociRef) //從containerd本地加載或遠端拉取鏡像
	if err != nil {
		return nil, err
	}

	image := c.Images().New() //初始化一個ignite image對象
	// Set the image name
	image.Name = ociRef.String()
	// Set the image's ociRef
	image.Spec.OCI = ociRef
	// Set the image's ociSource
	image.Status.OCISource = *src

	// Generate UID automatically
	if err := metadata.SetNameAndUID(image, c); err != nil { //設置image對象的UID
		return nil, err
	}

	log.Infoln("Starting image import...")

	// Truncate a file for the filesystem, format it with ext4, and copy in the files from the source
	if err := dmlegacy.CreateImageFilesystem(image, dockerSource); err != nil { //創建ext4文件系統,並導入鏡像文件
		return nil, err
	}

	if err := c.Images().Set(image); err != nil  //存儲新鏡像的元數據
		return nil, err
	}

	log.Infof("Imported OCI image %q (%s) to base image with UID %q", ociRef, image.Status.OCISource.Size, image.GetUID())
	return image, nil
}

下面看下ignite是如何從containerd或遠端加載鏡像的。其實現比較簡單,此處用到了providers.Runtime。首先通過providers.Runtime.InspectImage查找本地鏡像,如果沒有則從遠端拉取(ctr --namespace firecracker images pull):

func (ds *DockerSource) Parse(ociRef meta.OCIImageRef) (*api.OCIImageSource, error) {
	res, err := providers.Runtime.InspectImage(ociRef)
	if err != nil {
		log.Infof("%s image %q not found locally, pulling...", providers.Runtime.Name(), ociRef)
		if err := providers.Runtime.PullImage(ociRef); err != nil {
			return nil, err
		}

		if res, err = providers.Runtime.InspectImage(ociRef); err != nil {
			return nil, err
		}
	}

	if res.Size == 0 || res.ID == nil {
		return nil, fmt.Errorf("parsing %s image %q data failed", providers.Runtime.Name(), ociRef)
	}

	ds.imageRef = ociRef

	return &api.OCIImageSource{
		ID:   res.ID,
		Size: meta.NewSizeFromBytes(uint64(res.Size)),
	}, nil
}

在鏡像加載成功之後就可以在constants.DATA_DIR(/var/lib/firecracker)中查看鏡像的元數據。下面是weaveworks/ignite-ubuntu:latest的image元數據,其保存路徑爲/var/lib/firecracker/image/<UID>metadata.json 中以yaml格式保存了鏡像的元數據,使用的CRD的gv爲ignite.weave.works/v1alpha4,kind爲Image

# cd /var/lib/firecracker/image/669a5721d130ef1d

# ll
-rw-r--r--. 1 root root 295698432 Jul 14 10:53 image.ext4
-rw-r--r--. 1 root root       464 Jul 14 10:53 metadata.json

# cat metadata.json 
{
  "kind": "Image",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "weaveworks/ignite-ubuntu:latest",
    "uid": "669a5721d130ef1d",
    "created": "2023-07-14T02:53:01Z"
  },
  "spec": {
    "oci": "weaveworks/ignite-ubuntu:latest"
  },
  "status": {
    "ociSource": {
      "id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
      "size": "249308KB"
    }
  }
}

image對象的定義如下,可以看到它就是一個標準的對應上面的metadata.json

type Image struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   ImageSpec   `json:"spec"`
	Status ImageStatus `json:"status"`
}

// ImageSpec declares what the image contains
type ImageSpec struct {
	OCI meta.OCIImageRef `json:"oci"`
}

type OCIImageSource struct {
	// ID defines the source's content ID (e.g. the canonical OCI path or Docker image ID)
	ID *meta.OCIContentID `json:"id"`
	// Size defines the size of the source in bytes
	Size meta.Size `json:"size"`
}

// ImageStatus defines the status of the image
type ImageStatus struct {
	// OCISource contains the information about how this OCI image was imported
	OCISource OCIImageSource `json:"ociSource"`
}
創建基礎文件系統文件

上面提到ignite的Storage中會保存鏡像的元數據,而鏡像本身會導入到使用mkfs.ext4創建出來的文件系統中,下面看下這個過程。

  1. 首先找到mkfs創建出來的image.ext4文件路徑,然後創建一個臨時目錄,並將image.ext4掛載到臨時目錄中。
  2. 使用export方式(類似docker export)將鏡像導出爲tar包,然後將該tar包解壓到臨時目錄中(/dev/loop用於將文件虛擬成文件系統)
  3. 配置/etc/resolv.conf文件,主要是確保該文件的存在
  4. umount並刪除臨時目錄,至此就完成了基礎鏡像文件系統文件的製作。
func addFiles(img *api.Image, src source.Source) (err error) {
	log.Debugf("Copying in files to the image file from a source...")
	p := path.Join(img.ObjectPath(), constants.IMAGE_FS) //mkfs創建出來的ext4文件路徑
	tempDir, err := ioutil.TempDir("", "") //創建一個臨時文件
	if err != nil {
		return
	}
	defer os.RemoveAll(tempDir)

	if _, err := util.ExecuteCommand("mount", "-o", "loop", p, tempDir); err != nil { //掛載ext4文件系統到臨時目錄
		return fmt.Errorf("failed to mount image %q: %v", p, err)
	}
	defer util.DeferErr(&err, func() error {
		_, execErr := util.ExecuteCommand("umount", tempDir)
		return execErr
	})

	err = source.TarExtract(src, tempDir)//將基礎鏡像解壓到歷史目錄中
	if err != nil {
		return
	}

	err = setupResolvConf(tempDir)//確保存在/etc/resolv.conf文件

	return
}

製作vm內核文件

image

與製作基礎文件系統文件類似,製作內核文件也需要拉取所需的內核鏡像。同樣也需要"創建containerdClient"和"創建cniInstance",不同的是,此處gvk中的kind爲Kernel

創建內核文件的方法如下,大體上與創建基礎文件系統類似,但並不需要所有內核鏡像中的文件,只需要內核鏡像的/boot/lib目錄即可,且/boot目錄中必須包含vmlinux文件(vmlinux是kvm創建vm的必要文件)。過程如下:

  1. 查找內核鏡像(本地獲取或遠程拉取)
  2. 創建kernel對象,並配置對象的相關參數,如名稱、UID等
  3. 解壓內核鏡像中的/boot/lib/modules目錄
  4. vmlinux文件拷貝到constants.DATA_DIR路徑下
  5. 將解壓出來的文件打包到constants.DATA_DIR路徑下,名稱爲kernel.tar,後續和基礎文件系統合併,目錄結構如下:
$ pwd
/var/lib/firecracker/kernel/1bdd3b2354873157

$ ll
-rw-r--r--. 1 root root 73574400 Jul 14 10:53 kernel.tar
-rw-r--r--. 1 root root      492 Jul 14 10:53 metadata.json
-rwxr-xr-x. 1 root root 43526368 Jul 14 10:53 vmlinux

由於內核文件後續需要放到文件系統中,因此不需要再製作單獨的文件系統,只需要將所需的文件拷貝打包到本地即可,在執行"Create vm"的過程中會將打包的內核文件解壓到基礎文件系統中進行合併:

// importKernel imports a kernel from an OCI image
func importKernel(c *client.Client, ociRef meta.OCIImageRef) (*api.Kernel, error) {
	log.Debugf("Importing kernel with ociRef %q", ociRef)
	// Parse the source
	dockerSource := source.NewDockerSource()
	src, err := dockerSource.Parse(ociRef) //從containerd本地或遠端加載鏡像
	if err != nil {
		return nil, err
	}

	kernel := c.Kernels().New() //初始化一個image對象
	// Set the kernel name
	kernel.Name = ociRef.String()
	// Set the kernel's ociRef
	kernel.Spec.OCI = ociRef
	// Set the kernel's ociSource
	kernel.Status.OCISource = *src

	// Generate UID automatically
	if err := metadata.SetNameAndUID(kernel, c); err != nil { //設置kernel對象的UID
		return nil, err
	}

	// Cache the kernel contents in the kernel tar file
	kernelTarFile := path.Join(kernel.ObjectPath(), constants.KERNEL_TAR)

	// vmlinuxFile describes the uncompressed kernel file at /var/lib/firecracker/kernel/<id>/vmlinux
	vmlinuxFile := path.Join(kernel.ObjectPath(), constants.KERNEL_FILE)

	// Create both the kernel tar file and the vmlinux file it either doesn't exist
	if !util.FileExists(kernelTarFile) || !util.FileExists(vmlinuxFile) {
		// Create a temporary directory for extracting
		// the necessary files from the OCI image
		tempDir, err := ioutil.TempDir("", "")
		if err != nil {
			return nil, err
		}

		// Extract only the /boot and /lib directories of the tar stream into the tempDir
		err = source.TarExtract(dockerSource, tempDir, "boot", "lib/modules") //抽取所需的內核文件到臨時目錄
		if err != nil {
			return nil, err
		}

		// Locate the kernel file in the temporary directory
		kernelTmpFile, err := findKernel(tempDir) //查找vmlinux文件
		if err != nil {
			return nil, err
		}

		// Copy the vmlinux file
		if err := util.CopyFile(kernelTmpFile, vmlinuxFile); err != nil {
			return nil, fmt.Errorf("failed to copy kernel file %q to kernel %q: %v", kernelTmpFile, kernel.GetUID(), err)
		}

		// 將抽取出來的內核文件打包到 /var/lib/firecracker/kernel/<UID>/kernel.tar
		if _, err := util.ExecuteCommand("tar", "-cf", kernelTarFile, "-C", tempDir, "."); err != nil {
			return nil, err
		}

		// 移除臨時目錄
		if err := os.RemoveAll(tempDir); err != nil {
			return nil, err
		}
	}

	// Populate the kernel version field if possible
	if len(kernel.Status.Version) == 0 {
		cmd := fmt.Sprintf("strings %s | grep 'Linux version' | awk '{print $3}'", vmlinuxFile)
		// Use the pipefail option to return an error if any of the pipeline commands is not available
		out, err := util.ExecuteCommand("/bin/bash", "-o", "pipefail", "-c", cmd)
		if err != nil {
			kernel.Status.Version = "<unknown>"
		} else {
			kernel.Status.Version = out
		}
	}

	if err := c.Kernels().Set(kernel); err != nil { //將內核對象保存到Storage中
		return nil, err
	}

	log.Infof("Imported OCI image %q (%s) to kernel image with UID %q", ociRef, kernel.Status.OCISource.Size, kernel.GetUID())
	return kernel, nil
}

內核鏡像的元數據如下:

# cat metadata.json 
{
  "kind": "Kernel",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "weaveworks/ignite-kernel:5.10.51",
    "uid": "1bdd3b2354873157",
    "created": "2023-07-14T02:53:10Z"
  },
  "spec": {
    "oci": "weaveworks/ignite-kernel:5.10.51"
  },
  "status": {
    "version": "5.10.51",
    "ociSource": {
      "id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
      "size": "72588KB"
    }
  }
}

kernel對象的定義如下,對應上面的metadata.json

type Kernel struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   KernelSpec   `json:"spec"`
	Status KernelStatus `json:"status"`
}

// KernelSpec describes the properties of a kernel
type KernelSpec struct {
	OCI meta.OCIImageRef `json:"oci"`
	// Optional future feature, support per-kernel specific default command lines
	// DefaultCmdLine string
}

// KernelStatus describes the status of a kernel
type KernelStatus struct {
	Version   string         `json:"version"`
	OCISource OCIImageSource `json:"ociSource"`
}

Create vm

image

創建vm使用的命令是ignite vm create,這一步只是做好vm啓動前的準備,如果要啓動vm,還需要執行 ignite vm start

配置vm對象

首先需要初始化一個vm對象,包括:

  1. 配置vm對象的鏡像、runtime和網絡
  2. 合併命令行傳入的自定義參數
  3. 校驗vm對象的合法性
  4. 嘗試拉取基礎鏡像和內核鏡像,並給vm對象設置image和kernel信息

運行一個vm可以直接執行ignite vm create+ignite vm start,或直接執行ignite vm run

func (cf *CreateFlags) NewCreateOptions(args []string, fs *flag.FlagSet) (*CreateOptions, error) {
	// Create a new base VM and configure it by combining the component config,
	// VM config file and flags.
	baseVM := providers.Client.VMs().New() //初始化一個vm對象

	// If component config is in use, set the VMDefaults on the base VM.
	if providers.ComponentConfig != nil {
		baseVM.Spec = providers.ComponentConfig.Spec.VMDefaults
	}

	// Resolve registry configuration used for pulling image if required.
	cmdutil.ResolveRegistryConfigDir()

	// Initialize the VM's Prefixer
	baseVM.Status.IDPrefix = providers.IDPrefix //設置vm對象的基本信息
	// Set the runtime and network-plugin on the VM, then override the global config.
	baseVM.Status.Runtime.Name = providers.RuntimeName // 設置runtime 和 CNI實例
	baseVM.Status.Network.Plugin = providers.NetworkPluginName
	// Populate the runtime and network-plugin providers.
	if err := config.SetAndPopulateProviders(providers.RuntimeName, providers.NetworkPluginName); err != nil {
		return nil, err
	}

	// Set the passed image argument on the new VM spec.
	// Image is necessary while serializing the VM spec.
	if len(args) == 1 {
		ociRef, err := meta.NewOCIImageRef(args[0])
		if err != nil {
			return nil, err
		}
		baseVM.Spec.Image.OCI = ociRef
	}

	// Generate a VM name and UID if not set yet.
	if err := metadata.SetNameAndUID(baseVM, providers.Client); err != nil {//設置vm的UID和名稱
		return nil, err
	}

	// Apply the VM config on the base VM, if a VM config is given.
	if len(cf.ConfigFile) != 0 {//如果使用文件指定了vm對象的配置信息,則將該配置合併到vm對象中
		if err := applyVMConfigFile(baseVM, cf.ConfigFile); err != nil {
			return nil, err
		}
	}

	// Apply flag overrides.
	if err := applyVMFlagOverrides(baseVM, cf, fs); err != nil {//使用命令行參數覆蓋vm對象
		return nil, err
	}

	// If --require-name is true, VM name must be provided.
	if cf.RequireName && len(baseVM.Name) == 0 {
		return nil, fmt.Errorf("must set VM name, flag --require-name set")
	}

	// Assign the new VM to the configFlag.
	cf.VM = baseVM

	// Validate the VM object.
	if err := validation.ValidateVM(cf.VM).ToAggregate(); err != nil { //vm對象有效性校驗
		return nil, err
	}

	co := &CreateOptions{CreateFlags: cf}
  //下面用於拉取基礎鏡像和內核鏡像,相當於 ignite image import 和 ignite kernel import
	// Get the image, or import it if it doesn't exist.
	var err error
	co.image, err = operations.FindOrImportImage(providers.Client, cf.VM.Spec.Image.OCI)
	if err != nil {
		return nil, err
	}

	// Populate relevant data from the Image on the VM object.
	cf.VM.SetImage(co.image) //設置vm對象的image信息

	// Get the kernel, or import it if it doesn't exist.
	co.kernel, err = (providers.Client, cf.VM.Spec.Kernel.OCI)
	if err != nil {
		return nil, err
	}

	// Populate relevant data from the Kernel on the VM object.
	cf.VM.SetKernel(co.kernel) //設置vm對象的kernel元數據
	return co, nil
}

vm對象的元數據如下:

$ cat metadata.json 
{
  "kind": "VM",
  "apiVersion": "ignite.weave.works/v1alpha4",
  "metadata": {
    "name": "restless-waterfall",
    "uid": "ddf49307b5b27c34",
    "created": "2023-07-18T08:33:25Z"
  },
  "spec": {
    "image": {
      "oci": "weaveworks/ignite-ubuntu:latest"
    },
    "sandbox": {
      "oci": "weaveworks/ignite:v0.10.0"
    },
    "kernel": {
      "oci": "weaveworks/ignite-kernel:5.10.51",
      "cmdLine": "console=ttyS0 reboot=k panic=1 pci=off ip=dhcp"
    },
    "cpus": 1,
    "memory": "512MB",
    "diskSize": "4GB",
    "network": {
    },
    "storage": {
    },
    "ssh": true
  },
  "status": {
    "running": true,
    "runtime": {
      "id": "ignite-ddf49307b5b27c34",
      "name": "containerd"
    },
    "startTime": "2023-07-18T08:33:25Z",
    "network": {
      "plugin": "cni",
      "ipAddresses": [
        "10.61.0.3"
      ]
    },
    "image": {
      "id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
      "size": "249308KB"
    },
    "kernel": {
      "id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
      "size": "72588KB"
    },
    "idPrefix": "ignite"
  }
}

vm對象的定義如下,與上述metadata.json對應:

type VM struct {
	runtime.TypeMeta `json:",inline"`
	// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
	// Name is available at the .metadata.name JSON path
	// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
	runtime.ObjectMeta `json:"metadata"`

	Spec   VMSpec   `json:"spec"`
	Status VMStatus `json:"status"`
}

// VMSpec describes the configuration of a VM
type VMSpec struct {
	Image    VMImageSpec   `json:"image"`
	Sandbox  VMSandboxSpec `json:"sandbox"`
	Kernel   VMKernelSpec  `json:"kernel"`
	CPUs     uint64        `json:"cpus"`
	Memory   meta.Size     `json:"memory"`
	DiskSize meta.Size     `json:"diskSize"`
	// TODO: Implement working omitempty without pointers for the following entries
	// Currently both will show in the JSON output as empty arrays. Making them
	// pointers requires plenty of nil checks (as their contents are accessed directly)
	// and is very risky for stability. APIMachinery potentially has a solution.
	Network VMNetworkSpec `json:"network,omitempty"`
	Storage VMStorageSpec `json:"storage,omitempty"`
	// This will be done at either "ignite start" or "ignite create" time
	// TODO: We might revisit this later
	CopyFiles []FileMapping `json:"copyFiles,omitempty"`
	// SSH specifies how the SSH setup should be done
	// nil here means "don't do anything special"
	// If SSH.Generate is set, Ignite will generate a new SSH key and copy it in to authorized_keys in the VM
	// Specifying a path in SSH.Generate means "use this public key"
	// If SSH.PublicKey is set, this struct will marshal as a string using that path
	// If SSH.Generate is set, this struct will marshal as a bool => true
	SSH *SSH `json:"ssh,omitempty"`
}

創建vm文件系統

至此我們已經創建了基礎文件系統文件,抽取了必要的內核文件,並創建了一個vm對象,但創建一個vm還需要一個完整的文件系統。在上面的"製作vm文件系統"中只是分別製作了基礎文件系統文件和內核文件,下面還需要將其合併成一個完整的文件系統。

下面是ignite vm create命令的入口,首先設置vm的UID和名稱(這一步在"創建vm對象"中已經執行過,此處主要是確保有UID和名稱)以及標籤,然後將其保存到Storage中,並創建vm的文件系統:

func Create(co *CreateOptions) (err error) {
	// Generate a random UID and Name
	if err = metadata.SetNameAndUID(co.VM, providers.Client); err != nil {
		return
	}
	// Set VM labels.
	if err = metadata.SetLabels(co.VM, co.Labels); err != nil {
		return
	}
	defer util.DeferErr(&err, func() error { return metadata.Cleanup(co.VM, false) })

	if err = providers.Client.VMs().Set(co.VM); err != nil {// 將vm對象存儲到Storage中
		return
	}

	// Allocate and populate the overlay file
	if err = dmlegacy.AllocateAndPopulateOverlay(co.VM); err != nil {//創建vm文件系統
		return
	}

	err = metadata.Success(co.VM)

	return
}

AllocateAndPopulateOverlay是文件系統製作的入口,最終生成一個devicemapper設備:

  1. 首先通過vm中的鏡像地址(name:tag)找到鏡像的UID,用於查找本地的本地/var/lib/firecracker/image/<UID>/image.ext4
  2. 使用找到的UID定位基礎鏡像的文件系統/var/lib/firecracker/image/<imageUID>/image.ext4,並調整文件系統大小。後續作爲devicemapper snapshot類型的origin device。(devicemapper snapshot的介紹見下文)
  3. 創建目錄/var/lib/firecracker/vm/<vmUID>,並創建文件/var/lib/firecracker/vm/<vmUID>/overlay.dm,根據命令行或vm配置文件來定義overlay.dm的大小,不能小於image.ext4。後續作爲devicemapper snapshot類型的COW device。
  4. 使用image.ext4overlay.dm創建一個snapshot類型的devicemapper,此時snapshot存儲中包含了基礎文件系統
  5. 將內核文件解壓合併到snapshot存儲中,並調整vm文件系統的其他配置,如hostname,DNS等。至此完成了一個vm文件系統。
func AllocateAndPopulateOverlay(vm *api.VM) error {
	requestedSize := vm.Spec.DiskSize.Bytes()
	// Truncate only accepts an int64
  if requestedSize > math.MaxInt64 {
		return fmt.Errorf("requested size %d too large, cannot truncate", requestedSize)
	}
	size := int64(requestedSize)

	//獲取基礎鏡像的UID,用於在/var/lib/firecracker/image中查找image.ext4
	imageUID, err := lookup.ImageUIDForVM(vm, providers.Client)
	if err != nil {
		return err
	}

	// Get the size of the image ext4 file
	fi, err := os.Stat(path.Join(constants.IMAGE_DIR, imageUID.String(), constants.IMAGE_FS))//查找image.ext4
	if err != nil {
		return err
	}
	imageSize := fi.Size()

	// The overlay needs to be at least as large as the image
	if size < imageSize { //調整overlay.dm的大小
		log.Warnf("warning: requested overlay size (%s) < image size (%s), using image size for overlay\n",
			vm.Spec.DiskSize.String(), meta.NewSizeFromBytes(uint64(imageSize)).String())
		size = imageSize
	}

	// Make sure the all directories above the snapshot directory exists
	if err := os.MkdirAll(path.Dir(vm.OverlayFile()), constants.DATA_DIR_PERM); err != nil {
		return err
	}

	overlayFile, err := os.Create(vm.OverlayFile())//創建vm的overlay文件
	if err != nil {
		return fmt.Errorf("failed to create overlay file for %q, %v", vm.GetUID(), err)
	}
	defer overlayFile.Close()

	if err := overlayFile.Truncate(size); err != nil {//調整overlay文件大小
		return fmt.Errorf("failed to allocate overlay file for VM %q: %v", vm.GetUID(), err)
	}

	// populate the filesystem
	return copyToOverlay(vm)//創建snapshot類型的devicemapper設備
}

現在根據,copyToOverlay的實現如下,ActivateSnapshot用於創建vm運行所需的devicemapper snapshot類型的存儲,除此之外,都是對vm文件系統的調整,如導入內核文件,設置ssh等。

func copyToOverlay(vm *api.VM) (err error) {
	_, err = ActivateSnapshot(vm) //創建devicemapper的snapshot存儲,作爲vm的啓動設備
	if err != nil {
		return
	}
	defer util.DeferErr(&err, func() error { return DeactivateSnapshot(vm) })

	mp, err := util.Mount(vm.SnapshotDev()) //掛載snapshot存儲
	if err != nil {
		return
	}
	defer util.DeferErr(&err, mp.Umount)

	// Copy the kernel files to the VM. TODO: Use snapshot overlaying instead.
  //將/var/lib/firecracker/kernel/<UID>/kernel.tar解壓到掛載路徑下,與基礎文件系統進行合併
	if err = copyKernelToOverlay(vm, mp.Path); err != nil { 
		return
	}

	// do not mutate vm.Spec.CopyFiles
	fileMappings := vm.Spec.CopyFiles

	if vm.Spec.SSH != nil { //如果指定了ssh,則需要爲vm創建ssh密鑰對
		pubKeyPath := vm.Spec.SSH.PublicKey
		if vm.Spec.SSH.Generate {
			// generate a key if PublicKey is empty
			pubKeyPath, err = newSSHKeypair(vm)
			if err != nil {
				return
			}
		}

		if len(pubKeyPath) > 0 {
			fileMappings = append(fileMappings, api.FileMapping{
				HostPath: pubKeyPath,
				VMPath:   vmAuthorizedKeys,
			})
		}
	}

	// TODO: File/directory permissions?
	for _, mapping := range fileMappings { //使用拷貝方式處理vm和host的文件映射
		vmFilePath := path.Join(mp.Path, mapping.VMPath)
		if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
			return
		}

		if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
			return
		}
	}

	ip := net.IP{127, 0, 0, 1}
	if len(vm.Status.Network.IPAddresses) > 0 {
		ip = vm.Status.Network.IPAddresses[0]
	}

	// Write /etc/hosts for the VM //在/etc/hosts中設置本機主機名地址解析
	if err = writeEtcHosts(mp.Path, vm.GetUID().String(), ip); err != nil {
		return
	}

	// Write the UID to /etc/hostname for the VM // 在/etc/hostname中設置本機主機名
	if err = writeEtcHostname(mp.Path, vm.GetUID().String()); err != nil {
		return
	}

	// Populate /etc/fstab with the VM's volume mounts //在/etc/fstab中配置vm.Spec.Storage中定義的卷掛載
	if err = populateFstab(vm, mp.Path); err != nil {
		return
	}

	// Set overlay root permissions
	err = os.Chmod(mp.Path, constants.DATA_DIR_PERM)

	return
}

ActivateSnapshot用來創建一個給vm使用的devicemapper塊存儲,主要步驟如下:

  1. 使用losetup將鏡像文件image.ext4 attach到一個/dev/loop設備上,此時可以將其虛擬成一個文件系統

  2. 使用losetup將鏡像文件overlay.dm attach到一個/dev/loop設備上,可以使用losetup查看attach的設備:

    $ losetup 
    NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                              DIO LOG-SEC
    /dev/loop1         0      0         1  1 /var/lib/firecracker/image/669a5721d130ef1d/image.ext4   0     512
    /dev/loop2         0      0         1  0 /var/lib/firecracker/vm/ddf49307b5b27c34/overlay.dm      0     512
    
  3. 如果overlay loop設備的大小大於image loop設備,需要對image loop設備進行擴展(官方要求)。方法是創建一個linear類型的devicemapper設備,將image loop設備映射到該dm設備上,並使用zero類型的devicemapper擴展該dm設備。擴展方式如下:

    linear類型的dm用於join多個存儲,或將一個存儲split成多個dm設備。

    $ dmsetup create test-snapshot <<EOF
    "0 8388608 linear /dev/loop0 0"
    "8388608 12582912 zero"
    EOF
    
  4. 使用dmsetup命令創建一個snapshot類型的devicemapper設備,創建方式如下,其中image loop作爲origin device,overlay loop作爲COW device:

    $ dmsetup create ignite-ddf49307b5b27c34 --table '0 8388608 snapshot /dev/{loop0,mapper/ignite-<uid>-base} /dev/loop1 P 8'
    

    使用如下命令可以查看創建的dm設備:

    $ dmsetup status
    ignite-ddf49307b5b27c34: 0 8388608 snapshot 274328/8388608 1080 #創建出來的snapshot設備
    ignite-ddf49307b5b27c34-base: 0 577536 linear      #擴展image loop所創建的devicemapper,映射到image loop設備
    ignite-ddf49307b5b27c34-base: 577536 8388608 zero  #用於擴展ignite-ddf49307b5b27c34-base的設備
    

    官方對snapshot的描述如下,即向snapshot中寫入數據時,數據只會寫到COW device,而讀取時則會從COW device和origin device中讀取。這裏描述了COW device要小於origin的大小。

    *) snapshot <origin> <COW device> <persistent?> <chunksize>
    
    A snapshot of the <origin> block device is created. Changed chunks of
    <chunksize> sectors will be stored on the <COW device>.  Writes will
    only go to the <COW device>.  Reads will come from the <COW device> or
    from <origin> for unchanged data.  <COW device> will often be
    smaller than the origin and if it fills up the snapshot will become
    useless and be disabled, returning errors.  So it is important to monitor
    the amount of free space and expand the <COW device> before it fills up.
    
    <persistent?> is P (Persistent) or N (Not persistent - will not survive
    after reboot).  O (Overflow) can be added as a persistent store option
    to allow userspace to advertise its support for seeing "Overflow" in the
    snapshot status.  So supported store types are "P", "PO" and "N".
    

    創建snapshot類型的dm設備需要兩部分,一個是origin device,它是隻讀的;另一個是COW device,可讀可寫。下面展示一下該類型在容器中的用法:

    $ mkdir -p /tmp/mnt
    
    # 拷貝一個vm鏡像文件,並attach到/dev/loop5
    $ cp /var/lib/firecracker/image/669a5721d130ef1d/image.ext4 /home  
    $ losetup /dev/loop5 image.img
    
    # 拷貝一個和vm鏡像文件一樣大小的overlay文件,並attach到/dev/loop6
    $ dd if=/dev/zero of=overlay.dm  bs=512 count=577536
    $ losetup /dev/loop6 overlay.dm
    
    # 獲取塊的block數目,並創建snapshot類型的devicemapper設備,並將其掛載到/tmp/mnt
    $ blockdev --getsz /dev/loop5
    577536
    
    # 創建snapshot設備並掛載到/tmp/mnt目錄
    $ dmsetup create test-snapshot --table '0 577536 snapshot /dev/loop5 /dev/loop6 P 8'
    $ mount /dev/mapper/test-snapshot /tmp/mnt
    
    # 設置remove snapshot設備之後自動detach loop設備
    $ losetup -d /dev/loop0
    $ losetup -d /dev/loop1
    

    查看掛載的目錄,可以看到它就是一個完整的linux文件系統。如果對該文件系統進行修改,其修改內容並不會影響到vm鏡像(可以在修改之後umount snapshot設備並單獨掛載image的/dev/loop5,可以發現其並沒有任何變更,重新創建snapshot之後可以復原變更):

    $ ll /tmp/mnt/
    total 76
    lrwxrwxrwx.  1 root root     7 Oct  7  2021 bin -> usr/bin
    drwxr-xr-x.  2 root root  4096 Apr 15  2020 boot
    drwxr-xr-x.  2 root root  4096 Oct  7  2021 dev
    drwxr-xr-x. 52 root root  4096 Jul 14 10:52 etc
    drwxr-xr-x.  2 root root  4096 Apr 15  2020 home
    lrwxrwxrwx.  1 root root     7 Oct  7  2021 lib -> usr/lib
    lrwxrwxrwx.  1 root root     9 Oct  7  2021 lib32 -> usr/lib32
    lrwxrwxrwx.  1 root root     9 Oct  7  2021 lib64 -> usr/lib64
    lrwxrwxrwx.  1 root root    10 Oct  7  2021 libx32 -> usr/libx32
    drwx------.  2 root root 16384 Jul 14 10:52 lost+found
    drwxr-xr-x.  2 root root  4096 Oct  7  2021 media
    drwxr-xr-x.  2 root root  4096 Oct  7  2021 mnt
    drwxr-xr-x.  2 root root  4096 Oct  7  2021 opt
    drwxr-xr-x.  2 root root  4096 Apr 15  2020 proc
    drwx------.  2 root root  4096 Oct  7  2021 root
    drwxr-xr-x.  8 root root  4096 Nov  9  2021 run
    lrwxrwxrwx.  1 root root     8 Oct  7  2021 sbin -> usr/sbin
    drwxr-xr-x.  2 root root  4096 Oct  7  2021 srv
    drwxr-xr-x.  2 root root  4096 Apr 15  2020 sys
    drwxrwxrwt.  2 root root  4096 Oct  7  2021 tmp
    drwxr-xr-x. 13 root root  4096 Oct  7  2021 usr
    drwxr-xr-x. 11 root root  4096 Oct  7  2021 var
    

    環境清理

    $ umount  /tmp/mnt
    $ dmsetup remove  test-snapshot
    
  5. 使用e2fsck解決可能存在的文件系統錯誤

    $ e2fsck -p -f /dev/mapper/<snapshot>
    
  6. 使用loseup Detach image和overlay的loop設備,這樣在snapshot被移除之後,底層的loop設備也會被自動移除:

    $ losetup -d /dev/loop0
    $ losetup -d /dev/loop1
    

    這樣就完成了一個vm文件系統存儲,後面只需將其進行掛載就可以爲vm所用。

配置ssh

在創建vm文件系統的過程中需要配置(copyToOverlay方法)ssh,代碼段如下,主要就是將公鑰(.pub結尾的文件)拷貝到vm的/root/.ssh/authorized_keys中。如果vm.Spec.SSH.Generatetrue,則會通過openssl命令生成新的密鑰對,路徑爲/var/lib/firecracker/vm/<UID>/id_<UID>/,其中包含了公鑰和私鑰,公鑰仍然會被拷貝到vm的/root/.ssh/authorized_keys中,私鑰則用於ssh client連接。

這樣後續就可以通過ssh登陸vm機器:

	if vm.Spec.SSH != nil {
		pubKeyPath := vm.Spec.SSH.PublicKey
		if vm.Spec.SSH.Generate {
			// generate a key if PublicKey is empty
			pubKeyPath, err = newSSHKeypair(vm)
			if err != nil {
				return
			}
		}

		if len(pubKeyPath) > 0 {
			fileMappings = append(fileMappings, api.FileMapping{
				HostPath: pubKeyPath,
				VMPath:   vmAuthorizedKeys,
			})
		}
	}

	for _, mapping := range fileMappings {
		vmFilePath := path.Join(mp.Path, mapping.VMPath)
		if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
			return
		}

		if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
			return
		}
	}

ssh密鑰對生成方式如下:

// Generate a new SSH keypair for the vm
func newSSHKeypair(vm *api.VM) (string, error) {
	privKeyPath := path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
	// TODO: In future versions, let the user specify what key algorithm to use through the API types
	sshKeyAlgorithm := "ed25519"
	if util.FIPSEnabled() {
		// Use rsa on FIPS machines
		sshKeyAlgorithm = "rsa"
	}
	_, err := util.ExecuteCommand("ssh-keygen", "-q", "-t", sshKeyAlgorithm, "-N", "", "-f", privKeyPath)
	if err != nil {
		return "", err
	}

	return fmt.Sprintf("%s.pub", privKeyPath), nil
}

Start vm

ignite的vm其實就是在容器中通過firecracker命令創建出來的一個vm。因此要創建一個vm,首先要啓動一個容器。容器也有自己的文件系統,在下圖中,容器的文件系統由ignite鏡像提供。另一個是vm所需要的文件系統,它就是上面我們創建出來的devicemapper設備,後續由firecracker掛載爲vm的root fs。

image

啓動vm使用的命令是ignite vm start。主要是啓動由ignite vm create創建出來的vm對象。

image

第一步通過vm名稱從Storage中找到該vm對象,然後啓動vm。入參so中包含了vm對象及其參數。在vm啓動之後,還需要處理ssh連接以及vm attach之類的操作:

func Start(so *StartOptions, fs *flag.FlagSet) error {
	// Check if the given VM is already running
	if so.vm.Running() {
		return fmt.Errorf("VM %q is already running", so.vm.GetUID())
	}
  
  //下面主要是配置runtime和networkplugin的名稱和client
	// Stopped VMs don't contain the runtime and network information. Set the
	// default runtime and network from the providers if empty.
	if so.vm.Status.Runtime.Name == "" {
		so.vm.Status.Runtime.Name = providers.RuntimeName
	}
	if so.vm.Status.Network.Plugin == "" {
		so.vm.Status.Network.Plugin = providers.NetworkPluginName
	}

	// In case the runtime and network-plugin are specified explicitly at
	// start, set the runtime and network-plugin on the VM. This overrides the
	// global config and config on the VM object, if any.
	if fs.Changed("runtime") {
		so.vm.Status.Runtime.Name = providers.RuntimeName
	}
	if fs.Changed("network-plugin") {
		so.vm.Status.Network.Plugin = providers.NetworkPluginName
	}

	// Set the runtime and network-plugin providers from the VM status.
	if err := config.SetAndPopulateProviders(so.vm.Status.Runtime.Name, so.vm.Status.Network.Plugin); err != nil {
		return err
	}

  //有效性校驗,主要校驗文件的存在性,如依賴的可執行文件,依賴的CNI文件以及內核/dev/kvm、/dev/net/tun、/dev/mapper/control等
	ignoredPreflightErrors := sets.NewString(util.ToLower(so.StartFlags.IgnoredPreflightErrors)...)
	if err := checkers.StartCmdChecks(so.vm, ignoredPreflightErrors); err != nil {
		return err
	}

  //啓動vm
	if err := operations.StartVM(so.vm, so.Debug); err != nil {
		return err
	}

  //等待ssh服務就緒
	// When --ssh is enabled, wait until SSH service started on port 22 at most N seconds
	if ssh := so.vm.Spec.SSH; ssh != nil && ssh.Generate && len(so.vm.Status.Network.IPAddresses) > 0 {
		if err := waitForSSH(so.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, constants.IGNITE_SPAWN_TIMEOUT); err != nil {
			return err
		}
	}

	// If starting interactively, attach after starting
	if so.Interactive {
		return Attach(so.AttachOptions)
	}
	return nil
}

StartVM是啓動vm的入口,vmChans.SpawnFinished用於校驗vm對象是否被成功保存到Storage中,超時時間爲2min,超時返回啓動失敗的錯誤。

func StartVM(vm *api.VM, debug bool) error {

	vmChans, err := StartVMNonBlocking(vm, debug)
	if err != nil {
		return err
	}

	if err := <-vmChans.SpawnFinished; err != nil {
		return err
	}

	return nil
}

啓動一個vm需要預先設置一些條件,如文件系統、網絡、目錄掛載等。下面是啓動vm的方法,瞭解vm是如何啓動的,基本就瞭解ignite是如何運作的。

  1. 首先查找是否已經存在vm所在的容器,如果存在,則移除該容器。這裏需要注意的是,RemoveContainer調用的是containerdClient去刪除容器,如果容器正在運行,則無法刪除,此時會直接返回,中斷後續流程(缺少kill?)
  2. 調用ActivateSnapshot配置容器需要的snapshot設備
  3. 獲取vm的目錄(/var/lib/firecracker/vm/<UID>)和內核目錄(/var/lib/firecracker/kernel/<UID>),分別用於掛載vm的metadata.json文件和內核的vmlinux文件,後續firecracker會使用這兩個文件來啓動vm
  4. 添加環境變量,以及掛載的設備(如/dev/mapper/control/dev/net/tun)和自定義目錄,這裏包含vm的文件系統。可以使用ctr --namespace firecracker containers info <UID>查看一個vm的掛載情況。
  5. 調用providers.Runtime.RunContainer啓動vm的容器
  6. 配置容器的cni網絡
  7. 設置vm對象的runtime字段,後續會通過該字段來判斷vm使用的runtime
  8. 將vm元數據保存到Storage中。
  9. 通過vmChans.SpawnFinished等待vm創建成功。
func StartVMNonBlocking(vm *api.VM, debug bool) (*VMChannels, error) {
	// Inspect the VM container and remove it if it exists
	inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())
	RemoveVMContainer(inspectResult)

	// Make sure we always initialize all channels
	vmChans := &VMChannels{
		SpawnFinished: make(chan error),
	}

	// Setup the snapshot overlay filesystem
	snapshotDevPath, err := dmlegacy.ActivateSnapshot(vm)
	if err != nil {
		return vmChans, err
	}

	kernelUID, err := lookup.KernelUIDForVM(vm, providers.Client)
	if err != nil {
		return vmChans, err
	}

  //查找vm路徑和kernel路徑,用於掛載vm元數據和內核vmlinux文件
	vmDir := filepath.Join(constants.VM_DIR, vm.GetUID().String())
	kernelDir := filepath.Join(constants.KERNEL_DIR, kernelUID.String())

	// Verify that the image containing ignite-spawn is pulled
	// TODO: Integrate automatic pulling into pkg/runtime
  //校驗基礎鏡像和內核鏡像是否存在,不存在則重新拉取
	if err := verifyPulled(vm.Spec.Sandbox.OCI); err != nil {
		return vmChans, err
	}

  //設置掛載的卷,主要是/var/lib/firecracker/vm/<UID>/目錄和該目錄下的vmlinux文件,以及/dev下的一些設備
	config := &runtime.ContainerConfig{
		Cmd: []string{
			fmt.Sprintf("--log-level=%s", logs.Logger.Level.String()),
			vm.GetUID().String(),
		},
		Labels: map[string]string{"ignite.name": vm.GetName()},
		Binds: []*runtime.Bind{
			{
				HostPath:      vmDir,
				ContainerPath: vmDir,
			},
			{
				// Mount the metadata.json file specifically into the container, to a well-known place for ignite-spawn to access
				HostPath:      path.Join(vmDir, constants.METADATA),
				ContainerPath: constants.IGNITE_SPAWN_VM_FILE_PATH,
			},
			{
				// Mount the vmlinux file specifically into the container, to a well-known place for ignite-spawn to access
				HostPath:      path.Join(kernelDir, constants.KERNEL_FILE),
				ContainerPath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH,
			},
		},
		CapAdds: []string{
			"SYS_ADMIN", // Needed to run "dmsetup remove" inside the container
			"NET_ADMIN", // Needed for removing the IP from the container's interface
		},
		Devices: []*runtime.Bind{
			runtime.BindBoth("/dev/mapper/control"), // This enables containerized Ignite to remove its own dm snapshot
			runtime.BindBoth("/dev/net/tun"),        // Needed for creating TAP adapters
			runtime.BindBoth("/dev/kvm"),            // Pass through virtualization support
			runtime.BindBoth(snapshotDevPath),       // The block device to boot from
		},
		StopTimeout:  constants.STOP_TIMEOUT + constants.IGNITE_TIMEOUT,
		PortBindings: vm.Spec.Network.Ports, // Add the port mappings to Docker
	}

  // 配置環境變量
	var envVars []string
	for k, v := range vm.GetObjectMeta().Annotations {
		if strings.HasPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR) {
			k := strings.TrimPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR)
			envVars = append(envVars, fmt.Sprintf("%s=%s", k, v))
		}
	}
	config.EnvVars = envVars

	// 添加自定義掛載
	for _, volume := range vm.Spec.Storage.Volumes {
		if volume.BlockDevice == nil {
			continue // Skip all non block device volumes for now
		}

		config.Devices = append(config.Devices, &runtime.Bind{
			HostPath:      volume.BlockDevice.Path,
			ContainerPath: path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name),
		})
	}

	// Prepare the networking for the container, for the given network plugin
	if err := providers.NetworkPlugin.PrepareContainerSpec(config); err != nil {
		return vmChans, err
	}

	// If we're not debugging, remove the container post-run
	if !debug {
		config.AutoRemove = true
	}

	// Run the VM container in Docker
	containerID, err := providers.Runtime.RunContainer(vm.Spec.Sandbox.OCI, config, vm.PrefixedID(), vm.GetUID().String())
	if err != nil {
		return vmChans, fmt.Errorf("failed to start container for VM %q: %v", vm.GetUID(), err)
	}

	// 配置CNI網絡
	result, err := providers.NetworkPlugin.SetupContainerNetwork(containerID, vm.Spec.Network.Ports...)
	if err != nil {
		return vmChans, err
	}

	if !logs.Quiet {
		log.Infof("Networking is handled by %q", providers.NetworkPlugin.Name())
		log.Infof("Started Firecracker VM %q in a container with ID %q", vm.GetUID(), containerID)
	}

	// Set the container ID for the VM
	vm.Status.Runtime.ID = containerID
	vm.Status.Runtime.Name = providers.RuntimeName

	// Append non-loopback runtime IP addresses of the VM to its state
	for _, addr := range result.Addresses {
		if !addr.IP.IsLoopback() {
			vm.Status.Network.IPAddresses = append(vm.Status.Network.IPAddresses, addr.IP)
		}
	}
	vm.Status.Network.Plugin = providers.NetworkPluginName

	// write the API object in a non-running state before we wait for spawn's network logic and firecracker
	if err := providers.Client.VMs().Set(vm); err != nil {
		return vmChans, err
	}

	// TODO: This is temporary until we have proper communication to the container
	// It's best to perform any imperative changes to the VM object pointer before this go-routine starts
	go waitForSpawn(vm, vmChans)

	return vmChans, nil
}

在下面的RunContainer方法中涉及到一個snapshotService,該snapshot與devicemapper 的snapshot不同,此處的snapshotService用於解壓並掛載容器鏡像,給容器提供啓動所需的文件系統。

下面是containerd的主目錄。containerd本身是插件化的,該目錄下的目錄都由不同的插件創建。使用ctr plugin list查看支持的插件:

$ cd /var/lib/containerd/
$ ll
drwxr-xr-x. 4 root root 33 Mar 15 14:14 io.containerd.content.v1.content
drwx--x--x. 2 root root 21 Mar 15 11:05 io.containerd.metadata.v1.bolt
drwx--x--x. 2 root root  6 Mar 15 11:05 io.containerd.runtime.v1.linux
drwx--x--x. 4 root root 37 Jul 14 10:53 io.containerd.runtime.v2.task
drwx------. 3 root root 23 Mar 15 11:05 io.containerd.snapshotter.v1.native
drwx------. 3 root root 42 Jul 14 10:52 io.containerd.snapshotter.v1.overlayfs
  • io.containerd.content.v1.content:存儲 OCI 鏡像,更多參見:oci image spec
  • io.containerd.metadata.v1.bolt:存儲 containerd 管理的鏡像、容器、快照的元數據,存儲的內容參見:源碼
  • io.containerd.snapshotter.v1.<type>:Snapshotter 快照目錄,參見Snapshotters 文檔
    • io.containerd.snapshotter.v1.btrfs :使用 btrfs 文件系統創建容器快照的目錄
    • io.containerd.snapshotter.v1.overlayfs :默認的 snapshotter。採用 overlayfs2 創建快照。

上面是對containerd主目錄的描述,鏡像文件會被放到io.containerd.content.v1.content中,然後由snapshot解壓並mount到io.containerd.snapshotter.v1.overlayfs(此處使用的是overlayfs),供容器使用。

下面是本機啓動的一個ignit vm,可以看到snapshot提供了overlayfs所需的lowerdirupperdirworkdir

$ mount|grep ignit
overlay on /run/containerd/io.containerd.runtime.v2.task/firecracker/ignite-272a0eab-75be-4131-a022-1fde8012f9f6/rootfs type overlay (rw,relatime,seclabel,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/work)

containerd的snapshot有兩種類型ActiveCommitted,分別對應容器運行的container layer(lowerdirworkdir)和image layer(lowerdir),對Active snapshot的修改是不會保存的,如果需要保存可以通過snapshot commit將其轉變爲Committed狀態。使用ctr --namespace firecracker snapshot ls可以查看當前的snapshot狀態。snapshot是有層級關係的,使用ctr --namespace firecracker snapshot tree 可以查看snapshot的層級關係。

$ ctr --namespace firecracker snapshot ls
KEY                                                                     PARENT                                                                  KIND      
ignite-272a0eab-75be-4131-a022-1fde8012f9f6                             sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9 Active    
sha256:0c949a3342f6400d49b4d378bf7b20b768cf09bef107cc6c5d58a1f3e50e06f3 sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541 Committed 
sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541                                                                         Committed 
sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9                                                                         Committed 
sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b                                                                         Committed 
sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 Committed 
sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b Committed 
sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 Committed 
sha256:f9e99b137a1976a6aaa287cb3cddea2f6e6545707ad1302c454fd4d06ffbb2ab sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d Committed 

下面用一個例子看下snapshot是如何工作的。

# 創建一個名爲test的containerd 命名空間
$ ctr ns create test

# 準備掛載點
$ mkdir /var/lib/containerd/custom_dir  

# 第一次提交 (根)
$ ctr -n test snapshot prepare activeLayer0                 # prepare 創建一個工作狀態的層
# 生成並執行snapshot 文件系統掛載(此掛載類型overlayfs)命令
$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0  | xargs sudo
$ echo "1" > /var/lib/containerd/custom_dir/add01           # 增加一次變更文件 
$ umount /var/lib/containerd/custom_dir                     # umount
$ ctr  -n test snapshot commit commit_add01 activeLayer0    # 提交 committed,變更snapshot狀態,保存Layer

上面snapshot mount 生成的mount命令爲:"mount -t bind /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs /var/lib/containerd/custom_dir -o rw,rbind"

查看當前的snapshot,可以發現生成了一個committed狀態的snapshot

$ ctr -n test snapshot ls
KEY          PARENT KIND      
commit_add01        Committed 

如果查看io.containerd.snapshotter.v1.overlayfs/snapshots目錄可以發現生成了一個新的文件夾21866,它就是commit產生的文件系統:

$ ll io.containerd.snapshotter.v1.overlayfs/snapshots/21841/fs/
-rw-r----- 1 root root    2 Jul 26 01:33 add01

下面再測試提交一個變更,首先創建一個active的snapshot,parent爲commit_add01

# 第二次提交,以第一次 layer 爲 parent
$ ctr -n test snapshot prepare activeLayer0 commit_add01

查看snapshot,發現active的snapshot的parent是上面commit的snapshot:

$ ctr -n test snapshot ls  
KEY          PARENT       KIND      
activeLayer0 commit_add01 Active    
commit_add01              Committed 

提交二次變更,這一步中ctr生成的mount命令爲:"mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs",可以看到它就是容器運行所需的overlay文件系統,lowerdir就是commit_add01生成的文件系統。

$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 | xargs sudo
$ echo "2" > /var/lib/containerd/custom_dir/add02
$ umount /var/lib/containerd/custom_dir
$ ctr -n test snapshot commit commit_add02 activeLayer0

查看snapshot,可以發現新增了一個snapshot commit_add02,其parents爲commit_add01,即commit之後就產生了一個子snapshot。

$ ctr -n test snapshot ls
KEY          PARENT       KIND      
commit_add01              Committed 
commit_add02 commit_add01 Committed 

如果繼續以新的snapshot commit_add02爲parents創建overlay,會不會合並commit_add01的變更?

$ ctr -n test snapshot prepare activeLayer0 commit_add02
$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 

下面是snapshot mount生成的mount命令,可以看到lowerdir中包含了commit_add01commit_add02的文件系統。

$ mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs

環境清理,清理時注意先清理子snapshot,否則會出現錯誤"cannot remove snapshot with child: failed precondition"

$ ctr -n test snapshot rm commit_add02
$ ctr -n test snapshot rm commit_add01

此外還可以使用ctr snapshot view創建只讀系統,此時如果向掛載的目錄中寫數據,會返回"Read-only file system"的錯誤。

總結下來就是,首先使用prepareview(只讀)創建一個active snapshot,然後通過mount命令掛載active snapshot,在對active snapshot修改之後就可以通過commit命令將變更持久化。

更多參見Snapshots

containerd有兩個概念:container和task。container可以看做是爲容器運行準備的環境,如cgroup和掛載的卷,而task則是容器內運行的進程。如下,查看container可以看到的是容器使用的鏡像和runtime,而task則是進程和進程狀態:

$ ctr -n firecracker container ls 
CONTAINER                                      IMAGE                                  RUNTIME                  
ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5    docker.io/weaveworks/ignite:v0.10.0    io.containerd.runc.v2 
$ ctr -n firecracker task ls 
TASK                                           PID      STATUS    
ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5    10332    RUNNING

有了上述知識後,就不難理解RunContainer的流程:

  1. 首先移除非running的容器
  2. 將主機的/etc/resolv.conf中的內容寫入vm目錄下的runtime.containerd.resolv.conf文件中,並將其加入掛載配置,後續掛載爲容器的/etc/resolv.conf
  3. 配置創建容器所需的cni選項,這裏添加了配置的環境變量、hostname、掛載卷和/dev的掛載設備
  4. 創建一個containerd snapshot Service,用於給容器提供rootfs
  5. 配置創建容器的選項,這裏用到了上面創建的cni選項和rootfs
  6. 創建containerd 容器和task並啓動 task

以下都是標準的啓動containerd容器的流程,感興趣的話也可以在containerd源碼的_test.go文件中查找使用例子:

func (cc *ctdClient) RunContainer(image meta.OCIImageRef, config *runtime.ContainerConfig, name, id string) (s string, err error) {
	img, err := cc.client.GetImage(cc.ctx, image.Normalized())
	if err != nil {
		return
	}

	// Remove the container if it exists
	if err = cc.RemoveContainer(name); err != nil {
		return
	}

	// Load the default snapshotter
	snapshotter := cc.client.SnapshotService(containerd.DefaultSnapshotter)

	// Add the /etc/resolv.conf mount, this isn't done automatically by containerd
	// Ensure a resolv.conf exists in the vmDir. Calculate path using the vm id
	resolvConfPath := filepath.Join(constants.VM_DIR, id, resolvConfName)
  //讀取主機的/etc/resolv.conf並寫入vm目錄的runtime.containerd.resolv.conf中
	err = resolvconf.EnsureResolvConf(resolvConfPath, constants.DATA_DIR_FILE_PERM) 
	if err != nil {
		return
	}
	config.Binds = append(
		config.Binds,
		&runtime.Bind{
			HostPath:      resolvConfPath,
			ContainerPath: "/etc/resolv.conf", //將runtime.containerd.resolv.conf掛載到容器中
		},
	)

	// Add the stop timeout as a label, as containerd doesn't natively support it
	config.Labels[stopTimeoutLabel] = strconv.FormatUint(uint64(config.StopTimeout), 10)

	// Build the OCI specification
	opts := []oci.SpecOpts{
		oci.WithDefaultSpec(),
		oci.WithDefaultUnixDevices,
		oci.WithTTY,
		oci.WithImageConfigArgs(img, config.Cmd),
		oci.WithEnv(config.EnvVars),
		withAddedCaps(config.CapAdds),
		withHostname(config.Hostname),
		withMounts(config.Binds), //掛載卷
		withDevices(config.Devices), //掛載設備
	}

	// Known limitations, containerd doesn't support the following config fields:
	// - StopTimeout
	// - AutoRemove
	// - NetworkMode (only CNI supported)
	// - PortBindings

	snapshotOpt := containerd.WithSnapshot(name)
	if _, err = snapshotter.Stat(cc.ctx, name); errdefs.IsNotFound(err) {
		// Even if "read only" is set, we don't use a KindView snapshot here (#1495).
		// We pass the writable snapshot to the OCI runtime, and the runtime remounts
		// it as read-only after creating some mount points on-demand.
		snapshotOpt = containerd.WithNewSnapshot(name, img)
	} else if err != nil {
		return
	}

	cOpts := []containerd.NewContainerOpts{
		containerd.WithImage(img),
		snapshotOpt,
		//containerd.WithImageStopSignal(img, "SIGTERM"),
		containerd.WithNewSpec(opts...),
		containerd.WithContainerLabels(config.Labels),
	}

	cont, err := cc.client.NewContainer(cc.ctx, name, cOpts...)
	if err != nil {
		return
	}

	// This is a dummy PTY to silence output
	// when starting without attach breaking
	con, _, err := console.NewPty()
	if err != nil {
		return
	}
	defer util.DeferErr(&err, con.Close)

	// We need a temporary dummy stdin reader that
	// actually works, can't use nullReader here
	dummyReader, _, err := os.Pipe()
	if err != nil {
		return
	}
	defer util.DeferErr(&err, dummyReader.Close)

	// Spawn the Creator with the dummy streams
	ioCreator := cio.NewCreator(cio.WithTerminal, cio.WithStreams(dummyReader, con, con))

	task, err := cont.NewTask(cc.ctx, ioCreator)
	if err != nil {
		return
	}

	if err = task.Start(cc.ctx); err != nil {
		return
	}

	// TODO: Save task.Pid() somewhere for attaching?
	s = task.ID()
	return
}

至此已經完成容器的啓動,在容器啓動之後會通過ignite-spawn命令調用firecracker來啓動vm。在Dockerfile中可以看到,容器啓動命令爲:

ENTRYPOINT ["/usr/local/bin/ignite-spawn"]

firecracker start vm

解析配置

第一步是將掛載的IGNITE_SPAWN_VM_FILE_PATH轉變爲一個vm對象,該文件是在start vm時掛載的/var/lib/firecracker/vm/<UID>/metadata.json文件,裏面包含了創建vm的規則,如CPU、內存、磁盤、網絡等。需要注意的是啓動vm的操作是在容器中執行的。

func decodeVM(vmID string) (*api.VM, error) {
	filePath := constants.IGNITE_SPAWN_VM_FILE_PATH
	obj, err := scheme.Serializer.DecodeFile(filePath, true)
	if err != nil {
		return nil, err
	}

	vm, ok := obj.(*api.VM)
	if !ok {
		return nil, fmt.Errorf("object couldn't be converted to VM")
	}

	// Explicitly set the GVK on this object
	vm.SetGroupVersionKind(api.SchemeGroupVersion.WithKind(api.KindVM.Title()))
	return vm, nil
}
啓動vm

啓動vm需要完成如下三步:

  1. 配置容器網絡:主要是檢查接口地址是否正常,併爲vm創建接口
  2. 配置DHCP
  3. 啓動vm:這一步使用firecracker啓動vm,用到了第一步中準備的接口、主機上的devicemapper設備等
func StartVM(vm *api.VM) (err error) {

	// Setup networking inside of the container, return the available interfaces
	fcIfaces, dhcpIfaces, err := container.SetupContainerNetworking(vm)//配置容器網絡
	if err != nil {
		return fmt.Errorf("network setup failed: %v", err)
	}

	// Serve DHCP requests for those interfaces
	// This function returns the available IP addresses that are being
	// served over DHCP now
	if err = container.StartDHCPServers(vm, dhcpIfaces); err != nil { //配置DHCP
		return
	}

	// Serve metrics over an unix socket in the VM's own directory
	metricsSocket := path.Join(vm.ObjectPath(), constants.PROMETHEUS_SOCKET)
	serveMetrics(metricsSocket)

	// Patches the VM object to set state to stopped, and clear IP addresses
	defer util.DeferErr(&err, func() error { return patchStopped(vm) })

	// Remove the snapshot overlay post-run, which also removes the detached backing loop devices
	defer util.DeferErr(&err, func() error { return dmlegacy.DeactivateSnapshot(vm) })

	// Remove the Prometheus socket post-run
	defer util.DeferErr(&err, func() error { return os.Remove(metricsSocket) })

	// Execute Firecracker
	if err = container.ExecuteFirecracker(vm, fcIfaces); err != nil { //啓動vm
		return fmt.Errorf("runtime error for VM %q: %v", vm.GetUID(), err)
	}

	return
}
配置容器網絡

firecracker是在容器中創建vm的,因此需要在容器中爲vm準備網絡環境。通過vm對象的annotation ignite.weave.works/interface可以爲vm添加額外的接口。

注:ignite支持兩種網絡模式,MODE_DHCPMODE_TC,目前用的是MODE_DHCP

func SetupContainerNetworking(vm *api.VM) (firecracker.NetworkInterfaces, []DHCPInterface, error) {
   var dhcpIntfs []DHCPInterface
   var fcIntfs firecracker.NetworkInterfaces

   //通過vm的metadata.json的annotation ignite.weave.works/interface可以添加額外的接口
  vmIntfs := parseExtraIntfs(vm) //vmIntfs: map[<interface_name>][interface_mode]

   // 如果沒有eth0接口,則添加該接口,並設置爲dhcp模式
   if _, ok := vmIntfs[mainInterface]; !ok {
      vmIntfs[mainInterface] = MODE_DHCP
   }

   interval := 1 * time.Second

  //等待接口就緒,就緒則返回true
   err := wait.PollImmediate(interval, constants.IGNITE_SPAWN_TIMEOUT, func() (bool, error) {

      // 檢查接口是否存在且配置正確
      retry, err := collectInterfaces(vmIntfs)

      if err == nil {
         // We're done here
         return true, nil
      }
      if retry {
         // We got an error, but let's ignore it and try again
         log.Warnf("Got an error while trying to set up networking, but retrying: %v", err)
         return false, nil
      }
      // The error was fatal, return it
      return false, err
   })

   if err != nil {
      return nil, nil, err
   }

   //爲vm準備接口等網絡環境
   if err := networkSetup(&fcIntfs, &dhcpIntfs, vmIntfs); err != nil {
      return nil, nil, err
   }

   return fcIntfs, dhcpIntfs, nil
}

collectInterfaces方法用於檢查接口是否存在且配置正確,流程爲:

  1. 獲取當前的所有接口
  2. 驗證是否存在預期的接口vmIntfs,以及接口是否配置了IP地址
func collectInterfaces(vmIntfs map[string]string) (bool, error) {
  //獲取所有接口
   allIntfs, err := net.Interfaces()
   if err != nil || allIntfs == nil || len(allIntfs) == 0 {
      return false, fmt.Errorf("cannot get local network interfaces: %v", err)
   }

   // create a map of candidate interfaces
   foundIntfs := make(map[string]net.Interface)
   for _, intf := range allIntfs {
      if _, ok := ignoreInterfaces[intf.Name]; ok {
         continue
      }

      foundIntfs[intf.Name] = intf

      // If the interface is explicitly defined, no changes are needed
      if _, ok := vmIntfs[intf.Name]; ok { //如果已經定義接口,則無需再爲接口配置mode
         continue
      }

      // default fallback behaviour to always consider intfs with an address
      addrs, _ := intf.Addrs()
      if len(addrs) > 0 {
         vmIntfs[intf.Name] = MODE_DHCP
      }
   }

   // 校驗是否已經創建期望的接口
   for intfName, mode := range vmIntfs {
      if _, ok := foundIntfs[intfName]; !ok {
         return true, fmt.Errorf("interface %q (mode %q) is still not found", intfName, mode)
      }

      // for DHCP interface, we need to make sure IP and route exist
      if mode == MODE_DHCP {
         intf := foundIntfs[intfName]
        _, _, _, noIPs, err := getAddress(&intf) //返回接口的IP/掩碼、網關和物理接口(link),這裏判斷是否接口配置了IP
         if err != nil {
            return true, err
         }

         if noIPs {
            return true, fmt.Errorf("IP is still not found on %q", intfName)
         }
      }
   }
   return false, nil
}

在對vmIntfs接口進行校驗之後,就可以爲vm配置網絡,主要流程爲:

  1. 遍歷容器中所有預期的接口,然後獲取這些接口的第一個地址信息,並從接口上刪除該地址,並返回地址信息,後續作爲firecracker vm的地址,相當於vm借用了容器的地址信息。
  2. 針對每個預期的接口,創建一個tab接口和一個bridge接口,然後將預期接口和tab接口橋接到該bridge接口上。後續會將tab接口配置給firecracker 創建出來的vm,並配置上第一步返回的地址信息
func networkSetup(fcIntfs *firecracker.NetworkInterfaces, dhcpIntfs *[]DHCPInterface, vmIntfs map[string]string) error {

	// The order in which interfaces are plugged in is intentionally deterministic
	// All interfaces are sorted alphabetically and 'eth0' is always first
	var keys []string
	for k := range vmIntfs {
		keys = append(keys, k)
	}
	sort.Strings(keys)
	sort.Slice(keys, func(i, j int) bool {
		return keys[i] == mainInterface
	})

	for _, intfName := range keys {

		intf, err := net.InterfaceByName(intfName) //根據接口名稱獲取容器的接口實例
		if err != nil {
			return fmt.Errorf("cannot find interface %q: %s", intfName, err)
		}

		switch vmIntfs[intfName] {
		case MODE_DHCP:
			ipNet, gw, err := takeAddress(intf) //獲取容器接口的第一個地址信息,並從容器接口上刪除該地址信息,然後返回該信息,後續作爲vm的接口地址
			if err != nil {
				return fmt.Errorf("error parsing interface %q: %s", intfName, err)
			}

			dhcpIface, err := bridge(intf) //創建tab和bridge接口,並配置橋接。返回給vm使用的接口dhcpIface
			if err != nil {
				return fmt.Errorf("bridging interface %q failed: %v", intfName, err)
			}

			dhcpIface.VMIPNet = ipNet
			dhcpIface.GatewayIP = gw

			*dhcpIntfs = append(*dhcpIntfs, *dhcpIface) //添加dhcp接口

			*fcIntfs = append(*fcIntfs, firecracker.NetworkInterface{
				StaticConfiguration: &firecracker.StaticNetworkConfiguration{
					MacAddress:  dhcpIface.MACFilter,
					HostDevName: dhcpIface.VMTAP,
				},
			})
		case MODE_TC:
			tcInterface, err := addTcRedirect(intf)
			if err != nil {
				log.Errorf("Failed to setup tc redirect %v", err)
				continue
			}

			*fcIntfs = append(*fcIntfs, *tcInterface)
		}
	}

	return nil
}

下面是使用bridge CNI時創建的容器接口,可以看到eth0接口上的IP被(takeAddress)刪除了,vm_eth0br_eth0分別是爲eth0創建的tab接口和bridge接口:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP group default 
    link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::8cc9:3aff:fef0:5067/64 scope link 
       valid_lft forever preferred_lft foreve
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP group default qlen 1000
    link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::d420:6bff:fed4:3e2a/64 scope link 
       valid_lft forever preferred_lft forever
5: br_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:79:7d:39:28:b6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9079:7dff:fe39:28b6/64 scope link 
       valid_lft forever preferred_lft forever
$ ip link show master br_eth0
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP mode DEFAULT group default 
    link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP mode DEFAULT group default qlen 1000
    link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff

vm的接口和路由如下,eth0就是從容器中的vm_eth0

$ ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 1e:7d:6c:90:99:18 brd ff:ff:ff:ff:ff:ff
3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e6:18:3d:88:a9:ff brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 76:6f:77:5d:6d:1c brd ff:ff:ff:ff:ff:ff
    inet 10.61.0.41/16 brd 10.61.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::746f:77ff:fe5d:6d1c/64 scope link 
       valid_lft forever preferred_lft forever
$ ip route
default via 10.61.0.1 dev eth0 

呈現的接口如下:

image
配置DHCP

這一步在bridge接口上啓動了dhcp服務:

// StartDHCPServers starts multiple DHCP servers for the VM, one per interface
// It returns the IP addresses that the API object may post in .status, and a potential error
func StartDHCPServers(vm *api.VM, dhcpIfaces []DHCPInterface) error {

	// Fetch the DNS servers given to the container
	clientConfig, err := dns.ClientConfigFromFile("/etc/resolv.conf")
	if err != nil {
		return fmt.Errorf("failed to get DNS configuration: %v", err)
	}

	for i := range dhcpIfaces {
		dhcpIface := &dhcpIfaces[i]
		// Set the VM hostname to the VM ID
		dhcpIface.Hostname = vm.GetUID().String()

		// Add the DNS servers from the container
		dhcpIface.SetDNSServers(clientConfig.Servers)

		go func() {
			log.Infof("Starting DHCP server for interface %q (%s)\n", dhcpIface.Bridge, dhcpIface.VMIPNet.IP)
			if err := dhcpIface.StartBlockingServer(); err != nil {
				log.Errorf("%q DHCP server error: %v\n", dhcpIface.Bridge, err)
			}
		}()
	}

	return nil
}
啓動vm

使用firecracker啓動vm時需要配置如下基本參數:

  • 從vm元數據中獲取設置的CPU和內存資源
  • 配置devicemapper設備、接口和掛載卷
  • 初始並啓動一個firecracker vm
func ExecuteFirecracker(vm *api.VM, fcIfaces firecracker.NetworkInterfaces) (err error) {
	drivePath := vm.SnapshotDev() //獲取vm的devicemapper設備,由於容器中掛載了host的/dev,因此可以直接查看使用

	vCPUCount := int64(vm.Spec.CPUs) //獲取CPU和內存資源
	memSizeMib := int64(vm.Spec.Memory.MBytes())

	cmdLine := vm.Spec.Kernel.CmdLine
	if len(cmdLine) == 0 {
		// if for some reason cmdline would be unpopulated, set it to the default
		cmdLine = constants.VM_DEFAULT_KERNEL_ARGS
	}

	// Convert the logrus error level to a Firecracker compatible error level.
	// Firecracker accepts "Error", "Warning", "Info", and "Debug", case-sensitive.
	fcLogLevel := "Debug"
	switch logs.Logger.Level {
	case log.InfoLevel:
		fcLogLevel = "Info"
	case log.WarnLevel:
		fcLogLevel = "Warning"
	case log.ErrorLevel, log.FatalLevel, log.PanicLevel:
		fcLogLevel = "Error"
	}

	firecrackerSocketPath := path.Join(vm.ObjectPath(), constants.FIRECRACKER_API_SOCKET)
	logSocketPath := path.Join(vm.ObjectPath(), constants.LOG_FIFO)
	metricsSocketPath := path.Join(vm.ObjectPath(), constants.METRICS_FIFO)
	cfg := firecracker.Config{
		SocketPath:      firecrackerSocketPath,
		KernelImagePath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH, //掛載到容器中的vmlinux文件路徑
		KernelArgs:      cmdLine,
		Drives: []models.Drive{{
			DriveID:      firecracker.String("1"),
			IsReadOnly:   firecracker.Bool(false),
			IsRootDevice: firecracker.Bool(true),
			PathOnHost:   &drivePath, //設置devicemapper設備
		}},
		NetworkInterfaces: fcIfaces, //設置vm接口
		MachineCfg: models.MachineConfiguration{
			VcpuCount:  &vCPUCount,
			MemSizeMib: &memSizeMib,
			HtEnabled:  firecracker.Bool(true),
		},
		//JailerCfg: firecracker.JailerConfig{
		//	GID:      firecracker.Int(0),
		//	UID:      firecracker.Int(0),
		//	ID:       vm.ID,
		//	NumaNode: firecracker.Int(0),
		//	ExecFile: "firecracker",
		//},

		LogLevel: fcLogLevel,
		// TODO: We could use /dev/null, but firecracker-go-sdk issues Mkfifo which collides with the existing device
		LogFifo:     logSocketPath,
		MetricsFifo: metricsSocketPath,
	}

	// Add the volumes to the VM
	for i, volume := range vm.Spec.Storage.Volumes {
		volumePath := path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name)
		if !util.FileExists(volumePath) {
			log.Warnf("Skipping nonexistent volume: %q", volume.Name)
			continue // Skip all nonexistent volumes
		}

		cfg.Drives = append(cfg.Drives, models.Drive{
			DriveID:      firecracker.String(strconv.Itoa(i + 2)),
			IsReadOnly:   firecracker.Bool(false), // TODO: Support read-only volumes
			IsRootDevice: firecracker.Bool(false),
			PathOnHost:   &volumePath, //設置掛載卷,這部分是從host-->container-->vm
		})
	}

	// Remove these FIFOs for now
	defer os.Remove(logSocketPath)
	defer os.Remove(metricsSocketPath)

	ctx, vmmCancel := context.WithCancel(context.Background())
	defer vmmCancel()

	cmd := firecracker.VMCommandBuilder{}.
		WithBin("firecracker").
		WithSocketPath(firecrackerSocketPath).
		WithStdin(os.Stdin).
		WithStdout(os.Stdout).
		WithStderr(os.Stderr).
		Build(ctx)

	m, err := firecracker.NewMachine(ctx, cfg, firecracker.WithProcessRunner(cmd))
	if err != nil {
		return fmt.Errorf("failed to create machine: %s", err)
	}

	//defer os.Remove(cfg.SocketPath)

	//if opts.validMetadata != nil {
	//	m.EnableMetadata(opts.validMetadata)
	//}

	if err = m.Start(ctx); err != nil { //啓動vm
		return fmt.Errorf("failed to start machine: %v", err)
	}
	defer util.DeferErr(&err, m.StopVMM)

	installSignalHandlers(ctx, m)

	// wait for the VMM to exit
	if err = m.Wait(ctx); err != nil {
		return fmt.Errorf("wait returned an error %s", err)
	}

	return
}

Run vm

下面是vm運行的入口,可以看到其內部只調用了vm create和vm start兩種方法,即執行了上面的"Create vm"和"Start vm"兩個步驟:

func Run(ro *RunOptions, fs *flag.FlagSet) error {
   if err := Create(ro.CreateOptions); err != nil {
      return err
   }

   // Copy the pointer over for Start
   // TODO: This is pretty bad, fix this
   ro.vm = ro.VM

   return Start(ro.StartOptions, fs)
}

Kill VM

kill vm用於強制停止vm,但不會刪除vm,vm的元數據和存儲都還在/var/lib/firecracker/vm/<UID>目錄下。

kill vm主要用的就是remove vm中調用的StopVM方法,但執行的是providers.Runtime.KillContainer,用於停止containerd task。

在刪除containerd的task之前必須kill task

注意這裏釋放了網絡資源,在執行ignite vm start的時候會重新配容器的網絡資源。

func StopVM(vm *api.VM, kill, silent bool) error {
	var err error
	container := vm.PrefixedID()
	action := "stop"

	if !vm.Running() && !logs.Quiet {
		log.Warnf("VM %q is not running but trying to cleanup networking for stopped container\n", vm.GetUID())
	}

	// 釋放網絡資源
	if err = removeNetworking(vm.Status.Runtime.ID, vm.Spec.Network.Ports...); err != nil {
		log.Warnf("Failed to cleanup networking for stopped container %s %q: %v", vm.GetKind(), vm.GetUID(), err)

		return err
	}

	if vm.Running() {
		// Stop or kill the VM container
		if kill {
			action = "kill"
			err = providers.Runtime.KillContainer(container, signalSIGQUIT) // TODO: common constant for SIGQUIT
		} else {
			err = providers.Runtime.StopContainer(container, nil)
		}

		if err != nil {
			return fmt.Errorf("failed to %s container for %s %q: %v", action, vm.GetKind(), vm.GetUID(), err)
		}

		if silent {
			return nil
		}

		if logs.Quiet {
			fmt.Println(vm.GetUID())
		} else {
			log.Infof("Stopped %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
		}
	}

	return nil
}

KillContainer的實現如下,即獲取containerd容器進程並通過向該進程發送syscall.SIGQUIT信號來強制停止該容器進程,此處使用cont.Task來等待進程退出。

func (cc *ctdClient) KillContainer(container, signal string) (err error) {
	cont, err := cc.client.LoadContainer(cc.ctx, container)
	if err != nil {
		// If the container is not found, return nil, no-op.
		if errdefs.IsNotFound(err) {
			log.Warn(err)
			err = nil
		}
		return
	}

	task, err := cont.Task(cc.ctx, cio.Load)
	if err != nil {
		// If the task is not found, return nil, no-op.
		if errdefs.IsNotFound(err) {
			log.Warn(err)
			err = nil
		}
		return
	}

	// Initiate a wait
	waitC, err := task.Wait(cc.ctx)
	if err != nil {
		return
	}

	// Send a SIGQUIT signal to force stop
	if err = task.Kill(cc.ctx, syscall.SIGQUIT); err != nil {
		return
	}

	// Wait for the container to stop
	<-waitC

	// Delete the task
	_, err = task.Delete(cc.ctx)
	return
}

Stop VM

stop vm使用的也是StopVM方法,但執行的是providers.Runtime.StopContainer,相比kill vm增加了等待時間,更優雅一些。

  1. 首先向容器進程發送syscall.SIGTERM命令來優雅停機
  2. 如果在超時時間(30s)內進程沒有退出,則向容器進程發送syscall.SIGQUIT信號來強制停機
  3. 最後調用task.Delete刪除vm進程

核心代碼如下:

	waitC, err := task.Wait(cc.ctx)
	if err != nil {
		return
	}

	// Send a SIGTERM signal to request a clean shutdown
	if err = task.Kill(cc.ctx, syscall.SIGTERM); err != nil {
		return
	}

	// After sending the signal, start the timer to force-kill the task
	timeoutC := make(chan error)
	timer := time.AfterFunc(*timeout, func() {
		timeoutC <- task.Kill(cc.ctx, syscall.SIGQUIT)
	})

	// Wait for the task to stop or the timer to fire
	select {
	case exitStatus := <-waitC:
		timer.Stop()             // Cancel the force-kill timer
		err = exitStatus.Error() // TODO: Handle exit code
	case err = <-timeoutC: // The kill timer has fired
	}

	// Delete the task
	if _, e := task.Delete(cc.ctx); e != nil {
		if err != nil {
			err = fmt.Errorf("%v, task deletion failed: %v", err, e) // TODO: Multierror
		} else {
			err = e
		}
	}

Remove vm

下面是刪除一個vm的入口:

func Rm(ro *RmOptions) error {
	for _, vm := range ro.vms {
		// 如果vm是運行狀態,則需要指定強制刪除才能繼續刪除,這與docker命令行刪除一個運行的容器一樣
		if vm.Running() && !ro.Force {
			return fmt.Errorf("%s is running", vm.GetUID())
		}

		// Runtime and network info are present only when the VM is running.
		if vm.Running() {
			// Set the runtime and network-plugin providers from the VM status.
			if err := config.SetAndPopulateProviders(vm.Status.Runtime.Name, vm.Status.Network.Plugin); err != nil {
				return err
			}
		}

		// This will first kill the VM container, and then remove it
		if err := operations.DeleteVM(providers.Client, vm); err != nil {
			return err
		}
	}

	return nil
}

一個運行的vm包含幾種資源:containerd task、containerd container、cni網絡、vm掛載的devicemapper snapshot設備、vm日誌文件以及Storage中保存的vm對象。移除一個vm意味着需要清理這些資源。

func DeleteVM(c *client.Client, vm *api.VM) error {
	if err := CleanupVM(vm); err != nil {
		return err
	}

  //清除vm對象以及/var/lib/firecracker/vm/<UID>/目錄
	return c.VMs().Delete(vm.GetUID())
}

CleanupVM是主要的清理方法。首先調用StopVM(參見"kill vm" 和"stop vm"章節)停止並刪除容器進程,移除容器網絡,然後調用RemoveVMContainer清理containerd相關資源,最後調用dmlegacy.DeactivateSnapshot移除vm的文件系統(內部調用dmsetup remove命令行)。步驟爲:

  1. 如果vm正在運行,則調用StopVM移除網絡、停止containerd 容器的task

    在移除vm時也需要移除對應的容器,否則會導致資源泄露,參見:issue

  2. 刪除vm所在的容器

  3. 移除vm掛載的devicemapper snapshot設備以及vm日誌文件

func CleanupVM(vm *api.VM) error {
	// Runtime information is available only when the VM is running.
	if vm.Running() {
		// Inspect the container before trying to stop it and it gets auto-removed
		inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())

		// If the VM is running, try to kill it first so we don't leave dangling containers. Otherwise, try to cleanup VM networking.
		if err := StopVM(vm, true, true); err != nil {
			if vm.Running() {
				return err
			}
		}

		// Remove the VM container if it exists
		// TODO should this function return a proper error?
		RemoveVMContainer(inspectResult)
	}

	// After removing the VM container, if the Snapshot Device is still there, clean up
	if _, err := os.Stat(vm.SnapshotDev()); err == nil {
		// try remove it again with DeactivateSnapshot
		if err := dmlegacy.DeactivateSnapshot(vm); err != nil {
			return err
		}
	}

	if logs.Quiet {
		fmt.Println(vm.GetUID())
	} else {
		log.Infof("Removed %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
	}

	return nil
}

RemoveContainer的清理操作如下:

  1. 通過名稱從containerd中加載vm所在的容器
  2. 獲取並刪除該容器的task
  3. 刪除容器本身
  4. 移除vm日誌文件/tmp/<containerName>.log
func (cc *ctdClient) RemoveContainer(container string) error {
	// Remove the container if it exists
	cont, contLoadErr := cc.client.LoadContainer(cc.ctx, container)
	if errdefs.IsNotFound(contLoadErr) {
		log.Debug(contLoadErr)
		return nil
	} else if contLoadErr != nil {
		return contLoadErr
	}

	// Load the container's task without attaching
	task, taskLoadErr := cont.Task(cc.ctx, nil)
	if errdefs.IsNotFound(taskLoadErr) {
		log.Debug(taskLoadErr)
	} else if taskLoadErr != nil {
		return taskLoadErr
	} else {
		_, taskDeleteErr := task.Delete(cc.ctx)
		if taskDeleteErr != nil {
			log.Debug(taskDeleteErr)
		}
	}

	// Delete the container
	deleteContErr := cont.Delete(cc.ctx, containerd.WithSnapshotCleanup)
	if errdefs.IsNotFound(contLoadErr) {
		log.Debug(contLoadErr)
	} else if deleteContErr != nil {
		return deleteContErr
	}

	// Remove the log file if it exists
	logFile := fmt.Sprintf(logPathTemplate, container)
	if util.FileExists(logFile) {
		logDeleteErr := os.RemoveAll(logFile)
		if logDeleteErr != nil {
			return logDeleteErr
		}
	}

	return nil
}

輔助命令

vm logs

獲取vm日誌其實就是獲取vm所在容器的task的打印信息,然後輸出到/tmp/ignite-<UID>.log文件中:

func (cc *ctdClient) ContainerLogs(container string) (r io.ReadCloser, err error) {
	var (
		cont containerd.Container
	)

	if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
		return
	}

	var retriever *logRetriever
	if retriever, err = newlogRetriever(fmt.Sprintf(logPathTemplate, container)); err != nil {
		return
	}

	if _, err = cont.Task(cc.ctx, cio.NewAttach(retriever.Opt())); err != nil {
		return
	}

	// Currently we have no way of detecting if the task's attach has filled the stdout and stderr
	// buffers without asynchronous I/O (syscall.Conn and syscall.Splice). If the read reaches
	// the end, the application hangs indefinitely waiting for new output from the container.
	// TODO: Get rid of this, implement asynchronous I/O and read until the streams have been exhausted
	time.Sleep(time.Second)

	// Close the writer to signal EOF
	if err = retriever.CloseWriter(); err != nil {
		return
	}

	return retriever, nil
}

Attach vm

ignite連接終端的方式有兩種:一種是attach,另一種是ssh。不同之處是,每次執行ssh會生成新的會話,而每次attach則操作的是系統的終端,因此通常使用ssh來獲取終端會話。

ignite的這部分代碼參考了containerd中attach的實現

attach操作首先獲取當前的終端,然後處理輸入輸出。ignite啓動時會使用oci.WithTTY配置終端。

func (cc *ctdClient) AttachContainer(container string) (err error) {
	var (
		cont containerd.Container
		spec *oci.Spec
	)

	if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
		return
	}

	if spec, err = cont.Spec(cc.ctx); err != nil {
		return
	}

	var (
		con console.Console
		tty = spec.Process.Terminal
	)

	if tty {
		con = console.Current() //獲取當前的終端
		defer util.DeferErr(&err, con.Reset)
		if err = con.SetRaw(); err != nil {
			return
		}
	}

	var (
		task     containerd.Task
		statusC  <-chan containerd.ExitStatus
		igniteIO *igniteIO
	)

	if igniteIO, err = newIgniteIO(fmt.Sprintf(logPathTemplate, container)); err != nil {
		return
	}
	defer util.DeferErr(&err, igniteIO.Close)

	if task, err = cont.Task(cc.ctx, cio.NewAttach(igniteIO.Opt())); err != nil { //配置日誌相關的輸出
		return
	}

	if statusC, err = task.Wait(cc.ctx); err != nil {
		return
	}

	if tty {
		if err := HandleConsoleResize(cc.ctx, task, con); err != nil {
			log.Errorf("console resize failed: %v", err)
		}
	} else {
		sigc := ForwardAllSignals(cc.ctx, task)
		defer StopCatch(sigc)
	}

	var code uint32
	select {
	case ec := <-statusC:
		code, _, err = ec.Result()
	case <-igniteIO.Detach():
		fmt.Println() // Use a new line for the log entry
		log.Println("Detached")
	}

	if code != 0 && err == nil {
		err = fmt.Errorf("attach exited with code %d", code)
	}

	return
}

Inspect vm

inspect可以查看image/kernel/vm三種資源,從Storage中加載對象,然後進行解碼輸出即可。

ssh vm

還可以通過在執行create時指定--ssh標誌來啓用ssh:

$ ignite ssh my-vm

在"create vm->配置ssh"中已經介紹了vm是如何配置ssh服務的。這裏看下客戶端是如何連接vm的ssh的。

此處使用了密鑰對來進行ssh連接,大部分都是標準的ssh連接代碼,參考demo

// runSSH creates and runs ssh session based on the provided arguments.
// If the command list is empty, ssh shell is created, else the ssh command is
// executed.
func runSSH(vm *api.VM, privKeyFile string, command []string, tty bool, timeout uint32) (err error) {
	// Check if the VM is running.
	if !vm.Running() {
		return fmt.Errorf("VM %q is not running", vm.GetUID())
	}

	// Get the IP address.
	ipAddrs := vm.Status.Network.IPAddresses //獲取ssh連接的ip地址
	if len(ipAddrs) == 0 {
		return fmt.Errorf("VM %q has no usable IP addresses", vm.GetUID())
	}

	// Get private key file path.
	if len(privKeyFile) == 0 { //獲取本地私鑰
		privKeyFile = path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
		if !util.FileExists(privKeyFile) {
			return fmt.Errorf("no private key found for VM %q", vm.GetUID())
		}
	}

	// Create a new ssh signer for the private key.
	signer, err := newSignerForKey(privKeyFile)
	if err != nil {
		return fmt.Errorf("unable to create signer for private key: %v", err)
	}

	// Defer exit here and set the exit code based on any ssh error, so that
	// this ssh command returns the correct ssh exit code. Since this function
	// results in an os.Exit, any error returned by this function won't be
	// received by the caller. Print the error to make the errror message
	// visible and set the error code when an error is found.
	exitCode := 0
	defer func() {
		os.Exit(exitCode)
	}()

	// printErrAndSetExitCode is used to print an error message, set exit code
	// and return nil. This is needed because once the ssh connection is
	// estabilish, to return the error code of the actual ssh session, instead
	// of returning an error, the runSSH function defers os.Exit with the ssh
	// exit code. For showing any error to the user, it needs to be printed.
	printErrAndSetExitCode := func(errMsg error, exitCode *int, code int) error {
		log.Errorf("%v\n", errMsg)
		*exitCode = code
		return nil
	}

	// Create an SSH client, and connect.
	config := newSSHConfig(signer, timeout)
	client, err := ssh.Dial(defaultSSHNetwork, net.JoinHostPort(ipAddrs[0].String(), defaultSSHPort), config)
	if err != nil {
		return printErrAndSetExitCode(fmt.Errorf("failed to dial: %v", err), &exitCode, 1)
	}
	defer util.DeferErr(&err, client.Close)

	// Create a session.
	session, err := client.NewSession()
	if err != nil {
		return printErrAndSetExitCode(fmt.Errorf("failed to create session: %v", err), &exitCode, 1)
	}
	defer util.DeferErr(&err, session.Close)

	// Configure tty if requested.
	if tty {
		// Get stdin file descriptor reference.
		fd := int(os.Stdin.Fd())

		// Store the raw state of the terminal.
		state, err := terminal.MakeRaw(fd)
		if err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to make terminal raw: %v", err), &exitCode, 1)
		}
		defer util.DeferErr(&err, func() error { return terminal.Restore(fd, state) })

		// Get the terminal dimensions.
		w, h, err := terminal.GetSize(fd)
		if err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to get terminal size: %v", err), &exitCode, 1)
		}

		// Set terminal modes.
		modes := ssh.TerminalModes{
			ssh.ECHO: 1,
		}

		// Read the TERM environment variable and use it to request the PTY.
		term := os.Getenv("TERM")
		if term == "" {
			term = defaultTerm
		}

		if err = session.RequestPty(term, h, w, modes); err != nil {
			return printErrAndSetExitCode(fmt.Errorf("request for pseudo terminal failed: %v", err), &exitCode, 1)
		}
	}

	// Connect input / output.
	// TODO: these should come from the cobra command instead of hardcoding
	// os.Stderr etc.
	session.Stderr = os.Stderr
	session.Stdout = os.Stdout
	session.Stdin = os.Stdin

	if len(command) == 0 {
		if err = session.Shell(); err != nil {
			return printErrAndSetExitCode(fmt.Errorf("failed to start shell: %v", err), &exitCode, 1)
		}

		if err = session.Wait(); err != nil {
			if e, ok := err.(*ssh.ExitError); ok {
				return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
			}
			return printErrAndSetExitCode(fmt.Errorf("failed waiting for session to exit: %v", err), &exitCode, 1)
		}
	} else {
		if err = session.Run(joinShellCommand(command)); err != nil {
			if e, ok := err.(*ssh.ExitError); ok {
				return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
			}
			return printErrAndSetExitCode(fmt.Errorf("failed to run shell command: %s", err), &exitCode, 1)
		}
	}
	return
}
func newSSHConfig(publicKey ssh.Signer, timeout uint32) *ssh.ClientConfig {
   return &ssh.ClientConfig{
      User: "root",
      Auth: []ssh.AuthMethod{
         ssh.PublicKeys(publicKey),
      },
      HostKeyCallback: ssh.InsecureIgnoreHostKey(), // TODO: use ssh.FixedPublicKey instead
      Timeout:         time.Second * time.Duration(timeout),
   }
}

exec vm

可以看到exec方式內部其實用的就是ssh方式,首先使用waitForSSH等待ssh服務正常工作,然後使用runSSH登錄:

func Exec(eo *ExecOptions) error {
	if err := waitForSSH(eo.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, time.Duration(eo.Timeout)*time.Second); err != nil {
		return err
	}
	return runSSH(eo.vm, eo.IdentityFile, eo.command, eo.Tty, eo.Timeout)
}
func waitForSSH(vm *ignite.VM, dialSeconds int, sshTimeout time.Duration) error {
	if err := dialSuccess(vm, dialSeconds); err != nil { //驗證ssh服務是否可達
		return err
	}

	certCheck := &ssh.CertChecker{
		IsHostAuthority: func(auth ssh.PublicKey, address string) bool {
			return true
		},
		IsRevoked: func(cert *ssh.Certificate) bool {
			return false
		},
		HostKeyFallback: func(hostname string, remote net.Addr, key ssh.PublicKey) error {
			return nil
		},
	}

	config := &ssh.ClientConfig{ //配置無認證方式登錄
		HostKeyCallback: certCheck.CheckHostKey,
		Timeout:         sshTimeout,
	}

	addr := vm.Status.Network.IPAddresses[0].String() + ":22"
	sshConn, err := ssh.Dial("tcp", addr, config) //驗證ssh服務是否能夠返回無法認證的錯誤,以此判斷ssh服務是否正常
	if err != nil {
		if strings.Contains(err.Error(), "unable to authenticate") {
			// we connected to the ssh server and recieved the expected failure
			return nil
		}
		return err
	}

	defer sshConn.Close()
	return fmt.Errorf("waitForSSH: connected successfully with no authentication -- failure was expected")
}

rm image

  1. 根據鏡像ID從Storage中找到鏡像對象,同時獲取所有的vm對象
  2. 移除鏡像時如果指定了--force參數,則會同時刪除掉使用該鏡像的vm
  3. 刪除鏡像所在的目錄/var/lib/firecracker/image/<UID>

如果指定了多個鏡像,則需要遍歷處理。

rm kernel

rm image處理邏輯相同

更多CLI操作參見官方文檔

Ignited Daemon

Ignited daemon是ignite的守護進程,當用戶在constants.MANIFEST_DIR(默認爲/etc/firecracker/manifests)目錄下創建vm的描述文件時,ignited會自動發現文件變動,並從constants.DATA_DIR(默認爲/var/lib/firecracker)讀取生成vm所需要的鏡像和元數據。

ignited使用一個ManifestStorage來管理constants.MANIFEST_DIRconstants.DATA_DIR這兩個目錄,並將生成的manifestStorage保存到providers.Storage中:

func SetManifestStorage() (err error) {
	log.Trace("Initializing the ManifestStorage provider...")
	ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
	if err != nil {
		return
	}

	providers.Storage = cache.NewCache(ManifestStorage)
	return
}

由於需要watch constants.MANIFEST_DIR目錄的變動,此處用到了一個GenericWatchStorage存儲,它內部使用rjeczalik/notify庫來通知文件的變動(Create/Modify/Delete)。

constants.DATA_DIR存儲的是鏡像相關的文件,只需要在創建vm的時候讀取即可,不需要對其watch,因此使用了GenericStorage來對其進行管理。

func NewTwoWayManifestStorage(manifestDir, dataDir string, ser serializer.Serializer) (*ManifestStorage, error) {
   ws, err := watch.NewGenericWatchStorage(storage.NewGenericStorage(storage.NewGenericMappedRawStorage(manifestDir), ser))
   if err != nil {
      return nil, err
   }

   ss := sync.NewSyncStorage(
      storage.NewGenericStorage(
         storage.NewGenericRawStorage(dataDir), ser),
      ws)

   return &ManifestStorage{
      Storage: ss,
   }, nil
}

syncStorage和ignited daemon主流程(ReconcileManifests)的關係如下。一個syncStorage可以對接多個Storage,用於同時操作多個Storage資源,例如從多個Storage中獲取/設置/刪除某個資源對象。

watchStorage會通過一個名爲eventStream的chan將文件事件傳遞給syncStorage,而syncStorage則會通過一個名爲updateStream的chan將該事件傳遞給ignited daemon的主流程,在主流程中根據事件類型以及產生事件的對象來做出相應的動作(增/刪/改等)。需要注意的是,主流程只關心vm對象的事件。

image

有了上述認知,ignited daemon主流程就比較簡單了。根據vm的事件類型,對vm進行相應的操作即可。

func ReconcileManifests(s *manifest.ManifestStorage) {
	startMetricsThread()

	// Wrap the Manifest Storage with a cache for better performance, and create a client
	c = client.NewClient(cache.NewCache(s))

	// 監聽syncStorage中傳過來的事件
	for upd := range s.GetUpdateStream() {

		// 僅關心vm資源的事件
		if upd.APIType.GetKind() != api.KindVM {
			log.Tracef("GitOps: Ignoring kind %s", upd.APIType.GetKind())
			kindIgnored.Inc()
			continue
		}

		var vm *api.VM
		var err error
    //如果是刪除事件,此時 manifeststorage.ManifestStorage 中vm的描述文件已經被刪除,無法從ManifestStorage中獲取vm
		if upd.Event == update.ObjectEventDelete { 
			// As we know this VM was deleted, it wouldn't show up in a Get() call
			// Construct a temporary VM object for passing to the delete function
			vm = &api.VM{
				TypeMeta:   *upd.APIType.GetTypeMeta(),
				ObjectMeta: *upd.APIType.GetObjectMeta(),
				Status: api.VMStatus{
					Running: true, // TODO: Fix this in StopVM
				},
			}
		} else {
			// Get the real API object
			vm, err = c.VMs().Get(upd.APIType.GetUID())
			if err != nil {
				log.Errorf("Getting %s %q returned an error: %v", upd.APIType.GetKind(), upd.APIType.GetUID(), err)
				continue
			}

			// If the object was existent in the storage; validate it
			// Validate the VM object
			// TODO: Validate name uniqueness
			if err := validation.ValidateVM(vm).ToAggregate(); err != nil {
				log.Warnf("Skipping %s of %s %q, not valid: %v.", upd.Event, upd.APIType.GetKind(), upd.APIType.GetUID(), err)
				continue
			}
		}

		// TODO: Parallelization
		switch upd.Event {
		case update.ObjectEventCreate, update.ObjectEventModify: //處理創建和修改事件
			runHandle(func() error {
				return handleChange(vm)
			})

		case update.ObjectEventDelete: //處理刪除事件
			runHandle(func() error {
				// TODO: Temporary VM Object for removal
				return handleDelete(vm)
			})
		default:
			log.Infof("Unrecognized Git update type %s\n", upd.Event)
			continue
		}
	}
}

下面是處理事件的具體內容:

func handleChange(vm *api.VM) (err error) {
	// Only apply the new state if it
	// differs from the current state
	running := currentState(vm)
	if vm.Status.Running && !running { // 如果vm元數據中狀態是 running,而實際非running, 則啓動vm
		err = start(vm)
	} else if !vm.Status.Running && running { // 如果vm元數據中狀態非running,而實際是running,則停止vm
		err = stop(vm)
	}

	return
}
func handleDelete(vm *api.VM) error {
	return remove(vm)
}

func remove(vm *api.VM) error {
	log.Infof("Removing VM %q with name %q...", vm.GetUID(), vm.GetName())
	vmDeleted.Inc()
	// Object deletion is performed by the SyncStorage, so we just
	// need to clean up any remaining resources of the VM here
	return operations.CleanupVM(vm)
}

CNI

ignite使用CNI來配置主機和容器網絡。

默認CNI

默認的cni是bridge,源碼位於plugins,其主要缺點是無法實現vm的跨節點通信,結構如下:

bridge的cni配置/etc/cni/net.d/10-ignite.conflist如下:

{
	"cniVersion": "0.4.0",
	"name": "ignite-cni-bridge",
	"plugins": [
		{
			"type": "bridge",
			"bridge": "ignite0",
			"isGateway": true,
			"isDefaultGateway": true,
			"promiscMode": true,
			"ipMasq": true,
			"ipam": {
				"type": "host-local",
				"subnet": "10.61.0.0/16"
			}
		},
		{
			"type": "portmap",
			"capabilities": {
				"portMappings": true
			}
		},
		{
			"type": "firewall"
		}
	]
}

ignite的使用go-cni來配置cni,實際調用的就是基本的go-cni用法,即如下接口:

New(config ...CNIOpt) (CNI, error)  //初始化一個cni對象
// Setup setup the network for the namespace
Setup(ctx context.Context, id string, path string, opts ...NamespaceOpts) (*CNIResult, error)
// Remove tears down the network of the namespace.
Remove(ctx context.Context, id string, path string, opts ...NamespaceOpts) error
// Load loads the cni network config
Load(opts ...CNIOpt) error

此外還可以採用Flannel插件實現跨主機通信,但flannel需要etcd來維護網絡。更多參見官方文檔

編譯和鏡像製作

參考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章