背景

Ceph RBD目前存儲模式只支持RWO、ROX，因此pv yaml中將其配置爲RWO，這樣至少可寫，之後準備pvc和業務yaml，使其被掛載到容器中的數據庫讀寫目錄。

apiVersion: v1
kind: PersistentVolume
metadata:
  name: user-pv
  namespace: sock-shop
  labels:
    pv: user-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  rbd:
    monitors:
      - 192.168.22.47:6789
      - 192.168.22.48:6789
      - 192.168.22.49:6789
    pool: sock_shop
    image: user-pv
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
    readOnly: false
  persistentVolumeReclaimPolicy: Recycle

出現的問題

測試過程中關閉了掛載該PV的Pod所在的node，發現該Pod無法被遷移，報錯如下：

Multi-Attach error for volume "user-pv" Volume is already used by pod(s) user-db-79f7876cbc-dl8b8
Unable to mount volumes for pod "user-db-79f7876cbc-chddt_sock-shop(1dd2393e-ed51-11e8-95af-001a4ad9b270)": timeout expired waiting for volumes to attach or mount for pod "sock-shop"/"user-db-79f7876cbc-chddt". list of unmounted volumes=[data-volume]. list of unattached volumes=[tmp-volume data-volume default-token-l2g8x]

問題分析

k8s會判斷如果該volume禁用了多pod/node掛載的話，再去判斷當attach上該volume的數量>0的時候就讓該volume無法被新啓的pod掛載，見下

        if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
            nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
            if len(nodes) > 0 {
                if !volumeToAttach.MultiAttachErrorReported {
                    rc.reportMultiAttachError(volumeToAttach, nodes)
                    rc.desiredStateOfWorld.SetMultiAttachError(volumeToAttach.VolumeName, volumeToAttach.NodeName)
                }
                continue
            }
        }

而對isMultiAttachForbidden的判斷則是對AccessModes，而Ceph的AccessModes並不支持RWX，僅支持RWO(不考慮RO)，而之前掛載該volume的Pod在運行Node宕機後，集羣仍然認爲他是Running狀態，因此其Volume資源未釋放，因此出現新啓動的Pod仍不可掛載volume的問題，見下

    if volumeSpec.PersistentVolume != nil {
        // Check for persistent volume types which do not fail when trying to multi-attach
        if volumeSpec.PersistentVolume.Spec.VsphereVolume != nil {
            return false
        }

        if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
            // No access mode specified so we don't know for sure. Let the attacher fail if needed
            return false
        }

        // check if this volume is allowed to be attached to multiple PODs/nodes, if yes, return false
        for _, accessMode := range volumeSpec.PersistentVolume.Spec.AccessModes {
            if accessMode == v1.ReadWriteMany || accessMode == v1.ReadOnlyMany {
                return false
            }
        }
        return true
    }

這樣問題就清楚了，Pod遷移時會將之前的Pod置於Unknown狀態，這樣會導致k8s內部對RWO的限制影響到了Pod遷移，雖然Ceph RBD不支持RWX的模式，但是在PV、PVC端執行流程中並未有限制，也就是說PV、PVC端並不會因爲你是Ceph RBD或者Glusterfs就禁止你的PV申請，因此將accessModes設置成RWX可以解決問題。但是目前還有一個問題，如果Ceph RBD並不支持RWX，將其accessModes強行設置爲RWX會不會出問題？

實際是上不會的，見下代碼，首先這裏會判斷accessModes，實際上這裏本身有個bug，但已經被我修復掉了，會在後面的文章聊到，先看修改後的，如果判斷出非ReadOnlyMany，則必然會驗證util.rbdStatus來判斷這個rbd pv是否已經被使用了。

func (util *RBDUtil) AttachDisk(b rbdMounter) (string, error) {
。。。
		if b.accessModes != nil {
			// If accessModes only contain ReadOnlyMany, we don't need check rbd status of being used.
			if len(b.accessModes) == 1 && b.accessModes[0] == v1.ReadOnlyMany {
				needValidUsed = false
			}
		}
		err := wait.ExponentialBackoff(backoff, func() (bool, error) {
			used, rbdOutput, err := util.rbdStatus(&b)
			if err != nil {
				return false, fmt.Errorf("fail to check rbd image status with: (%v), rbd output: (%s)", err, rbdOutput)
			}
			return !needValidUsed || !used, nil
		})
。。。
}

那麼如何實現RBD pv使用判斷的呢？實現就是通過rbd status看是否有watcher，如果該rbd被map過了，則必然會出現watcher，也就是說對於RWX和RWO，其都只能被掛載一次，回到最初的問題，即便node宕機了，即便腦裂了，只要該pv被map了，也就是被Pod使用中，新的Pod就無法啓動。換句話說就是我們的PV遷移流程是：啓動新Pod，等待舊Pod; 有兩種情況，其一，出現腦裂，舊Pod仍然掛載上了PV，這時即便調度了新Pod，由於判斷watcher已存在，該Pod的啓動仍然會失敗，RBD不會出現多次RW掛載的問題。其二，未出現腦裂，舊主機宕機，這時候新Pod判斷watcher不存在，Pod重新調度成功。

// rbdStatus runs `rbd status` command to check if there is watcher on the image.
func (util *RBDUtil) rbdStatus(b *rbdMounter) (bool, string, error) {
	var err error
	var output string
	var cmd []byte

	// If we don't have admin id/secret (e.g. attaching), fallback to user id/secret.
	id := b.adminId
	secret := b.adminSecret
	if id == "" {
		id = b.Id
		secret = b.Secret
	}

	mon := util.kernelRBDMonitorsOpt(b.Mon)
	// cmd "rbd status" list the rbd client watch with the following output:
	//
	// # there is a watcher (exit=0)
	// Watchers:
	//   watcher=10.16.153.105:0/710245699 client.14163 cookie=1
	//
	// # there is no watcher (exit=0)
	// Watchers: none
	//
	// Otherwise, exit is non-zero, for example:
	//
	// # image does not exist (exit=2)
	// rbd: error opening image kubernetes-dynamic-pvc-<UUID>: (2) No such file or directory
	//
	glog.V(4).Infof("rbd: status %s using mon %s, pool %s id %s key %s", b.Image, mon, b.Pool, id, secret)
	cmd, err = b.exec.Run("rbd",
		"status", b.Image, "--pool", b.Pool, "-m", mon, "--id", id, "--key="+secret)
	output = string(cmd)

	if err, ok := err.(*exec.Error); ok {
		if err.Err == exec.ErrNotFound {
			glog.Errorf("rbd cmd not found")
			// fail fast if command not found
			return false, output, err
		}
	}

	// If command never succeed, returns its last error.
	if err != nil {
		return false, output, err
	}

	if strings.Contains(output, imageWatcherStr) {
		glog.V(4).Infof("rbd: watchers on %s: %s", b.Image, output)
		return true, output, nil
	} else {
		glog.Warningf("rbd: no watchers on %s", b.Image)
		return false, output, nil
	}
}

kubernetes中Ceph PV導致Pod無法遷移

背景

出現的問題

問題分析

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

解決virtio-gpu對framebuffer支持及VT切換等問題

setns對當前進程無效問題的排查(getpid獲取值不變)

解決qemu虛擬機圖形界面卡死問題

解決qemu虛擬機中內存偏小的問題

kubernetes中的CLUSTER-IP和EXTERNAL-IP無法ping通，但是curl可以獲取到頁面

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結