解決ovirt虛擬機使用FCP瘦分配安裝win10系統卡死的問題

問題描述:
僅在FCP 瘦分配模式下會出現該問題,測試將win10安裝到Getting files ready for installation(13%)時卡死,通過virsh看到,虛擬機狀態進入pause

1、2日誌中均出現如下報錯:

2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onSuspend (vm:5085)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error (vm:4218)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,558+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,559+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,559+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,565+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,566+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,566+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,570+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)

實際測試中出現虛擬機掛起問題,問題在於vdsm端通過獲取qemu的ENOSPC報錯信息,通過下圖流程:

Created with Raphaël 2.1.2extendDrivesIfNeedextendDriveVolume

實際是在_shouldExtendVolume中進行physical - alloc < drive.watermarkLimit判定時出錯,而watermarkLimit參數由self.VOLWM_FREE_PCT * self.volExtensionChunk / 100計算得到,由vdsm.conf中的參數volume_utilization_chunk_mb(默認1024)決定,默認值爲512MB,當vm硬盤對容量擴大大於限制值擴大會失敗,嘗試改大conf參數,還是會出問題,故在代碼中取消對其磁盤擴展的限制,後面發現修改會引入瘦分配失效的問題:

Apr 21 14:34:07 Linx vdsmd: —-extend:[]
Apr 21 14:34:07 Linx vdsmd: ——out extend Drives
Apr 21 14:34:09 Linx vdsmd: ——in extend Drives
Apr 21 14:34:09 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:09 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:09 Linx vdsmd: —-extend:[]
Apr 21 14:34:09 Linx vdsmd: ——out extend Drives
Apr 21 14:34:11 Linx vdsmd: ——in extend Drives
Apr 21 14:34:11 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:11 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:11 Linx vdsmd: —-extend:[]
Apr 21 14:34:11 Linx vdsmd: ——out extend Drives
Apr 21 14:34:13 Linx vdsmd: ——in extend Drives
Apr 21 14:34:13 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:13 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:13 Linx vdsmd: —-extend:[]
Apr 21 14:34:13 Linx vdsmd: ——out extend Drives
Apr 21 14:34:15 Linx vdsmd: ——in extend Drives
Apr 21 14:34:15 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:15 Linx vdsmd: —-ret:[(

瘦分配的主要用途,是減小虛擬機對磁盤佔用率,瘦分配失效會導致每次創建快照後,該虛擬機對磁盤的
佔用率會翻倍,通過打印發現ENOSPC的事件觸發不停發生,且alloc size = 0, commit中的
修復直接返回True,會導致磁盤不停擴容,直到上限,也就導致了瘦分配的失效。

繼續排查:

log如下:

Apr 21 15:42:06 Linx vdsmd: blockInfo:[107374182400L, 0L, 536870912L]
Apr 21 15:42:06 Linx vdsmd: —-ret:[(, u’83cf08db-ea47-40be-9b6f-c39b8444d369’, 107374182400L, 0L, 536870912L)]
Apr 21 15:42:06 Linx vdsmd: —-false3, physical:536870912; alloc:0; watermarkLimit:268435456;

vdsm端擴展內存的調用如下兩條線:

流程1:

Created with Raphaël 2.1.2DriveWatermarkMonitor_execute->self._vm.extendDrivesIfNeeded()

流程2:

Created with Raphaël 2.1.2onIOError(callBack)->if reason == 'ENOSPCself.extendDrivesIfNeeded()

兩條線代表瞭如下兩種情況:
流程1)後臺監控對磁盤水位進行實時監控,當達到閥值,擴展磁盤大小。
流程2)當實時監控的間隙出現磁盤達到閥值,qemu會掛起虛擬機並且拋出ENOSPC的異常,vdsm中捕獲該異常,並且擴展磁盤。

如果僅僅針對qemu掛起後的情況進行處理,會導致磁盤持續寫入時虛擬機斷續掛起,表現在使用中的情況是不停出現卡頓,所以需要解決的是vdsm端_getExtendCandidates中調用libvirt self._dom.blockInfo無法獲取到當前磁盤實際大小的問題。

self._dom.blockInfo在libvirt中和domblkinfo調用流程相同,實際測試如下:

virsh # domblkinfo –device sda –domain linx80-1
Capacity: 107374182400
Allocation: 0
Physical: 536870912

virsh # domstats –block –domain linx80-1
Domain: ‘linx80-1’
block.count=2
block.0.name=hdc
block.0.rd.reqs=4
block.0.rd.bytes=152
block.0.rd.times=82981
block.0.wr.reqs=0
block.0.wr.bytes=0
block.0.wr.times=0
block.0.fl.reqs=0
block.0.fl.times=0
block.0.allocation=0
block.0.physical=0
block.1.name=sda
block.1.path=/rhev/data-center/a3ae667f-bd61-4b9e-903b-9f57b2e89080/572679d5-b080-425e-9e4f-f5d01988a6be/images/1796cd02-5b15-4d72-864d-4bc33ca7cc1c/8ad5f4b3-b3fd-43c5-8c47-8da9a24904e9
block.1.rd.reqs=13340
block.1.rd.bytes=404829696
block.1.rd.times=73699128640
block.1.wr.reqs=1182
block.1.wr.bytes=128020480
block.1.wr.times=128182811428
block.1.fl.reqs=157
block.1.fl.times=401870367
block.1.allocation=149946368
block.1.capacity=107374182400
block.1.physical=536870912

可以看出domblkinfo無法獲取到Allocation,但domstats卻能獲取到allocation,定位到該問題出現在libvirt端。

libvirt中domblkinfo調用流程如下:

Created with Raphaël 2.1.2domblkinfocmdDomblkinfovirDomainGetBlockInfoconn->drivce->domainGetBlockInfoqemuDomainGetBlockInfoqemuMonitorGetAllBlockStatsInfo

主要問題出在qemuDomainGetBlockInfo中的:

    if (entry->physical == 0 || info->allocation == 0 ||
        info->allocation == entry->physical) {
        info->allocation = entry->physical;
        if (info->allocation == 0) 
            info->allocation = entry->physical;

        if (qemuDomainStorageUpdatePhysical(driver, cfg, vm, disk->src) < 0)
            goto endjob;

        info->physical = disk->src->physical;
    } else {
        info->physical = entry->physical;
    }

libvirt當從monitor獲得的磁盤現有大小爲0時強行將已分配大小置0,之後從disk->src中重新獲取磁盤大小,主要目的在於保證已分配大小不大於實際大小,但當從disk->src中獲取到的磁盤大小不爲0時,已分配大小卻爲0,實際改動見[1],只有當從disk->src中獲取的磁盤大小爲0時,才真正的將allocation大小置0。

測試:
進行了3次重裝win10,安裝過程磁盤分配均正常

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章