NVMe協議調試總結
1、NVMe問答
NVMe 是什麼?
百度百科這麼說的:
NVMe(Non-VolatileMemory express),是一種建立在M.2接口上的類似AHCI的一種協議,是專門爲閃存類存儲設計的協議。中文名 NVMe協議 外文名 Non-Volatile Memory express。NVMe具體優勢包括:
①性能有數倍的提升;
②可降低延遲超過50%;
③NVMe PCIe SSD可提供的IOPs十倍於高端企業級SATA SSD;
④自動功耗狀態切換和動態能耗管理功能大大降低功耗;
⑤支持未來十年技術發展的可擴展能力。
碼農該怎麼理解?
它是一個存儲協議,既然是存儲協議是不是需要快速的讀寫?
答:對。
PCIe纔是最快的協議啊,爲啥不用PCIe呢?
答:PCIe很複雜的。
那我們給PCIe穿個馬甲,就可以?
答:NVMe就是給PCIe穿個馬甲。
NVMe是怎麼做到的?
答:PCIe是作文題,NVMe是選詞填空,最後的結果卻一樣。
怎麼填?填什麼?
答:按照這個表格填寫,發什麼就填什麼,總共64字節,不需要的填0就行了。
IO命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
Admin 命令:
rsvd11 |
numd |
offset |
lid |
prp2 |
prp1 |
rsvd1 |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
NVMe處於什麼位置
NVMe是一種Host與SSD之間通訊的協議,它在協議棧中隸屬高層。
NVMe命令該選什麼詞填什麼空?
NVMe制定了Host與SSD之間通訊的命令,以及命令如何執行的。
NVMe有兩種命令,一種叫Admin Command,用以Host管理和控制SSD;另外一種就是I/O Command,用以Host和SSD之間數據的傳輸。下面是NVMe1.2支持的命令列表:
NVMe支持的Admin Command:
NVMe支持的I/O Command:
發送的太快我來不及執行咋辦?
搞兩個緩衝區吧,
發送緩衝區:SubmissionQueue (SQ)。
完成緩衝區:CompletionQueue(CQ)
處理完了,我該怎麼告訴你呢?
寫這個寄存器就行Doorbell Register (DB)
系統結構什麼怎樣的?
這個namespace是啥玩意?
每個flash塊就是一個namaspce,它有個id,叫namaspceID。
NVMe到SDD是怎麼玩的?
舉例Host需要從flash地址 0x02000000上讀取nblock = 2的數據, PRP1給出內存地址是0x10000000,該怎麼操作?
首先我們得組包nvme_cmd,這個包爲讀命令它包含我們讀地址(0x02000000)、長度(nblock = 2)、和讀到什麼地方(prp);然後把這個包扔給sq,寫doorbell通知控制器來數據咯,控制器取出命令來轉換爲TLP包通過PCIe Memory方式把0x02000000的數據寫入到0x10000000中,然後在Cq的尾部寫入完成標誌,再寫doorbell告訴控制器我的事幹完了。
1:這個命令放在SQ裏 。
2:Host通過寫SQ的Tail DB,通知SSD來取命令。
3:SSD收到通知,去Host端的SQ中取指。 PCIe是通過發一個Memory Read TLP到Host的SQ中取指的。
4:SSD執行讀命令,把數據從閃存中讀到緩存中,然後把數據傳給Host。
5:SSD往Host的CQ中返回狀態。
6:SSD採用中斷的方式告訴Host去處理CQ。
7:Host處理相應的CQ。
2、NVMe調試準備
本次調試採用第三方NVMe卡,軟件環境採用Linux 內核3.11.10。插入卡後能夠在pci樹上看到設備1987:5007,如圖:
目前NVMe卡已經能作爲pci設備被識別了,接下來開始移植驅動。下載linux3.11.10並解壓,提取nvme-core.c 、nvme-scsi.c、nvme.h三個文件,然後編寫makefile,如下:
然後加載驅動#insmod nvme_driver.ko, 接下來就可以看到nvme設備了:
注意:nvme0設備 是我們註冊file_operations,nvme0n1對應block_device_operations。
現在設備和驅動都調試成功了,接下來就可以通過ioctl調試命令下方工具和解析命令。
2、獲取namespace_ID 和 sg_vesion
獲取namespace_id時最簡單的ioctl操作,這裏就不粘代碼了,結果如下:
3、SUBMIT_IO Cmd/Write and Read
Submitio 就是對應disk的讀寫,這裏只介紹READ/WRITE命令的下發:
READ命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
|
|
|
0xc1 |
|
addr |
|
|
n |
|
|
0x02 |
Opcode: read命令頭0x02
Flags:清0
Control:清0
nblocks: 讀的blocks個數,不能超過最大值
metadata:暫時不用
addr:數據保存的地址,最好申請數組空間,大小至少16k
dsmgmt: 0xc1->11000001b, not compressible , sequential read , No latency information provided,Typical number of reads and writes expected forthis LBA range.
Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.
Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.
Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.
WRITE命令:
appmask |
apptag |
reftag |
dsmgmt |
slba |
addr |
metadata |
rsvd |
nblocks |
control |
Flags |
Opcode |
|
|
|
0xc1 |
|
addr |
|
|
n |
|
|
0x01 |
Opcode: write命令頭0x01
Flags:清0
Control:清0
nblocks: 寫的blocks個數,不能超過最大值
metadata:暫時不用
addr:數據保存的地址,最好申請數組空間,大小至少16k
dsmgmt: 0xc1->11000001b, not compressible , sequential read , No latency information provided,Typical number of reads and writes expected forthis LBA range.
Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.
Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.
Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.
值得注意的是ioctl的cmd參數,用戶空間的cmd經過魔數、基數、變量型的轉化和偏移纔得到驅動層的cmd。
4、Admin Cmd send
根據測試,返回status和result都爲0表示命令成功,其他都表示命令失敗。
Get Log Page command:
rsvd11 |
numd |
offset |
lid |
prp2 |
prp1 |
rsvd1 |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_get_log_page
Flags:清0
Command_id:清0
Prp1:數據保存的地址,最好申請數組空間,大小至少16k
Prp2:datalength,注意datalength的長度
Lid:
Offset:清0
Numd:清0
值得注意的是這裏並沒有定義namespace_ID, 最好設置rsvd1[0] = ~0。
Get Log Page: SMART/ Health Information
Critical Warning: 00
Composite Temperature: (32 01 )306K氏度
Available Spare: (64)100%
Identify command:
rsvd11 |
cns |
Prp2 |
Prp1 |
Rsvd2 |
nsid |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_identify
Flags:清0
Command_id:清0
Nsid: 0
Prp1:數據保存的地址,最好申請數組空間,大小至少16k
Prp2:datalength,注意datalength的長度
Cns:0x01;
Identify Controller Data Structure:見附件
Set Features command& Get Featurescommand:
rsvd12 |
dword11 |
Fid |
Prp2 |
Prp1 |
rsvd2 |
Nsid |
command_id |
flags |
Opcode |
|
|
|
|
|
|
|
|
|
|
Opcode:nvme_admin_get_features& nvme_admin_set_features
Flags:清0
Command_id:清0
Nsid: 0
Prp1:數據保存的地址,最好申請數組空間,大小至少16k
Prp2:datalength,注意datalength的長度
Fid:
5、驅動處理cmd流程
6、 NVMe塊設備文件操作接口
NVMe塊設備文件操作集會在申請disk設備的時候進行聲明,代碼如下:
disk->fops =&nvme_fops;
static conststruct block_device_operations nvme_fops = {
.owner =THIS_MODULE,
.ioctl =nvme_ioctl,
.compat_ioctl = nvme_ioctl,
};
其中owner成員表面該fops的所有者是NVMe塊設備驅動,而ioctl和compat_ioctl分別是用戶ioctl調用的兩種方式,一般是ioctl,而不管是哪種方式,二者都會進入nvme_ioctl。
進入nvme_ioctl()接口後,驅動程序會對cmd類型進行解析被進入不同的分支,這裏重點關注NVME_IOCTL_ADMIN_CMD和NVME_IOCTL_SUBMIT_IO。
注意這裏兩個函數最終都會調用:nvme_submit_sync_cmd(nvmeq,&c, NULL, NVME_IO_TIMEOUT);
其是利用同步的方式進行命令的下發和返回最終返回狀態的處理。由於該函數會睡眠,我們需要保持搶佔處理使能狀態。其有可能在任意地方被搶佔,然後重新被調度。
NVMe資料下載
目前最新的協議爲NVME-1.2.1Specification,http://www.nvmexpress.org/specifications/可下載; 驅動位於http://www.nvmexpress.org/drivers/,目前提供Microsoft Drivers、Linux Drivers、VMware、UEFI、FreeBSD、Solaris等系統的驅動代碼。
附錄:
Identify Controller Data Structure ,
低位在前高位在後。
87 19 PCI Vendor ID (VID)://Vendor ID:87,Device ID :19
87 19 PCI Subsystem Vendor ID (SSVID)://Subsystem Vendor ID :87, Subsystem ID (SSID): 19
36 37 43 45 30 37 36 36 31 30 31 37 30 3030 30 30 31 38 33:Serial Number (SN):
50 43 49 65 20 53 53 44 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20: Model Number (MN):
45 37 46 4d 30 31 2e 31: Firmware Revision
01 :Recommended Arbitration Burst ,一頁2K?
00 00 00 :EEE OUIIdentifier (IEEE):
00: Controller Multi-Path I/O and Namespace Sharing Capabilities (CMIC): then the NVM subsystem contains onlya single PCI Express port.
09 :Maximum Data Transfer Size (MDTS): The value is in unitsof the minimum memory page size (CAP.MPSMIN) and is reported as a power of two(2^n). 512
00 00 :ControllerID (CNTLID):
00 02 01 00 :Version(VER): Major Version Number :2, MinorVersion Number :1
80 4f 12 00 :RTD3 Resume Latency (RTD3R): ?
60 e3 16 00 :RTD3 Entry Latency (RTD3E): ?
07 00 :Optional Admin Command Support (OACS): the controller supports the FirmwareCommit and Firmware Image Download commands. the controller supports the FormatNVM command.the controller supports the Security Send and Security Receivecommands.
03 :Abort CommandLimit (ACL): 最大同時傳送失敗的個數限制
03 :Asynchronous Event Request Limit (AERL): 最大同時傳送異步事件個數限制
02 :Firmware Updates (FRMW): the controller requiresa reset for firmware to be activated.indicate the number of firmware slots thatthe controller support(1~7) thefirst firmware slot (slot 1) is read/write
03 :Log PageAttributes (LPA): T controller supports the Command Effects log page,n the controller supports the SMART / Health information log page ona per namespace basis
3f :Error Log PageEntries (ELPE): T the maximum number of Error Information log entries that arestored by the controller
04 :Number of PowerStates Support (NPSS): This field indicates the number of NVM Express powerstates supported by the controlle ,
01 :Admin VendorSpecific Command Configuration (AVSCC): Tt all Admin Vendor Specific Commandsuse the format defined in Figure 13.
01 :AutonomousPower State Transition Attributes (APSTA):the controller supports autonomouspower state transitions.
7f 01 :WarningComposite Temperature Threshold (WCTEMP) 告警溫度 383k
93 01 :CriticalComposite Temperature Threshold (CCTEMP) 危機溫度403k
66 :Submission Queue Entry Size (SQES):define the maximum Submission Queueentry size when using the NVM Command Se :6;define the required Submission Queue Entry size when using the NVM Command Set:6
44 :Completion Queue Entry Size (CQES): define the maximum Completion Queue entry size when using the NVMCommand Set.:4; define the required Completion Queue entry size when using the NVM Command Set:4
01 00 00 00 :Numberof Namespaces (NN):This field defines the number of valid namespaces presentfor the controller:1
1e 00 :Optional NVMCommand Support (ONCS): the controllerdoes not support the Compare command. the controller supports the Write Uncorrectable command,the controllersupports the Dataset Management command, the controller supports the WriteZeroes command, the controller supports the Save field in the Set Features command and the Select field in the Get Features command.
00 00: FusedOperation Support (FUSES): the controller does not support the Compare andWrite fused operation.
01:Format NVMAttributes (FNA): then all namespacesshall be configured with the same attributes and a format of any namespace results in a format of all namespaces
01:Volatile WriteCache 525 indicates that a volatile write cache is present
ff 00:Atomic Write Unit Normal 原子寫的最大邏輯塊個數
00 00:Atomic WriteUnit Power Fail (
01:NVM VendorSpecific Command Configuration l NVM Vendor Specific Commands use the formatdefined in Figure 13. I
00 00:AtomicCompare & Write Unit
16 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 521c 40 00 16 03 81 00 00 00 00 00 00 00 00 00:PowerState 0 Descriptor (PSD0)
16 03: the maximum power consumed by the NVM subsystem in this power state. 790W ?= 7.9w
00:Reserved
00:the controllerprocesses I/O commands in this power state.the scale of the Maximum Power fieldis in 0.01 Watts.
00 00 00 00:he maximum entry latency in microseconds associated with entering this power state.
00 00 00 00:maximum exit latency in microseconds associated withexiting this power state
00:ative read throughputassociated with this power state.
00:the relativeread latency associated with this powerstat
00 : relative write throughput associatedwith this power state.
52 1c : the typical power consumed by theNVM subsystem over 30 seconds in this power state when idle .30s空閒消耗多少電7250*0.0001W
40:Idle Power Scale( 0.0001w)
00:保留
81:Active PowerScale:0.01w,the workload usedto calculate maximum power for this power state:001b
f0 00: the largestaverage power consumed by the NVM subsystem over a 10 second period in thispower state with the workload indicated in the Active Power Workload field.
00 00 00 00 00 00 00 00 00 Power State 1Descriptor (PSD1):
{
be 00 00 00 0000 00 00 00 00 00 00 00 00 00 00 52 1c 40 00 be 00 81 00 00 00 00 00 00 00 0000 Power State 2 Descriptor (PSD2);
4c 04 00 03 58 02 00 00 58 02 00 00 02 0202 02 4c 04 40 00 4c 04 41 00 00 00 00 00 00 00 00 00 Power State 3 Descriptor(PSD3):
32 00 00 03 a0 86 01 00 00 71 02 00 03 0303 03 32 00 40 00 32 00 41 00 00 00 00 00 00 00 00 00 Power State 4 Descriptor(PSD4):
}