What “Did we miss an interrupt?” means - nvme_timeout

原創

2020-06-23 09:49

"Did we miss an interrupt" mean?
https://lkml.org/lkml/2018/4/26/623
Question:
We are testing NVMe cards on ARM64 platform, the card uses legacy interrupts.
Intermittently we are hitting following case in nvme_timeout of drivers/nvme/host/pci.c
/*
* Did we miss an interrupt?
*/
if (__nvme_poll(nvmeq, req->tag)) {
dev_warn(dev->ctrl.device,
"I/O %d QID %d timeout, completion polled\n",
req->tag, nvmeq->qid);
return BLK_EH_HANDLED;
}

Can anyone tell when does nvme_timeout gets invoked ?
What does "Did we miss an interrupt mean" ? Does it mean host missing to service a interrupt raised by EP card ?

Answer:
I recall we did observe issues like this when legacy interrupts were used, so the driver does try to use MSI/MSIx if possible.
The nvme_timeout() is called from the block layer when the driver didn't provide a completion within the timeout (default is 30 seconds for IO, 60 seconds for admin).
This message you're seeing means the device did indeed post a completion queue entry for the timed out command, but the driver believes it was never notified via interrupt to
check the completion queue.
This means either one of two things happened:
1，the interrupt was raised prior to the completion queue entry being written；
2，or the interrupt was never raised in the first place.
It might be possible to determine which if you can read the values from /proc/irq/<irq#>/spurious and see if the "last_unhandled" aligns with the expected completion time.

看來legacy interrupts（通過INTx）不穩定，而nvme驅動也正是先嚐試優先註冊MSI-X/MSI中斷的。

Patch for Poll CQ on timeout in case of "we miss an interrupt":
commit 0b2a8a9f4b564c7d923597828d93cd1f69ce40e0
Author: Christoph Hellwig <[email protected]>
Date: Sun Dec 2 17:46:20 2018 +0100

nvme-pci: consolidate code for polling non-dedicated queues

We have three places that can poll for I/O completions on a normal
interrupt-enabled queue. All of them are in slow path code, so
consolidate them to a single helper that uses spin_lock_irqsave and
removes the fast path cqe_pending check.

commit 7776db1ccc123d5944a8c170c9c45f7e91d49643
Author: Keith Busch <[email protected]>
Date: Fri Feb 24 17:59:28 2017 -0500

nvme/pci: Poll CQ on timeout

If an IO timeout occurs, it's helpful to know if the controller did not
post a completion or the driver missed an interrupt. While we never expect
the latter, this patch will make it possible to tell the difference so
we don't have to guess.

commit 3a7afd8ee42a68d4f24ab9c947a4ef82d4d52375
Author: Christoph Hellwig <[email protected]>
Date: Sun Dec 2 17:46:23 2018 +0100

nvme-pci: remove the CQ lock for interrupt driven queues

Now that we can't poll regular, interrupt driven I/O queues there
is almost nothing that can race with an interrupt. The only
possible other contexts polling a CQ are the error handler and
queue shutdown, and both are so far off in the slow path that
we can simply use the big hammer of disabling interrupts.

With that we can stop taking the cq_lock for normal queues.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

What “Did we miss an interrupt?” means - nvme_timeout

io_submit，io_getevents可能被阻塞的原因

io_uring相對libaio的優勢

get_user_page, get_user_pages_fast

systemtap探測vfs dentry

關於linux vfs dentry cache

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結