What “Did we miss an interrupt?” means - nvme_timeout

"Did we miss an interrupt" mean?
https://lkml.org/lkml/2018/4/26/623
Question:
We are testing NVMe cards on ARM64 platform, the card uses legacy interrupts.
Intermittently we are hitting following case in nvme_timeout of drivers/nvme/host/pci.c
       /*
         * Did we miss an interrupt?
         */
        if (__nvme_poll(nvmeq, req->tag)) {
                dev_warn(dev->ctrl.device,
                         "I/O %d QID %d timeout, completion polled\n",
                         req->tag, nvmeq->qid);
                return BLK_EH_HANDLED;
        }

Can anyone tell when does nvme_timeout gets invoked ?
What does "Did we miss an interrupt mean" ? Does it mean host missing to service a interrupt raised by EP card ?

Answer:
I recall we did observe issues like this when legacy interrupts were used, so the driver does try to use MSI/MSIx if possible.
The nvme_timeout() is called from the block layer when the driver didn't provide a completion within the timeout (default is 30 seconds for IO, 60 seconds for admin).
This message you're seeing means the device did indeed post a completion queue entry for the timed out command, but the driver believes it was never notified via interrupt to
check the completion queue.
This means either one of two things happened: 
1,the interrupt was raised prior to the completion queue entry being written;
2,or the interrupt was never raised in the first place.
It might be possible to determine which if you can read the values from /proc/irq/<irq#>/spurious and see if the "last_unhandled" aligns with the expected completion time.

看來legacy interrupts(通過INTx)不穩定,而nvme驅動也正是先嚐試優先註冊MSI-X/MSI中斷的。

Patch for Poll CQ on timeout in case of "we miss an interrupt":
commit 0b2a8a9f4b564c7d923597828d93cd1f69ce40e0
Author: Christoph Hellwig <[email protected]>
Date:   Sun Dec 2 17:46:20 2018 +0100

    nvme-pci: consolidate code for polling non-dedicated queues
    
    We have three places that can poll for I/O completions on a normal
    interrupt-enabled queue.  All of them are in slow path code, so
    consolidate them to a single helper that uses spin_lock_irqsave and
    removes the fast path cqe_pending check.
    
commit 7776db1ccc123d5944a8c170c9c45f7e91d49643
Author: Keith Busch <[email protected]>
Date:   Fri Feb 24 17:59:28 2017 -0500

    nvme/pci: Poll CQ on timeout
    
    If an IO timeout occurs, it's helpful to know if the controller did not
    post a completion or the driver missed an interrupt. While we never expect
    the latter, this patch will make it possible to tell the difference so
    we don't have to guess.

commit 3a7afd8ee42a68d4f24ab9c947a4ef82d4d52375
Author: Christoph Hellwig <[email protected]>
Date:   Sun Dec 2 17:46:23 2018 +0100

    nvme-pci: remove the CQ lock for interrupt driven queues
    
    Now that we can't poll regular, interrupt driven I/O queues there
    is almost nothing that can race with an interrupt.  The only
    possible other contexts polling a CQ are the error handler and
    queue shutdown, and both are so far off in the slow path that
    we can simply use the big hammer of disabling interrupts.
    
    With that we can stop taking the cq_lock for normal queues.

   

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章