流 - 台部落

流

默認情況下所有cuda操作(kernel執行，數據傳輸)都運行載同一個流中。同一個流中的操作串行執行。通過手動創建多個流，可以實現：
主機-設備數據傳輸和設備計算的重疊
多個不同kernel的執行重疊
```
cudaError_t cudaStreamCreate(cudaStream_t* pStream);
cudaError_t cudaStreamDestroy(cudaStream_t stream);
```

設備和主機之間的數據傳輸可以是異步的，但必須指定所屬流，並且主機端的目標內存時頁鎖定的

cudaError_t cudaMemcpyAsync(void* dst,
                            const void* src,
                            size_t count
                            cudaMemcpyKind kind
                            cudaStream_t stream = 0);

在調用kernel時可以指定所屬流

kernelFunc <<<gridDim,blockDim,sharedMemSize,stream>>> (args)

查詢流中的異步操作是否全部完成

cudaError_t cudaStreamQuery(cudaStream_t stream);

阻塞主機端直到流中異步操作完成

cudaError_t cudaStreamSynchronize(cudaStream_t stream);
流分爲null stream(默認)和non-null stream，其中non-null stream分爲blocking stream和non-blocking stream。blocking stream和null stream之間的kernel執行會互相阻塞，blocking stream之間不會互相阻塞
```
cudaError_t cudaStreamCreateWithFlags(cudaStream_t* pStream, unsigned int flags);
//指定flag爲cudaStreamDefault創建blocking stream
//指定flag爲cudaStreamNonBlocking創建non-blocking stream
```
non-null stream中的操作必須是異步的

Event

Event可以理解爲流中的一個標記點，當流在設備端執行到達標記點時，主機端對應的event中的flag會被修改

cudaError_t cudaEventCreate(cudaEvent_t* event);
cudaError_t cudaEventDestroy(cudaEvent_t event);

//在流中插入標記點
cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0);

//在主機執行自旋鎖直到對應流執行到標記點
cudaError_t cudaEventSynchronize(cudaEvent_t event);

//查詢流是否執行到達標記點(不會阻塞主機)
cudaError_t cudaEventQuery(cudaEvent_t event);

Event可以用於不同流之間的同步，具體做法是在某一流中插入Event，在另一流中等待該Event被觸發
```
cudaEventRecord(event,streamA);
cudaStreamWaitEvent(streamB,event,0);
```

Event是可配置的

cudaError_t cudaEventCreateWithFlags(cudaEvent_t* event, unsigned int flags);
//可以設置以下標誌位
cudaEventDefault
cudaEventBlockingSync
cudaEventDisableTiming
cudaEventInterprocess

費米架構的假依賴

費米架構中，不同流中的操作最後還是被髮送到同一工作隊列中，隊列中的操作必須按先入先出的順序執行。如果連續兩個操作屬於不同的流，則後一個操作不必等待前一個操作完成纔出列。假設有流A和B，A中有操作A1和A2，B中有B1操作，在主機端調用操作的順序是A1，A2，B1，則隊列的順序爲
```
B1      A2      A1
```
雖然B1不依賴與其他操作，但其在隊列中的順序決定了他必須在A2出列後才能出列，但由於A2與A1在同一流中，即A2依賴與A1，所以在A1執行完成前A2不能出列，導致B1也不能出列，這就是假依賴
開普勒架構的Grid Management Unit會爲工作隊列中的操作進行重排序和調度，能夠有效避免假依賴
開普勒架構的Hyper-Q允許GPU最多維護32(默認爲8)個操作隊列，當stream數小於32時，cuda會爲每一個stream申請單獨一個隊列，該方案也能夠避免假依賴

利用流實現數據傳輸和設備計算的重疊

CUDA最多同時支持1個流進行H2D和1個流進行D2H的內存傳輸，所以即使在多個不同的流進行內存讀取，也不能達到並行

流

流

Event

費米架構的假依賴

利用流實現數據傳輸和設備計算的重疊

Android.Camera2 API

Occupancy

細分着色器

共享內存

流

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結