Occupancy

定義

Active Wraps

從線程束中的線程開始執行,直到線程束中的線程執行完畢,該線程束認爲是Active的

Occupancy

occupancy = active wraps per SM / maximum wraps per SM

Theoretical Occupancy

theoretical occupancy可以理解爲在設備計算能力,線程組織方式,kernel對資源的使用情況均已知的情況下,理論上能夠達到的occupancy。

Achieved Occupancy

在每一個時鐘週期線程束調度器上的計數器會統計時鐘週期active wrap的數量,並將最終結果處於週期數得到平均active wrap數,進而計算出實際occupancy

Warp Issue Efficiency

在所有時鐘週期中,至少有一個可調度(就緒)線程束的時鐘週期和沒有可調度線程束的時鐘週期的個數比

性能分析

Occupancy對性能的影響

  • 過低的occupancy會導致沒有足夠多的就緒線程束供調度器調度,使得某些指令延遲無法被掩蓋
  • 當就緒線程足夠多或者指令延遲足夠少時,過高的occupancy會導致每個線程能夠分配得到的資源減少(寄存器溢出導致local memory的使用),從而降低性能

可能導致low theoretical occupancy的原因

  • 對SM而言Active Block的數量是有上限的,當每個block中的wrap數太小

    device limited active blocks * wraps per block << device limited active wraps
  • 線程使用過多的register或shared memory會導致SM上active wraps的數量減少

可能導致low achieved occupancy的原因

  • 同一個塊內不同線程束執行的時間不平衡,導致在計算收尾階段Active Wrap數量減少(由於資源是以塊爲單位分配的,所以即使塊中的部分線程束執行完畢,他們的資源無法回收,導致沒有辦法提供調度新的線程塊到SM上)
  • 啓動過少的線程塊。根據nsight用戶手冊,theoretical occupancy沒有將啓動的線程塊數納入計算因素中

性能優化

低的occupancy未必意味着低的性能,應該通過觀察Warp Issue Efficiency,如果無可調度線程束的時鐘週期比例太高,才認爲應該提高occupancy。具體做法包括:

- if the theoretical occupancy is low, try to optimize the execution configuration of the kernel launch, using the Occupancy table to identify which factor(s) are limiting occupancy. If you are register limited do not rule out experimenting with launch bounds to increase occupancy, even if this results in some register spilling.
- if the achieved occupancy is well below the theoretical occupancy, check the Instruction Statistics experiment for highly unbalanced workloads or tail effects. Potential strategies may include splitting the kernel grid in a more fine granular way, distribute work across the blocks in a more balanced way, avoiding gathering the final result on a single block, warp, or thread.
- if the Pipe Utilization experiment shows a particular pipeline is already fully utilized, increasing active warps is unlikely to results in more eligible warps, because all additional active warps will stall trying to access the oversubscribed pipeline. In this case, try to reduce the load on this pipeline or investigate if the expected peak performance for the target hardware is already reached.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章