[轉] The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

第二篇，介紹了Tile-based rendering。一樣有中英文對照。

1. 英文原文

In my previous blog I started defining an abstract machine which can be used to describe the application-visible behaviors of the Mali GPU and driver software. The purpose of this machine is to give developers a mental model of the interesting behaviors beneath the OpenGL ES API, which can in turn be used to explain issues which impact their application’s performance. I will use this model in the future blogs of this series to explore some common performance pot-holes which developers encounter when developing graphics applications.

This blog continues the development of this abstract machine, looking at the tile-based rendering model of the Mali GPU family. I’ll assume you've read the first blog on pipelining; if you haven’t I would suggest reading that first.

The “Traditional” Approach

In a traditional mains-powered desktop GPU architecture — commonly called an immediate mode architecture — the fragment shaders are executed on each primitive, in each draw call, in sequence. Each primitive is rendered to completion before starting the next one, with an algorithm which approximates to:

    foreach( primitive )  

         foreach( fragment )  

              render fragment  

As any triangle in the stream may cover any part of the screen the working set of data maintained by these renderers is large; typically at least a full-screen size color buffer, depth buffer, and possibly a stencil buffer too. A typical working set for a modern device will be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1080p display therefore has a working set of 16MB, and a 4k2k TV has a working set of 64MB. Due to their size these working buffers must be stored off-chip in a DRAM.

Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment’s pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency memory, both of which result in external memory accesses which are particularly energy intensive.

The Mali Approach

The Mali GPU family takes a very different approach, commonly called tile-based rendering, designed to minimize the amount of power hungry external memory accesses which are needed during rendering. As described in the first blog in this series, Mali uses a distinct two-pass rendering algorithm for each render target. It first executes all of the geometry processing, and then executes all of the fragment processing. During the geometry processing stage, Mali GPUs break up the screen into small 16x16 pixel tiles and construct a list of which rendering primitives are present in each tile. When the GPU fragment shading step runs, each shader core processes one 16x16 pixel tile at a time, rendering it to completion before starting the next one. For tile-based architectures the algorithm equates to:

    foreach( tile )  

         foreach( primitive in tile )  

              foreach( fragment in primitive in tile )  

                    render fragment

As a 16x16 tile is only a small fraction of the total screen area it is possible to keep the entire working set (color, depth, and stencil) for a whole tile in a fast RAM which is tightly coupled with the GPU shader core.

This tile-based approach has a number of advantages. They are mostly transparent to the developer but worth knowing about, in particular when trying to understand bandwidth costs of your content:

All accesses to the working set are local accesses, which is both fast and low power. The power consumed reading or writing to an external DRAM will vary with system design, but it can easily be around 120mW for each 1GByte/s of bandwidth provided. Internal memory accesses are approximately an order of magnitude less energy intensive than this, so you can see that this really does matter.
Blending is both fast and power-efficient, as the destination color data required for many blend equations is readily available.
A tile is sufficiently small that we can actually store enough samples locally in the tile memory to allow 4x, 8x and 16x multisample antialising¹. This provides high quality and very low overhead anti-aliasing. Due to the size of the working set involved (4, 8 or 16 times that of a normal single-sampled render target; a massive 1GB of working set data is needed for 16x MSAA for a 4k2k display panel) few immediate mode renderers even offer MSAA as a feature to developers, because the external memory footprint and bandwidth normally make it prohibitively expensive.
Mali only has to write the color data for a single tile back to memory at the end of the tile, at which point we know its final state. We can compare the block’s color with the current data in main memory via a CRC check — a process called Transaction Elimination — skipping the write completely if the tile contents are the same, saving SoC power. My colleague tomolson has written a great blog on this technology, complete with a real world example of Transaction Elimination (some game called Angry Birds; you might have heard of it). I’ll let Tom’s blog explain this technology in more detail, but here is a sneak peek of the technology in action (only the “extra pink” tiles were written by the GPU - all of the others were successfully discarded).

We can compress the color data for the tiles which survive Transaction Elimination using a fast, lossless, compression scheme — ARM Frame Buffer Compression (AFBC) — allowing us to lower the bandwidth and power consumed even further. This compression can be applied to offscreen FBO render targets, which can be read back as textures in subsequent rendering passes by the GPU, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP500 in the system.
Most content has a depth and stencil buffer, but doesn’t need to keep their contents once the frame rendering has finished. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved² — ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0) orglInvalidateFramebuffer (OpenGL ES 3.0), although it can be inferred by the drivers in some cases — then the depth and stencil content of tile is never written back to main memory at all. Another big bandwidth and power saving!

It is clear from the list above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. What is the downside?

The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the vertex shader to the fragment shader. The output of the geometry processing stage, the per-vertex varyings and tiler intermediate state, must be written out to main memory and then re-read by the fragment processing stage. There is therefore a balance to be struck between costing extra bandwidth for the varying data and tiler state, and saving bandwidth for the framebuffer data.

In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal for smartphones, tablets such as the Mali-T604 powered Google Nexus 10 are running at WQXGA (2560x1600), and 4k2k is becoming the new “must have” in the television market. Screen resolution, and hence framebuffer bandwidth, is growing fast. In this area Mali really shines, and does so in a manner which is mostly transparent to the application developer - you get all of these goodies for free with no application changes!

On the geometry side of things, Mali copes well with complexity. Many high-end benchmarks are approaching a million triangles a frame, which is an order of magnitude (or two) more complex than popular gaming applications on the Android app stores. However, as the intermediate geometry data does hit main memory there are some useful tips and tricks which can be applied to fine tune the GPU performance, and get the best out of the system. These are worth an entire blog by themselves, so we’ll cover these at a later point in this series.

Summary

In this blog I have compared and contrasted the desktop-style immediate mode renderer, and the tile-based approach used by Mali, looking in particular at the memory bandwidth implications of both.

Tune in next time and I’ll finish off the definition of the abstract machine, looking at a simple block model of the Mali shader core itself. Once we have that out of the way we can get on with the useful part of the series: putting this model to work and earning a living optimizing your applications running on Mali.

Note: The next blog in this series has now been published: The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

As always comments and questions more than welcome,

Pete

Footnotes

Exactly which multisampling options are available depends on the GPU. The recently announced Mali-T760 GPU includes support for up to 16x MSAA.
The depth and stencil discard is automatic for EGL window surfaces, but for offscreen render targets they may be preserved and reused in a future rendering operation.

2. 中文翻譯

在我上一篇博文中，我開始定義一臺抽象機器，用於描述 Mali GPU和驅動程序軟件對應用程序可見的行爲。此機器的用意是爲開發人員提供 OpenGL
ES API 下有趣行爲的一個心智模型，而這反過來也可用於解釋影響其應用程序性能的問題。我在本系列後面幾篇博文中繼續使用這一模型，探討開發人員在開發圖形應用程序時常常遇到的一些性能缺口。

這篇博文將繼續開發這臺抽象機器，探討 Mali GPU系列基於區塊的渲染模型。你應該已經閱讀了關於管線化的第一篇博文；如果還沒有，建議你先讀一下。

“傳統”方式

在傳統的主線驅動型桌面 GPU 架構中 — 通常稱爲直接模式架構 — 片段着色器按照順序在每一繪製調用、每一原語上執行。每一原語渲染結束後再開始下一個，其利用類似於如下所示的算法：

1. foreach( primitive )

2. foreach( fragment )

3. render fragment

由於流中的任何三角形可能會覆蓋屏幕的任何部分，由這些渲染器維護的數據工作集將會很大；通常至少包含全屏尺寸顏色緩衝、深度緩衝，還可能包含模板緩衝。現代設備的典型工作集是 32 位/像素 (bpp) 顏色，以及 32 bpp 封裝的深度/模板。因此，1080p 顯示屏擁有一個 16MB 工作集，而 4k2k 電視機則有一個64MB 工作集。由於其大小原因，這些工作緩衝必須存儲在芯片外的 DRAM 中。

每一次混合、深度測試和模板測試運算都需要從這一工作集中獲取當前片段像素座標的數據值。被着色的所有片段通常會接觸到這一工作集，因此在高清顯示中，置於這一內存上的帶寬負載可能會特別高，每一片段也都有多個讀-改-寫運算，儘管緩存可能會稍稍緩減這一問題。這一對高帶寬存取的需求反過來推動了對具備許多針腳的寬內存接口和專用高頻率內存的需求，這兩者都會造成能耗特別密集的外部內存訪問。

Mali 方式

Mali GPU 系列採用非常不同的方式，通常稱爲基於區塊的的渲染，其設計宗旨是竭力減少渲染期間所需的功耗巨大的外部內存訪問。如本系列第一篇博文中所述，Mali 對每一渲染目標使用獨特的兩步驟渲染算法。它首先執行全部的幾何處理，然後執行所有的片段處理。在幾何處理階段中，Mali GPU 將屏幕分割爲微小的 16x16 像素區塊，並對每個區塊中存在的渲染原語構建一份清單。GPU 片段着色步驟開始時，每一着色器核心一次處理一個 16x16 像素區塊，將它渲染完後再開始下一區塊。對於基於區塊的架構，其算法相當於：

1. foreach( tile )
2. foreach( primitive in tile )

3. foreach( fragment in primitive in tile )
4. render fragment

由於 16x16 區塊僅僅是總屏幕面積的一小部分，所以有可能將整個區塊的完整工作集（顏色、深度和模板）存放在和 GPU 着色器核心緊密耦合的快速 RAM中。

這種基於區塊的方式有諸多優勢。它們大體上對開發人員透明，但也值得了解，尤其是在嘗試瞭解你內容的帶寬成本時：

對工作集的所有訪問都屬於本地訪問，速度快、功耗低。讀取或寫入外部 DRAM 的功耗因系統設計而異，但對於提供的每 1GB/s 帶寬，它很容易達到大約120mW。與這相比，內部內存訪問的功耗要大約少一個數量級，所以你會發現這真的大有關係。
混合不僅速度快，而且功耗低，因爲許多混合方式需要的目標顏色數據都隨時可用。

區塊足夠小，我們實際上可以在區塊內存中本地存儲足夠數量的樣本，實現 4 倍、8 倍和 16 倍多采樣抗鋸齒¹。這可提供質量高、開銷很低的抗鋸齒。由於涉及的工作集大小（一般單一採樣渲染目標的 4、8 或 16 倍；4k2k 顯示面板的 16x MSAA需要巨大的 1GB 工作集數據），少數直接模式渲染器甚至將 MSAA作爲一項功能提供給開發人員，因爲外部內存大小和帶寬通常導致其成本過於高昂。
Mali 僅僅需要將單一區塊的顏色數據寫回到區塊末尾的內存，此時我們便能知道其最終狀態。我們可以通過 CRC 檢查將塊的顏色與主內存中的當前數據進行比較 — 這一過程叫做“事務消除”— 如果區塊內容相同，則可完全跳過寫出，從而節省了 SoC 功耗。我的同事 Tom Olson 針對這一技術寫了一篇優秀的博文，文中還提供了“事務消除”的一個現實世界示例（某個名叫“憤怒的小鳥”的遊戲；你或許聽說過）。有關這一技術的詳細信息還是由 Tom 的博文來介紹；不過，這兒也稍稍瞭解一下該技術的運用（僅“多出的粉色”區塊由 GPU 寫入 - 其他全被成功丟棄）。

我們可以採用快速的無損壓縮方案 — ARM 幀緩衝壓縮 (AFBC) — ，對逃過事務消除的區塊的顏色數據進行壓縮，從而進一步降低帶寬和功耗。這一壓縮可以應用到離屏 FBO 渲染目標，後者可在隨後的渲染步驟中由 GPU 作爲紋理讀回；也可以應用到主窗口表面，只要系統中存在兼容 AFBC 的顯示控制器，如Mali-DP500。
大多數內容擁有深度緩衝和模板緩衝，但幀渲染結束後就不必再保留其內容。如果開發人員告訴 Mali 驅動程序不需要保留深度緩衝和模板緩衝²— 理想方式是通過調用 glDiscardFramebufferEXT (OpenGL ES 2.0) 或 glInvalidateFramebuffer (OpenGLES 3.0)，雖然在某些情形中可由驅動程序推斷 — 那麼區塊的深度內容和模板內容也就徹底不用寫回到主內存中。我們又大幅節省了帶寬和功耗！

上表中可以清晰地看出，基於區塊的渲染具有諸多優勢，尤其是可以大幅降低與幀緩衝數據相關的帶寬和功耗，而且還能夠提供低成本的抗鋸齒功能。那麼，有些什麼劣勢呢？

任何基於區塊的渲染方案的主要額外開銷是從頂點着色器到片段着色器的交接點。幾何處理階段的輸出、各頂點可變數和區塊中間狀態必須寫出到主內存，再由片段處理階段重新讀取。因此，必須要在可變數據和區塊狀態消耗的額外帶寬與幀緩衝數據節省的帶寬之間取得平衡。

當今的現代消費類電子設備正大步向更高分辨率顯示屏邁進；1080p 現在已是智能手機的常態，配備
Mali-T604 的 Google Nexus 10 等平板電腦以 WQXGA (2560x1600) 分辨率運行，而 4k2k 正逐漸成爲電視機市場上新的“不二之選”。屏幕分辨率以及幀緩衝帶寬正快速發展。在這一方面，Mali 確實表現出衆，而且以對應用程序開發人員基本透明的方式實現 - 無需任何代價，就能獲得所有這些好處，而且還不用更改應用程序！

在幾何處理方面，Mali 也能處理好複雜度。許多高端基準測試正在接近每幀百萬個三角形，其複雜度比 Android 應用商店中的熱門遊戲應用程序高出一個（或兩個）數量級。然而，由於中間幾何數據的確到達主內存，所以可以應用一些有用的技巧和訣竅，來優化 GPU 性能並充分發揮系統能力。這些技巧值得通過一篇博文來細談，所以我們會在這一系列的後續博文中再予以介紹。

小結

在這篇博文中，我比較了桌面型直接模式渲染器與 Mali 所用的基於區塊方式的異同，尤其探討了兩種方式對內存帶寬的影響。
敬請期待下一篇博文。我將通過介紹 Mali 着色器核心本身的簡單塊模型，完成對這一抽象機器的定義。理解這部分內容後，我們就能繼續介紹系列博文的其他有用部分：將這一模型應用到實踐中，使其發揮實際作用，優化你在 Mali 上運行的應用程序。

注意：本系列的下一篇博文已經發布： Mali GPU: 抽象機器，第3部分 – 着色器核心

與往常一樣，歡迎提出任何意見和問題。

Pete

腳註

具體有哪些多采樣選項可用要視 GPU 而定。最近推出的 Mali-T760 GPU 最高支持 16 倍 MSAA。
對 EGL 窗口表面而言，深度丟棄與模板丟棄是自動執行的；但對於離屏渲染對象，它們可能會予以保留，供將來的渲染運算重新利用。

[轉] The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

1. 英文原文

The “Traditional” Approach

The Mali Approach

Summary

Footnotes

2. 中文翻譯

“傳統”方式

python gdal 安裝使用（Windows， python 3.6.8）

Vulkan in 30 minutes

[轉] A Brief Overview Of Vulkan API

AMD Mantle API 學習筆記 -- Mantle初始化

A look at the PowerVR graphics architecture: Tile-based rendering

遊戲中的Compute Shaders

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結