關於Texture Cache簡單總結

Texture Cache是一個存儲圖片數據的只讀cache

按照正常uv順序讀貼圖tex cache有高命中率

Texture Cache在 shader processor附近，所以它有高吞吐率，並且低延遲

上圖可見有許多 shader core

https://computergraphics.stackexchange.com/questions/355/how-does-texture-cache-work-considering-multiple-shader-units

texture units may be grouped together with one texture unit per shader core, or one texture unit shared among two or three shader cores, depending on the GPU.

The whole chip shares a single L2 cache, but the different units will have individual L1 caches.

每個shader core一個texture unit，或者在兩個或三個shader core之間共享一個texture unit

整個chip共享一個L2 cache，每個單元自己用一個L1cache

Texture units operate independently and asynchronously from shader cores. When a shader performs a texture read, it sends a request to the texture unit across a little bus between them; the shader can then continue executing if possible, or it may get suspended and allow other shader threads to run while it waits for the texture read to finish.

Texture units與shader cores的運算相獨立，Texture units是異步運算，當shader執行一次紋理讀取時，會通過Texture units與shader cores之間的總線向Texture units發送請求，shader可能繼續執行也可能被掛起等待返回的結果

The texture unit batches up a bunch of requests and performs the addressing math on them—selecting mip levels and anisotropy, converting UVs to texel coordinates, applying clamp/wrap modes, etc. Once it knows which texels it needs, it reads them through the cache hierarchy, the same way that memory reads work on a CPU (look in L1 first, if not there then L2, then DRAM). If many pending texture requests all want the same or nearby texels (as they often do), then you get a lot of efficiency here, as you can satisfy many pending requests with only a few memory transactions. All these operations are pipelined, so while the texture unit is waiting for memory on one batch it can be doing the addressing math for another batch of requests, and so on.

Texture units批量處理請求，執行尋址(選擇mip級別，和各項異性，將uv轉換爲texel座標，應用clamp/wrap模式，等等)，就可以知道需要哪裏的texel，之後在cache hierarchy中讀取(如果L1 cache沒有命中會通過Crossbar去找L2 Cache，如果也沒有找到，就會去找DRAM)

如果許多掛起的紋理請求都想要相同的或附近的texels，那麼在這裏會有效率提升。

Once the data comes back, the texture unit will decode compressed formats, do sRGB conversion and filtering as necessary, then return the results back to the shader core.

一旦數據返回，紋理單元將解碼壓縮格式，做sRGB轉換和過濾，然後將結果返回到shader core。

When a cache miss occurs and a new cache line must be read in from off-chip memory, a large latency is incurred.

如果沒有命中緩存，需要從外部存儲讀入到新的緩存線，會有很大延遲

computing the values that go into a texture map using the GPU has become common place and is called“render to texture”.all on-chip memory references to that texture must be invalidated and then the GPU can write to the texture. A simpler brute force method which flushes all caches in the GPU, can also be used.

TextureCache是隻讀的，如果有render to texture操作的話，存儲器上對該圖片的引用都會無效，或者flushes all caches

the pixel grid is divided into groups of 2 × 2 pixels, called quads.

光柵化繪製是以quads爲單位的，如果三角形太小，只包含一個像素，那麼會畫上多餘的三個空的像素，效率低

傳統光柵化會有兩個問題：

The order of texels accessed in a texture map can have any orientation with respect to the rasterization order. If the texels in a texture map are stored in memory in a simple linear row major ordering, and the texture is mapped horizontally,then cache lines will store long horizontal lines resulting in a high hit rate. But if the texture is mapped at a 90 degrees rotation, then every pixel will miss and require a new cacheline to be loaded.

https://computergraphics.stackexchange.com/questions/357/is-using-many-texture-maps-bad-for-caching

Although I suspect most textures these days will be stored in either a tiled or Morton-like (aka Twiddled/Swizzled) order (or even a combination of both), some textures might still be in scan-line order, which means that rotation of the texture is likely to lead to a significant number of cache misses/page breaks. Unfortunately, I don't really know how to spot if a particular format is arranged in such a way.

1.貼圖中紋素的訪問順序由光柵化順序決定，如果cache中存儲的方向與光柵化方向不同，就會導致命中率下降，會加載新的cacheline

For very large triangles a texture cache could fill up as pixels are generated across a horizontal span and on returning to the next line find that the texels from the previous line have been evicted from the cache.

2.如果三角形過大，水平繪製完一行，Texture Cache全被填充，當繪製下一行時上一行已經被cache消除

Tiled Rasterization 會解決上面兩個問題。

To avoid this orientation dependency textures are stored in memory in a tiled higher cache hit rates are achieved when the tile size is equal to cache line size for cache sizes of 128KB and 256KB and tile sizes of 8×8 and 16×16.

Tiled rasterization divides the screen into equally sized rectangles, typically powers of two , and completes the rasterization within each tile before moving to the next tile.

將屏幕分成格子，逐個格子進行柵格化

The texture unit requires 4 texels in parallel in order to perform one bilinear operation, so most texture caches are designed to feed these values in parallel in order to maintain full rate for bilinear filtering.

因爲texture unit需要4個texel來進行bilinear過濾，所以大多texture cache被設計成並行來保持bilinear filtering全速進行。

Most GPUs have multiple texture filtering units running in parallel, and the texture cache must supply these with texels. To do this the texture cache uses multiple read ports, or else replicates the data in the cache.

紋理壓縮可以減少紋理帶寬，解壓縮發生在讀取壓縮的紋理和準備 texture filter之間

texture accesses that hit in the cache must wait in the FIFO behind others that miss. A FIFO is used because the graphics pipeline ensures that triangle submission order is maintained

因爲先入先出的隊列數據流，排在後面的儘管被緩存命中也要等待排在前面的沒有命中的部分

關於Texture Cache的優化

The easiest way to break the texture cache is to do lots of dependent texture reads and what I mean by dependent texture reads is to generate a texture coordinate in the fragment shader and then go fetch it instead of the interpolated texture coordinate.

This is going to start fetching texels from areas that are very different in your actual texture map, which is going to cause the cache to be thrashed often.

1.不是按照頂點shader插值的uv讀取貼圖

會破壞tex緩存導致cache經常thrashed，因爲基本不會命中，還會導致緩存經常被刷新。

2.貼圖不能過大

貼圖太大，導致貼圖cache超載，使得cache不能命中

3.貼圖格式儘量小

原因同上

4.Tiled Rasterization

會有助於在光柵化時命中Cache，避免三角形過大和光柵化與貼圖方向不同產生的cache miss/page break問題

5.使用mipmap

https://blogs.msdn.microsoft.com/shawnhar/2009/09/14/texture-filtering-mipmaps/

Remember our example of a tiled 256x256 texture, where each repeat is being scaled down to 1x1 (a common situation in things like terrain rendering). Without mipmaps, every destination pixel will sample a radically different location in the source texture, so the GPU must jump around fetching colors from different areas of memory. GPUs typically have very small texture caches, relying on the fact that textures tend to be accessed sequentially, so this access pattern will thrash the cache and can bring even a high end card to its knees. But with mipmaps, the GPU can simply load a small mip level which will easily fit in the cache, and can then render many destination pixels without having to go back to main memory.

6.壓縮貼圖

可以節省帶寬

並且因爲cache中存儲的是壓縮的texel，所以能增加tex cache中存儲texel的數量，因爲解壓縮是在數據出tex cache後在tex unit中進行的。

指定的壓縮格式在指定的硬件中解壓縮很容易

https://computergraphics.stackexchange.com/questions/357/is-using-many-texture-maps-bad-for-caching

Perhaps a bit of an obvious option, but if you can use texture compression (e.g. DXTn|ETC*|PVRTC*|etc) targeting from 8bpp(Bits Per Pixel) down to, say, 2bpp, you can greatly increase the effectiveness of the memory bandwidth/cache by factors of 4x through to 16x. Now I can't speak for all GPUs, but some texture compression schemes (e.g. those listed above) are so simple to decode in hardware, that the data could stay compressed throughout the entire cache hierarchy and only be decompressed in the texture unit, thus effectively multiplying the size of those caches.

關於壓縮格式詳細信息:

https://docs.unity3d.com/550/Documentation/Manual/class-TextureImporterOverride.html

7.少用trilinear filtering和anisotropic filtering

----by wolf96 2019/2/22

參考：

1.Texture Caches

http://fileadmin.cs.lth.se/cs/Personal/Michael_Doggett/pubs/doggett12-tc.pdf

2. Optimising the Graphics Pipeline

https://www.nvidia.com/docs/IO/10878/ChinaJoy2004_OptimizationAndTools.pdf

3.https://computergraphics.stackexchange.com/questions/355/how-does-texture-cache-work-considering-multiple-shader-units

4.https://blogs.msdn.microsoft.com/shawnhar/2009/09/14/texture-filtering-mipmaps/

5.https://computergraphics.stackexchange.com/questions/357/is-using-many-texture-maps-bad-for-caching