Performance Optimizations (Direct3D 9)

Performance Optimizations (Direct3D 9)

General Performance Tips

1. Clear only when you must.
2. Minimize state changes and group the remaining state changes.
3. Use smaller textures, if you can do so.
4. Draw objects in your scene from front to back.
5. Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later.
6. Gracefully degrade special effects that require a disproportionate share of system resources.
7. Constantly test your application's performance.
8. Minimize vertex buffer switches.
9. Use static vertex buffers where possible.
10. Use one large static vertex buffer per FVF for static objects, rather than one per object.
11. If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format.
12. Draw using indexed primitives. This can allow for more efficient vertex caching within hardware.
13. If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time.
14. Combine the shader instruction and the data output where possible. For example:
// Rather than doing a multiply and add, and then output the data with
// two instructions:
mad r2, r1, v0, c0
mov oD0, r2
// Combine both in a single instruction, because this eliminates an
// additional register copy.
mad oD0, r1, v0, c0

常规技巧
1 只在必须的时候Clear。
2 尽量减少状态切换。并且将需要进行的状态切换组合在一起设置。
3 纹理尺寸尽可能小
4 从前至后渲染场景中的对象，从前至后渲染可以尽可能早地精选出不需要绘制的对象和象素
5 使用三角条带代替三角列表和三角扇。为了能更有效利用顶点高速缓存（cache），在排列条带时因考虑尽快重用顶点。
6 根所需要据消耗的系统资源来逐步减少特效。
7 经常性地检测程序的性能。这样可以更容易发现引起性能突变的部分
8 最小化顶点缓存的切换
9 尽可能使用静态顶点缓存
10 对静态对象，对每种FVF使用一个大的静态顶点缓存来保存多个对象的顶点数据，而不是每个对象使用一个顶点缓存。其目的也是减少顶点缓存的切换
11 如果程序需要随机访问AGP内存中的顶点缓存，顶点格式的大小最好是32bytes的倍数。否则，选择合适的最小的格式。32bytes 也就是8个float数据或2个vector4。
12 使用顶点索引方式渲染，这样可以更有效利用顶点高速缓存。
13 如果深度缓存格式中包含有模版缓存，总是将两者一起Clear。
14 将计算结果和输出的shader指令合并：

Databases and Culling
Building a reliable database of the objects in your world is key to excellent performance in Direct3D. It is more important than improvements to rasterization or hardware.
You should maintain the lowest polygon count you can possibly manage. Design for a low polygon count by building low-polygon models from the start. Add polygons if you can do so without sacrificing performance later in the development process. Remember, the fastest polygons are the ones you don't draw.

建立场景对象的数据时，首先使用最低精度的模型，在保证性能的前提下逐步使用更高精度的模型。密切关注渲染的总的三角面数。

Batching Primitives
To get the best rendering performance during execution, try to work with primitives in batches and keep the number of render-state changes as low as possible. For example, if you have an object with two textures, group the triangles that use the first texture and follow them with the necessary render state to change the texture. Then group all the triangles that use the second texture. The simplest hardware support for Direct3D is called with batches of render states and batches of primitives through the hardware abstraction layer (HAL). The more effectively the instructions are batched, the fewer HAL calls are performed during execution.

批次渲染，例如纹理相同的物体一起渲染。

Lighting Tips
Because lights add a per-vertex cost to each rendered frame, you can improve performance significantly by being careful about how you use them in your application. Most of the following tips derive from the maxim, "the fastest code is code that is never called."

1. Use as few light sources as possible. To increase the overall lighting level, for example, use the ambient light instead of adding a new light source.
2. Directional lights are more efficient than point lights or spotlights. For directional lights, the direction to the light is fixed and doesn't need to be calculated on a per-vertex basis.
3. Spotlights can be more efficient than point lights, because the area outside the cone of light is calculated quickly. Whether spotlights are more efficient or not depends on how much of your scene is lit by the spotlight.
4. Use the range parameter to limit your lights to only the parts of the scene you need to illuminate. All the light types exit fairly early when they are out of range.
5. Specular highlights almost double the cost of a light. Use them only when you must. Set the D3DRS_SPECULARENABLE render state to 0, the default value, whenever possible. When defining materials, you must set the specular power value to zero to turn off specular highlights for that material; just setting the specular color to 0,0,0 is not enough.

因为灯每个顶点的成本添加到每个渲染帧，则可以通过小心你如何使用它们的应用程序提高性能显著。记住，“最快的代码是代码不会被调用。”
1. 使用尽可能少的光源成为可能。为了提高整体照明的水平，例如，使用而不是添加一个新光源的环境光
2. 平行光最有效率
3. 聚光灯比电光源更有效率
4. 用范围参数，在场景在灯光范围内才参与计算。
5. 镜面光计算亮高出一倍。 D3DRS_SPECULARENABLE 设为 0会关闭镜面光。仅仅把材质设为0，不能提高镜面光计算效率。

Texture Size
Texture-mapping performance is heavily dependent on the speed of memory. There are a number of ways to maximize the cache performance of your application's textures.
1. Keep the textures small. The smaller the textures are, the better chance they have of being maintained in the main CPU's secondary cache.
2. Do not change the textures on a per-primitive basis. Try to keep polygons grouped in order of the textures they use.
3. Use square textures whenever possible. Textures whose dimensions are 256x256 are the fastest. If your application uses four 128x128 textures, for example, try to ensure that they use the same palette and place them all into one 256x256 texture. This technique also reduces the amount of texture swapping. Of course, you should not use 256x256 textures unless your application requires that much texturing because, as mentioned, textures should be kept as small as possible.
纹理映射的性能在很大程度上依赖于内存的速度。有许多方法来最大化的应用程序的纹理高速缓存性能。
1. 保持纹理小。纹理越小，越容易被CPU的二级高速缓存命中。
2. 尽量保证按纹理分组渲染几何图元
3. 尽量用方形的纹理。小纹理尽量用相同的调色板，合并到一张大纹理上面，这样做减少了纹理的切换。

Matrix Transforms
Direct3D uses the world and view matrices that you set to configure several internal data structures. Each time you set a new world or view matrix, the system recalculates the associated internal structures. Setting these matrices frequently - for example, thousands of times per frame - is computationally time-consuming. You can minimize the number of required calculations by concatenating your world and view matrices into a world-view matrix that you set as the world matrix, and then setting the view matrix to the identity. Keep cached copies of individual world and view matrices so that you can modify, concatenate, and reset the world matrix as needed. For clarity in this documentation, Direct3D samples rarely employ this optimization.

每次设置世界、视图、投影矩阵内部的数据结构都会重新计算，频繁设置比较费时。可以将视图矩阵和投影矩阵连乘来减少计算，世界矩阵没有变换不要频繁设置。(其实framemove和render分离，已经做到效率优化了)

Using Dynamic Textures
To find out if the driver supports dynamic textures, check the D3DCAPS2_DYNAMICTEXTURES flag of the D3DCAPS9 structure.
Keep the following things in mind when working with dynamic textures.
1. They cannot be managed. For example, their pool cannot be D3DPOOL_MANAGED.
2. Dynamic textures can be locked, even if they are created in D3DPOOL_DEFAULT.
3. D3DLOCK_DISCARD is a valid lock flag for dynamic textures.
It is a good idea to create only one dynamic texture per format and possibly per size. Dynamic mipmaps, cubes, and volumes are not recommended because of the additional overhead in locking every level. For mipmaps, D3DLOCK_DISCARD is allowed only on the top level. All levels are discarded by locking just the top level. This behavior is the same for volumes and cubes. For cubes, the top level and face 0 are locked.
The following pseudocode shows an example of using a dynamic texture.
DrawProceduralTexture(pTex)
{
// pTex should not be very small because overhead of
// calling driver every D3DLOCK_DISCARD will not
// justify the performance gain. Experimentation is encouraged.
pTex->Lock(D3DLOCK_DISCARD);
<Overwrite *entire* texture>
pTex->Unlock();
pDev->SetTexture();
pDev->DrawPrimitive();
}

动态纹理工作时，请记住以下几点。
1. 它们不能被托管。例如，他们的水池不能D3DPOOL_MANAGED。
2. 动态纹理可以被锁定，即使它们是在D3DPOOL_DEFAULT创建。
3. D3DLOCK_DISCARD是一个有效的锁定标志为动态纹理。

Using Dynamic Vertex and Index Buffers
Locking a static vertex buffer while the graphics processor is using the buffer can have a significant performance penalty. The lock call must wait until the graphics processor is finished reading vertex or index data from the buffer before it can return to the calling application, a significant delay. Locking and then rendering from a static buffer several times per frame also prevents the graphics processor from buffering rendering commands, since it must finish commands before returning the lock pointer. Without buffered commands, the graphics processor remains idle until after the application is finished filling the vertex buffer or index buffer and issues a rendering command.

Ideally the vertex or index data would never change, however this is not always possible. There are many situations where the application needs to change vertex or index data every frame, perhaps even multiple times per frame. For these situations, the vertex or index buffer should be created with D3DUSAGE_DYNAMIC. This usage flag causes Direct3D to optimize for frequent lock operations. D3DUSAGE_DYNAMIC is only useful when the buffer is locked frequently; data that remains constant should be placed in a static vertex or index buffer.

To receive a performance improvement when using dynamic vertex buffers, the application must call IDirect3DVertexBuffer9::Lock or IDirect3DIndexBuffer9::Lock with the appropriate flags. D3DLOCK_DISCARD indicates that the application does not need to keep the old vertex or index data in the buffer. If the graphics processor is still using the buffer when lock is called with D3DLOCK_DISCARD, a pointer to a new region of memory is returned instead of the old buffer data. This allows the graphics processor to continue using the old data while the application places data in the new buffer. No additional memory management is required in the application; the old buffer is reused or destroyed automatically when the graphics processor is finished with it. Note that locking a buffer with D3DLOCK_DISCARD always discards the entire buffer, specifying a nonzero offset or limited size field does not preserve information in unlocked areas of the buffer.

There are cases where the amount of data the application needs to store per lock is small, such as adding four vertices to render a sprite. D3DLOCK_NOOVERWRITE indicates that the application will not overwrite data already in use in the dynamic buffer. The lock call will return a pointer to the old data, allowing the application to add new data in unused regions of the vertex or index buffer. The application should not modify vertices or indices used in a draw operation as they might still be in use by the graphics processor. The application should then use D3DLOCK_DISCARD after the dynamic buffer is full to receive a new region of memory, discarding the old vertex or index data after the graphics processor is finished.

The asynchronous query mechanism is useful to determine if vertices are still in use by the graphics processor. Issue a query of type D3DQUERYTYPE_EVENT after the last DrawPrimitive call that uses the vertices. The vertices are no longer in use when IDirect3DQuery9::GetData returns S_OK. Locking a buffer with D3DLOCK_DISCARD or no flags will always guarantee the vertices are synchronized properly with the graphics processor, however using lock without flags will incur the performance penalty described earlier. Other API calls such as IDirect3DDevice9::BeginScene, IDirect3DDevice9::EndScene, and IDirect3DDevice9::Present do not guarantee the graphics processor is finished using vertices.

Below are ways to use dynamic buffers and the proper lock flags.

// USAGE STYLE 1
// Discard the entire vertex buffer and refill with thousands of vertices.
// Might contain multiple objects and/or require multiple DrawPrimitive
// calls separated by state changes, etc.

// Determine the size of data to be moved into the vertex buffer.
UINT nSizeOfData = nNumberOfVertices * m_nVertexStride;

// Discard and refill the used portion of the vertex buffer.
CONST DWORD dwLockFlags = D3DLOCK_DISCARD;

// Lock the vertex buffer.
BYTE* pBytes;
if( FAILED( m_pVertexBuffer->Lock( 0, 0, &pBytes, dwLockFlags ) ) )
return false;

// Copy the vertices into the vertex buffer.
memcpy( pBytes, pVertices, nSizeOfData );
m_pVertexBuffer->Unlock();

// Render the primitives.
m_pDevice->DrawPrimitive( D3DPT_TRIANGLELIST, 0, nNumberOfVertices/3)

// USAGE STYLE 2
// Reusing one vertex buffer for multiple objects

// Determine the size of data to be moved into the vertex buffer.
UINT nSizeOfData = nNumberOfVertices * m_nVertexStride;

// No overwrite will be used if the vertices can fit into
// the space remaining in the vertex buffer.
DWORD dwLockFlags = D3DLOCK_NOOVERWRITE;

// Check to see if the entire vertex buffer has been used up yet.
if( m_nNextVertexData > m_nSizeOfVB - nSizeOfData )
{
// No space remains. Start over from the beginning
// of the vertex buffer.
dwLockFlags = D3DLOCK_DISCARD;
m_nNextVertexData = 0;
}

// Lock the vertex buffer.
BYTE* pBytes;
if( FAILED( m_pVertexBuffer->Lock( (UINT)m_nNextVertexData, nSizeOfData,
&pBytes, dwLockFlags ) ) )
return false;

// Copy the vertices into the vertex buffer.
memcpy( pBytes, pVertices, nSizeOfData );
m_pVertexBuffer->Unlock();

// Render the primitives.
m_pDevice->DrawPrimitive( D3DPT_TRIANGLELIST,
m_nNextVertexData/m_nVertexStride, nNumberOfVertices/3)

// Advance to the next position in the vertex buffer.
m_nNextVertexData += nSizeOfData;

Z-Buffer Performance
Applications can increase performance when using z-buffering and texturing by ensuring that scenes are rendered from front to back. Textured z-buffered primitives are pretested against the z-buffer on a scan line basis. If a scan line is hidden by a previously rendered polygon, the system rejects it quickly and efficiently. Z-buffering can improve performance, but the technique is most useful when a scene draws the same pixels more than once. This is difficult to calculate exactly, but you can often make a close approximation. If the same pixels are drawn less than twice, you can achieve the best performance by turning z-buffering off and rendering the scene from back to front

从前向后渲染，当像素被遮挡时Z-Buffer会拒绝绘制该像素，提高效率。但是，当通常平均每个像素绘制小于两次时，从后向前渲染，然后关闭Z-Buffer效率会更高。

claien

发布了61 篇原创文章 · 获赞 33 · 访问量 13万+

私信关注

Performance Optimizations (Direct3D 9)

985 硕士程序员，空窗 4 个月没有 Offer！

一文搞懂 Spring 循环依赖

赛博斗地主——使用大语言模型扮演Agent智能体玩牌类游戏。

VScode右键打开(添加到右键)

Performance Optimizations (Direct3D 9)

地雷和蜘蛛你選什麼？

Link錯誤

數據驅動

深入分析qsort庫函數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結