LogLuv Encoding for HDR

對於HDR的Buffer Format，個人還是認爲用16f或者7e3比較好，RGBM，RGME，LogLuv個人都不是很喜歡，因爲這麼做的目的是什麼呢，是降低帶寬，節省顯存，但是現在的顯卡包括nv8600這樣的顯卡，省這麼點帶寬對性能的影響真的微乎其微，也許是我測試的是Deferred Shading，因爲對於Deferred Shading來說HDR的Buffer只有在算Lighting之後纔會用到，GBuffer的時候是往RGBA8的Buffer畫的，所以影響真的非常小；也許對於Forward Shading來說因爲畫每一個物體都是往HDR的buffer畫的這時候帶寬可能會有點幫助吧，下面的內容是MJP實現的LogLuv的詳細方法，如果需要可以參考一下。

// M matrix, for encoding
const static float3x3 M = float3x3(
    0.2209, 0.3390, 0.4184,
    0.1138, 0.6780, 0.7319,
    0.0102, 0.1130, 0.2969);

// Inverse M matrix, for decoding
const static float3x3 InverseM = float3x3(
    6.0013,    -2.700,    -1.7995,
    -1.332,    3.1029,    -5.7720,
    .3007,    -1.088,    5.6268);    

float4 LogLuvEncode(in float3 vRGB)
{
    float4 vResult;
    float3 Xp_Y_XYZp = mul(vRGB, M);
    Xp_Y_XYZp = max(Xp_Y_XYZp, float3(1e-6, 1e-6, 1e-6));
    vResult.xy = Xp_Y_XYZp.xy / Xp_Y_XYZp.z;
    float Le = 2 * log2(Xp_Y_XYZp.y) + 127;
    vResult.w = frac(Le);
    vResult.z = (Le - (floor(vResult.w*255.0f))/255.0f)/255.0f;
    return vResult;
}

float3 LogLuvDecode(in float4 vLogLuv)
{
    float Le = vLogLuv.z * 255 + vLogLuv.w;
    float3 Xp_Y_XYZp;
    Xp_Y_XYZp.y = exp2((Le - 127) / 2);
    Xp_Y_XYZp.z = Xp_Y_XYZp.y / vLogLuv.y;
    Xp_Y_XYZp.x = vLogLuv.x * Xp_Y_XYZp.z;
    float3 vRGB = mul(Xp_Y_XYZp, InverseM);
    return max(vRGB, 0);
}

附上MJP的Blog.

------------------------------------------------------------------------------------分界線-------------------------------------------------------------------------------------------

Designing an effective and performant HDR implementation for my game’s engine was a step that was complicated a bit by a few of the quirks of running XNA on the Xbox 360. As a quick refresher for those who aren’t experts on the subject, HDR is most commonly implemented by rendering the scene to a floating-point buffer and then performing a tone-mapping pass to bring the colors back into he visible range. Floating-point formats (like A16B16G16R16F, AKA HalfVector4) are used because their added precision and floating-point nature allows them to comfortbly store linear RGB values in ranges beyond the [0,1] typically used for shader output to the backbuffer, which is crucial as HDR requires having data with a wide dynamic range. They’re also convenient, as this it allows values to be stored in the same format they’re manipulated in the shaders. Newer GPU’s also support full texture filtering and alpha-blending with fp surfaces, which prevents the need for special-case handling of things like non-opaque geometry. However as with most things, what’s convient is not always the best option. During planning, I came up with the following list of pro’s and con’s for various types of HDR implementations:

Standard HDR, fp16 buffer
+Very easy to integrate (no special work needed for the shaders)
+Good precision
+Support for blending on SM3.0+ PC GPU’s
+Allows for HDR bloom effects
-Double the bandwidth and storage requirements of R8G8B8A8
-Weak support for multi-sampling on SM3.0 GPU’s (Nvidia NV40 and G70/G71 can’t do it)
-Hardware filtering not available on ATI SM2.0 and SM3.0 GPU’s
-No blending on the Xbox 360
-Requires double space in framebuffer on the 360, which increases the number of tiles needed

HDR with tone-mapping applied directly in the pixel shader (Valve-style)
+Doesn’t require output to an HDR format, no floating-point or encoding required
+Multi-sampling and blending is supported, even on old hardware
-Can’t do HDR bloom, since only an LDR image is available for post-processing
-Luminance can’t be calculated directly, need to use fancy techniques to estimate it
-Increases shader complexity and combinations

HDR using an encoded format
+Allows for a standard tone-mapping chain
+Allows for HDR bloom effects
+Most formats offer a very wide dynamic range
+Same bandwidth and storage as LDR rendering
+Certain formats allow for multi-sampling and/or linear filtering with reasonable quality
-Alpha-blending usually isn’t an option, since the alpha-channel is used by most formats
-Linear filtering and multisampling usually isn’t mathmatically correct, although often the results are “good enough”
-Additional shader math needed for format conversions
-Adds complexity to shaders

My early prototyping used a standard tone-mapping chain and I didn’t want to ditch that, nor did I want to move away from what I was comfortable with. This pretty much eliminated the second option for me off the bat…although I was unlikely to choose it anyway due its other drawbacks (having nice HDR bloom was something I felt was an important part of the look I wanted for my game, and in my opinion Valve’s method doesn’t do a great job of determining average luminance). When I tried out the first method I found that it worked as well as it always did on the PC (I’ve used it before), but on the 360 it was another story. I’m not sure why exactly, but for some reason it simply does not like the HalfVector4 format. Performance was terrible, I couldn’t blend, I got all kinds of strange rendering artifacts (entire lines of pixels missing), and I’d get bizarre exceptions if I enabled multisampling. Loads of fun, let me tell you.

This left me with option #3. I wasn’t a fan of this approach initially, as my original design plan called for things to be simple and straightforward whenever possible. I didn’t really want to have two versions of my material shaders to support encoding, nor did I want to integrate decoding into the other parts of the pipeline that needed it. But unfortunately, I wasn’t really left with any other options after I found there were no plans to bring the support for the 360′s special fp10 backbuffer format to XNA (which would have conveniently solved my problems on the 360). So, I started doing my research. Naturally the first place I looked was to actual released commercial game. Why? Because usually when a technique is used in a shipped game, it means it’s gone though the paces and has been determined to actually be feasible and practical in game environment. Which of course naturally led me to consider NAO32.

NAO32 is a format that gained some fame in the dev community when ex-Ninja Theory programmer Marco Salvi shared some details on the technique over on the beyond3D forums. Used in the game Heavenly Sword, it allowed for multisampling to be used in conjuction with HDR on a platform (PS3) whose GPU didn’t support multisampling of floating-point surfaces (The RSX is heavily based on Nvidia G70). In this technique, color is stored in the LogLuv format using a standard R8G8B8A8 surface. Two components are used to store X and Y at 8-bit precision, and the other two are used to store the log of luminance at 16-bit precision. Having 16 bits for luminance allows for a wide dynamic range to be stored in this format, and storing the log of the luminance allows for linear filtering in multisampling or texture sampling. Since he first explained it other games have also used it, such as Naughty Dog’s Uncharted. It’s likely that it’s been used in many other PS3 games, as well.

My actual shader implementation was helped along quite a bit by Christer Ericson’s blog post, which described how to derive optimized shader code for encoding RGB into the LogLuv format. Using his code as a starting point, I came up with the following HLSL code for encoding and decoding:

// M matrix, for encoding
const static float3x3 M = float3x3(
    0.2209, 0.3390, 0.4184,
    0.1138, 0.6780, 0.7319,
    0.0102, 0.1130, 0.2969);

// Inverse M matrix, for decoding
const static float3x3 InverseM = float3x3(
    6.0013,    -2.700,    -1.7995,
    -1.332,    3.1029,    -5.7720,
    .3007,    -1.088,    5.6268);    

float4 LogLuvEncode(in float3 vRGB)
{
    float4 vResult;
    float3 Xp_Y_XYZp = mul(vRGB, M);
    Xp_Y_XYZp = max(Xp_Y_XYZp, float3(1e-6, 1e-6, 1e-6));
    vResult.xy = Xp_Y_XYZp.xy / Xp_Y_XYZp.z;
    float Le = 2 * log2(Xp_Y_XYZp.y) + 127;
    vResult.w = frac(Le);
    vResult.z = (Le - (floor(vResult.w*255.0f))/255.0f)/255.0f;
    return vResult;
}

float3 LogLuvDecode(in float4 vLogLuv)
{
    float Le = vLogLuv.z * 255 + vLogLuv.w;
    float3 Xp_Y_XYZp;
    Xp_Y_XYZp.y = exp2((Le - 127) / 2);
    Xp_Y_XYZp.z = Xp_Y_XYZp.y / vLogLuv.y;
    Xp_Y_XYZp.x = vLogLuv.x * Xp_Y_XYZp.z;
    float3 vRGB = mul(Xp_Y_XYZp, InverseM);
    return max(vRGB, 0);
}

Once I had this implemented and worked through a few small glitches;, results were much improved in the 360 version of my game. Performance was much much better, I could multi-sample again, and the results looked great. So while things didn’t exactly work out in an ideal way, I’m pleased enough with the results.

If you’re interested in this, be sure to check out my sample