071126 / Optimization and More

previous | next

Random Image...

Random Thoughts...

Better Deferred Rendering Anti-Aliasing

No not for Atom, but for everyone else!

For those who have normal stored in their G-Buffer, here is an idea on how you can get really high quality anti-aliasing with a image space post process, without using hardware anti-aliasing. First take the per pixel normal, project into screen space, and re-normalize. Then take this screen space normal, and rotate 90 degrees (compute the vector perpendicular to the screen space normal). This perpendicular vector will be in the direction of the strongest "edge" at the pixel. Now like screen space motion blur, sample from the screen in the direction of this perpendicular vector, and intellegently take a weighted average of a few screen samples along both the positive and negative directions on this vector. Would have to workout the details, but the concept should work really well.

Limitations of 32-bit Floats on Effective Map/Level Size

I was once tempted to move my double precision code to the GPU, so I decided to look at the effective maximum fully interactive world size for single precision. Now the average human walks at 4 mph (or 70.4 inches/sec). In this context, from personal experience, about 1/32 inch in positional precision is needed in order to render objects close to the screen without problems. With this in mind (2^24/32=), the 24 bits of precision of FP32 gives 524288 inches, or a little over 8.2 miles (13.3 km) in one direction (also have a sign bit).

According to some forum posts, Crysis sports something like a 16 km viewing distance, which seems to support the idea that they are pushing the upper limits of map size for a FPS using single precision.

Atom Update

I've found an interesting easy solution to the Z-sort order popup problem which has been an unresolved issue in the Atom engine from the get-go. The answer is in sorting incorrectly based on particle size, didn't even need to go to a separate Z write pass. Yep, I'm doing an incorrect Z sort which insures that the soft edge seams between particles always have another particle in front to mask the z-order change popup issues. Works amazingly well (since I have hierarchical particles) and only requires a simple pre-sort Z offset change based on particle radius.

This discovery in combination with getting the alpha blended deferred shading working, has now ended my tangent to R&D stuff, and I'm back to meeting my impending December 31st deadline for finishing up the graphics/CFD engine optimization.

Deferred shading has also enabled me to bring back FP16 HDR (with tri-linear env lookups) at virtually no cost, and has moved the performance bottleneck to the ROP (fill rate limited during alpha blending to the G-Buffer). Drawing in reverse painters (front first) and using the stencil to cull fragments should enable me to control the fill rate (which works great in my non-deferred engine path). Seems like the G80 can stencil clip about 8x its fill rate, and this should be even better on the G84 and G92. Because of the duplicate Alphas, I need 3 FP16 MRTs currently. I'm attempting to move to 3 INT8 MRTs in the cubemap path (should need less precision for my "normals").

SSE2 Optimization Using Double Precision

Atom uses a L-system where each particle in the world tree has it's own coordinate system (parent relative rotational matrix, offset, and scale). There are basically 2 primary per frame parts of the engine which operate on this tree on the CPU using double precision. This is something which simply would never map well to the GPU because of its precision requirements and tree recursive structure (children results depend on the results of the full parent tree).

I'm trying to get as near as possible to the 2 double precision flops per cycle (best case, cached, etc) which is possible on Intel's Core arch using SSE2, for all my core per particle algorithms which are CPU side.

Ultimately SSE2 double precision performance is limited by pack/unpack and swizzles. In going with SOA (structure of arrays) layout, I've now taken a page from the CPU raytracing packet optimization concept and am working with a fixed set of 8 child nodes per parent node at a time. This is requiring a complete re-write of nearly everything but in the end should also help vastly improve cache performance due to better locality of data (which probably will be more of an improvement overall than getting 2 DP ops/cycle best case).

There are a bunch of key SSE2 instructions for double precision which help in interfacing with non-SOA data structures. For example MOVDDUP, which can load a 64-bit aligned double into both the hi and lo parts of an XMM register (ie, great for doing a matrix multiply of the same parent rotation matrix times 8 child matrices or vectors in SOA form). Likewise MOVHPD+MOVLPD also work with 64-bit aligned doubles for memory operands. In one case, SSE2 has enabled me to hide the cost of unpacking and converting GPU single precision readback results into double precision SOAs, which ended up being a little under a 2x performance improvement for this case.