090105 / NV GPU Programming Guide

previous | next

More On Tessellation

Ignacio Castaño at NVidia has a great post on Watertight Texture Sampling. Posts like that go a long way towards helping us developers come to grips with what will be necessary to make use of future hardware.

IMO this next jump from DX9 for console devs to DX11 or beyond is going to be huge, and require a large amount of re-engineering to really make use of the hardware. A majority of devs skipped on DX10. DX10 ports of DX9 code without proper re-engineering for constant buffers and immutable state objects caused a decrease in performance, and often the best of DX10 was left unused because it wouldn't port to older console hardware. Now add tessellation, compute shaders, and proper threading to DX. If the time for PS3 devs to make use of the SPUs is any indication, my prediction is a large lag of software making use of the technology for this next generation of consoles from both Sony and Microsoft.

Draw calls of current titles clock in over 4000/frame and over 2M triangles/frame on the upper end.
So what is next?

To put those numbers in perspective 2M triangles is about the same number of pixels on the screen at 1080P resolution! At the upper end, 2M tri / 4000 draws = 500 tri/draw which seems like a low average number of triangles per batch. Draws are divided by state like render target (pass, ie pre-z, shadow, motion vector, geometry, lighting, etc), or texture, or shader (material), or buffer, etc.

Threading of DX11 will remove some of the CPU limitations of DX10 on max number of draw calls. Texture arrays of DX10 or some other method such as mega-texturing in combination with decoupling shader and material (like deferred shading) can remove many of the state changes which result in excessive draw calls. DX10 flexibility (vertex texture fetch, instancing, etc) can remove other limitations which would have otherwise split meshes into multiple draws (like GPU skinning with limited DX9 hardware constant buffer size). For all unique objects in theory, one should be able to get to one draw per object per output framebuffer pass. With instancing and merging objects into single buffers, one should be able to get to many times less than one draw per object per output. This type of system could help solve the problem of view distances and LOD in terms of number of draw calls. DX11 tessellation requires splitting by topography (which in worst case might be a draw call per topography), but can be handled in other ways using the flexibility of DX11 to avoid the draw call explosion. So perhaps draw calls won't be a problem for DX11 with some engine design.

Patches as The New Primitive!

There are all sorts of pre-raster steps involved in triangle setup: from reading the index buffer, fetching vertex attributes, checking the post transform cache, to running the vertex shader on post transform cache misses. DX11 with tessellation is set to reduce the relative burden on this hardware, the same number of verts will provide a much higher quality output. I'm guessing this pre-raster hardware for next generation simply scales with clock rate (as a serial process), but largely doesn't change and probably isn't parallelized. If this is the case, simply drawing triangles as we do now, might easily get pre-raster setup limited (assuming the rest of the pipeline scales and resolution is capped at current HD).

Enter DX11 tessellation. Tessellation functions as a way to run-time decompress vertex data (help solve the memory bandwidth and memory utilization problem for high detail), parallelize the rest of the pre-raster triangle setup, and help solve the level of detail problem. Patches can be computed independently from each other in parallel. The hull shader computes how much to tessellate. Tessellation can be view dependent and likely can even be made occlusion dependent as well (hull shader could sample depth from the previous frame or other prior calculations). Patches can be killed in the hull shader with 0 for tessellation factors.

Not sure how vector SIMD friendly all the stages are. Patch limited to 32 control points on input (so on current NVidia hardware, one warp). If a lower number of controls points are used, can the hardware still maintain full SIMD efficiency? What about in the domain shader with variable number of outputs? These are open questions of which I don't expect to get an answer to until I can go out an buy the first DX11 graphics card and see for myself!

What About Post Tessellation?

Patches get tessellated into verts. Verts get rastered as triangles, lines, or points (seems to be possible in DX11). Raster output is as before, 2x2 pixel quads which get sent to the rest of the pipeline. Not sure if raster setup gets parallelized or not. Perhaps raster setup just gets double clocked? If not tessellation could get triangle bound? Small triangles likely not performance friendly anyway do to 2x2 pixel quads.

NVidia GPU Programming Guide Update

NVidia's GPU Programming Guide has been updated for 8 and 9 series GPUs. Here a few highlights.

General
- Suggested to use half's when possible for computation even on 8/9 hardware.?
- Use several thousand triangles per draw call.
- Interpolants increase post-[vertex-]transform cache pressure.
- Interpolants reduce vertices/clock, can cause attribute bottleneck.
- So factoring work from PS to VS can decrease performance.
- Suggested for NVidia DX10 cards, DON'T do manual vector packing for interpolants.
- Manual interpolant vector packing can reduce compiler's effectiveness?
- Saturate is no longer free.
- Still suggesting replacing complex functions with texture lookups.
- Double-speed Z/stencil rendering disabled if using texkill or depth replace.
- Allocate render targets first ordered by pitch (width*bpp) then frequency of use.
- The NULL, INTZ, and RAWZ codes can be used in DX9 for depth textures...

Z-Cull and Early-Z
- 8 series and later also has fine-grained Z and stencil culling.
- GPU can skip shading of occluded pixels.
- Guessing that fine-grain means that cull happens before packing 2x2 pixel quads into vectors for shading.
- And the following notes for 8/9/2xx hardware,

Coarse Z/Stencil Cull (Z-Cull) Can Not Cull When
- When clears are NOT used to clear depth/stencil buffer.
- Pixel shader writes depth.
- Direction of depth test is changed while writing depth in fragment shader.
- Stencil writes are enabled while doing stencil testing.
- On 8 series if DepthStencilView is a texture array.

Z-Cull Performance Reduced When
- If depth test direction changes (when not writing depth in fragment shader).
- If depth contains a lot of high frequency information.
- If there are too many large depth buffers.
- D32_FLOAT is used.

Fine Grain (Early-Z) Cull is Disabled When
- If pixel shader outputs depth.
- On 8 series if pixel shader uses .z of input attribute with SV_Position semantic.
- If depth or stencil writes are enabled, or occlusion queries are enabled, and
- one of the following is true,
-- Alpha test is enabled.
-- Pixel shader kills pixels.
-- Alpha to coverage is enabled.
-- SampleMask is not 0xFFFFFFFF.

SV_* Semantics
- Suggests using SV_InstanceID and Load() to reduce primitive assembly overhead.
- SV_* semantics add fixed attribute overhead of 8 scalar attributes to the shader.
- If attribute bound, SV_* semantics only help if over 8 attributes.
- Vertex attribute vector packing still important for 8/9 hardware.

Geometry Shader
- Geometry shader works on primitives, so shared verts get re-transformed (re-VS).
- Geometry shader performance is directly proportional to number of output attributes.
- The 8/9/2xx cards have limited GS output buffer sizes.
- On 8800 GTX 1-20 scalers runs at peak performance, more results in duplicate passes.
- So next step for GS performance is 50% of peak.
- The maxvertexcount also directly effects speed with which the GS runs.
- Variable number of GS outputs always runs at worst case.

DX10 Related
- Don't real-time create immutable state objects.
- Validation done at creation/destruction time instead of draw call time.
- Remember constant blocks are atomically updated as a whole block.
- Constants can be shared between vertex and pixel shaders.
- Constants are cached, so group constants by access patterns.
- Don't use global constants.
- Tbuffers have better random access performance (texture cache),
- but larger latencies than constant buffers.
- If latency can be masked then performance of tbuffers can be better for random access.

DX10 Texture Updates
- Use ring buffer of intermediate D3D10_USAGE_STAGING textures.
- Map with D3D10_MAP_WRITE and D3D10_MAP_FLAG_DO_NOT_WAIT.
- Use CopyResource() or CopySubresourceRegion() with D3D10_USAGE_DEFAULT.

DX10 Buffer Updates
- Unlike OpenGL3 the entire buffer gets copied over to GPU after update.
- For large buffers, use shared ring buffer.
- Don't use UpdateSubResource() for large buffers.
- Use Map() with D3D10_MAP_DISCARD when ring wraps around.
- Use Map() with D3D10_MAP_NO_OVERWRITE when updating next buffer in ring.

NV 8+ Series And My Particle Overdraw Reduction Trick

Wow, my stencil overdraw trick (increase stencil on draw, and cull over fixed limit) was working on fine Early-Z only (ie no coarse stencil cull). Doc clearly shows why manual discard on low alpha actually made performance worse (ie was disabling Early-Z) for the stencil overdraw trick.

Other Stuff

Open source AMD HD GPU drivers blog...