090105 / NV GPU Programming GuideMore On TessellationIgnacio Castaño at NVidia has a great post on Watertight Texture Sampling. Posts like that go a long way towards helping us developers come to grips with what will be necessary to make use of future hardware. IMO this next jump from DX9 for console devs to DX11 or beyond is going to be huge, and require a large amount of re-engineering to really make use of the hardware. A majority of devs skipped on DX10. DX10 ports of DX9 code without proper re-engineering for constant buffers and immutable state objects caused a decrease in performance, and often the best of DX10 was left unused because it wouldn't port to older console hardware. Now add tessellation, compute shaders, and proper threading to DX. If the time for PS3 devs to make use of the SPUs is any indication, my prediction is a large lag of software making use of the technology for this next generation of consoles from both Sony and Microsoft. Draw calls of current titles clock in over 4000/frame and over 2M triangles/frame on the upper end. To put those numbers in perspective 2M triangles is about the same number of pixels on the screen at 1080P resolution! At the upper end, 2M tri / 4000 draws = 500 tri/draw which seems like a low average number of triangles per batch. Draws are divided by state like render target (pass, ie pre-z, shadow, motion vector, geometry, lighting, etc), or texture, or shader (material), or buffer, etc. Threading of DX11 will remove some of the CPU limitations of DX10 on max number of draw calls. Texture arrays of DX10 or some other method such as mega-texturing in combination with decoupling shader and material (like deferred shading) can remove many of the state changes which result in excessive draw calls. DX10 flexibility (vertex texture fetch, instancing, etc) can remove other limitations which would have otherwise split meshes into multiple draws (like GPU skinning with limited DX9 hardware constant buffer size). For all unique objects in theory, one should be able to get to one draw per object per output framebuffer pass. With instancing and merging objects into single buffers, one should be able to get to many times less than one draw per object per output. This type of system could help solve the problem of view distances and LOD in terms of number of draw calls. DX11 tessellation requires splitting by topography (which in worst case might be a draw call per topography), but can be handled in other ways using the flexibility of DX11 to avoid the draw call explosion. So perhaps draw calls won't be a problem for DX11 with some engine design. Patches as The New Primitive! There are all sorts of pre-raster steps involved in triangle setup: from reading the index buffer, fetching vertex attributes, checking the post transform cache, to running the vertex shader on post transform cache misses. DX11 with tessellation is set to reduce the relative burden on this hardware, the same number of verts will provide a much higher quality output. I'm guessing this pre-raster hardware for next generation simply scales with clock rate (as a serial process), but largely doesn't change and probably isn't parallelized. If this is the case, simply drawing triangles as we do now, might easily get pre-raster setup limited (assuming the rest of the pipeline scales and resolution is capped at current HD). Enter DX11 tessellation. Tessellation functions as a way to run-time decompress vertex data (help solve the memory bandwidth and memory utilization problem for high detail), parallelize the rest of the pre-raster triangle setup, and help solve the level of detail problem. Patches can be computed independently from each other in parallel. The hull shader computes how much to tessellate. Tessellation can be view dependent and likely can even be made occlusion dependent as well (hull shader could sample depth from the previous frame or other prior calculations). Patches can be killed in the hull shader with 0 for tessellation factors. Not sure how vector SIMD friendly all the stages are. Patch limited to 32 control points on input (so on current NVidia hardware, one warp). If a lower number of controls points are used, can the hardware still maintain full SIMD efficiency? What about in the domain shader with variable number of outputs? These are open questions of which I don't expect to get an answer to until I can go out an buy the first DX11 graphics card and see for myself! What About Post Tessellation? Patches get tessellated into verts. Verts get rastered as triangles, lines, or points (seems to be possible in DX11). Raster output is as before, 2x2 pixel quads which get sent to the rest of the pipeline. Not sure if raster setup gets parallelized or not. Perhaps raster setup just gets double clocked? If not tessellation could get triangle bound? Small triangles likely not performance friendly anyway do to 2x2 pixel quads. NVidia GPU Programming Guide UpdateNVidia's GPU Programming Guide has been updated for 8 and 9 series GPUs. Here a few highlights. General Z-Cull and Early-Z Coarse Z/Stencil Cull (Z-Cull) Can Not Cull When Z-Cull Performance Reduced When Fine Grain (Early-Z) Cull is Disabled When SV_* Semantics Geometry Shader DX10 Related DX10 Texture Updates DX10 Buffer Updates NV 8+ Series And My Particle Overdraw Reduction TrickWow, my stencil overdraw trick (increase stencil on draw, and cull over fixed limit) was working on fine Early-Z only (ie no coarse stencil cull). Doc clearly shows why manual discard on low alpha actually made performance worse (ie was disabling Early-Z) for the stencil overdraw trick. Other Stuff | Atom©2009-2007 Timothy Farrar Latest Blog Entries090407 . dxt tip Index000000 . index Graphics090311 . atom tri soup Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language090318 . re-attachable code Elsewhereandrew selle All Blog Entries090407 . dxt tip |