080826 / Olick's Paper

previous | index

Current and Next Generation Parallelism in Games

An awesome presentation, if you haven't read it yet, do so, and open up all the notes in the upper left corner. There are a bunch of interesting bits, many of which I am surprised went public.

Current Generation

1. Using 16KB groupings of joint data for SPU local store, 1/16 of 256KB local store.
2. Unit vector compression, 10bits per 2 smallest, largest computed, and extra sign/W bits.
3. Noted the RSX four vert mini cache used in primitive assembly.
4. Index decompression 6.5:1, very easy.
5. 30% increase in RSX performance doing SPU side vertex skinning.
6. 70% increase in RSX performance for shadow map generation.
7. Per vertex progressive meshes.
8. Average 60-70% of triangles do not contribute to raster output.
9. Get 10-20% improvement of balanced scene with SPU triangle culling.
10. Tight sync SPU ring buffer control via command buffer semaphores.
11. About 48M triangles per sec on 1 SPU for full pipeline.

If you think about this in terms of more modern GPU technology, the SPUs are functioning as DX10 vertex shaders outputting into a geometry shader (for stream compaction after culling) which writes verts out via stream out for single or multi-pass rendering. Most likely on current GPUs, using the hardware triangle setup for culling might be faster (ie skip the stream compaction, or at least use histo-pyramids instead), and obviously for single pass rendering, stream out would then be unnecessary.

Next Generation

As John Carmack noted as the light bulb moment in his Quakecon speech, what GPUs provide for general purpose computation beyond SPUs, is a fast vector gather. This functionally is tremendously important for many algorithms. Best case on SPUs for general gather is to work with a local store cache of 128 byte sized objects (for DMA performance), or 16 byte sized objects (if necessary, smallest for vector transpose), and then eat the overhead of the vector transpose to move from AOS to SOA for efficient computation. Personally I think SPU functionally can be replaced by the DX11 compute shader or OpenCL. I'd rather see one tremendous GPU attached to a small multi-core CPU. I wonder what Sony will do for PS4?

Next Generation Game Code

1. Parallelizing game code, cross entity communication only through one frame delay.
2. Entity gathers from data in previous frame, writes to self.

Note Jon's game parallelization for game entities description exactly parallels current GPU function (see what I wrote in parenthesis). Majority of code only gathers data from previous frame (texture fetch of results from previous frame), and writes to self (writing to multiple render targets). One sync point, massive easy parallelism. If needing dependent entities in a given frame, sort into groups of independent tasks which can run in parallel, with sync points separating groupings (groups are sequences of independent draw calls, ie no draw call does a texture fetch from a frame buffer written by a previous draw call). If you read between my lines, you might see that I am indeed saying that you can do all game entity code on the GPU. And I have a work in progress future blog post which will detail exactly how this can be done (GPU side scripting).

Sparse Voxel Octtree

Next section of the talk is spilling the beans on the sparse voxel octtree. The paper didn't go into much detail on a lot of the acceleration methods tried, but I was impressed with the voxel translation trick.

1. Rendering bounding hull with polygons to accelerate open space skipping. Ideal case is for the bounding hull to be generated from the coarse voxel geometry. Would require depth write in the fragment shader after searching to construct a proper Z buffer for hybrid rendering. Only found a 2x speed improvement for skipping most of the traversal process. Surprised at this stage why he didn't switch to a faster algorithm and better texture/data compression for the fine search. Should be all they need to reach a good FPS on current high end GPUs.

2. Using depth feedback from previous frame to accelerate open space skipping. This is only noted in the blurb in the corner. Not sure if he tried combining this with adaptive refinement. The idea being to raster by mipmap expansion. Start with the full search at a really small framebuffer (mipmap) size, search to minimum depth for region. Then repeatedly refine search to higher resolutions (in mipmap), but use reprojection from the previous frame to accelerate refinement. The up-resolution catches previously occluded (or newly visible) geometry. Wouldn't need to do full search each frame for each pixel, could hide artifacts of frame converging to proper result in motion or DOF blur.

The section on infinite surface detail is interesting, but IMO not that useful. Largest problem is that the fractal would always be axis aligned (think about detail on a sphere for example). Sure it is great for boxes (see my 080709 post for an example of infinite detail with an axis aligned fractal ... in a fisheye projection however). The technology I'm using for infinite detail with Atom allows for each node of the tree to have a coordinate space rotation and scale, which is a necessary construct to compress infinite detail on curved surfaces well.

I am personally not that interested in SVO as it is described in the paper because of the complexities of dynamic objects. A direct parametric raycast (ie curved ray search) seems way to expensive, and the best acceleration methods depend on triangle rendering.

The Future of Rasterization

Interesting. The point being that GPU parallel drawing performance will eventually be limited by triangle setup since GPU triangle setup is currently serial, or requires sort of triangles into cores, each core taking one region of the screen (ie the Larrabee binning model).

However, if we take the eventual model of fragment sized triangles and even more massively parallel machines, rasterization becomes reduced to a sort on 2D projection and Z (note this sort has great temporal locality, which hints that there is a very good parallel solution to this problem, both in terms of cross low cross core communication if objects are stored on cores based on locality, and in terms of parallel computational efficiency). Alternatively, this future rasterization could be looked at as an atomic scatter operation (depth controlling an atomic swap), which if the input stream of fragments had good 3D locality (which it should), would provide great cache performance.

And raycasting (as described in Jon's talk) is simply scatter through a gather search. So they all end up similar. The real question is which scales better with the bandwidth required to communicate between nodes.