090320 / GDC 2009

previous | next

Lots of stuff here, mostly my notes from GDC presentations... (wasn't at GDC this year!)

Larrabee Instructions

Prototype Instructions

1. 16-wide 32-bit or 8-wide 64-bit SIMD.
2. 32 512-bit vector registers.
3. 8 16-bit predicate registers.
4. Predicate register source operand per opcode.
5. Register = Register OP Swizzle/Broadcast/Convert(Register|Memory) opcode form.
6. Bit interleave 1:1 and 2:1 instructions!
7. Pack and store vector to unaligned memory.
8. Convert to float 11:11:10.
9. Convert to unorm 10:10:10:2.
10. Convert to sRGB 8-bit.
11. Vector gather/scatter from 16 addresses of [base + index*scale].
12. Vector prefetch to L1 for later gather/scatter.
13. Rotate and bitfield insertion or mask.
14. Unaligned loads.
15. Four element set vector multiply.
16. Scaler bitscan forward and reverse.
17. Evict/prefetch L1 or L2 cacheline.
18. Scaler bit count.
19. Four element set scaler quad mask.
20. And more...

Also screen shots of some GDC slides.

1. Vector store (in scaler pipe) co-issue with vector ALU op.
2. 4-9 cycle latency for most vector ops.
3. Gather/scatter fetches/stores to/from only one L1 cache line per cycle.
4. Cache miss causes only offending hyperthread to sleep.
5. Branch miss-prediction only flushes instructions from the offending thread (1-2 cycles avg).

GDC

Some highlights from, click on the GDC 2009 bar for presentations,

The PS3's SPUs in the Real World : A Killzone 2 Case Study - 2500 static light probes per level stored as 9x3 SH in KD-tree. IBL on objects from 4 nearest probes with inverse distance weight rotated into view space and written out as 8x8 pix sphere environment map. Around 3000 particles updated per frame, 250 particle systems, 150 systems draw, with 200 Havok collision ray casts. RSX generates 1/4 resolution image in XDR. SPUs do bloom, DoF, and motion blur. Motion blur uses 1/16 resolution 2D motion vectors (8bit xy). Motion vectors are blured for dilation, image is blured along motion vectors with 8-tap filter. Depth of field uses 36 jittered disc point samples. Order of 50 animation jobs and 500 animations sampled. Waypoint cover maps with a depth cubemap per per waypoint. Looks like depth cubemaps have 1/2 resolution on the top and bottom faces.

Guerrilla Tactics: Killzone's Art Tools and Techniques - KZ2 built with 1500 level building blocks (BBs). Building blocks placed, scaled, and rotated in Maya to build levels. 90% of level geometry is instanced from repository. 10% of level geometry from hand-modelled BSP geometry. About 1000 BBs per lightmap. Both low and full-res drawing of particles. Low res particles only alpha blend and additive. Low and full res particles would not sort correctly. Future is "dirtmaps" for BB to easy reskin BB in different environments.

Some highlights from, GDC Tutorials Handouts,

The A-Z of DX10 Performance - Awesome presentation, a must read. Covers things like, good to use no more than 5 bound constant buffers in a shader stage, and ATI can point sample FP16 RGBA textures faster than filtered.

GSC Game World's S.T.A.L.K.E.R : Clear Sky - Seems like a mix of deferred rendering via drawing a g-buffer and accumulating lighting (as in light pre-pass rendering). Using split hi/lo 8bit output of HDR so that bloom pass can just use the hi RT. Interesting method for dynamic rain using a shadowmap to test for surfaces which get rain, followed by g-buffer adjustment of albedo, spec, and normal for rain surfaces.

Shader Model 5 and Compute Shader - Provides SV_Coverage bitmask in PS shader. Gather() now able to sample from different channels! Bit ops with firstbithigh(), firstbitlow(), and reversebits(). Raw resource views! Atomic operations. Buffer append/consume, provides global pointer probably accessed via atomic operations internally. DX11 CS5 provides indirect dispatch of work group dimensions for conditional execution, and provides 3D dispatch. DX11 CS4.x (for DX10.x hardware) only supports 2D dispatch, writing to own offset in TLS, no atomic operations, and only write to Buffer object type. Which seems odd given CUDA for DX10 GPUs supports more than this. Is DX11 only going to support the common minimum between DX10 ATI and NV GPUs? If so this is sad (very thankful for GL3.x + CL).

Insomniac Physics - Shows transition from R1 to R2 and current in using parallel SPU programming to solve for game physics.

Insomniac Prelighting - Using RGB 8bit for view space normals. Drawing depth and normal with 2xMSAA, but doing lighting rest without MSAA. So is this full 2x resolution then, or just final output resolution from the "specular is 2x super sampled" comment? Sun shadowing for dynamic objects is classic object based shadows with screen space accumulation. So gather objects, group close objects, then for each group, render shadowmap for group, stencil group bounding volume, then draw and sample from shadowmap. Seems like they are doing some kind of ambient adjustment via darkening for backface triangles, is this because dynamic objects don't self sun shadow? Light space shadow maps are pre-rendered before drawing lights in the light pass. They attempt to resolve the multiple shadowing problem (ie combining baked lightmaps with dynamic shadows) by checking how dark the lightmap is (against a global cutoff) and trimming out dynamic shadows under the cutoff. Perhaps they have a blend region to feather the cutoff (will have to go back and play R2 to see!). In R2 they resorted to extra forward rendering passes for materials which didn't fit in the light pre-pass framework. Presentation also covers important things like doing rounding when recovering depth from aliased RGBA8 texture.

Insomniac SPU Wrangling - Rather important presentation. For example, covers lock-line waiting for SPU job control, avoid bus transfer of a spin-lock check by waiting for PPU or GPU to write to reserved address (block on rdch instead).

CryEngine 3

Looks like cascade shadow maps only get one of the maps redrawn per frame. Should be quite a draw call saver, at the expense of sun shadow update frame rate being slower than actual frame rate (which can be seen in the video). Doesn't look like they are doing aggressive lower resolution particle blending, but darkening particles seem drawn after lightening particles, which creates some strange artifacts in the video.

Project Offset

OpenGL 3.1 and GLSL 1.40

Specs are here.

GL Updates - GL 3.1 cuts out depreciated features. Support for signed normalized formats. Full 16 texture unit support in the vertex shader. The following extensions have been added to the core, GL_ARB_draw_instanced, GL_EXT_copy_buffer, NV_primitive_restart, GL_ARB_texture_buffer_object, GL_ARB_texture_rectangle, GL_ARB_uniform_buffer_object.

GLSL Updates - Added rectangular texture support in the form of *sampler2DRect and *sampler2DRectShadow. Added *samplerBuffer support for buffer textures. The above samplers with an i or u prefix support signed and unsigned integer textures. Uniform blocks support has been added, layout-qualifier uniform block-name { member-list } ;. The member-list is a bunch of uniforms. The layout-qualifier can be used alone to set defaults, for example I'll be using, layout(row_major) uniform;, instead of the default of column major matrix storage. See spec for complete details!

Volume Rendering Optimizations

Kyle Hayward has an interesting blog post on volume rendering optimizations. Highlights include keeping structure data in a swizzled ordering in memory (instead of linear) for better cache performance, and using an image space pass to compute normals from depth (instead of storing them).

Random Interesting Papers

Implicit Visibility and Anti-Radiance

Octahedron Environment Maps

Data-Parallel Hierarchical Link Creation for Radiosity - They find CUDA atomics on GT2xx to be only 15% slower than non-atomic operations.

Other iPhone Tricks

Switched to non-perspective correct texturing for this 3D Atom game by modifying my perspective projection matrix. Sounds crazy but provides some very important advantages.

1. Framebuffer feedback works better. Effects now consistent on field of view changes. I'm not actually texturing triangles with a texture, but rather texturing with an image space copy of part of the screen. Color interpolation is no longer correct, but that isn't a problem either in my case.

2. It should enable me (with other changes) to reduce my per vertex size to 14 bytes (short xyz[3], byte rgba[4], short uv[2]). On the iPhone, vertex input is always copied (even with static VBOs) and is often a bottleneck. Instead of a 4 byte float per (x,y,z), I can now do a short and still have huge view distances. Short that I am sending is (x,y) in projected screen space, and projected z (I'm doing the perspective divide in software, and all my verts already had software transforms into view space).