080826 / Olick's Paper

previous | index

Current and Next Generation Parallelism in Games

An awesome presentation, if you haven't read it yet, do so, and open up all the notes in the upper left corner. There are a bunch of interesting bits, many of which I am surprised went public.

Current Generation

1. Using 16KB groupings of joint data for SPU local store, 1/16 of 256KB local store.
2. Unit vector compression, 10bits per 2 smallest, largest computed, and extra sign/W bits.
3. Noted the RSX four vert mini cache used in primitive assembly.
4. Index decompression 6.5:1, very easy.
5. 30% increase in RSX performance doing SPU side vertex skinning.
6. 70% increase in RSX performance for shadow map generation.
7. Per vertex progressive meshes.
8. Average 60-70% of triangles do not contribute to raster output.
9. Get 10-20% improvement of balanced scene with SPU triangle culling.
10. Tight sync SPU ring buffer control via command buffer semaphores.
11. About 48M triangles per sec on 1 SPU for full pipeline.

If you think about this in terms of more modern GPU technology, the SPUs are functioning as DX10 vertex shaders outputting into a geometry shader (for stream compaction after culling) which writes verts out via stream out for single or multi-pass rendering. Most likely on current GPUs, using the hardware triangle setup for culling might be faster (ie skip the stream compaction, or at least use histo-pyramids instead), and obviously for single pass rendering, stream out would then be unnecessary.

Next Generation

As John Carmack noted as the light bulb moment in his Quakecon speech, what GPUs provide for general purpose computation beyond SPUs, is a fast vector gather. This functionally is tremendously important for many algorithms. Best case on SPUs for general gather is to work with a local store cache of 128 byte sized objects (for DMA performance), or 16 byte sized objects (if necessary, smallest for vector transpose), and then eat the overhead of the vector transpose to move from AOS to SOA for efficient computation. Personally I think SPU functionally can be replaced by the DX11 compute shader or OpenCL. I'd rather see one tremendous GPU attached to a small multi-core CPU. I wonder what Sony will do for PS4?

Next Generation Game Code

1. Parallelizing game code, cross entity communication only through one frame delay.
2. Entity gathers from data in previous frame, writes to self.

Note Jon's game parallelization for game entities description exactly parallels current GPU function (see what I wrote in parenthesis). Majority of code only gathers data from previous frame (texture fetch of results from previous frame), and writes to self (writing to multiple render targets). One sync point, massive easy parallelism. If needing dependent entities in a given frame, sort into groups of independent tasks which can run in parallel, with sync points separating groupings (groups are sequences of independent draw calls, ie no draw call does a texture fetch from a frame buffer written by a previous draw call). If you read between my lines, you might see that I am indeed saying that you can do all game entity code on the GPU. And I have a work in progress future blog post which will detail exactly how this can be done (GPU side scripting).

Sparse Voxel Octtree

Next section of the talk is spilling the beans on the sparse voxel octtree. The paper didn't go into much detail on a lot of the acceleration methods tried, but I was impressed with the voxel translation trick.

1. Rendering bounding hull with polygons to accelerate open space skipping. Ideal case is for the bounding hull to be generated from the coarse voxel geometry. Would require depth write in the fragment shader after searching to construct a proper Z buffer for hybrid rendering. Only found a 2x speed improvement for skipping most of the traversal process. Surprised at this stage why he didn't switch to a faster algorithm and better texture/data compression for the fine search. Should be all they need to reach a good FPS on current high end GPUs.

2. Using depth feedback from previous frame to accelerate open space skipping. This is only noted in the blurb in the corner. Not sure if he tried combining this with adaptive refinement. The idea being to raster by mipmap expansion. Start with the full search at a really small framebuffer (mipmap) size, search to minimum depth for region. Then repeatedly refine search to higher resolutions (in mipmap), but use reprojection from the previous frame to accelerate refinement. The up-resolution catches previously occluded (or newly visible) geometry. Wouldn't need to do full search each frame for each pixel, could hide artifacts of frame converging to proper result in motion or DOF blur.

The section on infinite surface detail is interesting, but IMO not that useful. Largest problem is that the fractal would always be axis aligned (think about detail on a sphere for example). Sure it is great for boxes (see my 080709 post for an example of infinite detail with an axis aligned fractal ... in a fisheye projection however). The technology I'm using for infinite detail with Atom allows for each node of the tree to have a coordinate space rotation and scale, which is a necessary construct to compress infinite detail on curved surfaces well.

I am personally not that interested in SVO as it is described in the paper because of the complexities of dynamic objects. A direct parametric raycast (ie curved ray search) seems way to expensive, and the best acceleration methods depend on triangle rendering.

The Future of Rasterization

Interesting. The point being that GPU parallel drawing performance will eventually be limited by triangle setup since GPU triangle setup is currently serial, or requires sort of triangles into cores, each core taking one region of the screen (ie the Larrabee binning model).

However, if we take the eventual model of fragment sized triangles and even more massively parallel machines, rasterization becomes reduced to a sort on 2D projection and Z (note this sort has great temporal locality, which hints that there is a very good parallel solution to this problem, both in terms of cross low cross core communication if objects are stored on cores based on locality, and in terms of parallel computational efficiency). Alternatively, this future rasterization could be looked at as an atomic scatter operation (depth controlling an atomic swap), which if the input stream of fragments had good 3D locality (which it should), would provide great cache performance.

And raycasting (as described in Jon's talk) is simply scatter through a gather search. So they all end up similar. The real question is which scales better with the bandwidth required to communicate between nodes.




080814 / Otoy and Braid

previous | next

OTOY

Below is a shot from the full Ruby demo, OTOY technology. The motion blur is very odd indeed. Also lack of any high frequency image content. Other leaked OTOY video below shows real-time content with what appears to be cubemap reflections which arn't updated in real-time. Notice video always blurs when lights get placed / updated...
Click here for the full size image.
Link to more shots and the full Ruby video.
Link to "real-time" video from OTOY.

Braid

Braid is an Awesome Game. If I had a 360 (other than my devkit at work) or Braid was on the PC or PS3, I'd have bought the game now, so I'll have to wait for the PC port. If you have a 360, buy it!




080813 / OpenGL 3.0 II

previous | next

Awesome Papers from Siggraph 2008

March of the Froblins: Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU - Co-authored by Jeremy Shopf, Level of Detail Blog. Impressive: GPU side scene traversal, occlusion, LOD, path finding, and more. Anyone have a link to a video of this?

StarCraft II: Effects & Techniques - Very informative section on SSAO and DOF, especially in outlining who to use deferred depth and normal in the calculations. Also good to see someone else using the "compute at full res but sample at lower res" texture cache performance trick.

Other Highlights of What is in the GL3 Core Spec

Previous list is here on my 080811 post. Mixed-format framebuffer attachments are supported! Apparently mixed size framebuffer attachments are in the spec, but not currently supported by NVidia's beta drivers.

Features :: GL3 ARB Extensions / In DX10

Known as the extension pack, didn't make it into core because of time deadline for Siggraph 2008. Expected that ATI will support these in possible Q1 2009 driver release. I would expect these in core in GL3.1 because of direct match to DX10 hardware spec.

GEOMETRY SHADERS :: ARB_geometry_shader4 - Also supported in GL2 via extensions with Apple and NVidia through EXT_geometry_shader4. Apple even emulates this functionality via LLVM on hardware which doesn't have a geometry shader pipeline stage, check out the Apple GL Capabilities Matrix.

INSTANCING :: GL_ARB_draw_instanced, GL_ARB_instanced_arrays - Draw instanced provides gl_InstanceIDARB to shaders, and instanced arrays provides a frequency stream divider for vertex inputs.

TEXTURE BUFFER OBJECTS :: ARB_texture_buffer_object - Supported in GL2 by NVidia via EXT_texture_buffer_object. Texture buffer object provides unfiltered texel lookups (integer index) into 1D linear buffer objects. Basically the glue which provides the ability to fetch from transform feedback / vertex buffers, CPU writes to mapped GPU memory, and framebuffer readbacks (PBOs).

Features :: GL Vendor Extensions / Not in DX

FENCE :: NV_fence, APPLE_fence - Nearly positive this isn't available in the PC versions of DX. Provides for finer granularity of synchronization with GL, can be useful measuring GPU latency for drawing commands. Also can be useful for multi-threading with GL.

Features :: GL Possible 3.1 Release Features / In DX10

PARAMETER BUFFER OBJECTS :: NV_parameter_buffer_object - Supported by NVidia only. Provides a 64KB constant store which is array accessible in shaders, and via buffer objects, and avoids the compile and link problems of uniforms. In DX10 constants are declared in constant buffers (just like parameter buffer objects), and this can be a major cause of low performance if used improperly. When constant buffers are updated, the entire buffer is uploaded to the GPU. Need to balance number of calls to update constant buffers, and amount of data to transfer per call.

One thing to remember about constants and uniforms (ie what CUDA refers to as Constant Memory), is that while constant memory is cached (8KB cache on NV 8 series cards I think), a miss costs memory read(s) from device memory. On a cache hit, reading a uniform is as fast as reading from a register, only if all threads read the same constant, otherwise it scales linearly with the number of different constants read (divergence). The CUDA docs don't say if divergent constant reads can be fully hidden by the hardware (I suspect not). They do say that texture fetches are not subject to the constraints on memory access patterns that global and constant memory reads must respect to get good performance. Also that the latency of addressing calculations is hidden better with texture fetches, and that packed data may be converted for free and broadcast to separate variables in a single operation with a texture fetch only (ie texture fetch a vec4, into registers, vs 4 separate constant fetches). Obviously for divergence, random access and larger working sets, texture fetch is going to the faster path, but for non-divergent constant access parameter buffer objects would be the way to go.

Features :: GL Future Object Model Improvements / In DX10

IMMUTABLE STATE OBJECTS - This functionality has limited availability in DX9 in the form of StateBlock9, which could record a bunch of state changes and apply with one function. This is a core part of the eventual object based API rewrite for GL. Was also a core change for DX10 (organized pipeline into 5 immutable state objects: input layout, raster, sampler, depth/stencil, blend). The point is to push validation and processing overhead into object creation, so this is an optimization to improve batch performance. DX10 limited to 4096 objects for most types. DX10 didn't get threading right (this is being finally addressed in DX11), and GL still has the opportunity to do the object model correct, so that command buffers (what the driver creates to issue commands to the GPU) can be created in parallel on separate threads.

Much of the unfinished object model improvements could get incrementally rolled into the spec prior to the eventual cleaner API. For example, Apple's Vertex Array Objects are now core in GL3, which provides a single object similar in function to DX10's input layout. Should be a good place for optimization for GL2 based apps. EXT_direct_state_access might also be a possible intermediate optimization for draw call bound applications depending on what drivers support this extension.

DECOUPLE TEXTURES AND TEXTURE FILTERING - Ability to sample from a texture with both filtered and non-filtered texture fetches. Requires object API change described in State Objects above.

TEXTURE FETCH FROM A MULTISAMPLED TEXTURE - DX10 provides ability to sample individual samples from the texture backing a multisampled render target. Even on DX10 with NVidia's drivers you can sample both depth and color.

Features :: Not in GL3 / In DX10.1

The following are DX10.1 features (ie only supported on ATI hardware to my knowledge). Would not expect these in the spec until supported on a broader hardware base.

CUBEMAP ARRAYS, INDIVIDUAL RENDER TARGET BLEND MODES, GATHER4

Features :: Not in GL3 / Not in DX10

LOCAL CACHED FULLY COMPILED SHADERS - DX byte code isn't the solution, it needs to be re-optimized and re-compiled by the drivers. For example ATI's shader compiler has to fight to undo the DX byte code optimization when it does it's internal recompile. I'm sure the case is the same for NVidia as well. Not practical to pre-compile for all current and future hardware. Possible good compromise is to cache compiled shaders on the local machine, so only take the hit once.

PROPER THREADING SUPPORT - This is expected as a core feature in DX11, and most likely in future GL through object model improvements. Both DX10 and GL currently have really poor threading models. Arguably GL actually has better support for threading than DX10, in the way that GL can share objects between contexts. The future is the ability to for the driver to build command buffers user side per thread.

SHADER SCATTER - This is expected as a core feature in DX11, but perhaps limited to compute and pixel shaders. Both current DX10 ATI and NVidia hardware support this, but no API to use it. Ability to do general writes to graphics resources from within a shader.

COMPUTE SHADER - Expected as core feature in DX11. Available now in CUDA mixing with DX and GL, but only on NVidia cards. Path for GL is with OpenCL. OpenGL/OpenCL to be able to share resources without a COPY! OpenGL/OpenCL to have very flexible scheduling. OpenGL/OpenCL is going to be awesome!

DYNAMIC SHADER LINKING - Expected feature in DX11.

TESSELLATION - Expected feature in DX11.

GL3 Driver Support

Until the vendors finish up their GL3 drivers, one can currently prototype GL3 work on NVidia's beta Windows driver, or using the pre-GL3 non-ARB vendor extensions on current released GL2 drivers by both NVidia and Apple.

NVIDIA - NVidia's Beta GL3 Drivers for Windows Only (XP and Vista)

According to NVidia, driver has full core GL3 functionality excluding,
- One-channel (RED) and two-channel (RG) textures
- Mixed size FBO attachments (mixed-format attachments are supported)
- The Clearbuffer API
- Windowless rendering support (We suggest you use GPU affinity instead)
- Forward-compatible context
- Debug Context

Includes the following new extensions,
- ARB_vertex_array_object
- ARB_framebuffer_object
- ARB_half_float_vertex
- WGL_create_context
- ARB_draw_instanced
- ARB_geometry_shader4
- ARB_texture_buffer_object
- WGL_create_context

ATI - Possible Q1 2009 for complete 3.0 driver release, possible early beta, plans to implement the GL3 ARB extensions. Source, BOF slides.

APPLE - My personal guess is that we might see GL3 ARB extension support early on NVidia hardware since Apple has EXT_gpu_shader4 support currently, but that full GL3 support on ATI hardware happens at similar time to ATI's release of PC drivers.

INTEL - Expected on Larrabee and future platforms, but seems like not expected on current hardware. Source, BOF slides.

GL3 Hardware Fast Paths Reference

Don't have time yet to do this, but I do intend to compile this data on this page here in the future. Most importantly to describe what is known as to best methods to buffer data between CPU and GPU, but also to cover relative performance of things like transform feedback and geometry shader support. What I don't have time to do, I'll keep links to other references on.

Apple's OpenGL Optimization Page

My chicken scratches follow, don't expect any of it to make sense yet...

DYNAMIC UPDATE OF TEXTURES : DX10 - Have a pool of textures created with D3D10_USAGE_STAGING. Map() with D3D10_MAP_WRITE and D3D10_MAP_FLAG_DO_NOT_WAIT so that Map doesn't block. Write texture to the mapped memory and Unmap(). Then CopyResource() or CopySubresourceRegion() to do an asynchronous GPU to GPU copy. Any subresource that is bound to the pipeline must be unmapped before any render operation can execute, thus the reason for the staging texture.

DYNAMIC READBACK OF TEXTURES : DX10 - Only resources created with the D3D10_USAGE_STAGING flag can be read from the GPU. However D3D10_USAGE_STAGING resources cannot be written to by the GPU. So CopyResource() or CopySubresourceRegion() must be used to do an asynchronous copy to the D3D10_USAGE_STAGING resource. Wait at least 2 frame before using, ie Mapping, the resourced copied to.

DYNAMIC UPDATE OF CONSTANT BUFFERS : DX10 - Use Map() with D3D10_MAP_WRITE_DISCARD and then UpdateSubResource().

DYNAMIC UPDATE OF BUFFERS : DX10 - Update using Map() with D3D10_MAP_WRITE_NO_OVERWRITE (BTW, this is only valid for vertex/index buffers, not textures, too bad) use like a ring-buffer, only write to empty portions of buffer. DX10 is allowing you to map resource directly even though it is using the resource. Wouldn't want to use UpdateSubresource() here because of frame delay.




080811 / OpenGL 3.0

previous | next

OpenGL 3.0 is here! GLSL spec | GL spec

Thanks ARB for getting GL3 up to date with current GPU hardware!

In short, GL3 is an incremental update to GL2 with added support for some of the most important current GPU hardware features, and a set path towards spec simplification (via depreciation of parts of the API). I am personally thankful for what they did manage to get into the spec, and cannot wait for updated drivers from the core vendors!

GL 3.0

1. Unified shader support!
2. Conditional rendering.
3. Fine buffer control.
4. Floating point color and depth support for textures and renderbuffers now core.
5. Framebuffer objects now core.
6. Half float support now core.
7. Multisample now core.
8. Integer support.
9. Texture arrays and texel fetch.
10. Packed depth/stencil textures and renderbuffers.
11. Per render target blend enables and color write masks.
12. Better support for vendor compressed texture formats.
13. Single and double channel support in textures and renderbuffers.
14. Transform feedback.
15. Vertex arrays.
16. sRGB framebuffer mode and texture support.
17. 8 render targets.
18. 16 vertex attributes.
19. 1024 uniforms.
20. 64 varying components.
21. 16 combined vertex and fragment texture units.

Other GL3 highlights,

1. Framebuffer object now supports "if the attachment sizes are not all identical, rendering will be limited to the largest area that can fit in all of the attachments (an intersection of rectangles having a lower left of (0; 0) and an upper right of (width; height) for each attachment)" as well as "If the attachment sizes are not all identical, the values of pixels outside the common intersection area after rendering are undefined".

2. Multisample support for renderbuffers also looks to be good, along with attachment support for cubemaps and texture arrays!

3. Rendering to texture when the texture is in use is allowed but marked as undefined under the proper conditions. Will have to check if reverse mipmap generation pattern now works...

4. We can thank Blizzard and Apple for this, MapBufferRange() supports access to part of a buffer with key access parameter options: MAP_INVALIDATE_RANGE_BIT, MAP_INVALIDATE_BUFFER_BIT, MAP_FLUSH_EXPLICIT_BIT, and MAP_UNSYNCHRONIZED_BIT. FlushMappedBufferRange() should also be quite useful.

5. Framebuffers now support attachments with different formats and bit depths.

Depreciation highlights,

1. Most of the fixed function is going away.
2. Texture borders are going away for good.
3. Display lists. Wonder if anything is going to replace command buffer generation?
4. Now need to use Gen*() functions instead of application generated object names. Obvious move towards rolling in new object API model.

GLSL 1.30

The new GLSL language support is awesome.

1. Integer support in uniforms, variables, shader inputs and outputs, and textures.
2. Declarations with in/inout/out instead of attribute/varying.
3. Interpolation control (non-perspective).
4. Texture arrays, texel fetch (integer indexed), and offsetting.
5. Better built in function support (ie trunc(), round(), isnan(), etc).
6. User defined vertex inputs and fragment outputs.
7. Support for gl_VertexID.
8. Precision qualifiers (not yet functional).
9. Clean up of texture interface, projective cube maps textures and shadow.
10. Explicit gradient texture lookup.
11. Common blocks which can be backed by buffer in the API.
12. And more...




080806 / Random Stuff

previous | next

After many requests, the blog now has an experimental atom feed. Decided to trade getting any sleep tonight, for this functionality on the blog. Not sure if it is 100% to spec, but looks to work from google and firefox live bookmarks.

DX11

More info on DX11 has been made public at the 2008 XNA Gamefest and at some point in the future, the slides and presentations should arrive here. DX11 provides a compute shader which looks very much like CUDA, but has cross vendor support. Invocation is via a regular (1D, 2D, or 3D) array of threads. This thread array is further broken down into a sub grouping where the threads of the sub group can access a new shared register class. Compute shader also has access to a thread id, well as atomic operations on global memory or shared registers, and general purpose unordered read/write of global memory. Initial prototypes show 2x improvement over DX10 code, while much better performance should be possible in theory. Also HLSL 5.0 provides dynamic linkage, as well as unordered reads and writes of global memory for sure in compute and pixel shaders. Very exciting indeed.

OpenGL 3.0

According to GPU Cafe Rumors GL3 is indeed here and we will have actual NVidia drivers in September. Perhaps ATI, Intel, and Apple will soon also have real GL3 support. Will have to wait for Siggraph to know for sure.

Larrabee

The Siggraph Larrabee paper is out and confirms the expected software rendering architecture for Larrabee. General vector scatter/gather to/from L1 is there as well, and texturing goes through L2. Still lots of unanswered questions, check out the speculation on Beyond3D.

Busy

Been too busy at the art shows to get any good programming work done at home ... here is our booth right before our last show opened.

Also just celebrated our 6th anniversary!

Next Time

Been porting my forth like scripting language (which I use to build this website among other things) to C for intended use to accelerate general development. Enables insane factoring of code in any language, as well as syntax, etc. More later...




080718 / NVPerfKit Linux

previous | next

Woohoo, latest NVPerfKit 6.0 is now available for Linux!
Best web resource ever for graphics/GPU related papers.

E3 and Little Big Planet

Wasn't there, but was very impressed by Little Big Planet's use of CFD. In fact, LBP currently has my vote for most impressive graphics for any game. The whole graphics package from motion blur to global illumination is so very polished. Some hints as to their GI approximation, Fast Approximations for Lighting of Dynamic Scenes and Siggraph2006 course notes. See it in action, the image below links to video on GT.




080709 / Anti-Aliasing

previous | next

Anti-Aliasing

Got anti-aliasing working in engine without hardware AA, using temporal jitter, sub-pixel positioning, and frame feedback. The 720P shots below (running at over 60fps on my NVidia 8600 GTS) were generated from only 256 K/pixels per frame. So I'm effectively only computing 25% of the screen pixels a frame. Which is great, builtin "keeping it low res".

Click image for the full 1280x720 resolution shot, updated it actually works now!

I'm very satisfied with the results thus far, moving to GPGPU from CPU has enabled me to get a 64x speed up in the l-system traversal. There is still a lot of work to do (lighting, physics, content creation tools, etc), and I've still got a lot of possibilities for optimization, so this thing will be awesome when finished...

A Little Profiling

You can only get so far with GL_EXT_timer_query, and I don't expect NVidia to release the newest PerfKit for Linux any time soon, but I have some rough estimates. First, I a previous said that somehow the 8600 GTS was point draw limited around 68 Mpoints/sec. I cannot remember where I got that number from, but it is dead wrong.

Early in the engine I was getting 287M points/sec when drawing 128-bits/point, texture fetching 128-bits/point, and doing a large amount of math (over 80 gpu_program4 shader instructions per point).

At this point I'm doing 256-bits/point ROP (and 256-bits/point TEX) at 170M points/sec (on 8600 GTS). Without porting over to Windows and using NPerf, guessing I might be hitting a bandwidth limit either from a combination of texture cache over fetch on cache miss, or ROP bandwidth limits from scattering. I should have something like 272 ALU ops/point best case performance from the hardware, so I'm probably not ALU bound yet.

Good news is that I have broken through the G80 1/8 performance barrier for point drawing (based on SIMD packing in the fragment shader) in terms of ROP performance. I'm about 1/4 the way to the estimated peak ROP performance on this card (perhaps the G84 doesn't share the G80's limitations). Since I'm doing points instead of 2x2 pixel quads, this makes sense. Not sure yet if this limit can be broken (when ROP is probably designed for 2x2 pixel quads), but I have a few ideas to try out when I get to the windows port.

Part of the performance I've managed to get I'm sure has to do with a combination of doing all the math in the vertex shader, working with a FP32 RGBA framebuffer (which I'm guessing is much more ROP bandwidth efficient for scatter), and that I've designed my algorithm so its point scatter has both great destination locality, and good MP utilization (so it doesn't starve what I'm guessing are limits on output 2x2 pixel quad queues in the hardware).

Newest NVidia GL Drivers

Just switched to the latest NVidia drivers (after not updating for perhaps the last 6 months), and my test program ran significantly faster on the newest drivers. Seems like they might even have an optimization to not fetch vertex attributes when you don't use them... (I'm computing vertex attributes from gl_VertexID).

In any case, thanks all at NVidia for single handedly keeping OpenGL up to date!




080704 / Micro Polygons II

previous | index

I'm out of state for the weekend, so I decided to get a bunch of ideas down for the next week, before I run of memory...

Motion Blur in the Point Based Renderer

The idea I have for motion blur for the new point based engine is to effectively stretch the geometry in the direction of motion, then blur in the direction of motion in the post process clean up. The way I intend to stretch the geometry is by temporal jitter of the l-system tree traversal, in combination with the engine's ability to fork multiple times from a given node (add jitter to get different points in time), and independently traverse those nodes. This should result in the object filling out its motion path leaving some pixel gaps towards the edges which the post processing cleanup should blend into a nice semi-transparent directional blur. Will be testing this when I get back home after the weekend.

Post Processing

As was hinted above in the section on motion blur, my concept to deal with transparency is temporal jitter, and to reduce the pixel space filling for transparent objects. Basically a form of dithering. So post processing is then reduced to a noise reduction filter which has the advantage of knowing depth, pixel position and size at sub-pixel accuracy (for quality antialiasing), the previous frame (for temporal consistancy), and a motion vector per pixel.

I am going to attempt to run post processing in the fisheye projection space only (instead of running a post processing pass on the entire octahedron map). This should have the advantage with the temporal jitter and sub-pixel position, that I can run with a reduced (perhaps half) resolution octahedron map. Also would enable me to really clean up the loss of resolution on the outer radii of the fisheye projection. Noise reduction is going to be a simple weighted average of a selection of pixels from the current and previous frames and perhaps mipmaps (for hole filling). To increase quality I am going to do something similar to the Crytek motion blur trick of doing multiple passes with only a small number of pixels per pass.

L-System Traversal

For now I have a fixed number of eight children per node, somewhat like a octtree but without fixed child positions. The l-system transform gives each child a position, radius and direction (quaternion), relative to the parent.

The l-system can be though of as a complex form of compression, a set of rules which enable the building of a complex system. There are some parallels here to cells and DNA. DNA is a set of rules which in combination with environmental forces shape some very complex things indeed. A large aspect of my core idea for the new engine is to have this duality between a simple set of rules (l-system) and environmental forces (massive number of players, who's actions and self made content, shape the world). More on this much later...

A side note, I'm also thinking of modeling the l-system on cells, and moving to a binary tree (from the current octtree).