071130 / GPU Only

previous | next

Domino effect ... solving one problem yields a solution to many.

This started with figuring out how to blend normals into a G-Buffer, and thus being able to reduce shading operations to one pixel, even with semi-transparency. And this looks to be ending with the ultimate form of optimization, extremely tight coupling of graphics to combine physics+CFD on the GPU only.

Infinite World Size With Single Precision

First stumbling block, how to represent an infinitely large world in single precision using a L-system. I've been trying to solve this for about a year in the background of my mind, and now believe I have the solution. Numerically the precision of location diverges when expanding down the L-system tree. So children of the same parent are positioned with very good precision relative to the parent. However children which may be adjacent in screen space can traverse completely separate paths (in worst case having no common parents other than the root node) and this suffers from precision problems the deeper one traverses down the L-system tree.

The solution is to somehow refine the locations of unrelated children. Of course unrelated children nodes are all dynamic and thus ultimately have no set connectivity. Well, in projected screen space temporal connectivity is semi-consistent, and this ultimately gives the solution, provide general positional constraints based on distances from any nearby node. So as nodes get closer to the eye (in an eye relative coordinate system), the constraints relatively correctly refine the positions. This type of system works perfectly into how I'm already doing physics calculations.

If this works, it removes need for double precision, enables this to be ported to the GPU, and removes the bottleneck of transfer of data between the CPU and GPU.

Sorting, Occlusion, and Overdraw

The next stack of problems in porting code to the GPU. GPU sorting is simply a bad idea unless you use a non-graphics interface (like CUDA) where a fast radix sort could be done on the GPU. Sorting, besides for drawing, is also required for my current solution to hidden surface removal. So removing sorting requires a fundamental change in the engine, something which would probably be easier to solve working the problem backwards and keeping an open mind for massive changes.

Without sorting the answer has to use the Z-buffer, and transparency is either additive, multiplicative, or simply not done. Without overdraw control and culling, drawing triangles is out (too much overdraw). Which leaves points (at 1024x1024 the midrange NVidia 8600 GTS can easily render 100M points/sec), and leaves a new problem, how to fill the area around the points.

Seems both stupid and impossible right? After all a majority of the screen would be empty, and all sorts of occluded points would be littering the screen, talk about a nightmare of a problem. Also all my cool diffuse effects (volumetric fog, etc) are the result of drawing the larger parents of the child nodes. Which would compound the problem of somehow needing to combine some kind of multi-resolution scheme.

I got stuck at this point back at the original engine design a long time also. About a month ago I started working on a possible solution, using a multi-resolution method which ultimately had performance issues. So perhaps it is best to defer the solution to this problem (solution below), and work on something else, which is exactly what I did.

Undoing Stack Based Tree Traversal

Stack based tree traversal, also a no-no on the GPU, but required for how I do physics and rendering. The only way to solve this is to either traverse the tree for each and every node (at O(ln) cost), or find a way to remove the dependency between the current frame child and parent. Or better yet, simply delay the dependency, so that the child could use the parents previous frame's results. Of course this creates a one frame delay (per level) as information propagates up and down the tree.

Ultimately the fast path, delayed dependency, is the only acceptable solution performance wise for the number of nodes I want to do. With the frame delay, because of the time lag, the parents nodes can no longer be drawn on the screen. While this shouldn't be a problem with the physics code, it provides a new constraint, no drawing the parents, which both simplifies the drawing problem but creates a new problem.

Drawing a Solution to the Problems of Rasterizing Implicit Volumes

The advantage of in-exact implicit volumes (which is exactly what Atom renders), is that there is no need for a precise solution, and the solution can change over time (everything being dynamic is core to the game). However what is absolutely needed is a consistent solution between frames (and relative large scale consistency between networked players).

Consistency between frames hints at a possible solution in the form of re-using the previous frame to render the next frame. This has the added benefit of built in optimization though temporal coherence. Also given the fisheye projection and lighting requirement of needing to render all sides of a cubemap per frame, I don't have to worry about border conditions of having large amounts new geometry appearing from the sides of the screen. Only a relatively small amount of new surfaces would be produced each frame.

Going back to the drawing with points constraint for which I never had a good solution, and expanding upon the idea of temporal coherence, if all viewable surfaces are known from the previous frame, it becomes easy to handle removing occluding points on the new frame, and if necessary connecting the visible points into a surface. The method for doing this is a depth aware pyramid image processing algorithm.

However, why stop there?

The idea of recycling the results of the previous frame, or computation, exactly parallels that of a iterative grid based CFD solver, where each pixel is a grid element. Humm, CFD smoke and other implicit volumes are easy to create given a few emmitive seeds, or points. And this is exactly the construct I believe is the solution. No overdraw, point based, fully dynamic, and completely insane! Exactly what I am looking for, not to mention excellent texture cache performance, and perfect for the GPU.

Fully Substituting Dynamics for Animation

I'm currently using animated L-systems as a factoring method for content creation. However large amounts of animation takes time and resources which I simply don't have as a single developer. Besides, I would like to have the option of enabling user generated content, allowing users to build things in the Atom world. So animation has to go, simply too time consuming.

But dynamic movement has to stay, just without being done with playback of pre-computed animation. I think the solution to this is to design the static L-system in a unstable state dynamically, so it cannot ever fully return to being static.

Motion Blur

Motion blur couldn't be done with my current method of stretching and alpha blending geometry. However since I have the best estimate of both the location of each new pixel on the previous frame, and the z, current per pixel depth aware image space motion blur should work fine here. Also will allow for more user frame-rate and resolution control.

Putting it all Together

This is my project for December, morphing the Atom Engine into its finished form based on these new ideas, then January to start content creation, while I finish up the networking, audio, and server code.