080319 / Moving Beyond Rendering in a VacuumFirst a link to an awesome GDC 2005 paper on implementing next gen effects on the PS2 posted on c0de517e. Parallel Computation ... Continued from Last TimeInteresting CUDA Note : if you have to do general scatter or gather (ie non-coalesced) for an algorithm from global memory when you are memory bound, and you can afford the registers, it is better to gather 128-bit values (ie 4 floats) per fetch per thread because that results in only a 50% memory bandwidth reduction (for compute 1.1 hardware), vs fetching one float at only 12.5% of proper coalesced bandwidth. BTW, it looks like CUDA 2.0 beta is out and provides stream support for running more than one computation kernel at the same time! No time for more now... Moving Beyond Rendering in a Vacuum : Ideas from PhotographyThe Spaces Between post on meshula.net talks about one of my favorite very challenging real-time rendering frontiers: participating media. Light does indeed dance around, and is the single most important element in the real-life rendering of landscape photography. Without an overcast sky, fog, or some other atmospheric interest, the shoot ends after sunrise and begins a hour before sunset. It is not a filter which is placed on a camera which is important here, but rather the sky and atmosphere which filters the light reaching the camera. For example, long before dawn while in the Earth's shadow you can increase exposure times and capture scattered light which you cannot see. No shadows, no specular, minor diffuse, with a majority reflected, ambient and scattered. Not simple light distance based attenuation, but rather the color of light changing based on distance.
Or how about dawn on an overcast day with fog. Again no direct shadow, no specular, but with diffuse lighting from scattered light. This is something which should be easily renderable with a single "lighting" pass. The most complex part of this scene would be anti-aliasing.
Now after sunset, mountains back-lit and bathed in scattered light even with a near cloudless sky. Touches on a different way of thinking about high dynamic range, in that objects are lit by the scattered light of a very high intensity light source instead of simply adding HDR glare post processing effect.
Or simplified without color, fog and mist from the sea, and in this case with shadow during early morning. Tough to simulate, usually done with non-lit particle systems constrained by overdraw, and lacking in dynamic shadowing.
And in some cases the participating media is even solid, such as scattered light passing through back lit leaves in Autumn.
ApplicationSo why does it take being on now relatively constrained hardware such as the PS2 with a fixed camera system for someone to get this right in a video game (meaning God of War II)? And how can the boundary be pushed even further? In case it wasn't obvious, the real-time solution isn't raytracing. Possible Evolution of the Atom EngineThis is an idea which has been keeping we awake at night. One common theme of effective programming of highly parallel SIMD machines (ie GPUs, Cell, and in the future Larrabee) is to remove searching and simply brute force a solution. Searching is divergent while brute force is parallel and localized. Algorithmic divide and conquer is now in the form of keeping large SIMD groupings of objects. If a SIMD grouping has elements which apply to two cases you simply run both cases and predicate the results (as that is usually faster than regrouping). Regrouping should be limited and amortized over time. Computationally my original engine concept worked out because while I was compositing up to 16 layers per pixel, pixels were grouped into billboards (ie great SIMD), and the per pixel cost was small and TEX limited by two localized texture lookups. Then I diverged in the effort to increase per pixel detail, branching out into this concept of splatting say 64K control points into a framebuffer, followed by a large hole filling pass to generate a reprojection vector field, and lastly reproject previous frames contents and converge results to new frame. In my divergent engine, I broke one of my primary rules, never do any non-temporal-amortized per pixel searching! Even though my hole filling pass had logarithmically spatial amortized cost per pixel, it was still too expensive. Think of this in terms of the final pass of a bilateral upsample, you have to read at least the results from the nearest four parents in the next smaller mipmap level, which in my case resulted in 8 TEX lookups per pixel, which is already 50% of the cost of my previous engine. Lets go back to the core idea from the divergent engine, taking a few obvious lessons from video compression: temporal (and to a lessor extent spatial) locality of the results of a computation is key to compression, and compression points to the pattern for reduction of work. Video compressors reproject blocks of the previous frame to the new frame, and then make mostly minor adjustments to the blocks to converge them to the new image. The solution I believe is to again turn searching into brute force, returning to the original Atom engine, but do temporal amortized rendering of a high detail impostor cache (ie color+alpha+normal) for the impostor textures which get composited to the screen. Impostors form an overlapped reprojection, and would get recomputed as needed based on visual priority. Updating the impostor cache would be GPU SIMD ideal with 16x16 aligned blocks in a texture and something similar to raycasting with a fixed number of iterations (key is that the raycasting would be into a small local area and not shoot through the entire scene, also if you read into this I am hinting to exactly why screen overlap of the impostors is a solved problem). Rendering pipeline would be the same drawing impostors front first (reverse alpha blending), and lighting with something similar to one simple spheric harmonic per impostor. More next time... | Atom©2008/2007 Timothy Farrar Latest Blog Entries080826 . olick paper Index000000 . index Graphics080709 . antialiasing Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language070921 . assembler in atom4th Elsewhereandrew selle All Blog Entries080826 . olick paper |