Index / The Atom Project latest | oldest Welcome to the Atom Development Blog! Atom started with the idea to go back to PC gaming's roots (low risk investment, experimenting with technology, fun timeless gameplay, taking a wild idea from concept to market), while taking advantage of the power of modern hardware. Atom in its final state will be a massively multi-player online 3D first person perspective game set in a atomic or microscopic cell like world where the players are small micro-organisms battling in a huge living host world. Atom is about making the unbelievable playable, seamless, graphically brilliant, and fully interactive. It is about creating an experience of which is infinitely enjoyable, by breaking the limitations of discovery and learning. There will always be new places to explore, as the Atom world is uniquely infinite in size, in character, and is alive, moving and changing over time. There will always be new ways to do things, as everything in Atom, both inside and out and at any scale large or however small, can be physically interacted with. 080718 / NVPerfKit Linux previous | index Woohoo, latest NVPerfKit 6.0 is now available for Linux! Best web resource ever for graphics/GPU related papers. E3 and Little Big Planet Wasn't there, but was very impressed by Little Big Planet's use of CFD. In fact, LBP currently has my vote for most impressive graphics for any game. The whole graphics package from motion blur to global illumination is so very polished. Some hints as to their GI approximation, Fast Approximations for Lighting of Dynamic Scenes and Siggraph2006 course notes. See it in action, the image below links to video on GT. 
080709 / Anti-Aliasing previous | next Anti-Aliasing Got anti-aliasing working in engine without hardware AA, using temporal jitter, sub-pixel positioning, and frame feedback. The 720P shots below (running at over 60fps on my NVidia 8600 GTS) were generated from only 256 K/pixels per frame. So I'm effectively only computing 25% of the screen pixels a frame. Which is great, builtin "keeping it low res". Click image for the full 1280x720 resolution shot, updated it actually works now!    I'm very satisfied with the results thus far, moving to GPGPU from CPU has enabled me to get a 64x speed up in the l-system traversal. There is still a lot of work to do (lighting, physics, content creation tools, etc), and I've still got a lot of possibilities for optimization, so this thing will be awesome when finished... A Little Profiling You can only get so far with GL_EXT_timer_query, and I don't expect NVidia to release the newest PerfKit for Linux any time soon, but I have some rough estimates. First, I a previous said that somehow the 8600 GTS was point draw limited around 68 Mpoints/sec. I cannot remember where I got that number from, but it is dead wrong. Early in the engine I was getting 287M points/sec when drawing 128-bits/point, texture fetching 128-bits/point, and doing a large amount of math (over 80 gpu_program4 shader instructions per point). At this point I'm doing 256-bits/point ROP (and 256-bits/point TEX) at 170M points/sec (on 8600 GTS). Without porting over to Windows and using NPerf, guessing I might be hitting a bandwidth limit either from a combination of texture cache over fetch on cache miss, or ROP bandwidth limits from scattering. I should have something like 272 ALU ops/point best case performance from the hardware, so I'm probably not ALU bound yet. Good news is that I have broken through the G80 1/8 performance barrier for point drawing (based on SIMD packing in the fragment shader) in terms of ROP performance. I'm about 1/4 the way to the estimated peak ROP performance on this card (perhaps the G84 doesn't share the G80's limitations). Since I'm doing points instead of 2x2 pixel quads, this makes sense. Not sure yet if this limit can be broken (when ROP is probably designed for 2x2 pixel quads), but I have a few ideas to try out when I get to the windows port. Part of the performance I've managed to get I'm sure has to do with a combination of doing all the math in the vertex shader, working with a FP32 RGBA framebuffer (which I'm guessing is much more ROP bandwidth efficient for scatter), and that I've designed my algorithm so its point scatter has both great destination locality, and good MP utilization (so it doesn't starve what I'm guessing are limits on output 2x2 pixel quad queues in the hardware). Newest NVidia GL Drivers Just switched to the latest NVidia drivers (after not updating for perhaps the last 6 months), and my test program ran significantly faster on the newest drivers. Seems like they might even have an optimization to not fetch vertex attributes when you don't use them... (I'm computing vertex attributes from gl_VertexID). In any case, thanks all at NVidia for single handedly keeping OpenGL up to date!
080704 / Micro Polygons II previous | index I'm out of state for the weekend, so I decided to get a bunch of ideas down for the next week, before I run of memory... Motion Blur in the Point Based Renderer The idea I have for motion blur for the new point based engine is to effectively stretch the geometry in the direction of motion, then blur in the direction of motion in the post process clean up. The way I intend to stretch the geometry is by temporal jitter of the l-system tree traversal, in combination with the engine's ability to fork multiple times from a given node (add jitter to get different points in time), and independently traverse those nodes. This should result in the object filling out its motion path leaving some pixel gaps towards the edges which the post processing cleanup should blend into a nice semi-transparent directional blur. Will be testing this when I get back home after the weekend. Post Processing As was hinted above in the section on motion blur, my concept to deal with transparency is temporal jitter, and to reduce the pixel space filling for transparent objects. Basically a form of dithering. So post processing is then reduced to a noise reduction filter which has the advantage of knowing depth, pixel position and size at sub-pixel accuracy (for quality antialiasing), the previous frame (for temporal consistancy), and a motion vector per pixel. I am going to attempt to run post processing in the fisheye projection space only (instead of running a post processing pass on the entire octahedron map). This should have the advantage with the temporal jitter and sub-pixel position, that I can run with a reduced (perhaps half) resolution octahedron map. Also would enable me to really clean up the loss of resolution on the outer radii of the fisheye projection. Noise reduction is going to be a simple weighted average of a selection of pixels from the current and previous frames and perhaps mipmaps (for hole filling). To increase quality I am going to do something similar to the Crytek motion blur trick of doing multiple passes with only a small number of pixels per pass. L-System Traversal For now I have a fixed number of eight children per node, somewhat like a octtree but without fixed child positions. The l-system transform gives each child a position, radius and direction (quaternion), relative to the parent. The l-system can be though of as a complex form of compression, a set of rules which enable the building of a complex system. There are some parallels here to cells and DNA. DNA is a set of rules which in combination with environmental forces shape some very complex things indeed. A large aspect of my core idea for the new engine is to have this duality between a simple set of rules (l-system) and environmental forces (massive number of players, who's actions and self made content, shape the world). More on this much later... A side note, I'm also thinking of modeling the l-system on cells, and moving to a binary tree (from the current octtree).
080628 / Micro Polygons previous | next But first... There is something about seeing awesome success of others which inspires ones competitive spirit to strive for excellence. For me it is the frontier of GPU parallel programming and graphics. Voxels Again Literally could not wait to see more of Carmack's Sparce Voxel Octtree, so I and another decided to figure out exactly how to do it on current hardware, and do it nearly as fast as current triangle rendering. Have an answer too and that is all I am saying. Almost sure this method isn't what Carmack had in mind. The idea is in good hands, and I have moved on to even greener pastures (as the idea isn't as non-static geometry friendly as I would like). CUDA 2.0 NVidia's newest hardware has some very important changes beyond double the registers, double precision, and faster geometry shader. Global memory fetch and store looses stalls on thread bank conflicts. Cost is now proportional to the minimum number of bus granularity accesses required to service the vector access. So free global memory swizzle. However, CUDA 2.0 is still missing PTX's .surf (surface cache). How about one tough open question to all CUDA programmers, what is the fastest way to simulate a Z buffered write in CUDA. Especially given the case where multiple threads might or might not write to the same pixel with different Z values? If you got an idea, I would love to get an email... Realtime Raytracing on ATI's Hardware? Add in a few other buzz words like voxels, pixels with depth, re-lighting, wavelets, and you have a completely new form of confusion, and I really do like puzzles. Check out the talk on the ompf board. Just what are they using the tessellation hardware for in a DX9 based app? My guess is that we are simply looking at 4D (space and time) realtime re-lighting, with an awfully large amount of pre-computation. Here is a hint, not sure what it means yet.  On the Previous Post I have no idea how many people actualy read this and can get used to how I am constantly running in multiple directions all the time. So a quick update on the previous post, I did manage to quickly prototype my all post-processing in one pass with the same samples idea, and a large amount of it actually works quite well. Given that post processing can be over 30% of the GPU cost of rendering a game's frame and is usually 100% texture bound, there are free lots of ALU cycles to use to figure out where to sample and what to do with the samples. Biggest surprise was how well the frame re-circulation anti-aliasing worked. Moved back to non-polygon rendering again so this works sleeps now. A Very Important Semi-Missing Hardware Feature I'm not currently interested in this but anyone with a traditional triangle based renderer should be. Lets just fast forward to when your engine can afford to do cone step mapping (or some other fragment based raytracing). Perhaps this time is even now because your engine is just that awesome in that it can do cone step mapping, handling the GPU cost LODing shaders (reverting to less expensive parallax mapping, bump mapping, and simple texturing based on distance). Normally you would be limited to the triangle's hard edges. Perhaps even you went nuts and tried to extend with fins in the geometry shader. There is (on some hardware and could be on other hardware) better way. First what you want is the triangles to simply be a bounding hull over the surface. When your cone step mapping misses the surface, you tex kill the fragment. Of course the z buffer is still just a triangle surface, so anything which intersects this surface will do so wrongly. Unless of course you write depth which would kill your frame rate. The problem is that writing depth causes the fast Z-Cull (NVidia) and Hi-Z (ATI) hardware to turn off. The idea being that there is no way for the hardware to compute min/max Z during coarse rasterization if depth is written. However, if you could tell the hardware that you were only writing depth increasing away from the triangle surface, now the coarse and fine rasterization hardware could easily compute minimum Z and do fast Z cull. It just could not do trivial accept (since maximum Z isn't known). Turns out this isn't possible on the NVidia 6 and 7 series hardware (which includes the PS3), but is possible on the 360. Not sure if later NVidia hardware or PC ATI/AMD hardware can do this. Could be an API problem even if the hardware had the feature. So for all the game developers who are reading this, ask your hardware vendor contacts for this feature right now, so it gets in all future hardware. You want this ability to work on all cards, it will enable you to drastically lower your poly count and still have fully detailed objects. You don't want to loose to the rest of us non-polygon rendering people as hardware gets faster. Another Missing API/Hardware Feature For those who don't read about this stuff elsewhere on the internet first a quick refresher on GPU hardware. Fast texture reads come from compressed swizzled textures. This enables great texture cache performance. The swizzling provides 2D locality. You loose swizzle and compression when you read from a frame buffer or render target. Now texture cache misses read in full memory granularity lines (which can be large), so random accesses from a render target can really hurt, especially 32-bit texels. What I would like to see is the ability to have more than a 4 component frame buffer. Sure you can have up to 8 render targets in DX10, but random fetches from those 8 render targets cause a loss of texture cache efficiency. What is needed is the ability to fetch from all render targets with one global memory read. So render targets are interleaved in memory, and random access reads would be bandwidth friendly. This would be hugely useful for GPGPU in that you could randomly load/store objects at full bandwidth efficiency and utilization as long as they matched the GPU memory granularity (which is often 2 FP32 vec4s). This can somewhat be currently done with CUDA, if the texture is a linear texture. I still have hope that DX11's compute shader will enable GP global memory read/write from shaders. If/when that is here, this feature could be less useful. ... and finally on to the topic of this post. Micro Polygons / GPU Scatter Revisited Take a look at the FragSniffer and refresh your mind about the G80 hardware. GPUs are designed to draw frame buffer aligned 2x2 pixel quads. The parallel hardware takes care of building SIMD vectors of verts for parallel execution, then triangle setup / rasterization packs SIMD vectors of pixel quads for parallel execution, and finally write combines the results in the ROP/OM (output merger). Small triangles loose efficiency really really fast as many of the pixels in a quad are not actually in the triangle. So fragment shader efficiency tanks and GPU resources are wasted. The great irony of rendered graphics these days is often for any given frame, more verts are processed than there are pixels on the screen. If the GPU rasterization hardware can setup just one primitive per clock cycle, this is approximately greater than 500M tri/s, of which there are under 30M pix/s at 720P at 30Hz. Or in other words, over 16 tris per screen pixel per frame at 720P at 30Hz. You can probably see where this train of thought leads too, but before we start with that insanity, lets go through a few more details. Huge triangle setup ability is needed because often triangles need many passes in the rendering pipeline. For example in a typical non-deferred style engine, the same triangle might get drawn once in a pre-Z pass, another 1-4 times in shadow map generation, and perhaps another 1-4 times in lighting. Then factor in all the triangles which get culled by the view and by occlusion. So the ability to setup 16 triangles per screen pixel per frame at 720P at 30Hz starts to make sense, even when most triangles are many times larger than a single pixel. Now say we want to evolve our rendering pipeline and draw pixel sized micro polygons. This would be slow on current hardware right? After all G80 can only fragment shade 4 pixels per a 32 SIMD vector, so fragments shaders would run at 1/8 capacity. There is an answer on current DX10 level hardware, don't do work in the fragment shader. Draw points, do all the fragment shading in the vertex shader using texture fetch where the GPU is still packing at 100% efficiency for ALU SIMD computation. Setup all vertex shader outputs as GLSL SM4.0 "noperspective" so they don't get interpolated. And use the fragment shader as only a pass-though to get vertex shader outputs to the ROP stage. The ROP stage should be able to re-merge single pixel output well without a large bandwidth cost, so fragment shader and ROP should no longer be a bottleneck. Of course, the vertex shader would have to have enough work or texture fetches to hide the triangle setup bottleneck, which should be possible shading points with really complex shaders. Taking Points Seriously Right, so if you can solve the following problems, a point sized triangle render could work. 1. How to anti-alias when there isn't enough triangle setup to fill a MSAA framebuffer? 2. How to handle LOD such that all primitives are point sized? 3. How to insure minimal overdraw? 4. How to fill holes in the framebuffer when no point rasters to a given pixel? 5. How to only have one drawing pass per point primitive. 6. How to draw over points which fill holes with a pixel which should have been occluded. 7. How to manage occlusion issues. 8. And more... Already started on a prototype. I have returned to my GPU only engine concept. Full GPU side L-system tree traversal is working, with visibility and occlusion computations only on the GPU as well. Also moved beyond cubemaps and into an octahedron map which fits into a single texture. Key here is that it can be updated via points with one draw call, where as a cubemap would require drawing each point to each face. So I'm doing the full 360 degree fisheye projection from the octahedron map from Atom's L-system based renderer. I'm not to the lighting, hole filling, or anti-aliasing yet, just got rough traversal working in the prototype, so don't expect anything awe-inspiring in the screen shots, at least yet. The octahedron to fisheye projection mapping looses detail towards the edges of the lens, so I'm going to handle that via a lens blur when I get around to it. The noise and aliasing in the system is by design. These shots are simply a debug view of the results of the tree traversal. Actual lighting and rendering is going to be done via a single full screen pass using frame buffer re-circulation (no point drawing). Doing something like 4M points a frame currently (including overdraw). View distances are effectively infinite.
080524 / Triangles previous | next Started programming outside of work again. Back to prototyping and testing some of my wild rendering ideas. Comment on Voxels Old news, but seems as if Carmack is returning to voxels. Some other thoughts on this on MollyRocket. Carmack does have a really good point there on fixing anti-aliasing by temporal jitter of pixel center position. Damn elegant solution IMO. As for the voxel stuff, looks to only be good for static geometry and simplifies static content into a compression problem. Should solve the LOD and transparency problem, but at the cost of per pixel searching (raycast or raytrace into octtree). Triangles In some ways rendering is to radix sort as raytracing/casting is to merge sort. One fundamental difference is searching for an answer to avoid doing work (raytracing), and doing work to avoid searching for an answer (rendering). So yeah, I'm finally warming up to triangles, and perhaps I've got a few surprises on how to solve old problems given that I spend nearly 99% of my time thinking about non-triangle rendering... SOLUTION TO LOD AND LARGE VIEW DISTANCE OCCLUSION : I don't like tessellation as it doesn't solve the problem of simplifying disjoint geometry. My solution, build surfaces up with layers of small triangles. Each layer independent. New LOD blends in new fine layer, blends out a previous course layer, keeping enough middle layers to insure a good effect. Course layers are mostly inner hulls, fine LOD layers mostly extend the surface, or split into disjoint shapes. Think of it somewhat as a painter paints a scene, laying out rough shapes and color (lower LOD layers), then refining (fine LOD layers). Occlusion culling is an easy and solved problem, with occlusion queries and a hierarchical layered world structure. RETURNING TO AMORTIZING PIXEL COST IN THE VERTEX SHADER : With the LOD problem solved, one can control the overall size of triangles in a given region on the screen. Small triangles enables more work to be pushed back to the vertex shader. Even soft shadows and diffuse transparency effects. Gets rid of the need for multiple passes for lighting. All lighting + dynamic global illumination computed per vertex, interpolate spherical harmonics for single pass per pixel lighting. Keep in mind at 720P at 30fps, the GPU is solving for only 30 Mpix/sec, but has the capacity of over 200 Mtri/sec in setup. So small triangles are not a problem (until reaching the size of micro triangles, with bad pixel quad utilization). SINGLE OR DUAL PASS FOR EVERYTHING : If have readable Z buffer (DX10, etc), and effective rough z sort of geometry, then perhaps only one pass needed, otherwise add pre-z pass. POST PROCESSING BUILT INTO THIS SINGLE PASS : The pass involves only 2 RGBA 8-bit MRTs and no-blending. First RT is color (perhaps LOGLUV), second RT is normal+extra. Previous frames results used to compute new frame, and fragment shader gets current and previous frame pixel position. All post processing works with previous frame feedback, gathering samples from the 2 RTs of the previous frame of which a full mipmap is generated (very important). Also uses previous frame's mipmaped depth buffer (also very important). Mipmap generation cost for these 3 textures is only 3 texture lookups per screen pixel total. Post processing effects work by adjusting position and mipmap level of a set number of texture lookups from previous frame, and adjusting the weighted average of those samples with the resulting computed color for the current fragment. Motion blur, depth of field, anti-aliasing, etc, simply control the distribution of those lookups and parameters. FRAME FEEDBACK SO TRIANGLES CAN EXCEED THEIR BOUNDS : Similar to how the post processing works, fragments can gather color+normal+extra+depth information from neighboring triangle's fragments through a previous frame lookups. So surfaces can either converge faster than your eye can focus on them, or slowly grow and change dynamically. So texture lookup is no longer ambient+diffuse+spec+bump+etc, but rather parameters which control how a fragment changes based on neighboring fragments. Combined this with fragment kill and binary cutout of textures. Single fragments will be able to generate diffuse effects far beyond their bounds, so I've got no plans for a transparent geometry pass. Blend in/out (key to LOD system) will be fragment kill based. Frame feedback will insure this is smooth and seamless without transparency. Other far reaching transparent effects will go into adjusting per vertex lighting (remember small triangles). Why This Design So I've wavered between motion stretched billboard (current running engine), forward and deferred rendering (with current engine), and point based rendering (lots of little prototypes), with my last idea to raycast an impostor cache (no time to fully try). Even with amortized searching in the impostor cache, I was still looking for a better way to use the hardware, something ideal, something with no searching. Progress on the Prototype I've spent a few evenings verifying the framebuffer feedback ideas, have a working prototype with anti-aliasing, fake image space refraction mapping, and initial motion blur. Working on final motion blur and depth of field code currently. Lots of cool stuff here, screen shots next time.
080426 / Parallel Programming II previous | next Memory and Caches First lets only address memory bound algorithms, computationally bound algorithms are easy and boring. What sequential access provides over random access is typically better cache line utilization (and hardware prefetch on some platforms). If truely random access is needed, working with cache aligned and sized vectors removes this performance advantage. In this case of random access performance will be limited by the size of the working space. Working space is bound to exceed L1, as L1 is tiny. I like to think about L1 as simply slower registers, or that L1 speeds up the very short term reuse of memory (for example a GPU's texture cache). The next stage, L2 provides perhaps 10 times less peak bandwidth. L2 is also usually relatively small (say 2MB), and might be shared between cores or hyperthreads. In the past, L2 catches longer term reuse. In the future, looks as if the large SIMD registers in Larrabee will go directly to L2, as the mentioned 32KB L1 would only be enough to spill 512 registers. So in that case L2 can be thought of a simply slower SIMD registers. Then perhaps there is an L3 which could be another 10 times less peak bandwidth from L2. The final stage is access of main memory which might only provide 1/400 the effective bandwidth of L1. Now add in TLB stalls and cache synchronization between multiple cores, and large working space random access might be 1/1500 of the effective bandwidth of L1. Note large working sets are not all that uncommon. GPU's are a great example, easily main memory bandwidth limited, with small L1 texture caches (backed by L2) which capture a majority of the texel reuse. Texture reads are effectively either a majority sequential or random, with minor very short term reuse in filtering (hence the need for only small texture caches). Performance for large working set problems reduces to the problem of factoring out common memory accesses or limiting memory access. Lets put large working set memory reuse into perspective with a made up 2 GHz processor which has a 128 byte cache line, 2 clock cycle L1 throughput/cacheline, a 20 clock L2 throughput, a 200 clock main memory throughput, and 4 way float SIMD throughput of 1 clock per SIMD MADD operation. One obvious limiting factor here is that in no way can one stream out to memory the large solution to any problem faster than poor 128bytes*2Ghz/200=1.28 GB/second on this dummy processor. Now if the solution is compressed a larger solution can be output per unit time, which hints at why framebuffer compression on GPUs is so important, and why run-time data compression/decompression could become very important in the future. Anyway continuing our large working set problem who's source working set and output working set doesn't fit in the cache, say 50% of memory bandwidth is used to store the result and 50% of memory bandwidth is used to transfer the working set used to compute the solution. For the 0.64 GB/sec read (0.16 Gfloats/sec), in order to utilize the 2Ghz*4SIMD = 8 Gflops/sec ALU capacity of the processor, 8/0.16=50 operations have to be associated on average with each float read. Lets take an easy example of something which could make use of the ALU capacity: input and output stream of 0.16 Gfloats/sec, and a 48 tap FIR filter. So in this case each output depends on 48 inputs. Sure this is thinking about the problem backwards, but it does point at a general method to solving problems optimally for performance given the constrains of current hardware... On CPUs Factoring Is Not Just for Computation Anymore It is almost unthinkable to optimize an algorithm without factoring out redundant computations. Now apply this same concept to your data flow, and you have the solution to your problem. Factoring data flow and grouping by locality is the way to achieve peak CPU computational utilization on interesting problems with large memory space solutions. So go to a giant white board, list your inputs on the left and your outputs on the right. Then draw a tree or graph in which you factor out and share as many results of intermediate calculations as possible, but keep intermediate results in groups such that they don't exceed either your target cache size or program managed local store size. Moving on to GPUs and Factoring Lets open up a can of insanity and dive in with some numbers. According to specs the GeForce 8800 Ultra (G80) has a capacity of 576 GFlops/s and 103.7 GB/s. However it is 1512 MHz * 16 processors * 8 wide SIMD, so the maximum number of instructions is more like 193 Ginst/sec. With output limited to 14.7 Gpix/sec (ROP) and input limited to 39.2 Gtex/sec (TEX). Here is the G80 from a CPU perspective, 16 cores 32-wide SIMD effective (8 wide actual, 4 clocks per 32 wide vector) 6 to 24 variable way hyperthreaded 42 to 10 32-wide SIMD registers per hyperthread 21 to 5 32-wide SIMD local store (shared memory, but effectively extra registers) writes not cached general reads not cached texture and constant reads cached The hyperthreading is used to hide both memory latency, texture fetch latency, and ALU latency. The 6 way hyperthread minimum is the minimum number of hyperthreads required to hide ALU write-after-read latency. With 6 way hyperthreading, 17-25 instructions are needed to hide the latency of an un-cached load from memory. With 24 way, only 5-7 instructions are need to hide the latency. All numbers computed from the CUDA docs. Texture fetch latency on a cache hit is not published in the CUDA docs, but peak throughput in terms of cached texture reads looks to be about 0.81 floats/ALUcycle (fastest format). This was calculated by 612 MHz * 64 TEX units * 1 RGBAtexel/cycle * 4 components/texel divided by 1512 MHz * 16 cores * 8 wide SIMD. GPUs do not provide opportunity to cache intermediate results on write, other than directly during a computation in registers. All output is uncached and takes away from bandwidth. In the case of the G80, without taking input into consideration, when approaching peak output bandwidth (writing to one render target), looks like somewhere between 4 and 13 instructions (depending on output format) are available per result from a shader (1 to 4 components). Peak output is limited by the ROP, and only uses around 50% of the cards peak bandwidth (at fastest output format). The importance of data factoring on GPUs is in keeping a high texture cache hit rate, through good data locality. In the case of the G80, it looks as if memory granularity is 32 bytes from the CUDA docs. Which hits that the texture cache has 32 byte cache lines. Which would give the following texel granularity per cache line, 8x4 texel rect for 8bit/texel compressed formats 4x2 texel rect for 32bit/texel (8bit RGBA, 16bit RG, 32bit R) 2x2 texel rect for 64bit/texel (16bit RGBA, 32bit RG) 2x1 texel rect for 128bit/texel (32bit RGBA) What should be obvious here is that if an GPU shader needs random access (always cache miss) in a large working space, best case is FP32 RGBA texture fetch which is going to waste about 50% of the available bandwidth (INT8 RGBA random fetch wastes 87.5% of bandwidth). Now say you wanted to use at most 64 GB/s (of the cards 103.7 GB/s bandwidth) for texture fetch and also try and meet the card's peak bilinear texture filter rate of 39 Gtex/s. The 64 GB/s provides for 2 Gcachelines/s fetched, which means you need 18.5 cache hits per line after a miss, or a 95% hit rate. Also note that this could only possibly be done reading compressed textures. Exactly Why GPUs Are So Awesome As working space increase well beyond cache size on a CPU, the CPU stalls and idles because of its lack of ability to hide memory fetch latency, while the GPU keeps on doing work because it can hide uncached memory fetch latency. Out of time, more next time...
080319 / Moving Beyond Rendering in a Vacuum previous | next First a link to an awesome GDC 2005 paper on implementing next gen effects on the PS2 posted on c0de517e. And another page on a Light Pre-Pass Renderer from Wolfgang Engel. Parallel Computation ... Continued from Last Time Interesting CUDA Note : if you have to do general scatter or gather (ie non-coalesced) for an algorithm from global memory when you are memory bound, and you can afford the registers, it is better to gather 128-bit values (ie 4 floats) per fetch per thread because that results in only a 50% memory bandwidth reduction (for compute 1.1 hardware), vs fetching one float at only 12.5% of proper coalesced bandwidth. BTW, it looks like CUDA 2.0 beta is out and provides stream support for running more than one computation kernel at the same time! No time for more now... Moving Beyond Rendering in a Vacuum : Ideas from Photography The Spaces Between post on meshula.net talks about one of my favorite very challenging real-time rendering frontiers: participating media. Light does indeed dance around, and is the single most important element in the real-life rendering of landscape photography. Without an overcast sky, fog, or some other atmospheric interest, the shoot ends after sunrise and begins a hour before sunset. It is not a filter which is placed on a camera which is important here, but rather the sky and atmosphere which filters the light reaching the camera. For example, long before dawn while in the Earth's shadow you can increase exposure times and capture scattered light which you cannot see. No shadows, no specular, minor diffuse, with a majority reflected, ambient and scattered. Not simple light distance based attenuation, but rather the color of light changing based on distance.  Or how about dawn on an overcast day with fog. Again no direct shadow, no specular, but with diffuse lighting from scattered light. This is something which should be easily renderable with a single "lighting" pass. The most complex part of this scene would be anti-aliasing.  Now after sunset, mountains back-lit and bathed in scattered light even with a near cloudless sky. Touches on a different way of thinking about high dynamic range, in that objects are lit by the scattered light of a very high intensity light source instead of simply adding HDR glare post processing effect.  Or simplified without color, fog and mist from the sea, and in this case with shadow during early morning. Tough to simulate, usually done with non-lit particle systems constrained by overdraw, and lacking in dynamic shadowing.  And in some cases the participating media is even solid, such as scattered light passing through back lit leaves in Autumn.  Application So why does it take being on now relatively constrained hardware such as the PS2 with a fixed camera system for someone to get this right in a video game (meaning God of War II)? And how can the boundary be pushed even further? In case it wasn't obvious, the real-time solution isn't raytracing. Possible Evolution of the Atom Engine This is an idea which has been keeping we awake at night. One common theme of effective programming of highly parallel SIMD machines (ie GPUs, Cell, and in the future Larrabee) is to remove searching and simply brute force a solution. Searching is divergent while brute force is parallel and localized. Algorithmic divide and conquer is now in the form of keeping large SIMD groupings of objects. If a SIMD grouping has elements which apply to two cases you simply run both cases and predicate the results (as that is usually faster than regrouping). Regrouping should be limited and amortized over time. Computationally my original engine concept worked out because while I was compositing up to 16 layers per pixel, pixels were grouped into billboards (ie great SIMD), and the per pixel cost was small and TEX limited by two localized texture lookups. Then I diverged in the effort to increase per pixel detail, branching out into this concept of splatting say 64K control points into a framebuffer, followed by a large hole filling pass to generate a reprojection vector field, and lastly reproject previous frames contents and converge results to new frame. In my divergent engine, I broke one of my primary rules, never do any non-temporal-amortized per pixel searching! Even though my hole filling pass had logarithmically spatial amortized cost per pixel, it was still too expensive. Think of this in terms of the final pass of a bilateral upsample, you have to read at least the results from the nearest four parents in the next smaller mipmap level, which in my case resulted in 8 TEX lookups per pixel, which is already 50% of the cost of my previous engine. Lets go back to the core idea from the divergent engine, taking a few obvious lessons from video compression: temporal (and to a lessor extent spatial) locality of the results of a computation is key to compression, and compression points to the pattern for reduction of work. Video compressors reproject blocks of the previous frame to the new frame, and then make mostly minor adjustments to the blocks to converge them to the new image. The solution I believe is to again turn searching into brute force, returning to the original Atom engine, but do temporal amortized rendering of a high detail impostor cache (ie color+alpha+normal) for the impostor textures which get composited to the screen. Impostors form an overlapped reprojection, and would get recomputed as needed based on visual priority. Updating the impostor cache would be GPU SIMD ideal with 16x16 aligned blocks in a texture and something similar to raycasting with a fixed number of iterations (key is that the raycasting would be into a small local area and not shoot through the entire scene, also if you read into this I am hinting to exactly why screen overlap of the impostors is a solved problem). Rendering pipeline would be the same drawing impostors front first (reverse alpha blending), and lighting with something similar to one simple spheric harmonic per impostor. More next time...
080223 / Human Head + Parallel Programming previous | next Perhaps can be guessed by the lack of blog posts, I've been taking a break from Atom for a while. Working for Human Head Studios In the mean time I've started working at Human Head Studios in Madison Wisconsin. It is simply awesome working with like minded people, especially the Head's render guru, Brian Karis. Of course I cannot say anything about what I am doing, but I will say that I really do enjoy programming for the consoles... What About the Atom Blog? Right now, I am currently living in two cities (which are 160 miles apart). Between working during the week, lifting, getting back into MMA, and driving back to Chicago on the weekends, my time is spent. Once I finish my transition, the blog will get some regular updates... Atom's Time will Still Come. Atom is still alive (but now only worked on in free time), and will some day get finished, but not in 2008. This is actually a really good thing for Atom. Atom really needs OpenGL3 Mt Evens to be finished before driver support for AMD/ATI and Apple catches up to what I am doing. Thoughts on Transitioning to the Near Future of Parallel Programming  This post is inspired by Mike Action's great SPU shader talk. My personal programming interests are almost exclusively geared towards parallel programming. With the PS3, unified shader model of DX10/SM4.0 and CUDA, parallel programming is just starting to get really interesting, and useful. With Larrabee on its way and some unknown answer from NVidia and AMD, the future looks very bright indeed. The age of hardware which auto parallelizes a serial instruction stream, and relieves the programmer from thinking about parallel memory access patterns is quickly coming to an end. The new parallel age has some new important factors: Vectorization. Vastly under utilized in the general case because we humans mostly think in single serial tasks. The concept here is to do one serial task to N independent things at the same time (a vector length of N). Where N is say 4 (SSE/Altivec) or perhaps 16 (Larrabee) or 32 (CUDA), the GPGPU age is bringing us larger vectors. The hardware fast path has no intra-vector communication between the N things (SOA style): fast aligned vector memory accesses (parallel). In some cases limited intra-vector communication can be found (ie splat, interleaving, swizzle, fully generalized permutation, etc) with performance dependent on the pattern of communication. Ultimately for portability to different N's, only very limited intra-vector communication should ever be used, and only between small (like 2 or 4) power of 2 sized and aligned sub groups within the vector. Practical example is the paired computation of gradients in GPU pixel/fragment shaders. Fully generalized communication, scatter and gather, is usually very expensive, with a few notable exceptions. First 1-to-N gather (splat, or all vectors reading the same value) can be somewhat fast. For example reading constants on GPU. Also useful for working in a very wide tree (a large fixed power of 2 number of children per node) to have all children do a computation with respect to the parent. Second, GPU texture units provide N-to-N gather. The important factor here is that gather runs in parallel with ALU, but is not fully pipelined and has a latency even on a cache hit (hence the factor of ALU to TEX ratio). Also GPU vertex input gathers (in the form of indexed vertexes). CPU side gather can be very expensive, not parallelized and thus taking ALU cycles and often requiring many round trips to memory. Take Altivec (2+3*N memory operations): store the vector with N indexes to memory, do something else (to hide write-read stall), read N indexes, fetch from memory pointed to by N indexes, store N fetched values into a new vector in memory, do something else (to hide write-read stall), fetch vector. Write-through/write-combining caches can increase the write-read stall. Parallel Cores. Many aspects of importance here, number of cores, number of hardware threads feeding a single ALU unit in the core, and issues with latency hiding. Obvious for an algorithm to be portable, it has to deal with scaling to a variable number of cores/threads. There are all sorts of different latencies in a core, and the solution to dealing with latency is to schedule independent instructions to hide latency. One obvious choice to hide instruction latency is simply to extend the vector length in software (for example N=32 in CUDA when hardware N=8). Also in architectures with large cache lines, this can help with better cache utilization. Having multiple hardware threads feeding a single vector ALU unit helps hide instruction latency, but can place more pressure on any shared caches. However when all the threads run the same program, instruction cache utilization can be much better, and often better data locality can be found as well. Then there is the issue of how the instructions of the threads are multiplexed into the single ALU unit. Does the hardware instruction multiplex prior to verifying that the instructions won't stall (ie CUDA), or can the core re-order instructions (SSE/PC), or will a later stall either stall all threads or result in the flush all the instructions from the offending thread from the multiplexed instruction stream leaving bubbles in the pipeline. Memory Access and Patterns. Most algorithms are memory bound, and thus data design, locality, and memory access patters become key in performance. Two cases, first you manage your own cache or local store (PS3, CUDA), or second you trick an actual cache into doing what you want (prefetching, etc). There are a few important factors here: latency to start a memory transfer (DMA, or L2 prefetch), number of simultaneous queued transactions, size of the transfer (in the case of DMA) and throughput. Designing an algorithm for the software managed cache or local store case, ports to the case of a hardware cache, and also insures great cache locality. Synchronization. Sync points require either blocking/spin lock, cooperative task switch (do something else), or heavy weight task switch. In all cases sync points are expensive and should be minimized. Atomic operations on all platforms provide a relatively lightweight way to solve a majority of sync issues. Next Time That was just a somewhat random introduction, next time it is time to start on specifics including useful parallel coding patterns and algorithms...
080114 / XP Install for Unix People previous | next Finally got XP installed, and I'm swapping cables for dual boot. So in the future I'll be posting speed differences between the Linux and Windows versions of Atom when profiling on the same exact machine. MinWG Install the Easy Way! For those who keep their Windows machines quarantined from the Internet (for good reason these days), the automated MinWG Installer will not work as it grabs the source packages from the Internet. Turns out manually installing from the packages is even easier! 1. On your trusty MacOSX or Unix machine, goto the MinGW SourceForge download page. 2. MINGW: Download all the latest non-source files for: GCC Version 4, GNU Binutils, GNU Make, GNU Source-Level Debugger, MinGW API for MS-Windows, MinGW Runtime, and MinGW Utilities. Decompress all these files into a mingw directory. cd downloads ... goto where you downloaded the files mkdir mingw ... make new path cd mingw ... goto path tar xjf ../gdb-6.7.50.20071127-mingw.tar.bz2 ... untar bzip2 files, do this for all tar.bz2 files tar xzf ../mingw-runtime-3.14.tar.gz ... untar, will need to do this for all the other tar.gz files as well gunzip ../libgcc_sjlj_1.dll.gz ... make sure to uncompress everything 3. MSYS: Download all the latest non-source files for: MSYS Base System, and MSYS Supplementary Tools. Decompress all these files into a msys directory. cd downloads ... goto where you downloaded the files mkdir msys ... make new path cd msys ... goto path tar xjf ../bash-3.1-MSYS-1.0.11-1.tar.bz2 ... untar bzip2 files, do this for all tar.bz2 files 4. Transfer the mingw and msys directories to your insecure Windows machine C: drive. I prefer to use a Flash card to do this. 5. On Windows use your favorite text editor (I use JEdit or NEdit, both excellent text and source editors) to save the following in C:\msys\etc\fstab (BTW, the lower case c and correct / are important. #/etc/fstab c:/mingw /mingw 6. That is it. Now run c:\msys\msys.bat to get a terminal window, and you are all set with an easy Unix like development environment. Note, do to the way they did their GCC 4.0 naming, gcc and g++ are run using gcc-sjlj and g++-sjlj, which is a little odd, but you can always manually go in and rename things if you want to... Helpful Tip for Mac/Linux Users Installing XP Vendors such as Intel often forget (or ignore) that perhaps people without Windows might need access their drivers for a first time install (...press F6 when installing XP...), and only package their drivers in a EXE file. In some cases this EXE file is just simply a self extracting ZIP, so running unzip in a terminal can be used on MacOSX/Unix/BSD oses to extract the files from the EXE. The unzip program will note that the file isn't a ZIP file, but will go ahead and search for a ZIP file header embedded in the file anyway, and extract the contents if found. Also in the case of providing an XP install floppy for disk drivers, often the EXE file is not a self extracting ZIP file, but actually a program which writes a floppy disk image. If you are lucky, the floppy disk image will be contained in the EXE file as an embedded ZIP file, which can be extracted using unzip. Then using dd, this extracted image can be written to a floppy. Here is an example of writing the image on MacOSX to a USB floppy drive. First insert a floppy into the drive, then do the following in a terminal window. su ... become the super user diskutil ... used to locate the USB floppy device (which is /dev/disk3 on my Mac) umount /dev/disk3 ... manual unmount required before dd dd if=/path/to/floppy/image of=/dev/disk3 ... and done
080108 / First Post of 2008 previous | next As probably can be guessed by the lack of blog posts, I've been too busy with other stuff, like getting ready for the 2008 art show season, and getting out the sixth release of my Darkroom Tools for Photoshop.  Back to Linux and Windows Development Avoid it as much as I would like, it is about time I need to re-setup a Windows XP machine for development purposes. Luckily this time I won't need Visual Studio or Cygwin, thanks to the great work of the MinGW team. PS3 + Ratchet and Clank Future Tools of Destruction Got a PS3 and Ratchet. Was impressed at what Insomniac managed to do with their 2nd PS3 title. A lot can be gathered about their engine from the Insomniac R&D Page. Probably the most striking graphical aspect of the game, is how they "created" lighting simply by how the artists produced different environment maps on all the static geometry. This in combination with the static precalculated lightmaps and fragment shader displacement mapping (seems to be the case), is simply awesome and damn fast. Creativity limited purely by available time and deadlines, it will be interesting to see how they evolve the engine and make better use of the SPUs in future games. The use of a rigid body physics engine to handle moving of boxes and debris was cool. The unconventional use of always view aligned billboarding for the tree leaves was also interesting. I'm guessing these are drawn using the game's particle engine, but still might have been used in the static geometry's occlusion database. One thing I'm wondering about is why mobyes couldn't use the same environment mapping as the static geometry. Perhaps this was on purpose so it would be easier to find all the dynamic interactive parts of the levels?
071207 / G84 previous | next Just some quick summary notes on the NVidia G84 for my own reference. Tested with a 8600 GTS clocked to 730MHz core / 1460MHz SPU / 2.26 DDR. 32 SPUs at 1460MHz 16 TEX units at 730MHz 8 ROP units at 730MHz
Texture Performance Nearest filtering has no advantage, bilinear is free. Trilinear is roughly double cost of bilinear. 64bit texels should have 50% performance of a 32bit texel. 128bit texels should have 25% performance of a 32bit texel. All formats tested with sequential access sum of 3x3 (9 total) texels, using a 2x2 texture and a second batch of tests with a 2048x2048 source texture. Both tests output to a 2048x2048 FBO. Bilinear results, Max possible bilinear rate = 16 TEX units at 730MHz = 11.6 Gtex/sec. <=32bit texels - L8,L16F,L32F,LA8,LA16F,RGBA8 : ~7.6-8.0 Gtex/sec, 65-69% of max. 64bit texels - LA32F,RGBA16F : ~5.8 Gtex/sec, 99% of max. 128bit texels - RGBA32F : ~2.7 Gtex/sec, 93% of max. Random notes, Strange reduction in performance for <=32bit texel types. For bilinear filtering, 64bit and 128bit texel performance is ideal. Trilinear textures with forced bilinear LOD levels are at bilinear speeds. Trilinear performance is bandwidth limited? with the tested 9 texel sequential access. Typical trilinear performance is 30-40% off expected performance of 2x2 texture. ROP Performance Assuming that memory read/write rates of 64bit and 128bit pixels are 1/2 and 1/4 32bit pixel rates. Also assuming that blend costs of 16bit and 32bit are 2x and 4x the 8bit rate. Guessing L32F is going to be blend limited but not write limited, RGBA32F is going to be both blend and memory limited (guessing blend and memory latency is additive), but that RGBA16F will be a fast path with memory latency fully masked by the blend latency. Tested writing to 2048x2048 FBO. Max possible blend rate = 8 ROP units at 730MHz = 5.8 Gpix/sec. Without blending - L8,L16F,LA8,LA16F,RGBA8 : ~5.1 Gpix/sec, 88% of max (5.8 Gpix/sec). Without blending - L32F : ~3.3 Gpix/sec, 57% of max (5.8 Gpix/sec). Without blending - RGBA16F : ~2.7 Gpix/sec, 93% of max (2.9 Gpix/sec). Without blending - LA32F : ~2.1 Gpix/sec, 72% of max (2.9 Gpix/sec). Without blending - RGBA32F : ~1.2 Gpix/sec, 82% of max (1.45 Gpix/sec).
With glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA), L8, RGBA8 : ~2.9 Gpix/sec, 50% of max (5.8 Gpix/sec). L16F : ~2.7 Gpix/sec, 93% of max (2.9 Gpix/sec). L32F : ~1.4 Gpix/sec, 97% of max (1.45 Gpix/sec). RGBA16F : ~2.7 Gpix/sec, 94% of max (2.9 Gpix/sec). RGBA32F : ~0.364 Gpix/sec, %99 of max (0.365 Gpix/sec, see comments above).
Random notes, ROP blends must always use a 2 clock 16bit blend even with 8bit texels. Write without blend rates are odd, especially L32F. Other Random Notes Using this as reference. Apparently Z-Cull is much more effective on the G84 compared to the G80. The G84 can do something like 510 Mtri/sec max. The G84 can do something like 68 Mpoints/sec max at 1pix 48 Mpts/s max at 4pix.
071204 / GPU Only II previous | next Got some very rough painting with CFD prototypes working... Hybrid Adaptive Grid and Particle Based CFD I'm still working out the details of the new GPU only engine through some rough prototyping which will converge into the finished engine. Initial tests of drawing with simple point emitters into a CFD grid have shown promising results. Looks like the story will end with some coupling of my previous particle based physics/CFD engine with a grid based CFD which will be used for drawing. One thing is for sure, I'm going to be using a view/detail adaptive grid for computation which isn't going to be using any of the hardware tri-linear filtering. I'm going to try doing a single adaptive grid for both particles and drawing, but I might have to resort to my previous separate particle engine with a "fake" 2.5D CFD for screen pixels only. Either way I'm going to use the particle connectivity constrains to handle semi-rigid bodies and the compressible aspects of fluid flow. Looks like the basic fast uniform grid based CFD is always texture limited and has a very bad ALU to TEX ratio. I'm hoping I can hide most all the extra math required for my non-uniform grid in texture fetch latency. Also worked out a full design for frame remapping, fully dynamic L-system tree, and occlusion culling (using L-system tree pruning) all on the GPU with no CPU work other than setting up drawing calls. Dynamic allocation was tough to solve. Ultimately I decided upon a method very similar to the histo-pyramids method by Gernot Ziegler. My concept uses two histograms and requires an extra mipmap expansion pass (mipmap generation in reverse). I had to accept one problem however, in the case where I have more new nodes than free nodes available in the free node cache, there is a random pruning of new nodes to fill the free nodes. I'm trying to see if I can adjust so the random pruning removes largest Z nodes first... Lots more work to go...
071130 / GPU Only previous | next Domino effect ... solving one problem yields a solution to many. This started with figuring out how to blend normals into a G-Buffer, and thus being able to reduce shading operations to one pixel, even with semi-transparency. And this looks to be ending with the ultimate form of optimization, extremely tight coupling of graphics to combine physics+CFD on the GPU only. Infinite World Size With Single Precision First stumbling block, how to represent an infinitely large world in single precision using a L-system. I've been trying to solve this for about a year in the background of my mind, and now believe I have the solution. Numerically the precision of location diverges when expanding down the L-system tree. So children of the same parent are positioned with very good precision relative to the parent. However children which may be adjacent in screen space can traverse completely separate paths (in worst case having no common parents other than the root node) and this suffers from precision problems the deeper one traverses down the L-system tree. The solution is to somehow refine the locations of unrelated children. Of course unrelated children nodes are all dynamic and thus ultimately have no set connectivity. Well, in projected screen space temporal connectivity is semi-consistent, and this ultimately gives the solution, provide general positional constraints based on distances from any nearby node. So as nodes get closer to the eye (in an eye relative coordinate system), the constraints relatively correctly refine the positions. This type of system works perfectly into how I'm already doing physics calculations. If this works, it removes need for double precision, enables this to be ported to the GPU, and removes the bottleneck of transfer of data between the CPU and GPU. Sorting, Occlusion, and Overdraw The next stack of problems in porting code to the GPU. GPU sorting is simply a bad idea unless you use a non-graphics interface (like CUDA) where a fast radix sort could be done on the GPU. Sorting, besides for drawing, is also required for my current solution to hidden surface removal. So removing sorting requires a fundamental change in the engine, something which would probably be easier to solve working the problem backwards and keeping an open mind for massive changes. Without sorting the answer has to use the Z-buffer, and transparency is either additive, multiplicative, or simply not done. Without overdraw control and culling, drawing triangles is out (too much overdraw). Which leaves points (at 1024x1024 the midrange NVidia 8600 GTS can easily render 100M points/sec), and leaves a new problem, how to fill the area around the points. Seems both stupid and impossible right? After all a majority of the screen would be empty, and all sorts of occluded points would be littering the screen, talk about a nightmare of a problem. Also all my cool diffuse effects (volumetric fog, etc) are the result of drawing the larger parents of the child nodes. Which would compound the problem of somehow needing to combine some kind of multi-resolution scheme. I got stuck at this point back at the original engine design a long time also. About a month ago I started working on a possible solution, using a multi-resolution method which ultimately had performance issues. So perhaps it is best to defer the solution to this problem (solution below), and work on something else, which is exactly what I did. Undoing Stack Based Tree Traversal Stack based tree traversal, also a no-no on the GPU, but required for how I do physics and rendering. The only way to solve this is to either traverse the tree for each and every node (at O(ln) cost), or find a way to remove the dependency between the current frame child and parent. Or better yet, simply delay the dependency, so that the child could use the parents previous frame's results. Of course this creates a one frame delay (per level) as information propagates up and down the tree. Ultimately the fast path, delayed dependency, is the only acceptable solution performance wise for the number of nodes I want to do. With the frame delay, because of the time lag, the parents nodes can no longer be drawn on the screen. While this shouldn't be a problem with the physics code, it provides a new constraint, no drawing the parents, which both simplifies the drawing problem but creates a new problem. Drawing a Solution to the Problems of Rasterizing Implicit Volumes The advantage of in-exact implicit volumes (which is exactly what Atom renders), is that there is no need for a precise solution, and the solution can change over time (everything being dynamic is core to the game). However what is absolutely needed is a consistent solution between frames (and relative large scale consistency between networked players). Consistency between frames hints at a possible solution in the form of re-using the previous frame to render the next frame. This has the added benefit of built in optimization though temporal coherence. Also given the fisheye projection and lighting requirement of needing to render all sides of a cubemap per frame, I don't have to worry about border conditions of having large amounts new geometry appearing from the sides of the screen. Only a relatively small amount of new surfaces would be produced each frame. Going back to the drawing with points constraint for which I never had a good solution, and expanding upon the idea of temporal coherence, if all viewable surfaces are known from the previous frame, it becomes easy to handle removing occluding points on the new frame, and if necessary connecting the visible points into a surface. The method for doing this is a depth aware pyramid image processing algorithm. However, why stop there? The idea of recycling the results of the previous frame, or computation, exactly parallels that of a iterative grid based CFD solver, where each pixel is a grid element. Humm, CFD smoke and other implicit volumes are easy to create given a few emmitive seeds, or points. And this is exactly the construct I believe is the solution. No overdraw, point based, fully dynamic, and completely insane! Exactly what I am looking for, not to mention excellent texture cache performance, and perfect for the GPU. Fully Substituting Dynamics for Animation I'm currently using animated L-systems as a factoring method for content creation. However large amounts of animation takes time and resources which I simply don't have as a single developer. Besides, I would like to have the option of enabling user generated content, allowing users to build things in the Atom world. So animation has to go, simply too time consuming. But dynamic movement has to stay, just without being done with playback of pre-computed animation. I think the solution to this is to design the static L-system in a unstable state dynamically, so it cannot ever fully return to being static. Motion Blur Motion blur couldn't be done with my current method of stretching and alpha blending geometry. However since I have the best estimate of both the location of each new pixel on the previous frame, and the z, current per pixel depth aware image space motion blur should work fine here. Also will allow for more user frame-rate and resolution control. Putting it all Together This is my project for December, morphing the Atom Engine into its finished form based on these new ideas, then January to start content creation, while I finish up the networking, audio, and server code.
071126 / Optimization and More previous | next Random Image... Random Thoughts... Better Deferred Rendering Anti-Aliasing No not for Atom, but for everyone else! For those who have normal stored in their G-Buffer, here is an idea on how you can get really high quality anti-aliasing with a image space post process, without using hardware anti-aliasing. First take the per pixel normal, project into screen space, and re-normalize. Then take this screen space normal, and rotate 90 degrees (compute the vector perpendicular to the screen space normal). This perpendicular vector will be in the direction of the strongest "edge" at the pixel. Now like screen space motion blur, sample from the screen in the direction of this perpendicular vector, and intellegently take a weighted average of a few screen samples along both the positive and negative directions on this vector. Would have to workout the details, but the concept should work really well. Limitations of 32-bit Floats on Effective Map/Level Size I was once tempted to move my double precision code to the GPU, so I decided to look at the effective maximum fully interactive world size for single precision. Now the average human walks at 4 mph (or 70.4 inches/sec). In this context, from personal experience, about 1/32 inch in positional precision is needed in order to render objects close to the screen without problems. With this in mind (2^24/32=), the 24 bits of precision of FP32 gives 524288 inches, or a little over 8.2 miles (13.3 km) in one direction (also have a sign bit). According to some forum posts, Crysis sports something like a 16 km viewing distance, which seems to support the idea that they are pushing the upper limits of map size for a FPS using single precision. Atom Update I've found an interesting easy solution to the Z-sort order popup problem which has been an unresolved issue in the Atom engine from the get-go. The answer is in sorting incorrectly based on particle size, didn't even need to go to a separate Z write pass. Yep, I'm doing an incorrect Z sort which insures that the soft edge seams between particles always have another particle in front to mask the z-order change popup issues. Works amazingly well (since I have hierarchical particles) and only requires a simple pre-sort Z offset change based on particle radius. This discovery in combination with getting the alpha blended deferred shading working, has now ended my tangent to R&D stuff, and I'm back to meeting my impending December 31st deadline for finishing up the graphics/CFD engine optimization. Deferred shading has also enabled me to bring back FP16 HDR (with tri-linear env lookups) at virtually no cost, and has moved the performance bottleneck to the ROP (fill rate limited during alpha blending to the G-Buffer). Drawing in reverse painters (front first) and using the stencil to cull fragments should enable me to control the fill rate (which works great in my non-deferred engine path). Seems like the G80 can stencil clip about 8x its fill rate, and this should be even better on the G84 and G92. Because of the duplicate Alphas, I need 3 FP16 MRTs currently. I'm attempting to move to 3 INT8 MRTs in the cubemap path (should need less precision for my "normals"). SSE2 Optimization Using Double Precision Atom uses a L-system where each particle in the world tree has it's own coordinate system (parent relative rotational matrix, offset, and scale). There are basically 2 primary per frame parts of the engine which operate on this tree on the CPU using double precision. This is something which simply would never map well to the GPU because of its precision requirements and tree recursive structure (children results depend on the results of the full parent tree). I'm trying to get as near as possible to the 2 double precision flops per cycle (best case, cached, etc) which is possible on Intel's Core arch using SSE2, for all my core per particle algorithms which are CPU side. Ultimately SSE2 double precision performance is limited by pack/unpack and swizzles. In going with SOA (structure of arrays) layout, I've now taken a page from the CPU raytracing packet optimization concept and am working with a fixed set of 8 child nodes per parent node at a time. This is requiring a complete re-write of nearly everything but in the end should also help vastly improve cache performance due to better locality of data (which probably will be more of an improvement overall than getting 2 DP ops/cycle best case). There are a bunch of key SSE2 instructions for double precision which help in interfacing with non-SOA data structures. For example MOVDDUP, which can load a 64-bit aligned double into both the hi and lo parts of an XMM register (ie, great for doing a matrix multiply of the same parent rotation matrix times 8 child matrices or vectors in SOA form). Likewise MOVHPD+MOVLPD also work with 64-bit aligned doubles for memory operands. In one case, SSE2 has enabled me to hide the cost of unpacking and converting GPU single precision readback results into double precision SOAs, which ended up being a little under a 2x performance improvement for this case.
071121 / Deferred Shading III previous | next More from the deferred shading prototype path, this time changing material properties. Pre-Thanksgiving Eye Candy Feast Deferred shading is proving to be quite a gem indeed. I'm still learning what is possible with the deferred shading in my rendering path. Above are some screen shots captured while playing with material settings. While each shot shows one uniform material property on the meta-volume surfaces, I can easily adjust the material property across the surface, and have the material properties change based on physics/CFD velocity. Faking It Obviously there is no way I could render this in real-time without faking everything. Yep, fake lighting, fog, sub-surface scattering, reflection, refraction, and more dynamically using the same single shader and just a few material properties. BTW, I only use a 1D texture when blending in the material properties of the meta-volumes. Of course there are trade-offs to be made in any real-time rendering pipeline. I'm using the term meta-volumes because basically I am limited to defining stuff with hierarchical 3D ellipse (2 radius ellipsoid) based sharp to fuzzy "meta-ball" like objects. So while I have perfect anti-aliasing, motion blur, and cool looking fake lighting, I cannot render what is possible using a traditional skinned polygon based pipeline. Progress So I'm switching to the deferred pipeline from my previous "forward" rendering pipeline, and I still haven't ported over this code path to the cubemap rendering path. Right now I really have to fake forward reflections, which is something the cubemap should fix. I also have about three optimization paths I am currently working on to speed up the deferred renderer. The current working path blends everything in a full size framebuffer, which is quite expensive and is using reverse painters algorithm sorted drawing. I have a problem with pop-up when motion card (my term for motion stretched billboard for polygon people, or motion stretched splat for point based rendering people) order changes, which is worse now than in my previous forward rendering path. So I have two options to fix the ordering problem, eat a second rendering pass cost to generate both blended Z and Z fade point (2 channel FP32 FBO), or somehow vary the alpha on the motion cards in a direction to correct the pop-up. The second Z pass should provide a more robust solution (like true soft particles). Also with Z, I can easily switch to a hierarchical soft particle rendering path, drawing only the smaller particles (which define "edges") in the high resolution FBO, and working on smaller FBOs for the larger softer particles. If necessary I could go with a bilateral interpolation for the upsample. I've toyed with some other really strange ideas. Like the idea of simply doing something like ray tracing (for G-Buffer data only) the projected pixel at the center of my 64K particles, and then smartly interpolating the rest of the G-Buffer contents using a special pyramid bilateral interpolation and space filling upsample filter, then finishing up with the same deferred rendering. I'd only be raytracing 1/8 of the screen pixels per frame. From testing, only at most 16 particles ever need effect a single pixel, so this problem becomes effectively dynamically computing which 16 particles should be looked at for each of the 64K particles. I was going to simply draw the motion path of the particles into a hierarchical reduced size framebuffer, do this in 4 passes with depth pealing (each channel of a pixel becomes an id a different particle), and also divide the particle to different mipmap levels based on size. Ultimately I decided the batch, vertex load, and overhead would be too much to try this. Also I would have to do about 96 sphere intersections (4 pixels * 4 channels * 6 mipmap levels) per each of the 64K pixels, to find the 16 particles to work with, which is just too expensive.
071116 / Deferred Shading II previous | next These two shots are from my deferred shading prototype path. Deferred Shading with Alpha Blending of Normals? Back in August I forked my development path, branched out and tried Deferred Shading. I had not got it working (as can be seen if you follow the previous link) because of the problems of alpha blending normals. Unlike the traditional deferred shading path, I'm doing sorted alpha blending of material properties (including something which is not quite a normal, but serves a similar purpose) into my G-buffer. Now I've found a trick method to blend my "normals" which modifies a shiny-to-diffuse material property (effectively changing the mipmap level in a future texture lookup). This knowledge is effectively the gateway towards a simplified shading pipeline, but I have a few more problems to solve to get there. Random Thoughts on DX10.1 Putting this in my archive of thoughts on DX10.1 level hardware, of which perhaps it is going to take 1-2 years before the Khronos group catches up with an OpenGL spec for ... and then another 6 months to a year for driver support! Anyway the hardware is here now in the form of AMD/ATI RV670 chipset. What excites me the most about the hardware is the GPU double precision support. Something like 2 clocks for an ADD and 4 for a MUL? Would be ideal for say Atom II to move over the CPU bound double precision math to the GPU. As for DX10.1, the ability to read and write from multi-sample buffers would be awesome! With this, it becomes easy to do order independent transparency in one geometry pass. Simply setup a multi-sample FBO with 4x4 samples per pixel. Render with AA off and stencil setup so each write to a pixel goes to another sample. Then read back the color/depth for the 16 samples, sort by depth and manually composite in the correct order. With any luck, by the time GL has support for this, the cards will have the bandwidth to do this at a good resolution. As for the rest of DX10.1, I don't see it as all that important. Cubemap Arrays would be useful if only they managed to get the geometry shader pipe fast. Separate blend modes per MRT is something I'm not sure I'd use. Haven't reached the limit of vertex shader inputs and outputs yet either...
071115 / Critic II previous | next Ok, only two blog posts... Genius in Mario Galaxy The true genius of the Mario Galaxy game is in how they have found a workaround for the under powered Wii graphics hardware. Simply reduce both the geometry and fragment fill by building sparse small worlds. Simply awesome! Image Processing In the Game Graphics Pipeline Ultimately games will end up with a film like image post processing and compositing engine. Sure the hardware is not fast enough yet, but it will be in my lifetime. Little Big Planet is a great example with its integrated motion blur and depth of field. Looking at the screen shot, they are doing a image space blur which has image space velocity and is semi-depth aware (see the color bleeding between some foreground and background elements). Interesting results on the rolling wheels... How about something even more OT... Thou Know Thou Ist In Hell When Dealing with Linux developers dynamic linking abuse. No matter how you write it, the story ends with tons of wasted time and lots of profanity. Why re-invent the wheel? Sure hear that a lot these days, and the results of this can been seen in the multi-page library dependency trees of most Linux applications. KISS, Keep it Simple Stupid! Yep, has been tossed out the window. The point is the same on Mac, Windows, and Linux: hardware gets faster, software gets slower! Net result, my wife is at the gym power lifting to take out the frustration of dealing with many second user response latency. No joke here, she is dead-lifting her own weight now, and I'm under 100 lbs from joining the 1000 lbs club (which is tough for a tall once skinny programmer). Busy with Photography 
All this dynamic linking mess is in regards to getting the libraries needed for an open source photo-stitching tool working on the Linux machine so that I can off load my over-burdened Mac which just cannot handle stitching of GB files. Unfortunately I have about a 2-3 week backlog of Photography work which has been and is going to take me away from Atom for a while ... hence the OT posts and lack of Atom updates!
071112 / Critic previous | next I decided to take a break from my normal routine of talking about boring GPU programming stuff (and the progress on my game) to become game industry critic for just one blog post. Great Art Direction from An Independent What a great example of art polish, the golden hour color tones, soft backgrounds, the lighting, all very well done. Check out their gallery for a few more awesome screen shots. Also here is a link to a previous gamedev.net image of the day post showing progression of their graphics engine. Non-Photo-Realistic-Rendering Very impressed with Valve's Team Fortress 2 illustrative rendering engine. Insomniac I loved the Ratchet series on the PS2 (excluding Deadlocked). Haven't purchased a PS3 yet, I'm waiting simply because it forces me to work more on my game ... however Tools of Destruction will be the next game I purchase. Why, because the previous games were simply my kind of fun. BTW, Insomniac has a great tech page. Impression of the Crysis Demo Today I got a chance to try the Crysis demo out on a 8800 with settings maxed out. First off, very impressive work, they have advanced quite far. However, I feel sorry that they are caught up in the DirectX 10 Vista BS (where Microsoft's marketing goal is to try and force people to upgrade by not releasing DirectX 10 for XP). Having the "High" setting disabled in XP was sure a sour move. IMO, the whole "no DirectX 10 for XP" sure screwed the game industry over, drastically slowing down the ability for developers to adapt DirectX 10 features. Forcing next-gen games to have a near identical DX9 fall back rendering path, which can clearly be seen in the DX9 hack to enable "High" on the Crysis demo. Not much difference there. Nothing shatters the illusion of a virtual world for me more than aliasing. I just don't understand this push for ultra-high resolution rendering, where everything ends up looking like polygons and pixels, even with anti-aliasing on. The alpha test aliasing artifacts in Crysis are just too distracting for me. Moving Beyond Surfaces I think the guy at meshula.net is right on, especially his section on Moving Beyond Surfaces talking about painting the air between surfaces instead of the surfaces themselves. This is something which current raster graphics hardware is simply not designed to do well, because of the problems of needing to draw depth sorted objects for proper alpha blending. This is exactly the thing I am trying to do with my compositing engine. For example here is a Atom in-engine (meaning not a fake off-line rendered shot with in-game assets) showing each pixel doubled without filtering (nearest up-sample). No aliasing, partly solved by rendering the "atmosphere between objects", and secondly solved by not having objects not built out of polygons (no hard edges). Compare this to the up-sampled image from Crysis. The illusion of a "real" virtual world is instantly killed by aliasing. Just got another great idea, in game sharpening filter. Something which couldn't be done with an aliased image. Here is a non-upsampled crop with sharping applied. Enough being critical, time to get back to work...
071108 / GPU Assembly II previous | next NVidia has released the beta of Cg 2.0. Which is great, Leopard support, profiles for NV_gpu_program4, etc. I just spend that last few days really getting to know NV_gpu_program4, testing assembly shaders, looking over NVidia's CUDA PTX docs (BTW, have to download the full install to get them), and trying to get as much of an understanding for what composes the opcode level of the G8x GPU. I have learned a lot at the cost of effectively loosing a week of productive work, and spending way to much time working on new ideas (hierarchical soft particles) and features, when what I really need is to have a feature freeze and just get the damn engine done! Also I should say thanks for all the answers I have received on the various message boards! NV_gpu_program4 and PTX According to multiple sources, PTX very closely resembles the GPU opcodes on the G8x. From what I can gather beyond the obvious (like gpu_program4 being vector based assembly on a scaler processing unit) there are some important differences between gpu_program4 and PTX. First gpu_program4 opcodes have a large number of modifiers, like ability to swizzle, saturate (both -1 to 1, and 0 to 1), negate, take absolute value, multi-bit predicate, and set multi-bit predicate based on output of the opcode. PTX shows only saturate 0 to 1, and 1-bit predicate (meaning skip execution of instruction based on a 1-bit predicate register). So all the other gpu_program4 opcode modifiers have to be emulated by multiple instructions. Also PTX shows the need to run a "set if" compare opcode to write to a predicate register, but provides an opcode to do a register move from two registers conditionally based on another (based on sign I think). Seems as if the GLSL compiler generates NV_gpu_program4 code with a bunch of "set if" opcodes and single instructions skipped by branches when a simple predicated instruction would do. However since NV_gpu_program4 assembly still has to get compiled to the true GPU machine instruction set, this might simply be re-optimized in that step. Assembly in Cg or GLSL? The title sounds strange (and bogus), but given how simple the PTX instruction set is, it should be easy on the scaler G8x to produce ideal assembly like code in Cg or GLSL simply by knowing a few coding patters which get compiled to the underlining GPU opcodes. For example, setting a bool to invoke a "set if" opcode, and "if(bool) d=saturate(a*b+c); " to invoke a predicated multiply add opcode with saturation. What about Pack/Unpack Seems as if pack/unpack is emulated via type conversion opcodes and multiple integer opcodes (shifts, masks, etc). Guessing 4 ops to pack two FP16s, and 10 ops to pack four bytes. So the advantage of using these (memory bandwidth reduction) comes at the cost of lots of shader cycles. So it is not as useful as I thought. Also in GLSL with the NVidia drivers you can use both Cg types and Cg standard library functions (if you don't add a #version in the GLSL code) like pack_2half(), pack_2ushort(), pack_4byte(), pack_4ubyte(), unpack_2half(), unpack_2ushort(), unpack_4byte(), unpack_4ubyte() to do the packing. Or use floatToRawIntBits() and intBitsToFloat() to roll your own pack and unpack in combination with integer shifts and masks. Conclusions I'm sticking with GLSL, but might convert to Cg later! Don't get me wrong though, NV_gpu_program4 output is very very important however, just as an optimization tool to get an idea of register usage (which usage in excess limits threading on the G8x) and what opcodes are generated for the code.
071104 / GPU Assembly previous | next Looks like I am going to switch to using NV_gpu_program4 instead of GLSL because I need to have hardware ability to pack and unpack. Looks like I will have to again pull the Mac off my list of supported platforms, unless they magically roll out NV_gpu_program4 support in 2008! Saved by Near Undocumented Features Found a way output the NV_gpu_program4 assembler code from my GLSL programs. Simply add the following environment variables before running your program (in Linux), export __GL_WriteProgramObjectSource=1 export __GL_WriteProgramObjectAssembly=1 export __GL_WriteInfoLog=1 export __GL_ShaderPortabilityWarnings=1 This should be a tremendous tool for GLSL optimization as well. One interesting thing I have noticed is that the NVidia drivers are using an unreleased Cg 2.0 cgc compiler to compile GLSL in the driver. Also reading this output should give me a great idea of how to optimize for scaler instructions. NV_gpu_program4 and NV_transform_feedback Also turns out that it should be easy to use transform feedback with assembly shaders, simply use TransformFeedbackAttribsNV(). Also found this attribute mapping in NVidia's GLSL Enhancements doc, 0 = gl_Vertex 2 = gl_Normal 3 = gl_Color 4 = gl_SecondaryColor 5 = gl_FogCoord 8 = gl_MultiTexCoord0 9 = gl_MultiTexCoord1 10 = gl_MultiTexCoord2 11 = gl_MultiTexCoord3 12 = gl_MultiTexCoord4 13 = gl_MultiTexCoord5 14 = gl_MultiTexCoord6 15 = gl_MultiTexCoord7 This should be all the tools and info I need to get back up to speed with a GLSL to GPU assembly conversion.
071103 / Random Shots previous | next And one showing extreme motion blur (which is done in-engine with geometry stretching),  Optimization :: Transform Feedback, Vertex Shaders, and Pack/Unpack One of the primary paths I am trying to optimize logically expands a "particle" into a "motion card" for rendering. Now that I am trying to render this motion card to 6 faces of a cubemap basically I have to generate a few VBOs which will get read from 6 times per frame. A similar pipeline is to be used for the physics/CFD also. Having 6 passes, for the sake of memory bandwidth and vertex attribute cache performance, I am trying to keep the VBOs interleaved instead of separate. Transform feedback has the limitation of only allowing a maximum of 16 FP32 components as output per vertex. I am using the transform feedback as data expansion to write out 3 points per input vertex (or particle). So I can only output 5 values per point with transform feedback (3x5=15, 16 max). So I need to have multiple vertex passes to generate extra VBOs for the other per point outputs. Now with SM4.0 the assembly level fragment pack/unpack opcodes work in vertex shaders as well. However it appears that I would now have to convert all my GLSL to assembly to use this functionality. I am currently trying to find out if there is a way to do this in either GLSL or Cg. What I want to do is to pack multiple values into a FP32 value, then unpack in the vertex shader before I render my motion card triangles into the 6 cubemaps. If I have to move from GLSL to Cg or assembly, better to do it now early in the development instead of later...
071031 / Cubemap Mipmap Seams previous | next The left shows a low level of the mipmap for a cubemap. Note the very sharp boundaries between the sides of the cubemap. The right shows a relatively fast and makeshift workaround by setting all the edge pixels to the average. See the optical illusion where it looks as if there is an extra dark and light band in the filtered area, this is caused by the change in rate of gradient do to the flat 2 pixel matching borders. I knew seams on my dynamic mipmaped cubemap would be a problem. I've finished a quick workaround, which readers the seams for all mipmap levels and all faces into one FBO, then does a series of glCopyTexSubImage2D() from the FBO to correct the texture. It is fast but far from perfect, a bandaid for something which truly doesn't have a good solution. Might have to better the seams by feathering the average out over more than one pixel. Updates on the Status of OpenGL SM4.0 Driver Support I was quite surprised to find out that AMD/ATI's current Windows (and Linux) drivers for their new R600/HD2xxx cards have absolutely no SM4.0 support and are missing SM3.0 vertex texture fetch support! Somewhat like sending a Porsche to the customer without a transmission. So no AMD/ATI support for Atom until they bring their GL driver to the 21st century. However, Apple on the other hand is making progress towards working SM4.0 support in MacOSX! Geometry shader and transform feedback are in the latest drivers, and Macbook Pros are shipping with NVidia 8600 mobile GPUs. This just might make certain a native MacOSX port of Atom for release (and perhaps for alpha testing even before the Windows port is finished)!
071026 / Transform Feedback previous | next Transform feedback + multiple drawing passes is providing to be an excellent solution to the problem of the Geometry Shader pipe being too slow to be useful! Transform feedback basically is a SM4.0 feature which allows the output of a vertex shader to be written into one or more VBOs (vertex buffer objects). On my GeForce 8600 GTS up to 4 VBOs can be written to in one transform feedback pass. Furthermore up to 16 FP32 values can be written to each of those four VBOs. So this enables a transform feedback pass to output up to 16 new vec4 points per point input. Easy data expansion, and a very quick way to turn a single particle into an output triangle. So with transform feedback solving the geometry expansion issues, I tried separate drawing passes to each side of the cubemap (instead of using gl_Layer in a Geometry Shader). This new method is almost 20 times faster than using the Geometry Shader! Now to avoid re-calculating flat varyings (per primitive values, instead of per vertex) in the vertex shaders, and to keep these values well cached between vertexes of the same triangle, texture buffer objects should do just fine. So one early point to point VS pass to generate an interleaved VBO with per primitive values. Then map this VBO as a texture buffer object, and use gl_PrimitiveID to build an index into the texture buffer object in future vertex shaders. I believe this is the absolute fastest path on the GPU for what I am doing. Other Optimization Progress So I've managed to offload a large part of the CPU time by getting the motion card pass on the GPU. For the sake of getting this done, I'm archiving my ray tracing ideas for some future project. So the drawing pipeline is doing to be very similar to what I have already working, except now I have extra environmental lighting from the cubemap. I've also gone back through my stencil based reverse drawing code again, and found that turning off the alpha test and using the stencil test actually works quite well. Much better than turning off the stencil and turning on the alpha test. This almost makes me wonder if the GeForce 8 series has some block based stencil hardware to reject blocks of fragments (the AMD/ATI HD card has a hierarchical stencil buffer). Perhaps with the alpha test on, this hardware was disabled. One bonus of having the stencil test is that now I have another method for frame rate control, using the stencil to limit the number of times of overdraw per pixel. Tossing the HDR Code That is right, I'm no longer using it. First off the lower-end GeForce 8 cards take 2 times longer to fetch a bilinear filtered FP16 pixel, and 2 times longer to blend FP16 output in the ROP than the same operations with 8bit values. Not to mention the extra memory bandwidth and texture cache misses. With the amount of overdraw I use, this cost just wasn't worth it. Second, I don't like overblown overexposure. From a fine art photography perspective, HDR like extreme overexposure easily ruins an otherwise good photograph. Highlights should near clip or perhaps only clip a little at a point light source like say the sun. Otherwise highlights should have detail. Turns out with my mix of atmospheric lighting (I render atmospheric spaces in-between surfaces), it is just too easy to limit lighting and lighting feedback to values under the clipping point. I can still get near bloom with the added bonus of having detail there, and with colored lights, they still bleed to surrounding objects.
071025 / Motion Cards previous | next Here is a shot from todays work. It is of a 360 degree fisheye projection of a few thousand motion cards rotating around the camera really really fast. Triangle Reduction, From 4 to 1 Today I rewrote the motion card engine again, this time with a single triangle per motion card instead of 4. The aim here is to increase the efficiency of the geometry side of the engine now that I am planning on rendering 6 sides of a cubemap per frame. One thing I am guessing on is that when the GPU rasterizes small triangles which only cover a few pixels, that a chunk of SPUs are shading pixels which are not even in the polygon (since the GeForce 8 series has something like 8x4 fragment shader granularity). So changing from four tiny triangles per particle to one should provide somewhat of a performance boost in the fragment shading as well. Ellipsoid Rendering Anyone remember the Ecstatica game series from Psygnosis in 1997? Here is a screen shot,  The Atom engine is similar in concept, but taken to the next level. Instead of rendering with polygons, the basic primitive in the engine is an ellipsoid, limited to 2 radii instead of three. The motion card part of the engine composites the motion blurred ellipsoids into the framebuffer, and the rendering part of the engine turns each ellipsoid into a shaded impostor. Here is a few flat shaded ellipsoid primatives (fisheye projection curves them). Then a little motion. Then a lot of motion. And finally showing triangle primitives created by a large motion case. The engine uses a alpha falloff under motion to insure a correct transparency for each moving particle. What's Next Still need to find a faster output path than the geometry shaders. I'm thinking of trying a trick to use transform feedback on points (one point per motion card) in a vertex shader and then output three interleaved attributes (for the triangle) in a VBO. Then later read in as a non-interleaved array of vertexes for triangles for all my cubemap passes.
071024 / Geometry Shader Woes previous | next This is a fisheye projection composed of simple motion stretched billboards. Motion Card Drawing to Cubemap Prototype I decided to try porting my older raster based image compositing pipeline (sorted rendering of motion cards, basically motion stretched billboards), and get it working with cubemaps. And it works, but with a few problems. As I hinted before, Z aligned motion cards wouldn't work because of the seams. So I had to switch to actual 3D geometry for the cards. Getting eye perpendicular cards drawn wasn't too much of a problem, and this did fix the seams. Since I'm not using a Z buffer (pre-sorted), and I wanted to have infinite detail both near and far, I cooked my own projection for each motion card. This insured that I would not have precision issues for clipping near and infinitely far. All my billboards are stretched in the direction of motion, and also can be non-square. Computing and generating the proper bounding geometry took me a few days to figure out. What I ended up with was a quick simple algorithm which outputs a quad divided into 4 triangles (the point at the center insures proper interpolation for fragment shader). Problems Unlike my working engine, I tried to switch my generation of motion card geometry to a geometry shader. Combined this with output to six sides of a cubemap at once, and the GPU slows to a crawl with 64K motion cards per cubemap face. Switching to rendering to one face and outputing 6x the number of motion cards actually performed much much better. So apparently the single pass render to cubemap idea doesn't work all that well. I'm going to need to go back to my previous methods. Geometry shader usage is way too slow. Future I have a feeling that drawing 6 passes, one to each cubemap face, is many times faster than trying to use a geometry shader. This might be good news for getting this ported to SM3.0. Speaking of SM3.0 and creation of geometry, Gernot Ziegler's HistoPyramids looks like a really good alternative to geometry shader usage for a variety of situations...
071019 / Cubemap Woes previous | next Cubemaps rule! Cubemap mipmaping sucks! The problem is that the hardware doesn't filter across the ends of each cube. The 1 pix texture border option fixes this by allowing the clever programmer to provide extra texels so the filtering works out ok. Great, so canned static mipmapped cubemaps are golden. Dynamic generated mipmaped cubemaps is in a sorry state however. First glGenerateMipmapEXT() won't do correct mipmap generation around the seams. Second, and the show stopper, even the latest NVidia cards don't support attaching a cubemap with borders to a framebuffer object. The GL_FRAMEBUFFER_UNSUPPORTED_EXT laid down the law, no way to draw into mipmaped cubemaps with borders! What to do Now The only option is to manually blend the edges of all the cubes in the mipmaps so that they are the same. Basically trade a possible high contrast fine line on the seam with double width pixels. Rounds out the edges of the cubemap. Should work fine for the environment map, but sure breaks what I had in mind for ray tracing. Will have to rethink some of this... A Random Thought I Don't Want to Forget Was watching a cool show on ants a few days ago, specifically how their collective decision making works. Basically they leave chemical bread crumbs, other ants follow the trail, strongest chemical trail wins. Different chemicals to pass different messages to other ants. Might be something which could make a really interesting game AI.
071018 / Cubemap Concepts previous | next What Am I Looking At? A fisheye projection (270 degree horizontal FOV I think) at 640x480 from a 224x224 cubemap (linear filtered) with a single pixel line pattern on each face. Basically a resolution test, areas of very strong moire patters are over-sampled compared to screen space, and areas of the flat line pattern are 1:1 or under-sampled. One more very important note, 640x480=300K and 224x224x6=294K. A Fresh Idea I've always been a fan of fisheye and wide angle projections, and the ability to see behind yourself in a game, especially a FPS, is just awesome. Like many good ideas, it has been done before, checkout Fisheye Quake. But it isn't done often. There is no projection transform in OpenGL which can output a fisheye projection. Fisheye projections also removes the ability to do a flat screen image space motion blur (the projection curves straight lines). Bottom line is that fisheye projections are cool, but deemed impractical. However, with the ever faster GPU, rendering into a cubemap first, then projecting the fisheye is possible. And as can be seen above, for the equivalent number of pixels (single screen vs cube pixels), the output resolution is similar for ultra wide angles. Pros and Cons First obvious advantage of having a cubemap of the entire view surrounding a player is that now there is a free accurate environment map to use in lighting. And if the mipmap levels for the cubemap are generated, a LOD bias can be used in the environment map as a surface property for sharp or diffuse reflections. Another huge win with the cubemap in Atom is that I am planning on extending my physics to use cubemaps, so the geometry passes could be merged between the physics and drawing pass. There are some serious challenges with cubemaps however. First the cubemap must be seamless for visual rendering. This is a serious problem for Atom. Atom's current rendering implementation uses many incorrect optimization hacks to basically composite motion blurred particles which reflect, bend, and emit light, all in screen space. And since these effects are not raytraced, rendering to the sides of the cubemaps would generate very bad seams where a particle crossed between two different image space planes. Now For the Guts I have an idea, which might work, and if it does it will be spectacular. The basic concept is to use ideas from the physics engine to make a O(1) time lookup for ray intersections in cubemap space (single cubemap lookup), then raytrace the first intersection between the eye ray and particles. Now for the even stranger part, what I am going to try compositing (sorted alpha blending) into the cubemap is going to be particle center position, radius, and other properties. Also I'm going to be splitting up the particles by projected size, larger particles in smaller mipmap layers in the cubemap (very important, more on this later). The alpha blending serves to blend particles into more of a meta-ball like surface, and if I draw lines into the cubemap along the motion of the particle, I will have free motion blur! Yep that's right, I'm alpha blending Z. Sure it is a no-no, but in my case not a problem (I've been doing this for a while now). My blending is front first (reverse painters), thus I have the alpha coverage value for each pixel. So it is easy to take non-full coverage Z values and correct them. Atom by its very structure has a hierarchical metaball like surface, where layers are blended in and out for LOD control. Previously I had some problems with Z ordering changing between parent and child particles in the hierarchy, causing visible popup (can be seen on one of the videos on this site). Now what I am going to do is a set of pyramid rendering passes, rendering the different meta surface hierarchies in different mipmap levels. The less detailed particles are computed in smaller resolution mipmap levels. Should save a tremendous (4x) amount of computation time and remove my overdraw problems... The key to this hierarchical rendering is to only render certain properties at the reduced size (planning on ray intersection detection, normal generation, and alpha computation). Then do one full size (mipmap level 0) pass where I read from the eight smaller mipmap layers, Z sort and do both emmitive and environment reflective lighting for each layer. With proper Z ordered alpha blending between the results of each layer. Another trick I'm already doing is re-circulating the environmental light from the previous frame into the current frame. In a way it is like fake radiosity. The motion blur tends to hide the fact that it take a few frames to converge, based on the LOD factor in the environmental lighting pass. But the results are awesome, and no one has yet noticed how the lighting "converges" in a few milliseconds as the view changes. Will this major change work? Perhaps, will take a while to find out. What About Physics I read that quote on your blog - Collisons via mipmapped cubemaps - and was intrigued as to what you're planning. I'm somehow not imaging an algorithm that would fit the description, but I am super curious about what you're thinking there. If you feel inspired, a few paragraphs in your blog would be cool :-) First off, ultimately for any physics/CFD interaction there is a problem of a large world space, and sparsely grouped items to interact with (the "broad phase" problem). Options are, spacial hash function, uniform grid, or something else to figure out what items are close enough to interact. Subdivision algorithms suffer from a O(ln) run time. So might as well toss all of those out, with many objects composed of particles, O(1) is the only way to go. Uniform grid breaks down for large spaces, (BTW, great example of uniform grid in Chapter 29 of GPU Gems III). So lets toss that out as well. What is left? Something New Subdivision for collision detection I've tossed out. However Atom already has all particles in a hierarchical tree structure, and each child particle is in the coordinate space of the parent, and thus each child is automatically effected by the movement of the larger parent particle. So in this way I have a free (from the physics engine's standpoint) form of subdivision which is not used for collision, but which has a O(ln) reduction in the complexity of the physics code. Now one really important observation, as the number of particles or objects increase, a persons ability to find incorrect particle interactions decreases. In Atoms case, there are about 65536 particles active at a given time, and not all of those have to be 100% correctly interacting. What is important is a few key properties. First that what is hidden requires a much less accurate interaction than what is visible. Likewise higher energy interactions need to be much more accurate than lower energy interactions. Sounds like a great hash function to me! Exactly. To implement this idea, the hash function is simply the same cubemap used for the view rendering, with larger (projected radius) particles in smaller mipmap levels, and with particles drawn in a very specific order. Ordering is front first (once a bin, or "texel", is filled, it is no longer written to), with a energy level override so that high energy particles can fluidly have a higher priority than the lower energy front most particles. Essentially the viewing projection itself is a major component of the spacial hash. To recap, first render the particles (or particle properties in my case) for each particle in the cubemap, then check for "collisions" by looking up the particles in the cubemap around your particle and check other mipmap layers for larger and smaller particle interactions. Sure some particles get lost in this hashing, it is a contracting hash which is not reversible. If this was a problem the idea could be extended with a stencil select binning algorithm to allow multiple particles per bin, but I'm not going this route. I'm already blending my physics CFD particle properties much like what I am planning on doing with my raytracing idea above. Still Reading? I'm usually done when I don't see good screen shots. Well, I will be posting as I work this idea from concept to completion ... care to take a stab at the number of vertex, geometry, fragment shaders, and drawing passes that are going to be needed to get this new pipeline working? Might take a while!
071015 / Drawing in Reverse II previous | index Back from the photography trip. I finally got around to doing the reverse drawing with hierarchical Z buffer Z-Cull, and measuring the performance difference with a static VBO. Well the results are in and it is a draw. The benefit was about 10%. The extra cost of drawing the x/4 by y/4 32bit scaler float framebuffer using stencil, then turning each of those pixels into a 4x4 pixel quad to render Z into the full size depth buffer, and finally doing the full size drawing pass, is too much extra work. The overhead is something like 25% of the original drawing pass, for only a 35% gain. However After more testing, the performance found previously by drawing front first was simply a side effect of turning on the alpha test and throwing out ROP fragments with alpha under 0.0625. Stencil test wasn't even needed for the performance increase. I need 16x overdraw at a minimum to draw the frame anyway, and had 32x max in some areas but these were minor. In the end I need front first drawing for the physics scatter passes, so the front first is here to stay, and as it turns out I got a 10% improvement simply using the alpha test. What is Next Still working on the optimization of the engine. I'm taking a 1-2 week gamble on a full rewrite to a new combined overdraw culling and physics algorithm on the GPU. Current CPU time is roughly 16% tree prune, 32% overdraw cull, 16% particle to motion blurred imposter, and 36% generation double precision geometry work (tree traversal, etc). With the new system I'm moving 50% to the GPU, and optimizing, through simplification, of the rest of the CPU bound code. Also switching my version of the "broad phase" collision detection pass from texture arrays to a mipmaped cubemap (rendering to the various mipmaps seperately). More later when I know if it works or fails!
070930 / Porting to SM3.0 previous | next Atom is a technology gamble, will enough people with GeForce 8x00 cards be interested in the game within the first year of release (some time in 2008)? If so, this will be a success, if not, well, I'd rather not go there. BTW, this is an OpenGL based game, so there is NO need to goto Vista to run this (to get DirectX10), the NVidia drivers for Windows XP have SM4.0 support! Porting to AMD/ATI HD Getting Atom to run on the New AMD/ATI HD cards should be easy, only have to replace my TextureArray usage with a large flat 2D texture and multiple drawing passes with glScissor() to mask the regions. Is it worth the change? I will probably wait and see in 2008. Porting to SM3.0 First, ATI never really had SM3.0 support do to a lack of a vertex texture fetch ability. So all older ATI cards are an instant no go. Vertex texture fetch is an absolute requirement, render to vertex array just isn't going to cut it for Atom. Looking at the Valve Hardware Survey Summary about 22% have a 7600 or better NVidia card, and only about 6% have a SM4.0 able card. So could Atom be ported to SM3.0 to get about 3.6 times the possible user base? The GPU Gems 2 book has a chapter which outlines the 6800 card's hardware. Key points from this are, limited to 4 MRTs, only float FP32 and vec4 FP32 un-filtered texture lookups from the Vertex Shader, only un-filted FP32 texture lookups from the Fragment Shader, no TextureArrays, and no Geometry Shaders. From what I can gather, the 7x00 hardware is the same in this respect, just a faster series of cards. Getting around the lack of a Geometry Shader is a serious problem, requiring either more work on the CPU or 4 times more work on the Vertex Shader, including a bunch of dependent texture reads (to generate my impostor billboards on the GPU). Either way performance would suffer, but it could be done. However the real deal breaker might be the lack of a FP32 filtered texture lookup. A little secret, the physics gathering step uses linear filtered texture lookups to filter between particles ... without it, physics effects start grouping particles because a much smaller than screen size buffer (in the x,y dimension) is used to accumulate the velocity vector field (and other physics properties). Manually filtering is not an option. I'm doing 65K mega particles with multiple gathers per particle. The performance just isn't there to manually filter each gather. Going to FP16 was a possible fix, and would work for a 2D game. Would probably also greatly improve my performance do to 2x better texture caching and filtered lookup performance. Just not enough precision for Z. The X,Y position could be translated into projected screen space, then un-projected back to FP32 knowing the Z. So the first though was to just use impostor drawing order as Z, 65K values fit nicely into a FP16 value. Then could do a dependent texture lookup to find actual Z given the FP16 impostor drawing order. Might work, but one more problem, my physics scattering pass draws alpha blended motion billboards (of the physics properties per particle) into a layered 2D buffer. It is the alpha blending and actual drawing of the path of the particle, which enables the ultra smooth CFD physics. So if I had my XY velocity vectors in screen space, and my Z position as drawing order, the blending would be wrong. Also storing a Z velocity vector would be a mess. Could save Z velocity divided by particle radius, but would have to save particle radius in projected screen space as well. Doing a non-blended extra FP32 Z only pass could be an option, would double the work at the Vertex Shader level, and removes the Z blending which is important. Bottom line, it only just might be possible to port to SM3.0, but is it worth it? This is what I am going to ponder while I'm reading GPU Gems 3 on the plane and am off photographing this week in the Sierras.
070926 / Drawing in Reverse previous | next I have been getting ready for one of our big photography trips this year, this time to the Sierra Nevada area starting in the first week of October to catch the fall color, so I will have to take a break from Atom for about two weeks. On the Topic of Alpha Blending I had a theory that only about 8 times of overdraw per pixel would be necessary to render everything in Atom. Currently using something upwards of 32 times overdraw per pixel, so if I could skip 3/4 of the overdraw, this would be a tremendous performance win. So I switched the rendering from back to front, to front to back. Changing the alpha blending equation, and added a stencil test so only the first 8 front most impostors per pixel get drawn. The result worked mostly, with one problem. When the first 8 pixels are all low alpha, there is still some artifacting. Adding in a alpha test to clip out really low alpha pixels so they didn't get included in my 8 pixel limit, helped but didn't fix the problem. A more innovative solution was needed! If you think about it, when a pixel is generated by the overlap of many low alpha sprites, it is usually representing some kind of fog or haze. And this fog or haze usually has a similar color to the surrounding pixels. So if the accumulated coverage of a pixel is very low after drawing 8 pixels, it is probably safe to assume the fog/haze case. Now I had a solution to the problem. The solution is to add one more pass, drawing a 1/2 down-sampled copy (using the GPU's automatic mipmap generation) of the previous frame as the last back-most overdraw pass. The down-sampling blurs the pixels slightly (fog/haze), and fills in the areas of low alpha accumulation. Given a good 30 fps, the convergence of the algorithm is invisible to the eye. And it worked, really really well! Final Step to a Huge Performance Win Already the stencil test helps quite a lot by skipping the fragment shader (and thus 2 texture reads, and 1 ROP blend). But there is a faster way by eliminate large groups of pixels way before the stencil check. After some research, it looks as if only the newest AMD/ATI GPUs have a hierarchical stencil buffer, enabling the stencil pass to clip out groups of pixels (say 16 or 32) at a time. So the best next option is to use the hierarchical z-cull hardware, which I believe is similar in function in all DX10 type cards. Filling the Z buffer is another subproblem. Looks like to use the z-cull, I'm going to have to draw polygons with alpha test off, and no fragment shader depth write. So my idea is to draw a mini framebuffer (x/4 by y/4) first using the stencil idea, but only drawing Z into a texture instead of color. So the last z drawn is for the 8th pixel drawn into the mini framebuffer. Then using a vertex shader to generate two triangles per pixel of the mini framebuffer, and doing a depth only write of the resulting z values into the full size Z buffer. Then the z-cull hardware should be primed to quickly chop groups of
|