080628 / Micro Polygons![]() But first... There is something about seeing awesome success of others which inspires ones competitive spirit to strive for excellence. For me it is the frontier of GPU parallel programming and graphics. Voxels AgainLiterally could not wait to see more of Carmack's Sparce Voxel Octtree, so I and another decided to figure out exactly how to do it on current hardware, and do it nearly as fast as current triangle rendering. Have an answer too and that is all I am saying. Almost sure this method isn't what Carmack had in mind. The idea is in good hands, and I have moved on to even greener pastures (as the idea isn't as non-static geometry friendly as I would like). CUDA 2.0NVidia's newest hardware has some very important changes beyond double the registers, double precision, and faster geometry shader. Global memory fetch and store looses stalls on thread bank conflicts. Cost is now proportional to the minimum number of bus granularity accesses required to service the vector access. So free global memory swizzle. However, CUDA 2.0 is still missing PTX's .surf (surface cache). How about one tough open question to all CUDA programmers, what is the fastest way to simulate a Z buffered write in CUDA. Especially given the case where multiple threads might or might not write to the same pixel with different Z values? If you got an idea, I would love to get an email... Realtime Raytracing on ATI's Hardware?Add in a few other buzz words like voxels, pixels with depth, re-lighting, wavelets, and you have a completely new form of confusion, and I really do like puzzles. Check out the talk on the ompf board. Just what are they using the tessellation hardware for in a DX9 based app? My guess is that we are simply looking at 4D (space and time) realtime re-lighting, with an awfully large amount of pre-computation. Here is a hint, not sure what it means yet.
On the Previous PostI have no idea how many people actually read this and can get used to how I am constantly running in multiple directions all the time. So a quick update on the previous post, I did manage to quickly prototype my all post-processing in one pass with the same samples idea, and a large amount of it actually works quite well. Given that post processing can be over 30% of the GPU cost of rendering a game's frame and is usually 100% texture bound, there are free lots of ALU cycles to use to figure out where to sample and what to do with the samples. Biggest surprise was how well the frame re-circulation anti-aliasing worked. ![]() Moved back to non-polygon rendering again so this works sleeps now. A Very Important Semi-Missing Hardware FeatureI'm not currently interested in this but anyone with a traditional triangle based renderer should be. Lets just fast forward to when your engine can afford to do cone step mapping (or some other fragment based raytracing). Perhaps this time is even now because your engine is just that awesome in that it can do cone step mapping, handling the GPU cost LODing shaders (reverting to less expensive parallax mapping, bump mapping, and simple texturing based on distance). Normally you would be limited to the triangle's hard edges. Perhaps even you went nuts and tried to extend with fins in the geometry shader. There is (on some hardware and could be on other hardware) better way. First what you want is the triangles to simply be a bounding hull over the surface. When your cone step mapping misses the surface, you tex kill the fragment. Of course the z buffer is still just a triangle surface, so anything which intersects this surface will do so wrongly. Unless of course you write depth which would kill your frame rate. The problem is that writing depth causes the fast Z-Cull (NVidia) and Hi-Z (ATI) hardware to turn off. The idea being that there is no way for the hardware to compute min/max Z during coarse rasterization if depth is written. However, if you could tell the hardware that you were only writing depth increasing away from the triangle surface, now the coarse and fine rasterization hardware could easily compute minimum Z and do fast Z cull. It just could not do trivial accept (since maximum Z isn't known). Turns out this isn't possible on the NVidia 6 and 7 series hardware (which includes the PS3), but is possible on the 360. Not sure if later NVidia hardware or PC ATI/AMD hardware can do this. Could be an API problem even if the hardware had the feature. So for all the game developers who are reading this, ask your hardware vendor contacts for this feature right now, so it gets in all future hardware. You want this ability to work on all cards, it will enable you to drastically lower your poly count and still have fully detailed objects. You don't want to loose to the rest of us non-polygon rendering people as hardware gets faster. Another Missing API/Hardware FeatureFor those who don't read about this stuff elsewhere on the internet first a quick refresher on GPU hardware. Fast texture reads come from compressed swizzled textures. This enables great texture cache performance. The swizzling provides 2D locality. You loose swizzle and compression when you read from a frame buffer or render target. Now texture cache misses read in full memory granularity lines (which can be large), so random accesses from a render target can really hurt, especially 32-bit texels. What I would like to see is the ability to have more than a 4 component frame buffer. Sure you can have up to 8 render targets in DX10, but random fetches from those 8 render targets cause a loss of texture cache efficiency. What is needed is the ability to fetch from all render targets with one global memory read. So render targets are interleaved in memory, and random access reads would be bandwidth friendly. This would be hugely useful for GPGPU in that you could randomly load/store objects at full bandwidth efficiency and utilization as long as they matched the GPU memory granularity (which is often 2 FP32 vec4s). This can somewhat be currently done with CUDA, if the texture is a linear texture. I still have hope that DX11's compute shader will enable GP global memory read/write from shaders. If/when that is here, this feature could be less useful. ... and finally on to the topic of this post. Micro Polygons / GPU Scatter RevisitedTake a look at the FragSniffer and refresh your mind about the G80 hardware. GPUs are designed to draw frame buffer aligned 2x2 pixel quads. The parallel hardware takes care of building SIMD vectors of verts for parallel execution, then triangle setup / rasterization packs SIMD vectors of pixel quads for parallel execution, and finally write combines the results in the ROP/OM (output merger). Small triangles loose efficiency really really fast as many of the pixels in a quad are not actually in the triangle. So fragment shader efficiency tanks and GPU resources are wasted. The great irony of rendered graphics these days is often for any given frame, more verts are processed than there are pixels on the screen. If the GPU rasterization hardware can setup just one primitive per clock cycle, this is approximately greater than 500M tri/s, of which there are under 30M pix/s at 720P at 30Hz. Or in other words, over 16 tris per screen pixel per frame at 720P at 30Hz. You can probably see where this train of thought leads too, but before we start with that insanity, lets go through a few more details. Huge triangle setup ability is needed because often triangles need many passes in the rendering pipeline. For example in a typical non-deferred style engine, the same triangle might get drawn once in a pre-Z pass, another 1-4 times in shadow map generation, and perhaps another 1-4 times in lighting. Then factor in all the triangles which get culled by the view and by occlusion. So the ability to setup 16 triangles per screen pixel per frame at 720P at 30Hz starts to make sense, even when most triangles are many times larger than a single pixel. Now say we want to evolve our rendering pipeline and draw pixel sized micro polygons. This would be slow on current hardware right? After all G80 can only fragment shade 4 pixels per a 32 SIMD vector, so fragments shaders would run at 1/8 capacity. There is an answer on current DX10 level hardware, don't do work in the fragment shader. Draw points, do all the fragment shading in the vertex shader using texture fetch where the GPU is still packing at 100% efficiency for ALU SIMD computation. Setup all vertex shader outputs as GLSL SM4.0 "noperspective" so they don't get interpolated. And use the fragment shader as only a pass-though to get vertex shader outputs to the ROP stage. The ROP stage should be able to re-merge single pixel output well without a large bandwidth cost, so fragment shader and ROP should no longer be a bottleneck. Of course, the vertex shader would have to have enough work or texture fetches to hide the triangle setup bottleneck, which should be possible shading points with really complex shaders. Taking Points SeriouslyRight, so if you can solve the following problems, a point sized triangle render could work. 1. How to anti-alias when there isn't enough triangle setup to fill a MSAA framebuffer? Already started on a prototype. I have returned to my GPU only engine concept. Full GPU side L-system tree traversal is working, with visibility and occlusion computations only on the GPU as well. Also moved beyond cubemaps and into an octahedron map which fits into a single texture. Key here is that it can be updated via points with one draw call, where as a cubemap would require drawing each point to each face. So I'm doing the full 360 degree fisheye projection from the octahedron map from Atom's L-system based renderer. I'm not to the lighting, hole filling, or anti-aliasing yet, just got rough traversal working in the prototype, so don't expect anything awe-inspiring in the screen shots, at least yet. ![]() The octahedron to fisheye projection mapping looses detail towards the edges of the lens, so I'm going to handle that via a lens blur when I get around to it. ![]() The noise and aliasing in the system is by design. These shots are simply a debug view of the results of the tree traversal. Actual lighting and rendering is going to be done via a single full screen pass using frame buffer re-circulation (no point drawing). ![]() Doing something like 4M points a frame currently (including overdraw). ![]() View distances are effectively infinite. ![]() | Atom©2008/2007 Timothy Farrar Latest Blog Entries080826 . olick paper Index000000 . index Graphics080709 . antialiasing Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language070921 . assembler in atom4th Elsewhereandrew selle All Blog Entries080826 . olick paper |