081209 / OpenCLOpenCLOpenCL seems like a design which can enable high performance code across NVidia GPUs, AMD GPUs, and Larrabee. There are some clear GPU mapping hints in the spec. CL_DEVICE_GLOBAL_MEM_CACHE_TYPE has CL_READ_WRITE_CACHE for Larrabee. Not sure what CL_READ_ONLY_CACHE implies, perhaps just that constant buffer access is cached? OpenCL supports writing to a framebuffer, CL_DEVICE_MAX_WRITE_IMAGE_ARGS is a minimum of 8 (DX10 MRT minimum) if CL_DEVICE_IMAGE_SUPPORT is CL_TRUE. One very interesting node is that OpenCL's write_image*() functions take a direct integer coordinate. So support for "write anywhere in an image" is in the spec. Open question as to if current hardware will support anything other than using work-item coordinate in the work-dimension as this write_image*() coordinate. In theory, support for "write anywhere in an image" could be emulated with global memory writes. OpenCL will not allow both read and write from the same image from a given kernel. So don't expect to use OpenCL to emulate programmable blending unless going directly to global memory. Should this change my theory on future NVidia hardware? I had a hope that the CUDA PTX .surf memory space would be supported in DX11 level NVidia hardware, and get into OpenCL. This would have provided both read and write access to framebuffers (ie surfaces), in the form of a small coherent cache. As important, .surf would have provided the automatic address bit interleaving for 2D cache friendly addressing. Yet OpenCL has no support for anything like .surf, instead just write-only access to framebuffers. This hints that .surf was an early idea later abandoned by NVidia and that future NVidia hardware will have a different direction. Given the 3 or so years from GPU design to production, the future is already known right now by the hardware vendors, and probably greatly shaped the design of OpenCL. So is OpenCL the future? My guess is for now we are set with the post G80 G8x/G9x level hardware (ie lower performance vector gather/scatter, but compute 1.1 global integer atomics and asynchronous memory copies), with G2xx feature set getting low-end/mid-range/mainstream in late 2009 or early 2010 (much better vector gather/scatter, better atomics), with G3xx/DX11 level on high end only in late 2009 or early 2010 (full on competition with Larrabee?). Perhaps with OpenCL/DX11 hardware being the basis of the next console generation for 2011/2012, which will lock in a feature set for a long time. So I'd place a bet that OpenCL is the future, but paired with a DX11 tessellator and other raster features. If .surf isn't the future, what is? Perhaps ROP/OM unit including ROP cache goes general purpose. So normal vector gather/scatter (a majority of bus bandwidth) still goes direct to global memory (but with a G2xx vector gather/scatter friendly memory controller). However global memory atomics go through a separate coherent data cache (which used to be the ROP cache). Global memory atomics are optimized such that atomic vector scatter/gather of 32bit values is fast enough (and bandwidth efficient enough because of cache) to implement a software Z buffer. Then blending goes programmable (even though DX11 doesn't support this). This is how I hope things will go, and looks to be 100% API compatible with OpenCL (and Larrabee). Would provide the currently missing GPU hardware functionality required to implement fast parallel algorithms, which need parallel binning/queuing. In fact this would be about all I need for everything I'm personally working on. If this is the case, plan on doing everything with un-typed buffers of linear memory. Use atomic operations. Also plan on doing manual integer addressing, and manual format conversion (instead of "free" format conversion via textures or surfaces). Design algorithms which adapt to CL_DEVICE_PREFERRED_VECTOR_WIDTH_*, and work-group dimensions such that width is function of CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE. Also betting that single device GPUs ends now (or very soon). All future safe stuff should plan on being multi-device to scale in performance... GPU Gems 3 OnlineBTW, GPU Gems 3 is going online! Atom UpdatesScreenshots and new Videos to follow when I get the time. Been doing a lot of profiling with the Linux version of NVPerfSDK. I am very thankful that NVidia released an updated Linux version. Turns out Atom Engine v1 is very performance dependent on post PS3 GPU stencil feature support (using course stencil to limit blending overdraw, which works amazingly well on post G7x hardware). So v1 isn't a good match for PS3, and now I'm back to heavy development on v2. FYI, Atom Engine v2 is the all GPU version which is has a DX10 hardware level requirement. Current plans/progress, Point Scatter. For Atom v2 I'm still doing basically an insane particle system which draws into a octahedron. Visibility and display traversal is solved stochasticly. So it is an inexact solution. Each frame particles draw only their children (via L-system rules) into the next frame. Temporal jitter is used to vary the particle collisions each frame. Parents (meaning particles closer to the root) have a higher priority. Priority is written out as depth, so the Z buffer handles collisions. I like to call this the "visibility waterfall", because the particle tree which represents the scene expands by only one level per frame, and changes each frame. NVPerfSDK shows that on the 8600 GTS, I'm currently ALU bound on my point scatter, and not setup, ROP, or bandwidth bound. Optimizations. I've divided the engine into 2 sets of point scatter. I use a lower resolution octahedron buffer for visibility. This buffer switches between a vertical and horizontal stretched aspect ratio each frame to increase detail given a set number of particles. I then do a second point scatter into the output fisheye projection. This second scatter draws all the child points of all the lower resolution particles. The particles in this 2nd scatter don't get recirculated for visibility. With these optimizations, I was running at 90FPS on the 8600 GTS at 720P with the test engine. Challenges. The particle output to the fisheye projection has holes (point scatter). Given my limit of one level scene tree expansion per frame, there is a lag for detail draw-in of newly un-occluded geometry. Also collisions in the visibility computation can cause bits of geometry to disappear once and a while. These are all really tough problems to solve! Reprojection. Working on a reprojection based solution. Reprojecting the results of the previous frame given motion compensation. The temporal jitter has a side effect of enabling perfect anti-aliasing and filtering, which works well even under subpel motion reprojection. As expected, transparent rendering is working quite well with the combination of temporal dithering and reprojection. I've also merged the reprojection code with a semi-CFD code (think motion compensated advection). So the scattered points can interact in all sorts of dynamic ways. When the screen is fully dynamic and running the CFD code, I can fully hide any artifacts. Static Geometry? The thing I'm having problems with currently, in all irony, is drawing fast moving non-dynamic surfaces. In these cases I have the artifacts hidden in noisy way. This I am not 100% satisfied with yet. So I'm going to attempt a dual reprojection. First reprojection/hole-filling for just the point drawing to the fisheye. This reprojection designed for fastest convergence. Then a second reprojection, which uses the first reprojection to control the smooth primary view reprojection (which includes motion blur to hide artifacts). Free HDR, Free DOF. I'm using FP16 framebuffers for the reprojection, so I got free linear colorspace blending, and HDR currently running at 60-90 FPS at 720P on the 8600 GTS. DOF I haven't tried yet, but I believe I will have for free by adjusting the temporal jitter radius by a computed circle of confusion given camera parameters. Global Illumination. Got 50% of the frame budgeted for physics and lighting. My ideas for lighting and shadowing fall into the realm of insanity (like the rest of this project). The idea here is to amortize the illumination computation across both time and space. Particles in the "visibility waterfall" are going to incrementally calculate illumination. I've got a relatively small number of particles per screen pixel, so I can do a lot per particle. Each frame I am going to take the octahedron buffer and do a mipmap reduction which joins particles based on maximum intensity and maximum occlusion (for shadows). Then each particle is going to raycast (perhaps backwards) each frame into the mipmap reduction. The reduction represents importance sampling. Each frame the particle does a raycast in a new direction to factor into its illumination computation. In this way I'm going to gather illumination from many directions, as the particle splits into smaller and smaller children. Children thus share raycasts from parent nodes, intermediate results used for the next frame's computation, and everything is nicely amortized. Interaction, Physics. I'm expecting about 1024 root nodes in 360 view at a time. These are going to interact via semi-rigid body code (really easy to do). Child particles are going to interact via a very strange hierarchical SPH like code. And pixels interact using the above described semi-CFD/reprojection code. The challenge here is that non-root interaction has to remain consistent across frames given my strange "visibility waterfall" system. Since nodes only exist for one frame, and then disappear only to be re-spawned by a new version of the parent a frame later, I have to make sure that the re-spawn gets effected by the same thing (just at a later time) that effected the particle back two frames in time. For acceleration I'm going to be doing a weighted reduction of the octahedron based on velocity*mass. This reduction needs to also include an interaction shadow so that the re-spawn works out correctly. Roots are going to be semi-coherent across network players, interaction shadows will also be semi-coherent, and everything else (view CFD, node SPH) is going to be completely different. Going to be wild if it works... and have a backup plan if it doesn't. Dynamic L-System. I've decided to toss out the animated l-system, and instead use a l-system of which the rule changes based on local surroundings and physical interaction. | Atom©2009-2007 Timothy Farrar Latest Blog Entries090407 . dxt tip Index000000 . index Graphics090311 . atom tri soup Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language090318 . re-attachable code Elsewhereandrew selle All Blog Entries090407 . dxt tip |