080813 / OpenGL 3.0 II20080919 - Updated with info on GL3 driver support from ATI. GL3 is an incremental update to GL2 with added support for some of the most important current GPU hardware features, and a set path towards spec simplification (via depreciation of parts of the API). I am maintaining this page on my blog with all future updates on GL3. Features :: GL3 ARB Extensions / In DX10Known as the extension pack, didn't make it into core because of time deadline for Siggraph 2008. Expected that ATI will support these in possible Q1 2009 driver release. I would expect these in core in GL3.1 because of direct match to DX10 hardware spec. GEOMETRY SHADERS :: ARB_geometry_shader4 - Also supported in GL2 via extensions with Apple and NVidia through EXT_geometry_shader4. Apple even emulates this functionality via LLVM on hardware which doesn't have a geometry shader pipeline stage, check out the Apple GL Capabilities Matrix. INSTANCING :: GL_ARB_draw_instanced, GL_ARB_instanced_arrays - Draw instanced provides gl_InstanceIDARB to shaders, and instanced arrays provides a frequency stream divider for vertex inputs. TEXTURE BUFFER OBJECTS :: ARB_texture_buffer_object - Supported in GL2 by NVidia via EXT_texture_buffer_object. Texture buffer object provides unfiltered texel lookups (integer index) into 1D linear buffer objects. Basically the glue which provides the ability to fetch from transform feedback / vertex buffers, CPU writes to mapped GPU memory, and framebuffer readbacks (PBOs). Features :: GL Vendor Extensions / Not in DXFENCE :: NV_fence, APPLE_fence - Nearly positive this isn't available in the PC versions of DX. Provides for finer granularity of synchronization with GL, can be useful measuring GPU latency for drawing commands. Also can be useful for multi-threading with GL. Features :: GL via Extensions / In DX10PARAMETER BUFFER OBJECTS, BINDABLE UNIFORMS :: NV_parameter_buffer_object, EXT_bindable_uniform - Supported on cards NVidia only? EXT_binable_uniform supported on NVidia cards via Apple drivers. Provides a 64KB constant store which is array accessible in shaders, and via buffer objects, and avoids the compile and link problems of uniforms. In DX10 constants are declared in constant buffers (just like parameter buffer objects, or bindable uniforms), and this can be a major cause of low performance if used improperly. When constant buffers are updated, the entire buffer is uploaded to the GPU. Need to balance number of calls to update constant buffers, and amount of data to transfer per call. One thing to remember about constants and uniforms (ie what CUDA refers to as Constant Memory), is that while constant memory is cached (8KB cache on NV 8 series cards I think), a miss costs memory read(s) from device memory. On a cache hit, reading a uniform is as fast as reading from a register, only if all threads read the same constant, otherwise it scales linearly with the number of different constants read (divergence). The CUDA docs don't say if divergent constant reads can be fully hidden by the hardware (I suspect not). They do say that texture fetches are not subject to the constraints on memory access patterns that global and constant memory reads must respect to get good performance. Also that the latency of addressing calculations is hidden better with texture fetches, and that packed data may be converted for free and broadcast to separate variables in a single operation with a texture fetch only (ie texture fetch a vec4, into registers, vs 4 separate constant fetches). Obviously for divergence, random access and larger working sets, texture fetch is going to the faster path, but for non-divergent constant access parameter buffer objects would be the way to go. DRAWAUTO CALL :: NV_transform_feedback2 - Draw geometry of an unknown size that was created by the geometry shader stage. Currently only supported in GL via the above NVidia extension. TEXTURE FETCH FROM A MULTISAMPLED TEXTURE :: NV_explicit_multisample - DX10 provides ability to sample individual samples from the texture backing a multisampled render target. Even on DX10 with NVidia's drivers you can sample both depth and color. Initial GL support for this can be found in NVidia's extension. The extension also enables explicit control of the multisample sample mask! Features :: GL Future Object Model Improvements / In DX10IMMUTABLE STATE OBJECTS - This functionality has limited availability in DX9 in the form of StateBlock9, which could record a bunch of state changes and apply with one function. This is a core part of the eventual object based API rewrite for GL. Was also a core change for DX10 (organized pipeline into 5 immutable state objects: input layout, raster, sampler, depth/stencil, blend). The point is to push validation and processing overhead into object creation, so this is an optimization to improve batch performance. DX10 limited to 4096 objects for most types. DX10 didn't get threading right (this is being finally addressed in DX11), and GL still has the opportunity to do the object model correct, so that command buffers (what the driver creates to issue commands to the GPU) can be created in parallel on separate threads. Much of the unfinished object model improvements could get incrementally rolled into the spec prior to the eventual cleaner API. For example, Apple's Vertex Array Objects are now core in GL3, which provides a single object similar in function to DX10's input layout. Should be a good place for optimization for GL2 based apps. EXT_direct_state_access might also be a possible intermediate optimization for draw call bound applications depending on what drivers support this extension. DECOUPLE TEXTURES AND TEXTURE FILTERING - Ability to sample from a texture with both filtered and non-filtered texture fetches. Requires object API change described in State Objects above. ABILITY TO SET MAX ANISOTROPY :: EXT_texture_filter_anisotropic - DX10 supports this in the core through D3D10_SAMPLER_DESC.MaxAnisotropy, while GL only through extension. Features :: Not in GL3 / In DX10.1The following are DX10.1 features (ie only supported on ATI hardware to my knowledge). Would not expect these in the spec until supported on a broader hardware base. CUBEMAP ARRAYS, INDIVIDUAL RENDER TARGET BLEND MODES, GATHER4 Features :: Not in GL3 / Not in DX10LOCAL CACHED FULLY COMPILED SHADERS - DX byte code isn't the solution, it needs to be re-optimized and re-compiled by the drivers. For example ATI's shader compiler has to fight to undo the DX byte code optimization when it does it's internal recompile. I'm sure the case is the same for NVidia as well. Not practical to pre-compile for all current and future hardware. Possible good compromise is to cache compiled shaders on the local machine, so only take the hit once. PROPER THREADING SUPPORT - This is expected as a core feature in DX11, and most likely in future GL through object model improvements. Both DX10 and GL currently have really poor threading models. Arguably GL actually has better support for threading than DX10, in the way that GL can share objects between contexts. The future is the ability to for the driver to build command buffers user side per thread. SHADER SCATTER - This is expected as a core feature in DX11, but perhaps limited to compute and pixel shaders. Both current DX10 ATI and NVidia hardware support this, but no API to use it. Ability to do general writes to graphics resources from within a shader. COMPUTE SHADER - Expected as core feature in DX11. Available now in CUDA mixing with DX and GL, but only on NVidia cards. Path for GL is with OpenCL. OpenGL/OpenCL to be able to share resources without a COPY! OpenGL/OpenCL to have very flexible scheduling. OpenGL/OpenCL is going to be awesome! DYNAMIC SHADER LINKING - Expected feature in DX11. TESSELLATION - Expected feature in DX11. GL3 Driver SupportUntil the vendors finish up their GL3 drivers, one can currently prototype GL3 work on NVidia's nearly full featured GL3 driver, or using the pre-GL3 non-ARB vendor extensions on current released GL2 drivers by both NVidia and Apple. NVIDIA - NVidia's GL3 Drivers for Windows, Linux, FreeBSD, and Solaris According to NVidia, driver has full core GL3 functionality excluding, Includes the following new extensions, ATI - Possible Q1 2009 for complete 3.0 driver release, plans to implement the GL3 ARB extensions. Source, BOF slides. Early beta available now with Catalyst 8.9 drivers for Windows. Includes the following GL3 support, APPLE - My personal guess is that we might see GL3 ARB extension support early on NVidia hardware since Apple has EXT_gpu_shader4 support currently, but that full GL3 support on ATI hardware happens at similar time to ATI's release of PC drivers. INTEL - Expected on Larrabee and future platforms, but seems like not expected on current hardware. Source, BOF slides. GL3 Hardware Fast Paths ReferenceDon't have time yet to do this, but I do intend to compile this data on this page here in the future. Most importantly to describe what is known as to best methods to buffer data between CPU and GPU, but also to cover relative performance of things like transform feedback and geometry shader support. What I don't have time to do, I'll keep links to other references on. Apple's OpenGL Optimization Page My chicken scratches follow, don't expect any of it to make sense yet... DYNAMIC UPDATE OF TEXTURES : DX10 - Have a pool of textures created with D3D10_USAGE_STAGING. Map() with D3D10_MAP_WRITE and D3D10_MAP_FLAG_DO_NOT_WAIT so that Map doesn't block. Write texture to the mapped memory and Unmap(). Then CopyResource() or CopySubresourceRegion() to do an asynchronous GPU to GPU copy. Any subresource that is bound to the pipeline must be unmapped before any render operation can execute, thus the reason for the staging texture. DYNAMIC READBACK OF TEXTURES : DX10 - Only resources created with the D3D10_USAGE_STAGING flag can be read from the GPU. However D3D10_USAGE_STAGING resources cannot be written to by the GPU. So CopyResource() or CopySubresourceRegion() must be used to do an asynchronous copy to the D3D10_USAGE_STAGING resource. Wait at least 2 frame before using, ie Mapping, the resourced copied to. DYNAMIC UPDATE OF CONSTANT BUFFERS : DX10 - Use Map() with D3D10_MAP_WRITE_DISCARD and then UpdateSubResource(). DYNAMIC UPDATE OF BUFFERS : DX10 - Update using Map() with D3D10_MAP_WRITE_NO_OVERWRITE (BTW, this is only valid for vertex/index buffers, not textures, too bad) use like a ring-buffer, only write to empty portions of buffer. DX10 is allowing you to map resource directly even though it is using the resource. Wouldn't want to use UpdateSubresource() here because of frame delay. DX10 to GL PortabilityThis is a work in progress... ID3D10Device::CreateInputLayout() ID3D10Device::CreateRasterizerState() ID3D10Device::CreateDepthStencilState() ID3D10Device::CreateBlendState() ID3D10Device::OMSetBlendState() ID3D10Device::CreateSamplerState() D3D10CompileShader() ID3D10Device::CreateBuffer() with D3D10_BIND_CONSTANT_BUFFER Awesome Papers from Siggraph 2008March of the Froblins: Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU - Co-authored by Jeremy Shopf, Level of Detail Blog. Impressive: GPU side scene traversal, occlusion, LOD, path finding, and more. Anyone have a link to a video of this? StarCraft II: Effects & Techniques - Very informative section on SSAO and DOF, especially in outlining who to use deferred depth and normal in the calculations. Also good to see someone else using the "compute at full res but sample at lower res" texture cache performance trick. | Atom©2009-2007 Timothy Farrar Latest Blog Entries090407 . dxt tip Index000000 . index Graphics090311 . atom tri soup Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language090318 . re-attachable code Elsewhereandrew selle All Blog Entries090407 . dxt tip |