081008 / Future NVidia Hardware

previous | next

This post was inspired by Marco Salvi's Peek into the Future of Interactive Computer Graphics.

I'm guessing that given hardware developement pipelines are long and the future is well into design right now. Does CUDA's PTX ISA 1.2 (have to download the newest CUDA toolkit to get the PDF) perhaps hint at NVidia's future plans? They have extended the documentation on memory state spaces,

Name, Addressable, Initializable, Access, Sharing
.reg, No, No, R/W, per-thread
.sreg, No, No, RO, per-CTA
.const, Yes, Yes, RO, per-grid
.global, Yes, Yes, R/W, Context
.local, Yes, No, R/W, per-thread
.param, Yes, No, RO, per-grid
.shared, Yes, No, R/W, per-CTA
.surf, via surface instructions, Yes via driver, R/W, Context
.tex, via texture instructions, Yes via driver, RO, Context

However,

The surface (.surf) state space is unimplemented in the current release. Really wasn't expecting .surf to be shareable across context. My bet was that it was going to be microprocessor shareable only (somewhat like current OM/ROP write-combined caching). So if one reads between the lines, .surf is effectively a high latency coherent read and writable cache, probably with format conversion, and perhaps blending. Effectively a programmable ROP. Could be how NVidia plans to take on Larrabee's programmibility, opening up efficiency for all sorts of problem solving which requires coherent scatter of small scaler values (say like a z buffer, or binning algorithms). This type of thing simply is too bandwidth inefficient to be useful currently in CUDA. Unfortunately since DX11 doesn't have programmable blending or anything resembling this functionality, my guess is that .surf doesn't see hardware support for a while, perhaps until NVidia sees if it is needed to go against Larrabee. However when CUDA gets .surf, my GL/DX days are over.

Indirect branch through a register is unimplemented. Indirect call through a register is unimplemented. In the current ptx release, parameters are passed through statically allocated ptx registers; i.e., there is no support for recursive calls. Which means that one can only branch to an immediate (ie no virtual functions). I'm not sure if any of this functionality is required for DX11's dynamic linking, my guess is no, and we will still get shader (code) patching from the driver. So perhaps these will remain unimplemented for a while.

Floating-point atomic operations are unimplemented. So no floating-point atomic add, min, or max. Don't see a reason these would get in the spec unless future hardware was to support the functionality. One might think that atomic float min/max might be usefull in emulation of a z buffer (no I'm not suggesting zbuffer hardware is going away any time soon).

Random Image

Traffic jam in Yellowstone... and speaking of vacation photos, Meshula.net has some really cool stitched shots through windows in Bodie, Ghost Town.