071108 / GPU Assembly II

previous | next

NVidia has released the beta of Cg 2.0. Which is great, Leopard support, profiles for NV_gpu_program4, etc.

I just spend that last few days really getting to know NV_gpu_program4, testing assembly shaders, looking over NVidia's CUDA PTX docs (BTW, have to download the full install to get them), and trying to get as much of an understanding for what composes the opcode level of the G8x GPU. I have learned a lot at the cost of effectively loosing a week of productive work, and spending way to much time working on new ideas (hierarchical soft particles) and features, when what I really need is to have a feature freeze and just get the damn engine done!

Also I should say thanks for all the answers I have received on the various message boards!

NV_gpu_program4 and PTX

According to multiple sources, PTX very closely resembles the GPU opcodes on the G8x. From what I can gather beyond the obvious (like gpu_program4 being vector based assembly on a scaler processing unit) there are some important differences between gpu_program4 and PTX.

First gpu_program4 opcodes have a large number of modifiers, like ability to swizzle, saturate (both -1 to 1, and 0 to 1), negate, take absolute value, multi-bit predicate, and set multi-bit predicate based on output of the opcode. PTX shows only saturate 0 to 1, and 1-bit predicate (meaning skip execution of instruction based on a 1-bit predicate register). So all the other gpu_program4 opcode modifiers have to be emulated by multiple instructions.

Also PTX shows the need to run a "set if" compare opcode to write to a predicate register, but provides an opcode to do a register move from two registers conditionally based on another (based on sign I think). Seems as if the GLSL compiler generates NV_gpu_program4 code with a bunch of "set if" opcodes and single instructions skipped by branches when a simple predicated instruction would do. However since NV_gpu_program4 assembly still has to get compiled to the true GPU machine instruction set, this might simply be re-optimized in that step.

Assembly in Cg or GLSL?

The title sounds strange (and bogus), but given how simple the PTX instruction set is, it should be easy on the scaler G8x to produce ideal assembly like code in Cg or GLSL simply by knowing a few coding patters which get compiled to the underlining GPU opcodes. For example, setting a bool to invoke a "set if" opcode, and "if(bool) d=saturate(a*b+c); " to invoke a predicated multiply add opcode with saturation.

What about Pack/Unpack

Seems as if pack/unpack is emulated via type conversion opcodes and multiple integer opcodes (shifts, masks, etc). Guessing 4 ops to pack two FP16s, and 10 ops to pack four bytes. So the advantage of using these (memory bandwidth reduction) comes at the cost of lots of shader cycles. So it is not as useful as I thought.

Also in GLSL with the NVidia drivers you can use both Cg types and Cg standard library functions (if you don't add a #version in the GLSL code) like pack_2half(), pack_2ushort(), pack_4byte(), pack_4ubyte(), unpack_2half(), unpack_2ushort(), unpack_4byte(), unpack_4ubyte() to do the packing. Or use floatToRawIntBits() and intBitsToFloat() to roll your own pack and unpack in combination with integer shifts and masks.

Conclusions

I'm sticking with GLSL, but might convert to Cg later! Don't get me wrong though, NV_gpu_program4 output is very very important however, just as an optimization tool to get an idea of register usage (which usage in excess limits threading on the G8x) and what opcodes are generated for the code.