071207 / G84

previous | next

Just some quick summary notes on the NVidia G84 for my own reference.

Tested with a 8600 GTS clocked to 730MHz core / 1460MHz SPU / 2.26 DDR.

32 SPUs at 1460MHz
16 TEX units at 730MHz
8 ROP units at 730MHz

Texture Performance

Nearest filtering has no advantage, bilinear is free.
Trilinear is roughly double cost of bilinear.
64bit texels should have 50% performance of a 32bit texel.
128bit texels should have 25% performance of a 32bit texel.

All formats tested with sequential access sum of 3x3 (9 total) texels, using a 2x2 texture and a second batch of tests with a 2048x2048 source texture. Both tests output to a 2048x2048 FBO. Bilinear results,

Max possible bilinear rate = 16 TEX units at 730MHz = 11.6 Gtex/sec.
<=32bit texels - L8,L16F,L32F,LA8,LA16F,RGBA8 : ~7.6-8.0 Gtex/sec, 65-69% of max.
64bit texels - LA32F,RGBA16F : ~5.8 Gtex/sec, 99% of max.
128bit texels - RGBA32F : ~2.7 Gtex/sec, 93% of max.

Random notes,

Strange reduction in performance for <=32bit texel types.
For bilinear filtering, 64bit and 128bit texel performance is ideal.
Trilinear textures with forced bilinear LOD levels are at bilinear speeds.
Trilinear performance is bandwidth limited? with the tested 9 texel sequential access.
Typical trilinear performance is 30-40% off expected performance of 2x2 texture.

ROP Performance

Assuming that memory read/write rates of 64bit and 128bit pixels are 1/2 and 1/4 32bit pixel rates. Also assuming that blend costs of 16bit and 32bit are 2x and 4x the 8bit rate. Guessing L32F is going to be blend limited but not write limited, RGBA32F is going to be both blend and memory limited (guessing blend and memory latency is additive), but that RGBA16F will be a fast path with memory latency fully masked by the blend latency. Tested writing to 2048x2048 FBO.

Max possible blend rate = 8 ROP units at 730MHz = 5.8 Gpix/sec.

Without blending - L8,L16F,LA8,LA16F,RGBA8 : ~5.1 Gpix/sec, 88% of max (5.8 Gpix/sec).
Without blending - L32F : ~3.3 Gpix/sec, 57% of max (5.8 Gpix/sec).
Without blending - RGBA16F : ~2.7 Gpix/sec, 93% of max (2.9 Gpix/sec).
Without blending - LA32F : ~2.1 Gpix/sec, 72% of max (2.9 Gpix/sec).
Without blending - RGBA32F : ~1.2 Gpix/sec, 82% of max (1.45 Gpix/sec).

With glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA),

L8, RGBA8 : ~2.9 Gpix/sec, 50% of max (5.8 Gpix/sec).
L16F : ~2.7 Gpix/sec, 93% of max (2.9 Gpix/sec).
L32F : ~1.4 Gpix/sec, 97% of max (1.45 Gpix/sec).
RGBA16F : ~2.7 Gpix/sec, 94% of max (2.9 Gpix/sec).
RGBA32F : ~0.364 Gpix/sec, %99 of max (0.365 Gpix/sec, see comments above).

Random notes,

ROP blends must always use a 2 clock 16bit blend even with 8bit texels. Write without blend rates are odd, especially L32F.

Other Random Notes

Using this as reference.

Apparently Z-Cull is much more effective on the G84 compared to the G80.
The G84 can do something like 510 Mtri/sec max.
The G84 can do something like 68 Mpoints/sec max at 1pix 48 Mpts/s max at 4pix.