090412 / Moving Blog!

previous | index

Atom blog is now going to continue here at farrarfocus.blogspot.com.




090407 / DXT Tip

previous | index

When using a compressed format like DXT1 or DXT5 which only has 5-bits base red|blue level and 6-bits for base green level, normalizing the texture luminosity before compression and scaling in the fragment shader can be quite an improvement in quality. This I believe is common practice for everyone.

If using a proper linear lighting model, the 360's piece wise sRGB conversion effectively removes about a bit of precision in the darks for DXT* compressed textures. So skipping using 360's sRGB conversion and doing a gamma 2.0 conversion manually (at the cost of an extra MUL op) can provide around 0.5 bits of precision back to the darks. Other option is to eat the cost of a gamma 2.2 manual conversion for no loss in precision (compared to the PC).

And The Tip

Go a step further and normalize each color channel individually before compression. For textures which are mostly gray, this will only provide a fractional bit precision improvement. For those textures which are highly tinted towards one color, (such as foliage), it can provide an extra bit or two of precision in the non-dominate color channels.




090320 / GDC 2009

previous | next

Lots of stuff here, mostly my notes from GDC presentations... (wasn't at GDC this year!)

Larrabee Instructions

Prototype Instructions

1. 16-wide 32-bit or 8-wide 64-bit SIMD.
2. 32 512-bit vector registers.
3. 8 16-bit predicate registers.
4. Predicate register source operand per opcode.
5. Register = Register OP Swizzle/Broadcast/Convert(Register|Memory) opcode form.
6. Bit interleave 1:1 and 2:1 instructions!
7. Pack and store vector to unaligned memory.
8. Convert to float 11:11:10.
9. Convert to unorm 10:10:10:2.
10. Convert to sRGB 8-bit.
11. Vector gather/scatter from 16 addresses of [base + index*scale].
12. Vector prefetch to L1 for later gather/scatter.
13. Rotate and bitfield insertion or mask.
14. Unaligned loads.
15. Four element set vector multiply.
16. Scaler bitscan forward and reverse.
17. Evict/prefetch L1 or L2 cacheline.
18. Scaler bit count.
19. Four element set scaler quad mask.
20. And more...

Also screen shots of some GDC slides.

1. Vector store (in scaler pipe) co-issue with vector ALU op.
2. 4-9 cycle latency for most vector ops.
3. Gather/scatter fetches/stores to/from only one L1 cache line per cycle.
4. Cache miss causes only offending hyperthread to sleep.
5. Branch miss-prediction only flushes instructions from the offending thread (1-2 cycles avg).

GDC

Some highlights from, click on the GDC 2009 bar for presentations,

The PS3's SPUs in the Real World : A Killzone 2 Case Study - 2500 static light probes per level stored as 9x3 SH in KD-tree. IBL on objects from 4 nearest probes with inverse distance weight rotated into view space and written out as 8x8 pix sphere environment map. Around 3000 particles updated per frame, 250 particle systems, 150 systems draw, with 200 Havok collision ray casts. RSX generates 1/4 resolution image in XDR. SPUs do bloom, DoF, and motion blur. Motion blur uses 1/16 resolution 2D motion vectors (8bit xy). Motion vectors are blured for dilation, image is blured along motion vectors with 8-tap filter. Depth of field uses 36 jittered disc point samples. Order of 50 animation jobs and 500 animations sampled. Waypoint cover maps with a depth cubemap per per waypoint. Looks like depth cubemaps have 1/2 resolution on the top and bottom faces.

Guerrilla Tactics: Killzone's Art Tools and Techniques - KZ2 built with 1500 level building blocks (BBs). Building blocks placed, scaled, and rotated in Maya to build levels. 90% of level geometry is instanced from repository. 10% of level geometry from hand-modelled BSP geometry. About 1000 BBs per lightmap. Both low and full-res drawing of particles. Low res particles only alpha blend and additive. Low and full res particles would not sort correctly. Future is "dirtmaps" for BB to easy reskin BB in different environments.

Some highlights from, GDC Tutorials Handouts,

The A-Z of DX10 Performance - Awesome presentation, a must read. Covers things like, good to use no more than 5 bound constant buffers in a shader stage, and ATI can point sample FP16 RGBA textures faster than filtered.

GSC Game World's S.T.A.L.K.E.R : Clear Sky - Seems like a mix of deferred rendering via drawing a g-buffer and accumulating lighting (as in light pre-pass rendering). Using split hi/lo 8bit output of HDR so that bloom pass can just use the hi RT. Interesting method for dynamic rain using a shadowmap to test for surfaces which get rain, followed by g-buffer adjustment of albedo, spec, and normal for rain surfaces.

Shader Model 5 and Compute Shader - Provides SV_Coverage bitmask in PS shader. Gather() now able to sample from different channels! Bit ops with firstbithigh(), firstbitlow(), and reversebits(). Raw resource views! Atomic operations. Buffer append/consume, provides global pointer probably accessed via atomic operations internally. DX11 CS5 provides indirect dispatch of work group dimensions for conditional execution, and provides 3D dispatch. DX11 CS4.x (for DX10.x hardware) only supports 2D dispatch, writing to own offset in TLS, no atomic operations, and only write to Buffer object type. Which seems odd given CUDA for DX10 GPUs supports more than this. Is DX11 only going to support the common minimum between DX10 ATI and NV GPUs? If so this is sad (very thankful for GL3.x + CL).

Insomniac Physics - Shows transition from R1 to R2 and current in using parallel SPU programming to solve for game physics.

Insomniac Prelighting - Using RGB 8bit for view space normals. Drawing depth and normal with 2xMSAA, but doing lighting rest without MSAA. So is this full 2x resolution then, or just final output resolution from the "specular is 2x super sampled" comment? Sun shadowing for dynamic objects is classic object based shadows with screen space accumulation. So gather objects, group close objects, then for each group, render shadowmap for group, stencil group bounding volume, then draw and sample from shadowmap. Seems like they are doing some kind of ambient adjustment via darkening for backface triangles, is this because dynamic objects don't self sun shadow? Light space shadow maps are pre-rendered before drawing lights in the light pass. They attempt to resolve the multiple shadowing problem (ie combining baked lightmaps with dynamic shadows) by checking how dark the lightmap is (against a global cutoff) and trimming out dynamic shadows under the cutoff. Perhaps they have a blend region to feather the cutoff (will have to go back and play R2 to see!). In R2 they resorted to extra forward rendering passes for materials which didn't fit in the light pre-pass framework. Presentation also covers important things like doing rounding when recovering depth from aliased RGBA8 texture.

Insomniac SPU Wrangling - Rather important presentation. For example, covers lock-line waiting for SPU job control, avoid bus transfer of a spin-lock check by waiting for PPU or GPU to write to reserved address (block on rdch instead).

CryEngine 3

Looks like cascade shadow maps only get one of the maps redrawn per frame. Should be quite a draw call saver, at the expense of sun shadow update frame rate being slower than actual frame rate (which can be seen in the video). Doesn't look like they are doing aggressive lower resolution particle blending, but darkening particles seem drawn after lightening particles, which creates some strange artifacts in the video.

Project Offset

OpenGL 3.1 and GLSL 1.40

Specs are here.

GL Updates - GL 3.1 cuts out depreciated features. Support for signed normalized formats. Full 16 texture unit support in the vertex shader. The following extensions have been added to the core, GL_ARB_draw_instanced, GL_EXT_copy_buffer, NV_primitive_restart, GL_ARB_texture_buffer_object, GL_ARB_texture_rectangle, GL_ARB_uniform_buffer_object.

GLSL Updates - Added rectangular texture support in the form of *sampler2DRect and *sampler2DRectShadow. Added *samplerBuffer support for buffer textures. The above samplers with an i or u prefix support signed and unsigned integer textures. Uniform blocks support has been added, layout-qualifier uniform block-name { member-list } ;. The member-list is a bunch of uniforms. The layout-qualifier can be used alone to set defaults, for example I'll be using, layout(row_major) uniform;, instead of the default of column major matrix storage. See spec for complete details!

Volume Rendering Optimizations

Kyle Hayward has an interesting blog post on volume rendering optimizations. Highlights include keeping structure data in a swizzled ordering in memory (instead of linear) for better cache performance, and using an image space pass to compute normals from depth (instead of storing them).

Random Interesting Papers

Implicit Visibility and Anti-Radiance

Octahedron Environment Maps

Data-Parallel Hierarchical Link Creation for Radiosity - They find CUDA atomics on GT2xx to be only 15% slower than non-atomic operations.

Other iPhone Tricks

Switched to non-perspective correct texturing for this 3D Atom game by modifying my perspective projection matrix. Sounds crazy but provides some very important advantages.

1. Framebuffer feedback works better. Effects now consistent on field of view changes. I'm not actually texturing triangles with a texture, but rather texturing with an image space copy of part of the screen. Color interpolation is no longer correct, but that isn't a problem either in my case.

2. It should enable me (with other changes) to reduce my per vertex size to 14 bytes (short xyz[3], byte rgba[4], short uv[2]). On the iPhone, vertex input is always copied (even with static VBOs) and is often a bottleneck. Instead of a 4 byte float per (x,y,z), I can now do a short and still have huge view distances. Short that I am sending is (x,y) in projected screen space, and projected z (I'm doing the perspective divide in software, and all my verts already had software transforms into view space).




090318 / Reattachable Code

previous | next

BTW, ShaderX7 is out.

Reattachable Code

For the new iPhone version of Atom I decided to go with reattachable code for development. The entire codebase can be recompiled at runtime at reattached to the running executeable. This enables all sorts of quick development such as testing and profiling changes on a live running engine.

For development the code is compiled as a library, a small loader program makes a copy of the library, and loads the library copy to run the program. The program allocates all data at startup, and uses one single pointer to reference any data. I use nothing global except read only data. While the program is running the original library can be recompiled. Then in one key press, the program returns to the loader and returns that single data pointer. The loader makes a copy of the new library, loads the copy, and passes in the data pointer to the new code. Then the engine continues where it left off.

For release the single pointer references a static global, so related compiler optimizations are enabled.

GPGPU Course Notes

Jawed a Beyond3D regular poster, posted a link to some rather good GPGPU course notes from ECE 498AL at UIUC. Too bad GPGPU didn't exist when I was there...

QuickMath

For those who don't have a copy of Mathematica the www.quickmath.com site can once and a while be a great time saver.

GDC09

Second year at HH and like last year, I won't be with the other HHers going to GDC. Also probably won't get to all the presentations until the week after GDC, but greatly looking forward to it all...

OpenCL

More older OpenCL info.

More Atom iPhone Updates

Working on the hidden surface removal algorithm, attempting to solve the unsolvable problem. Only get 8K-16K triangles, want lots of graphics, oh and cannot use blending. No problem. Doing extreme LOD, and working on the "Grove of Trees" problem. Best way to solve the problem of too many things to draw, is not to draw too many things, even if they are all visible. Problem ends up being more of an art than a science, how to insure all LOD transitions are smooth and clean.

Fade in is now working quite well. LOD fade in of higher detail looks like the object is "growing". So fade in nodes start out with zero size, and increase to full size. Most of the time the new nodes start hidden inside the parent and the effect looks great. This is about the best which can be done without blending.

Fade out is a complex problem without blending. I have in for testing frame-dithered fade out (faking transparency). Just wanted to see what it would look like, didn't seem as bad as I thought it would be. Actually looks good in cases of a good amount of framebuffer feedback. Current implementation shrinks down size of fade out nodes, and uses a little frame-dithered transparency. Going to try fade out by shrinking node into most dominate child later this evening. Not sure what I am going to go with here.

I'm thinking about visuals now that I have a much better idea of what is possible on the platform. I thinking of trying my layered brush idea Redid my performance estimates for the GPU side. Seems like everything should work out based on previous testing on the actual device. Even thinking I could combine my feedback with a 2nd texture. Not sure if this is a good idea however. I hear two conflicting things about the iPhone: that it has a unified CPU/GPU memory and that it has a dedicated 16MB of VRAM. Wouldn't want to starve the CPU bandwidth if it is a unified memory device. CPU side is a big unknown performance wise. I'm still hoping that I don't have to do lots of custom VFP code.

Hardest part of development is keeping out all the distracting ideas. Last week I took a weekend away just to work on the project, for some reason I don't get work done when my wife is around. I also need to stop reminding myself that a PS3 test station is only $1200, and that I could just massively scale up (>100x more triangles) this iPhone Atom engine for the PS3 (turns out my other blending Atom engine wouldn't scale to the PS3 without fine grain early stencil, a DX10 level hardware feature). At least I have a backup plan if all doesn't go as planned. Building for strangled low Watt hardware has been a great experience, for once I'm not building stuff for not yet mainstream hardware. Might even break what seems to be a theme in my life, that I only get stuff done at work. In fact I might be the last person at work here now this Thursday waiting on a compile to do a few final tests before a check-in of a bug fix which needed to be in by tomorrow.

OpenGL3 Idea I'm Saving for Later

Been wanting to build a geometry streamer in combination with a version of my Atom HSR/LOD code. The idea would be to stream in and out geometry just like textures. Vertex buffer would be chunked up into fixed size pages, geometry paged in and out at run-time. Index buffer would also be adjusted dynamically such that I could have one draw per material, grouping all geometry and all objects for that draw into this dynamic index buffer. Pre-Z pass would have similar large sets of draw calls with multiple objects grouped into big calls. Grouping would be linked to hierarchical level of detail and visibility system. Everything would be wrapped into a nice virtual texture system, and use vertex texture fetch to deal with skinning many objects in one call. Would have been great for the DX10 generation which never really took off.

Anyway with OpenGL3, building a dynamic vertex buffer this way is a possibility. One can glMapBufferRange() with GL_MAP_UNSYNCHRONIZED_BIT to enable writes to a vertex buffer when it is being used by the driver. The key is knowing when the driver has finished with previous draw calls. My idea was to place a dummy occlusion query test at the end of the frame so that I could asynchronously query for completion to know when that frame was completed and thus when it is safe to do an unsynchronized map buffer range for a dynamic update.




090311 / Atom Triangle Soup

previous | next

Atom iPhone Update

Blend performance on the iPhone wasn't up to the level I needed. So I'm doing something different, and 3000 lines into a new engine (from "scratch") started this weekend using older Atom code as reference. The PowerVR MBX has one awesome feature by design, for a single opaque draw call, it can handle huge amount of Z overdraw. In fact the first generation iPod Touch can do 29 fps with 64x overdraw with a single draw call of 64 fullscreen triangles of different z (note 64x Z overdraw is not obtainable in practice, using lots of triangles will limit this). In any case I've gone back to rendering triangles.

Development in engine triangle soup test shots below, don't expect anything to look like real game...

I'm sticking to the same framebuffer feedback as I had in previous Atom engines. For the iPhone engine, triangle output is color + color.alpha * texture. Texture is the previous rendered frame only. I'm rendering to a 256x256 sized render texture (for bandwidth, fill, MBX requirement for pow2 texture size), then copy to screen using a scanline pattern to help hide upsample artifacts.

Texture recirculation coordinates are delayed by one frame so they can correspond to previous frame triangle positions. Effects are generated by adjusting texture coordinates and feedback amount.

Geometry shown here is just distorted octahedrons, in a distorted oct tree traversal, with random colors, with fractal framebuffer feedback cranked up. Feedback with scaling can have this "copy multiple reduction" fractal effect if contractive mappings are used.

The geometry for now has a static tree, because I haven't yet finished the dynamic traversal and hidden surface removal code for the new engine.

Using feedback to simulating the feel of refractive or reflective transparency. Guessing this is similar to how the ice was rendered in Super Mario Galaxy.

Per texture coordinate (not doing vertex) jitter can create a heat wave effect.

Or blur and and strange fluid visual echo.

Or without jitter for just the visual echo.

Crank up per texture coordinate jitter and the result is some wild stuff which looks in motion (and not on the static screen shot) like an old school fluid flame effect.

All these material effects can be mixed and matched on any geometry. Lots of stuff to try out still ... like using something other than random color. Next step is to get the HSR and view traversal working. After that adding in the hierarchical physics code. Then the scary task of optimizing the CPU work to insure it runs on the iPhone...

Network Notes

Xtremelabs has posted some interesting average bandwidth and latency numbers for various wireless networks (useful for iPhone dev). Latency for EDGE is next to useless for anything interactive on multiplayer. IMO also looks like P2P would be required (to cut communication latency in half) for good interaction...

Network : Upload Kbits/s, Download Kbits/s, Latency ms
3G : 955.6, 152.6, 484.2
EDGE : 218.4, 37.3, 907.3
WIFI PUB : 2502.0, 773.9, 205.0
WIFI PRI : 2905.3, 738.8, 184.4

Voxels Continued

Efficient High Quality Rendering of Point Sampled Geometry
Spacerat's Voxel Blog with CUDA demo.

Other iPhone Notes

According to this,

128MB
101MB for OS
11MB for video
15MB for app

Unity3D Wiki has some good numbers on textured overdraw performance (using 512x512 texture).

10.8 ms for 32bit texture full screen quad
20.4 ms for dual 32bit texture
8.1 ms for 4bit texture
12.2 ms for dual 4bit texture
9.8 ms for dual 2bit texture
7 ms for color only

Other MBX Notes

MBX Hardware Details not the MBX Lite, but very very useful indeed...




090305 / Voxels

previous | index

Just some more voxel links...

Atomontage - Voxel rendering engine. Looks like it keeps a persistent CPU side voxel structure, and then uses GPU to render cubes (at least in older engine). Newer update, of which no screenshots have been posted yet, provides smooth voxel rendering.

The Amusement Machine - Another voxel rendering engine. Scary "coder art" website, but interesting technology. Looks like they are doing a image space hierarchical view traversal (lower resolution coarse traversal, then refine search for full resolution).

3D-Coat - Voxel sculpting.




090219 / R600 Driver

previous | next

How to render a freaking triangle. - Awesome presentation by Matthias Hopf about the on going creation of open source drivers for the R600/700 hardware. Includes lots of detailed info on the hardware. Such as vertex shaders fetching vertex inputs instead of being fed inputs, and details on ALU unit pitfalls.




090218 / ARM VFP

previous | next

Got my iPod Touch, but apparently have a bunch of stuff to do Apple side before I can run any code on it.

ARM11 Vector Floating Point (VFP)

Finally got around to reading the ARM11 manuals. A lot has changed since the early ARM RISCOS days. The VFP is the ARM11's floating point coprocessor. Upon reading the words "Vector Floating Point" I had hopes for some kind of SIMD, but with the ARM11, VFP can only retire one single precision ALU operation per cycle (non-vector throughput, and a high result latency). The "Vector" term refers to the VFP unit's ability to execute the same instruction (ALU or even load/store) on between 1-8 scaler values in series with a single instruction. Effectively a form of instruction compression. Unfortunately vector length and stride (between registers) is set by modifying the FPSCR control register (slow). However, ARM11 does have DSP extensions which provide SIMD within an integer register for 8-bit and 16-bit integer values.

Best case performance will be both ARM CPU and VFP coprocessor running in parallel. With VFP doing parallel load/store and ALU work. The "vector" instructions enable an instruction issue rate which is fast enough to make this happen, while also reducing instruction fetch bandwidth. From what I can tell, only one instruction (CPU or VFP) can be issued per cycle, so "vector" instructions are required to keep the pipelines busy. Unfortunately with this kind of ISA, assembly is probably necessary to extract great performance.

With VFP I see only really one good choice: 8 scaler registers and 6 4-wide "vector" registers. The 8 scaler regs is fixed by the ISA. Using larger than 4 wide vectors requires scorecard locking even in run-fast-mode. Using smaller than 4 wide vectors and I think instruction issue will bottleneck. With the limited number of effective vector registers, L1 should end up serving as an "extended register file", with load/store in parallel with ALU. Load/store works in 64-bits per clock, so in theory this should be enough to keep the ALU going even when mostly working out of L1 and not getting a huge amount of register reuse... too bad all this vector assembly magic is at best going to get 1 ALU op per clock cycle.




090212 / iPhone Atom

previous | next

Updates on the iPhone Atom Port

One and a while an idea pops into your head which you just cannot ignore. At 256x128 I have enough fill rate in theory to port the nice looking older Atom engine to the iPhone. Nice looking as in this engine,

Yeah my old fragment shaders just sample from 2 textures, and should be easy to port to the fixed function GLES 1.1 texture stages. Last night I worked on the initial modifications, doing a 256x128 framebuffer with an upsample pass to draw to a 480x320 target overlayed with a scan line pattern to hide the upsample. Not bad, will show screen shots later. Had to do a lot of work to insure I didn't either overload the polygon limits (I'm sticking to about 4K quads max a frame) or the overdraw blending limits (at 256x128 I'm attempting to average under 16x) which I have predicted for the iPhone for 30Hz (hoping for 24Hz). Had to remove the motion blur (too fill intensive), had to limit the drawing of large particles (too fill intensive), and had to more aggressively LOD, (a little pop). Building in a dual screen view for development which shows simulated iPhone output along with an overdraw visualization and histogram reduction for an actual real-time fill rate number. This way I can tune the content to insure a good FPS after I get initial tests finished on a real device.

I hope all works out on the actual device. In theory I've got a few things going for me here: no depth buffer (save on bandwidth), what should be awesome texture cache performance (50% samples from very coherent 256x128 framebuffer texture feedback, 50% samples from small texture atlas regions), and one draw call per frame. Assuming that the PowerVR MBX will bucket all overlapping draws to a 32x32 tile at a time, that is 32 tiles total, where each tile gets 50% of samples from about a 16KB region of framebuffer feedback (fractal framebuffer does a 2x reduction so 64x64 source region for tile).

Still working on recoding vectorized SSE code back into basic C to get something to test on the device (yes I will eventually re-vectorize for the iPhone, assuming the port isn't GPU bound!). More later, now my Flower download is 94% complete...




090208 / iPhone

previous | next

Rough calculations of iPhone peek ability breaks down as follows as far as I can tell (actuals will be a ways under these),

1. 480x320 screen.
2. 412 MHz (for older model).
3. or about 89 clock cycles per screen pixel at 30Hz.
4. 450-590 Ktri/sec (range of texture+color to smooth shaded).
5. or about 15 Ktri/frame at 30Hz.
6. 24 Mtex/sec (fill).
7. or about 5x overdraw at 30Hz.

In order to better judge what the iPhone is able to do, I'm porting a very old version of the Atom engine to it. The engine still has the tree structured l-system, but since the iPhone doesn't have enough overdraw capacity, I'm changing all the primitives to opaque triangles (ie real geometry) to better work with the TBDR of the PowerVR MBX. Removes the need for sorting, but I'm not sure if the 16-bit Z-buffer is going to work out here. Seems as if the iPhone does software vertex work, so with any luck having 100% fully dynamic geometry won't be a problem. Got the backend (drawing junk) working today (tested on the simulator), everything gets done in a single draw call (ideal batch performance, which is rather important on the PowerVR MBX!).

So the plan is attempt a simple iPhone flying game with an older Atom engine, and with gameplay a like the PS3 Flower game. Difference from other iPhone apps, is that I'm going to compromise everything to try to go for some real view distances (ie awesome LOD), instead of the puny stuff in most iPhone apps. Really not sure if this is going to work or not ... going to have to get a iPod Touch to test with!

ATI OpenGL 3.0 Drivers Out

Looks like I missed this news, but ATI now has OpenGL 3.0 Catalyst 9.1 Drivers for Windows/Mac/Linux. Awesome!

Atom PC/Mac/Linux Updates

Not much so say here right now, work has been busy and live has got in the way, so I haven't had the time.

I've been looking to better solve the GPU side display traversal problems. Problem is mostly that I can only expand and contract the view tree by one node per frame. This leaves holes when things move which I have to fill in during a image space post process. I tried doing visibility checks where I moved the position of nodes ahead in time by a function of their size. The idea was the if a node will take 5 frames to produce pixel sized children, that I would move that node ahead 5 frames in time in estimated position. This worked a little, but not enough to solve the problem.

However I have a new untested idea for GPU side display traversal, which should both reduce me need for vector scatter (ie point drawing in OpenGL3/DX10) by 7x, and should solve the motion hole filling problem. The idea involves doing a reprojection of the previous nodes, then a mipmap style reduction to choose nodes which are in the most need of hole fill expansion, followed by a mipmap style expansion to expand the nodes most in need of hole filling. This should enable me to have a variable amount of tree expansion per frame with a fixed cost!

I've got one other really wacky idea which involves doing the above, but trading the vector scatter for a stable fluids advection style gather. I don't think it would work, but if it did, the algorithm would no longer need to draw points at all, and would be fully general (ie easy to do fast in CUDA).




090207 / KZ2 II

previous | next

Killzone 2 Rendering Observations

Had a chance to take a good look at the KZ2 demo, to see the tech first hand. The overall level of polish is fantastic. A few observations below, the rest I'm saving for when I play the actual game...

Level of Detail. They are using stippled LOD transitions for world geometry, with a transition threshold set so that geometry never stays in a stippled form. Seems as if they have 4 LOD levels for characters, but just pop between levels.

Particles. Judging from the artifacts where particle billboards intersect opaque geometry, I'd guess that they are drawing into a 1/4 size (1/16 area) downsized buffer. Sparks and other details seem to be drawn directly into the full size G-Buffer instead.

Textures. Good resolution on the textures, with lots of decals, and detail maps to maintain good resolution. Didn't see any evidence of texture streaming, so if it is there, they are doing an excellent job insuring textures are loaded before they are seen!

Motion Blur. The motion blur is subtle with a small maximum blur length. Just enough to give the sense of motion without inducing vomit (I prefer larger blur lengths, and we tried this at work and apparently it can make a few people feel sick). I have a feeling that the blur is done at a 1/2 size downsampled resolution, however the transition is so smooth it is hard to tell. Motion blur looks to not be depth aware, which can be seen as the motion blurred background pulls pixels from the relatively stationary gun. It can be expensive to correct this artifact, might not have been worth it for there engine. When looking through the gun sights and moving around, seems as if there is a fringe in the DOF regions where motion blur is not applied, suggesting that post processing selects between motion blur and DOF, but cannot apply DOF and motion blur at the same time. With a small maximum blur length this is a good optimization to make IMO.

Depth of Field. Looks like they transition between two filter kernels with a pop. Normal blur filter is smaller and touches the visible backend of the gun. At some point when player moves to gun sights, the filter kernel pops to a larger radius but what looks to be the same number of samples for the filter (a little banding). I have only seen near DOF thus far, and they have depth transition aware near DOF, as can be seen by the near out-of-focus bleeding into the in-focus background. A few observations about the KZ2 DOF convinced me to improve our DOF at work, and this is exactly why I inspect every graphics engine in detail, there is always something to learn from other people.

Film Grain. Applied to help hide any artifacts of limited 8-bit precision with 2x or so dynamic range. Really looks good IMO.

Shadows. Cascade shadows for the sun, dynamic shadows for muzzle flashes and what appears to be a few indoor lights. Shadow resolution appears to be rather low in some places (not as bad as Dead Space), however nicely filtered.

Decals. Lots of polish here: per material decals for bullet holes, foot prints where you walk, blood stains, and more.

iPhone

Installing the iPhone SDK on my iMac as I type this. Decided I wanted a distraction from all my long term work, with something simple, fast and fun...




090129 / GT3xx Speculation

previous | next

A possible CUDA roadmap shows CUDA 2.2 and 2.3 in 2009 before the big CUDA 3.0 in Q4.

The GPU 2013 slides show C++, preemption, virtual pipeline, complete pointer support, adaptive workload partitioning, arbitrary data flow, general purpose programming model, special purpose hardware, hardware managed threading and pipeline.

Then Beyond3D Forum speculation in regards to GT3xx using a form of MIMD, with dynamic clusters, use of new buffers and crossbar, with change in power and memory management. Possible dynamic warp formation (DWF)?

Seems to me that NVidia is in the process of iteratively making general purpose scalar computation as efficient as possible with each new hardware generation. This started with improvements to address divergent global memory access in GT2xx. Seems as if GT3xx might improve performance under branch divergence, and perhaps even under bank conflicts (if we are lucky).

This might be too pie in the sky for GT3xx, but I'm personally hoping for MIMD with dynamic work-item grouping to handle bank conflicts through a banked cache which is only coherent under atomic operations, where the cache provides the functionality of shared registers, shared memory, constants, and access to global memory. Effectively the hardware would auto vectorize access to the cache to maintain bandwidth efficiency, using knowledge of work-groups to maintain coherency and as a basis for grouping.

Pixeljunk Monsters

Was looking for a new game to play co-op on the PS3 and decided to try Pixeljunk Monsters (downloadable on PSN). Turned out to be quite an enjoyable game!

SEAforth 40C18

Speaking in terms of awesome hardware there is an interesting Dr. Dobb's article on Extreme Forth which describes the state of the art in ultra low power embedded parallel processing, ie SEAforth 40C18, 40 core chip with <9 mW/core at full speed (yes mW) which can do 25 billion 18-bit forth operation/sec.




090121 / Killzone 2 Bean Spilling

previous | next

Just to make sure my personal bias is clear, I've never been tempted to pre-order a game before, I'm actively waiting to pickup my pre-ordered KZ2. I'd pay my $60 just to visually dissect the tech, regardless if the game is good. From what I've seen thus far (which is little more than internet vids), KZ2 sets the bar for game technology.

Killzone 2 Tech

View the 40 minute behind the scenes making of KZ2 video at Game Kings!
Refresh with their previous deferred rendering presentation.

I tossed out a guess on the Beyond3D Forums that their post would be between 15-30% of the GPU time depending on frame rate, and it seems as if this video interview confirms that moving GPU work (post processing, perhaps more) to the SPUs has resulted in an amazing 20 to 40% performance gain.

Apparently tech was being polished late in development. The "4xMSAA trick" low resolution particle artifacts are clear on this set of August images (which appear to be direct framebuffer grabs). Look at the first muzzle flash image at 720P. Then check out the rocket image (2x2 pixel blocks). Also seems as if the small high detail particles are directly blending (perhaps into the G-Buffer) at full resolution. R2 does something similar with high detail particles at full resolution. Post August, I don't see any KZ2 shots with the 4xMSAA artifacts. Could either be doctored for public consumption shots, or that they indeed changed the particle rendering. I see evidence in at least one shot that they might now be drawing to a downsampled buffer and later merging back. View this shot at 720P. Look where the explosion is behind the sandbags, looks more like the Battlefield Bad Company particle upsizing. Perhaps the change was to provide downsize particle rendering which adapts to overdraw by changing resolution.

How About Some Numbers

Notes below are my best guess of the debug info gathered from many frames of the above linked video. It took a bunch of frames to make out the numbers, so don't expect sums to add up properly.

CPU TIME
--------
Unknown .......... 1.24%
SPU Sync ......... 0.06%
AI Manager ....... 0.47%
Game Logic ....... 9.52%
Script ........... 0.80%
Physics .......... 1.57%
Representation ... 10.46%
Draw ............. 20.18%
HUD .............. 2.19%
Sound ............ 0.65%
Profile HUD ...... 25.17%
GPU Sync ......... 37.99%
----------
Total Time ....... 36.85%

SPU TIME
----------------------------
AI.Cover ................... ........ 0.00%
AI.LineOfFire .............. ........ 0.00%
Anim.EdgeAnim .............. 33 ..... 2.01%
Anim.Skinning .............. 152 .... 30.68%
Gfx.DecalUpdate ............ 9 ...... 0.78%
Gfx.LightProbes ............ 396 .... 9.00%
Gfx.PB.DeferredSchedule .... 1 ...... 0.60%
Gfx.PB.Forward ............. 2 ...... 1.69%
Gfx.PB.Geometry ............ 1 ...... 18.67%
Gfx.PB.Lights .............. 1 ...... 0.66%
Gfx.PB.ShadowMap ........... 1 ...... 4.20%
Gfx.Particles.ManagerJob ... 1 ...... 3.14%
Gfx.Particles.UpdateJob .... 130 .... 12.33%
Gfx.Particles.VertexJob .... 70 ..... 20.64%
Gfx.Post.BloomCapture ...... 12 ..... 2.80%
Gfx.Post.BloomIntegrate .... 8 ...... 1.52%
Gfx.Post.DepthOfField ...... 64 ..... 12.12%
Gfx.Post.DepthToFuzzy ...... 8 ...... 0.67%
Gfx.Post.Downsample ........ 29 ..... 0.61%
Gfx.Post.GrainWeight ....... 1 ...... 0.51%
Gfx.Post.HBlur ............. 45 ..... 3.02%
Gfx.Post.ILR ............... 1 ...... 0.63%
Gfx.Post.Modulate .......... 27 ..... 1.3?%
Gfx.Post.MotionBlur ........ 46 ..... 11.31%
Gfx.Post.Unlock? ........... 1 ...... 0.01%
Gfx.Post.Upsample .......... 108 .... 9.47%
Gfx.Post.VBlur ............. 46 ..... 3.73%
Gfx.Post.Vg??lle ........... 1 ...... 1.18%
Gfx.Post.Zero .............. 16 ..... 0.64%
Gfx.Scene.Portals .......... 3 ...... 30.72%
Mesh.Decompression ......... ........ 0.00%
Physics.Collide ............ 4 ...... 2.48%
Physics.Integrate .......... 4 ...... 2.11%
Physics.KdTree ............. 8 ...... 20.50%
Physics.Raycast ............ ........ 0.00%
Snd.MP3.Stereo ............. 2 ...... 2.60%
Snd.MP3.Surround ........... 2 ...... 7.51%
Snd.?Synth ................. 35 ..... 3.23%
Snd.Reverb ................. 14 ..... 4.02%
----------------------------
Total Time ................. 1232 ... 227.46%

GRAPHICS
--------
FPS ................. 30
GPU Stall by CPU .... 0.123 ?s
CPU stall by GPU .... 12.231 ?s

GPU TIME
--------------------------
Unknown ....... 0.2?/ ... 3.43%
Geometry ....... 1.8?/ ... 43.37%
Lighting ....... 1.7?/ ... 14.??%
Effects ........ 8.5?/ ... 8.4?%
Post process ............. 18.31%
--------------------------
Total Time ............... 81.??%
GPU Stall ................ 0.??%

PRIMS / TRI
-----------
Totals ..... 1431/ ... 344,634
Prime? ..... 0/ ...... 0
Geometry ... 619/ .... 161,231
Shadow ..... 683/ .... 170,???
Effects .... 121/ .... 14,3??

MEMORY STATS
------------------------
Pushbuffer ???? ........ 0.15 MB
Pushbuffer High ........ 0.15 MB
VRAM Free .............. 23.43 MB
Host Free .............. 80.?? MB
Heap Free .............. 134.?? MB
Render Mem ???? ........ 0
Render Mem Used ........ 12.00 MB
Render Mem Watermark ... 12.00 MB

MAIN RAM ....... 101.00 MB
----------------
Physics ........ 5.30 MB
Collision ...... 3.72 MB
Sound .......... 16.25 MB
Mesh ........... 21.20 MB
Graphics ....... 6.53 MB
Animation ...... 34.45 MB
Texture ........ 0.56 MB
Shader ......... 1.46 MB
AI Data ........ 2.75 MB
Various ........ 3.32 MB
Waste .......... 5.27 MB
----------------
Total .......... 97.17 MB
Main RAM ??? ... 97 / 101

VIDEO RAM .. 190.04 MB
------------
Mesh ....... 15.99 MB
Texture .... 156.87 MB
Waste ...... 1.20 MB
Total ...... 174.08 MB

Interesting

Just a few bits which can be gathered from this, SPUs are getting used for all sorts of stuff: AI, animation, skinning, push buffer, post processing, visibility, physics, and audio. SPUs are doing motion blur, depth of field, and more. SPU list suggests that they have portal based visibility as well as a KDTree for physics and raycasts (very interesting). Primitive stats suggest shadows take around 50% of triangles, show a relatively low number of triangles per frame (thanks to SPU culling?), and that perhaps particles (and decals?) are upwards of 10-20K triangles. Streamed in texture budget looks to be about 160-180MB. Lighting appears to be 1/3 as expensive as G-Buffer creation.




090110 / Hole Filling

previous | next

Raytracing Variants

Progressive Photon Mapping, Image Synthesis using Adjoint Photons, Energy Redistribution Path Tracing

A few great papers, the Adjoint Photon paper being one of my personal favorites (check out Dual Photography for related info).

Hole Filling

Atom update, still working on perfecting all the issues related to reprojection and hole filling. One interesting paper related to hole filling is Efficient Point-Based Rendering Using Image Reconstruction, which describes the "push pull interpolation" image space method.

My latest prototype still has some problems to fix, but is much closer to a robust solution. My previous approaches followed a common pattern,

1. Draw new data into cleared buffer.
2. Hole fill new data.
3. Merge hole filled data with with reprojected results of last frame.
4. Repeat.

My new prototype follows a new pattern,

1. Draw new data into cleared buffer (point scatter).
2. Hole fill new motion vector data only.
3. Use hole filled motion vector field to reproject recirculated data.
4. Merge new data without hole filling with reproject recirculated data.
5. Use hole filling for color on the reprojection data for display only.
6. Repeat.

For hole filling I'm using only one weight factor (FP16 precision stored in the alpha channel) to control interpolation in a simple image pyramid reduction and expansion. Keep in mind I'm not sampling depth. The reduction computes a weighted average of 4 pixels. The weighted average uses squared weights to bias towards higher weighted data. The expansion does not do a bilateral upsample, and instead does one bilinear fetch from the smaller mip level (better performance). The blend between the larger and smaller mip samples is done using a clamped ratio of squared weights again to bias towards higher weighted data.

This weight factor is an estimation of pixel quality which is computed in my visibility algorithm. Pixel quality is reduced each frame based on the motion vector length to insure recirculated data looses out in future joins with new data. Lower quality data gets blended over in the hole filling process. Hole filling of sparse data works by a temporary "super" quality level which enables that data to expand outward. This super quality setting is reset in recirculation to a normal level.

I'm actually placing 1/(quality^2) into alpha. Using 1/(quality^2) with bilinear texture filtering in the image pyramid expansion enables hole filling expansion to contract around sparse data (which needs hole filling) and not overflow into non-sparse data. This is one case where incorrectly using linear interpolation is quite useful.

Quality in some ways functions like projected point radius (and is in fact partway a function of projected point radius). Larger values represent parent points which should not be drawn (this is a side effect of my stochastic visibility algorithm). Smaller values are marked as higher quality. Super quality (ie expansion) is set with really small quality values. This is almost counter intuitive, in that points marked with the super quality setting actually have a large projected radius. Since my visibility algorithm can only expand/contract by one tree level per frame, fast motion results in sparse point data. I use difference in project point radius from current and previous frames to detect sparse point data.

Motion blur was added during the expansion step of hole filling. Instead of taking one sample in the larger mip level, I take an average of 4 samples offset by the motion vector field. The result is quite nice even under huge amounts of motion because of the multi-resolutional filtering.

At this point the result has very high quality motion, very fast convergence, but is too sharp (ie not anti-aliased enough) and also not as good at hiding artifacts from missing geometry from my stochastic visibility. I believe both these problems can easily be fixed in the join step, and this is what I am working on next.

Honestly I'd be really surprised if you were still reading at this point. Great news if this was in some ways useful to you, for me its just another brain dump so that someday I can remember a bit of what I once worked on...




090108 / Structure Synth

previous | next

Structure Synth

Structure Synth and Syntopia Blog

Almost missed this little gem off a blog post on the Real-Time Rendering Blog. Structure Synth generates digital art via grammar rules similar to an L-system.

CUDA 2.1 Volumetric Particle Example

Posted on Industrial Arithmetic Blog, CUDA volumetric particle example which renders in layers from half view/light angle direction. A great 2003 paper which details this technique applied to volume rendering is A Model for Volume Lighting and Modeling.