070822 / New Pipeline Progress previous | next Work on the new combined graphics, physics, and CFD pipeline is progressing quite smoothly. All the current and previous screen shots and videos on this page (except the CFD stuff) are from my 100% working older fixed function pipeline and later Shader Model 3.0 source path in which nearly all the pipeline is CPU side except for compositing. During this major rewrite and optimization effort, I keep the old development path alive to test new ideas and simplifications. As it turns out, months ago when I had switched from development on my on-motherboard Intel GMA3000 to a NVidia 8600 GTS card, I had forgotten that I had turned off most of my aggressive hidden surface removal algorithm because the NVidia card provided a such a performance jump (both in offloading triangle setup, and fragment processing). Now with the aggressive HSR back on, I have found some substantial performance increases as I tweak and optimize the algorithm for the final pipeline. CPU / GPU Mix My new pipeline follows a very simple philosophy, move anything which can be parallelized which does not require double precision to the GPU, everything else is staying CPU side. The rough pipeline design is as follows. Note this is in logical order, actual order is different to hide the latency of CPU to GPU bus transfers and computation. Also I am skipping on the full Audio algorithm integration (which also features a CPU/GPU mix) until I get this 100% working. - CPU: Expand/contract the world tree structure based on interaction and view.
- CPU: Compute next animation frame for used cell animated l-system rules (requires double precision).
- CPU: Do world space to eye space transform for cells (requires double precision).
- CPU: Transfer single precision eye space coordinates and other data to GPU via VBO transfers.
- GPU: Project all cells into 2D screen space, compute all data necessary for CPU hidden surface removal (HSR) algorithm, and compute all factors for final display and physics properties. This runs as a geometry shader writing out to VBOs on the GPU.
- GPU: Transfer back the (HSR) data and modified z sort coordinate to the cpu.
- CPU: Sort nodes for HSR from front to back order (radix sort). Run HSR algorithm, which is hopelessly linear, running through the cells in z order and filtering out cells which are decided to be hidden enough to prune from the final display list. HSR also computes the priority for the beginning expansion/contraction step.
- CPU: Send index of displayable cells in back to front order to GPU.
- GPU: Composite all drawable cells back to front order to 16bit framebuffer, running a shader which does emmitive and fractal environmental reflective lighting.
- CPU: Prep special ordered index list of cells for all the physics/CFD scattering passes. Send these index lists to the GPU via VBOs. Send parent relative cell velocity and other per cell physics/CFD properties to GPU via VBOs.
- GPU: Run physics/CFD scatter pass followed by gather pass. The gather pass results in new physics properties for each cell.
- GPU: Transfer gathered new physics properties back to CPU.
- CPU: Transform new physics gathered velocities back into current frame world space parent relative coordinate spaces (these require double precision).
- CPU: Update current cell position and properties based on gathered physics/CFD info.
- LOOP!
Rough Optimization Budgeting When working on the engine pipeline design, I am always keeping track of rough estimated costs of key parts of the algorithm in terms of CPU cycles used, CPU memory bandwidth, PCIe bandwidth, latency for PCIe bus transfers, GPU SPU (scaler processing unit) cycles used, GPU texture loading/filtering bandwidth, GPU triangle setup bandwidth, and GPU ROP (raster operation processing) bandwidth. Once NVidia releases a GeForce 8 series profiling tool for Linux, I will be able to verify things like GPU cache performance and other critical factors, but for now, it is just intelligent design with lots of testing. Maximum performance budget, I tend to try and stay way under these figures to insure a good FPS. - GPU: 1450 MHz of ops * 32 SPUs = 46400 MHz of ops
- GPU: 1 clk for basic ops
- GPU: 4 clk for complex ops
- GPU: 32 GB/s of on card bus
- GPU: 725 tri/s of setup
- GPU: 8 TAUs at 725 Mhz = 5800 Mhz texture loads
- GPU: 16bit texture loads cost 2x
- GPU: 32bit texture loads cost 4x
- GPU: 8 ROPs at 725 Mhz = 5800 Mhz blends
- GPU: 16bit blends cost 2x
- GPU: 32bit blends cost 4x
- CPU: 2 GHz
- CPU: PCIe bus 4 GB/sec
My compositing engine is the most optimized and expensive part of the pipeline, using about 20% of my texture fetch and filter, 13% of my ROP, but only 6% of my SPU, and under 1% of my triangle setup budget. I also have room to move one of the two texture loads in the compositing fragment shader from a 2D to a 1D texture load, requiring some extra math (SPU time), which should help keep much more texture reads cached if texture trashing becomes a problem. | Atom ©2009-2007 Timothy Farrar Latest Blog Entries 090407 . dxt tip 090320 . gdc 2009 090318 . re-attachable code 090311 . atom tri soup 090305 . voxels 090219 . r600 090218 . arm vfp 090212 . iphone atom 090208 . iphone 090207 . kz2 ii 090129 . gt3xx speculation 090121 . killzone 2 090110 . hole filling 090108 . structure synth 090105 . nv gpu prg + tes 081230 . gl3 textures 081224 . larrabee 081223 . 3d ifs art 081219 . gl3 driver 081218 . reprojection 2 081217 . reprojection 081216 . pc gpu stats 081209 . opencl 081115 . r2 081106 . arm vfp11 081102 . gl3 on linux 081030 . p r d a 081020 . temporal binned ring buffer 081014 . octahedron map 081010 . temporal locality 081008 . future hardware 080926 . changed email 080918 . general purpose 080826 . olick paper 080814 . otoy, braid 080813 . opengl 3 II 080811 . opengl 3 080806 . random stuff 080718 . nv perf kit 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080426 . parallel II 080319 . beyond the vacuum 080223 . human head + parallel 080114 . xp install
Index 000000 . index
Graphics 090311 . atom tri soup 090110 . hole filling 081218 . reprojection 2 081217 . reprojection 081209 . opencl 081014 . octahedron map 081010 . temporal locality 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080319 . beyond the vacuum 071130 . GPU only 071121 . deferred 3 071116 . deferred 2 071103 . random shots 071025 . motion cards 071018 . cubemap concepts 071015 . drawing reverse II 070926 . drawing in reverse 070822 . new pipeline progress 070819 . high dynamic range 070817 . video update 070810 . engine lighting 070809 . engine videos 070731 . screen shots 070713 . micro impostors 070711 . infinite LOD 070710 . graphics engine intro
Interaction 071204 . GPU only 2 071018 . cubemap concepts 070816 . CFD videos 070730 . CFD code 070715 . self healing
Networking 070708 . breaking firewalls 070707 . management servers 070706 . 510 players / 128Kbps 070705 . UDP player bandwidth 070704 . network latency 070703 . cost of bandwidth
Sound 070709 . 3D audio / KEMAR
Language 090318 . re-attachable code 081030 . p r d a 070921 . assembler in atom4th 070919 . editor working 070915 . chicken and egg 070912 . font making 070910 . 2 4th | !2 4th
Elsewhere andrew selle adrian crook alex champandard angelo pesce aras pranckevicius brian karis cedrick collomb christer ericson chris hecker craig reynolds dave moore david lenihan ignacio castano jeremy shopf jonas risbrandt ke-sen huang marco salvi mikael christensen mike acton mingw naty hoffman nick porcino oss pete shirley pierre terdiman pixar papers realtime rendering ron fedkiw tom forsyth vincent scheib wolfgang engel All Blog Entries 090407 . dxt tip 090320 . gdc 2009 090318 . re-attachable code 090311 . atom tri soup 090305 . voxels 090219 . r600 090218 . arm vfp 090212 . iphone atom 090208 . iphone 090207 . kz2 ii 090129 . gt3xx speculation 090121 . killzone 2 090110 . hole filling 090108 . structure synth 090105 . nv gpu prg + tes 081230 . gl3 textures 081224 . larrabee 081223 . 3d ifs art 081219 . gl3 driver 081218 . reprojection 2 081217 . reprojection 081216 . pc gpu stats 081209 . opencl 081115 . r2 081106 . arm vfp11 081102 . gl3 on linux 081030 . p r d a 081020 . temporal binned ring buffer 081014 . octahedron map 081010 . temporal locality 081008 . future hardware 080926 . changed email 080918 . general purpose 080826 . olick paper 080814 . otoy, braid 080813 . opengl 3 II 080811 . opengl 3 080806 . random stuff 080718 . nv perf kit 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080426 . parallel II 080319 . beyond the vacuum 080223 . human head + parallel 080114 . xp install 080108 . 2008 071207 . G84 071204 . GPU only 2 071130 . GPU only 071126 . opt+more 071121 . deferred 3 071116 . deferred 2 071115 . critic 2 071112 . critic 071108 . GPU assembly 2 071104 . GPU assembly 071103 . random shots 071031 . cubemap seams 071026 . transform feedback 071025 . motion cards 071024 . GS woes 071019 . cubemap woes 071015 . drawing reverse II 070930 . porting to sm3.0? 070926 . drawing in reverse 070921 . assembler in atom4th 070919 . editor working 070915 . chicken and egg 070912 . font making 070910 . 2 4th | !2 4th 070822 . new pipeline progress 070819 . high dynamic range 070818 . DFES 070817 . video update 070816 . CFD videos 070810 . engine lighting 070809 . engine videos 070731 . screen shots 070730 . CFD code 070715 . self healing 070713 . micro impostors 070712 . fragment raytracer 070711 . infinite LOD 070710 . graphics engine intro 070709 . 3D audio / KEMAR 070708 . breaking firewalls 070707 . management servers 070706 . 510 players / 128Kbps 070705 . UDP player bandwidth 070704 . network latency 070703 . cost of bandwidth 070702 . market research
|