081224 / Larrabee previous | next Merry Christmas All! From SIGGRAPH Asia 2008: Parallel Computing for Graphics: Beyond Programmable Shading, Larrabee Intel, why have a CPU when you could have a Larrabee? Larrabee is finally set to have many of the things I personally have wanted in a CPU (yes CPU) for some time. I'd rather have a Larrabee than a Core7 as my CPU. Unfortunately it seems as if Larrabee isn't going to be realized as a CPU? Kind of like Porsche never placing a good engine in the Boxster because it would upset the bread and butter 911, my wild guess is that possible low margins of Larrabee in competition with NVidia and AMD GPUs will insure Larrabee never sees a CPU role. Perhaps I'm 100% off on this assumption. Anyway it will be interesting to see what ends up as price/wattage/performance of Larrabee vs GPUs. Will Intel pull it off, and what is the response going to be from NVidia and AMD? Will programmers go Larrabee native and make use of the platform, or will a majority stick to DX11/OpenGL3/OpenCL? Non-native, will Larrabee have good small triangle performance? Will 4x4 fragment groupings be fixed, or will they software merge fragments to fill out SIMD vectors for small triangles? How will software binning do with upwards of 200-1000 bins and millions of triangles per frame? Will triangles even matter on Larrabee? Will I be tempted to do a Larrabee native app if market share is there (likely yes!)? Lots to see for 2009/2010... General Info - Coherent L1 and L2. - 32KB L1 (not in SigAsia08 papers). - 256KB L2 (not in SigAsia08 papers). - In-order pipeline. - No latency scalar ops. - Low? latency on vector ops. - Cheap? branch mispredict and cache miss. - 4-way SMT designed to hide L1 miss. - 16-wide float32|int32, 8-wide float64 (all math at 32|64-bit). - SIMD one source from memory. - SIMD fast predication on every instruction, 16-bit predicates. - SIMD scatter/gather! - Almost free? conversion to float16, int|norm-8|16. - Cheap? conversion to other formats, sRGB, float11:11:10, norm10:10:10:2. - Long-latency texture fetch. - Support OpenGL and OpenCL!!!! - Support Larrabee Native C/C++ applications!!!!!!!!!!!!!!!!! Parallel Programming with Larrabee - Work Item = One SIMD lane. - Fiber = Software managed context, runs 16-64 work items. - Fiber = logically assigned a part of L1. - Thread = Hardware managed context!???? - Thread = Switches between 2-10 fibers to cover latencies. - Switch = Start asynchronous texture fetch, then user-space context switch. - Core = Independent processor that runs 1-4 threads. - SIMD divergent branch = execute both paths with masks regs to predicate. L1 As Extended Register File - 32KB / 4 threads x 02 fibers x 16 work items = 64 32-bit "L1 regs" per work item. - 32KB / 4 threads x 02 fibers x 64 work items = 16 32-bit "L1 regs" per work item. - 32KB / 4 threads x 10 fibers x 16 work items = 12 32-bit "L1 regs" per work item. - 32KB / 4 threads x 10 fibers x 64 work items = 03 32-bit "L1 regs" per work item. - Requires range of 8 to 40 asynchronous texture fetches per core. - Rough calculations ignoring other usage of L1... Larrabee vs OpenCL - Larrabee can submit work to itself! - OpenCL cannot. Larrabee on Graphics - Color/depth/stencil buffers stay in L2. - Binning, run in submit order, each pix does shading+OM. - 64x64 or 128x128 tiles, tile = bin. - Tile per core. - Add triangles to overlapping bins. - JIT for shaders and pipeline stages done on Larrabee itself. - In bin, dice triangle to leaf 4x4 blocks. - At leaf test sample locations? - Texture soft fault on page miss (easy mega-texturing, easy procedural texturing)! L2 As Framebuffer Tile - 256KB / 64x64 = 64 bytes = (32bit depth_stencil x 4 MSAA + 6 64bit MRTs). - 256KB / 128x128 = 16 bytes = (32bit depth_stencil + 3 32bit MRTs). - Rough calculations ignoring other usage of L2... GPU Binning AMD's binning paper shows results for HD 4870 at only 1M bins in 0.02 seconds (50Mbin/sec) on 64^3 grid. Uses the multipass DrawAuto trick with StreamOut recirculation of non-binned points. Humm, the method seems a little too slow for me regardless of AMD or NVidia GPU. Too much overhead for the result. Say this was a GT280, at 141.7 GBs bandwidth, 50M/s points averages to over 2KB available peek bandwidth per point. Larrabee or OpenCL or CUDA should be able to mop the field with this sort of computation. Basically a vector gather queue position for 16 points, increment queue position + vector scatter result, and vector scatter point output based on queue position. So 3 memory ops per point, assuming a measly 128 GB/s bandwidth (for highend GPU) and full cache misses for all memory transactions (ie 64 bytes/transaction per SIMD lane) should be able to support upwards of 500M points/sec for binning. Assuming cache hits with Larrabee, then the numbers drastically change towards even better performance. | Atom ©2009-2007 Timothy Farrar Latest Blog Entries 090407 . dxt tip 090320 . gdc 2009 090318 . re-attachable code 090311 . atom tri soup 090305 . voxels 090219 . r600 090218 . arm vfp 090212 . iphone atom 090208 . iphone 090207 . kz2 ii 090129 . gt3xx speculation 090121 . killzone 2 090110 . hole filling 090108 . structure synth 090105 . nv gpu prg + tes 081230 . gl3 textures 081224 . larrabee 081223 . 3d ifs art 081219 . gl3 driver 081218 . reprojection 2 081217 . reprojection 081216 . pc gpu stats 081209 . opencl 081115 . r2 081106 . arm vfp11 081102 . gl3 on linux 081030 . p r d a 081020 . temporal binned ring buffer 081014 . octahedron map 081010 . temporal locality 081008 . future hardware 080926 . changed email 080918 . general purpose 080826 . olick paper 080814 . otoy, braid 080813 . opengl 3 II 080811 . opengl 3 080806 . random stuff 080718 . nv perf kit 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080426 . parallel II 080319 . beyond the vacuum 080223 . human head + parallel 080114 . xp install
Index 000000 . index
Graphics 090311 . atom tri soup 090110 . hole filling 081218 . reprojection 2 081217 . reprojection 081209 . opencl 081014 . octahedron map 081010 . temporal locality 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080319 . beyond the vacuum 071130 . GPU only 071121 . deferred 3 071116 . deferred 2 071103 . random shots 071025 . motion cards 071018 . cubemap concepts 071015 . drawing reverse II 070926 . drawing in reverse 070822 . new pipeline progress 070819 . high dynamic range 070817 . video update 070810 . engine lighting 070809 . engine videos 070731 . screen shots 070713 . micro impostors 070711 . infinite LOD 070710 . graphics engine intro
Interaction 071204 . GPU only 2 071018 . cubemap concepts 070816 . CFD videos 070730 . CFD code 070715 . self healing
Networking 070708 . breaking firewalls 070707 . management servers 070706 . 510 players / 128Kbps 070705 . UDP player bandwidth 070704 . network latency 070703 . cost of bandwidth
Sound 070709 . 3D audio / KEMAR
Language 090318 . re-attachable code 081030 . p r d a 070921 . assembler in atom4th 070919 . editor working 070915 . chicken and egg 070912 . font making 070910 . 2 4th | !2 4th
Elsewhere andrew selle adrian crook alex champandard angelo pesce aras pranckevicius brian karis cedrick collomb christer ericson chris hecker craig reynolds dave moore david lenihan ignacio castano jeremy shopf jonas risbrandt ke-sen huang marco salvi mikael christensen mike acton mingw naty hoffman nick porcino oss pete shirley pierre terdiman pixar papers realtime rendering ron fedkiw tom forsyth vincent scheib wolfgang engel All Blog Entries 090407 . dxt tip 090320 . gdc 2009 090318 . re-attachable code 090311 . atom tri soup 090305 . voxels 090219 . r600 090218 . arm vfp 090212 . iphone atom 090208 . iphone 090207 . kz2 ii 090129 . gt3xx speculation 090121 . killzone 2 090110 . hole filling 090108 . structure synth 090105 . nv gpu prg + tes 081230 . gl3 textures 081224 . larrabee 081223 . 3d ifs art 081219 . gl3 driver 081218 . reprojection 2 081217 . reprojection 081216 . pc gpu stats 081209 . opencl 081115 . r2 081106 . arm vfp11 081102 . gl3 on linux 081030 . p r d a 081020 . temporal binned ring buffer 081014 . octahedron map 081010 . temporal locality 081008 . future hardware 080926 . changed email 080918 . general purpose 080826 . olick paper 080814 . otoy, braid 080813 . opengl 3 II 080811 . opengl 3 080806 . random stuff 080718 . nv perf kit 080709 . antialiasing 080704 . micro polys II 080628 . micro polys 080524 . triangles 080426 . parallel II 080319 . beyond the vacuum 080223 . human head + parallel 080114 . xp install 080108 . 2008 071207 . G84 071204 . GPU only 2 071130 . GPU only 071126 . opt+more 071121 . deferred 3 071116 . deferred 2 071115 . critic 2 071112 . critic 071108 . GPU assembly 2 071104 . GPU assembly 071103 . random shots 071031 . cubemap seams 071026 . transform feedback 071025 . motion cards 071024 . GS woes 071019 . cubemap woes 071015 . drawing reverse II 070930 . porting to sm3.0? 070926 . drawing in reverse 070921 . assembler in atom4th 070919 . editor working 070915 . chicken and egg 070912 . font making 070910 . 2 4th | !2 4th 070822 . new pipeline progress 070819 . high dynamic range 070818 . DFES 070817 . video update 070816 . CFD videos 070810 . engine lighting 070809 . engine videos 070731 . screen shots 070730 . CFD code 070715 . self healing 070713 . micro impostors 070712 . fragment raytracer 070711 . infinite LOD 070710 . graphics engine intro 070709 . 3D audio / KEMAR 070708 . breaking firewalls 070707 . management servers 070706 . 510 players / 128Kbps 070705 . UDP player bandwidth 070704 . network latency 070703 . cost of bandwidth 070702 . market research
|