080223 / Human Head + Parallel ProgrammingPerhaps can be guessed by the lack of blog posts, I've been taking a break from Atom for a while. Working for Human Head StudiosIn the mean time I've started working at Human Head Studios in Madison Wisconsin. It is simply awesome working with like minded people, especially the Head's render guru, Brian Karis. Of course I cannot say anything about what I am doing, but I will say that I really do enjoy programming for the consoles... What About the Atom Blog? Right now, I am currently living in two cities (which are 160 miles apart). Between working during the week, lifting, getting back into MMA, and driving back to Chicago on the weekends, my time is spent. Once I finish my transition, the blog will get some regular updates... Atom's Time will Still Come. Atom is still alive (but now only worked on in free time), and will some day get finished, but not in 2008. This is actually a really good thing for Atom. Atom really needs OpenGL3 Mt Evens to be finished before driver support for AMD/ATI and Apple catches up to what I am doing. Thoughts on Transitioning to the Near Future of Parallel Programming
This post is inspired by Mike Action's great SPU shader talk. My personal programming interests are almost exclusively geared towards parallel programming. With the PS3, unified shader model of DX10/SM4.0 and CUDA, parallel programming is just starting to get really interesting, and useful. With Larrabee on its way and some unknown answer from NVidia and AMD, the future looks very bright indeed. The age of hardware which auto parallelizes a serial instruction stream, and relieves the programmer from thinking about parallel memory access patterns is quickly coming to an end. The new parallel age has some new important factors: Vectorization. Vastly under utilized in the general case because we humans mostly think in single serial tasks. The concept here is to do one serial task to N independent things at the same time (a vector length of N). Where N is say 4 (SSE/Altivec) or perhaps 16 (Larrabee) or 32 (CUDA), the GPGPU age is bringing us larger vectors. The hardware fast path has no intra-vector communication between the N things (SOA style): fast aligned vector memory accesses (parallel). In some cases limited intra-vector communication can be found (ie splat, interleaving, swizzle, fully generalized permutation, etc) with performance dependent on the pattern of communication. Ultimately for portability to different N's, only very limited intra-vector communication should ever be used, and only between small (like 2 or 4) power of 2 sized and aligned sub groups within the vector. Practical example is the paired computation of gradients in GPU pixel/fragment shaders. Fully generalized communication, scatter and gather, is usually very expensive, with a few notable exceptions. First 1-to-N gather (splat, or all vectors reading the same value) can be somewhat fast. For example reading constants on GPU. Also useful for working in a very wide tree (a large fixed power of 2 number of children per node) to have all children do a computation with respect to the parent. Second, GPU texture units provide N-to-N gather. The important factor here is that gather runs in parallel with ALU, but is not fully pipelined and has a latency even on a cache hit (hence the factor of ALU to TEX ratio). Also GPU vertex input gathers (in the form of indexed vertexes). CPU side gather can be very expensive, not parallelized and thus taking ALU cycles and often requiring many round trips to memory. Take Altivec (2+3*N memory operations): store the vector with N indexes to memory, do something else (to hide write-read stall), read N indexes, fetch from memory pointed to by N indexes, store N fetched values into a new vector in memory, do something else (to hide write-read stall), fetch vector. Write-through/write-combining caches can increase the write-read stall. Parallel Cores. Many aspects of importance here, number of cores, number of hardware threads feeding a single ALU unit in the core, and issues with latency hiding. Obvious for an algorithm to be portable, it has to deal with scaling to a variable number of cores/threads. There are all sorts of different latencies in a core, and the solution to dealing with latency is to schedule independent instructions to hide latency. One obvious choice to hide instruction latency is simply to extend the vector length in software (for example N=32 in CUDA when hardware N=8). Also in architectures with large cache lines, this can help with better cache utilization. Having multiple hardware threads feeding a single vector ALU unit helps hide instruction latency, but can place more pressure on any shared caches. However when all the threads run the same program, instruction cache utilization can be much better, and often better data locality can be found as well. Then there is the issue of how the instructions of the threads are multiplexed into the single ALU unit. Does the hardware instruction multiplex prior to verifying that the instructions won't stall (ie CUDA), or can the core re-order instructions (SSE/PC), or will a later stall either stall all threads or result in the flush all the instructions from the offending thread from the multiplexed instruction stream leaving bubbles in the pipeline. Memory Access and Patterns. Most algorithms are memory bound, and thus data design, locality, and memory access patters become key in performance. Two cases, first you manage your own cache or local store (PS3, CUDA), or second you trick an actual cache into doing what you want (prefetching, etc). There are a few important factors here: latency to start a memory transfer (DMA, or L2 prefetch), number of simultaneous queued transactions, size of the transfer (in the case of DMA) and throughput. Designing an algorithm for the software managed cache or local store case, ports to the case of a hardware cache, and also insures great cache locality. Synchronization. Sync points require either blocking/spin lock, cooperative task switch (do something else), or heavy weight task switch. In all cases sync points are expensive and should be minimized. Atomic operations on all platforms provide a relatively lightweight way to solve a majority of sync issues. Next TimeThat was just a somewhat random introduction, next time it is time to start on specifics including useful parallel coding patterns and algorithms... | Atom©2008/2007 Timothy Farrar Latest Blog Entries080826 . olick paper Index000000 . index Graphics080709 . antialiasing Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language070921 . assembler in atom4th Elsewhereandrew selle All Blog Entries080826 . olick paper |