080426 / Parallel Programming IIMemory and CachesFirst lets only address memory bound algorithms, computationally bound algorithms are easy and boring. What sequential access provides over random access is typically better cache line utilization (and hardware prefetch on some platforms). If truely random access is needed, working with cache aligned and sized vectors removes this performance advantage. In this case of random access performance will be limited by the size of the working space. Working space is bound to exceed L1, as L1 is tiny. I like to think about L1 as simply slower registers, or that L1 speeds up the very short term reuse of memory (for example a GPU's texture cache). The next stage, L2 provides perhaps 10 times less peak bandwidth. L2 is also usually relatively small (say 2MB), and might be shared between cores or hyperthreads. In the past, L2 catches longer term reuse. In the future, looks as if the large SIMD registers in Larrabee will go directly to L2, as the mentioned 32KB L1 would only be enough to spill 512 registers. So in that case L2 can be thought of a simply slower SIMD registers. Then perhaps there is an L3 which could be another 10 times less peak bandwidth from L2. The final stage is access of main memory which might only provide 1/400 the effective bandwidth of L1. Now add in TLB stalls and cache synchronization between multiple cores, and large working space random access might be 1/1500 of the effective bandwidth of L1. Note large working sets are not all that uncommon. GPU's are a great example, easily main memory bandwidth limited, with small L1 texture caches (backed by L2) which capture a majority of the texel reuse. Texture reads are effectively either a majority sequential or random, with minor very short term reuse in filtering (hence the need for only small texture caches). Performance for large working set problems reduces to the problem of factoring out common memory accesses or limiting memory access. Lets put large working set memory reuse into perspective with a made up 2 GHz processor which has a 128 byte cache line, 2 clock cycle L1 throughput/cacheline, a 20 clock L2 throughput, a 200 clock main memory throughput, and 4 way float SIMD throughput of 1 clock per SIMD MADD operation. One obvious limiting factor here is that in no way can one stream out to memory the large solution to any problem faster than poor 128bytes*2Ghz/200=1.28 GB/second on this dummy processor. Now if the solution is compressed a larger solution can be output per unit time, which hints at why framebuffer compression on GPUs is so important, and why run-time data compression/decompression could become very important in the future. Anyway continuing our large working set problem who's source working set and output working set doesn't fit in the cache, say 50% of memory bandwidth is used to store the result and 50% of memory bandwidth is used to transfer the working set used to compute the solution. For the 0.64 GB/sec read (0.16 Gfloats/sec), in order to utilize the 2Ghz*4SIMD = 8 Gflops/sec ALU capacity of the processor, 8/0.16=50 operations have to be associated on average with each float read. Lets take an easy example of something which could make use of the ALU capacity: input and output stream of 0.16 Gfloats/sec, and a 48 tap FIR filter. So in this case each output depends on 48 inputs. Sure this is thinking about the problem backwards, but it does point at a general method to solving problems optimally for performance given the constrains of current hardware... On CPUs Factoring Is Not Just for Computation AnymoreIt is almost unthinkable to optimize an algorithm without factoring out redundant computations. Now apply this same concept to your data flow, and you have the solution to your problem. Factoring data flow and grouping by locality is the way to achieve peak CPU computational utilization on interesting problems with large memory space solutions. So go to a giant white board, list your inputs on the left and your outputs on the right. Then draw a tree or graph in which you factor out and share as many results of intermediate calculations as possible, but keep intermediate results in groups such that they don't exceed either your target cache size or program managed local store size. Moving on to GPUs and FactoringLets open up a can of insanity and dive in with some numbers. According to specs the GeForce 8800 Ultra (G80) has a capacity of 576 GFlops/s and 103.7 GB/s. However it is 1512 MHz * 16 processors * 8 wide SIMD, so the maximum number of instructions is more like 193 Ginst/sec. With output limited to 14.7 Gpix/sec (ROP) and input limited to 39.2 Gtex/sec (TEX). Here is the G80 from a CPU perspective, 16 cores The hyperthreading is used to hide both memory latency, texture fetch latency, and ALU latency. The 6 way hyperthread minimum is the minimum number of hyperthreads required to hide ALU write-after-read latency. With 6 way hyperthreading, 17-25 instructions are needed to hide the latency of an un-cached load from memory. With 24 way, only 5-7 instructions are need to hide the latency. All numbers computed from the CUDA docs. Texture fetch latency on a cache hit is not published in the CUDA docs, but peak throughput in terms of cached texture reads looks to be about 0.81 floats/ALUcycle (fastest format). This was calculated by 612 MHz * 64 TEX units * 1 RGBAtexel/cycle * 4 components/texel divided by 1512 MHz * 16 cores * 8 wide SIMD. GPUs do not provide opportunity to cache intermediate results on write, other than directly during a computation in registers. All output is uncached and takes away from bandwidth. In the case of the G80, without taking input into consideration, when approaching peak output bandwidth (writing to one render target), looks like somewhere between 4 and 13 instructions (depending on output format) are available per result from a shader (1 to 4 components). Peak output is limited by the ROP, and only uses around 50% of the cards peak bandwidth (at fastest output format). The importance of data factoring on GPUs is in keeping a high texture cache hit rate, through good data locality. In the case of the G80, it looks as if memory granularity is 32 bytes from the CUDA docs. Which hits that the texture cache has 32 byte cache lines. Which would give the following texel granularity per cache line, 8x4 texel rect for 8bit/texel compressed formats What should be obvious here is that if an GPU shader needs random access (always cache miss) in a large working space, best case is FP32 RGBA texture fetch which is going to waste about 50% of the available bandwidth (INT8 RGBA random fetch wastes 87.5% of bandwidth). Now say you wanted to use at most 64 GB/s (of the cards 103.7 GB/s bandwidth) for texture fetch and also try and meet the card's peak bilinear texture filter rate of 39 Gtex/s. The 64 GB/s provides for 2 Gcachelines/s fetched, which means you need 18.5 cache hits per line after a miss, or a 95% hit rate. Also note that this could only possibly be done reading compressed textures. Exactly Why GPUs Are So AwesomeAs working space increase well beyond cache size on a CPU, the CPU stalls and idles because of its lack of ability to hide memory fetch latency, while the GPU keeps on doing work because it can hide uncached memory fetch latency. Out of time, more next time... | Atom©2008/2007 Timothy Farrar Latest Blog Entries080826 . olick paper Index000000 . index Graphics080709 . antialiasing Interaction071204 . GPU only 2 Networking070708 . breaking firewalls Sound070709 . 3D audio / KEMAR Language070921 . assembler in atom4th Elsewhereandrew selle All Blog Entries080826 . olick paper |