090218 / ARM VFP

previous | next

Got my iPod Touch, but apparently have a bunch of stuff to do Apple side before I can run any code on it.

ARM11 Vector Floating Point (VFP)

Finally got around to reading the ARM11 manuals. A lot has changed since the early ARM RISCOS days. The VFP is the ARM11's floating point coprocessor. Upon reading the words "Vector Floating Point" I had hopes for some kind of SIMD, but with the ARM11, VFP can only retire one single precision ALU operation per cycle (non-vector throughput, and a high result latency). The "Vector" term refers to the VFP unit's ability to execute the same instruction (ALU or even load/store) on between 1-8 scaler values in series with a single instruction. Effectively a form of instruction compression. Unfortunately vector length and stride (between registers) is set by modifying the FPSCR control register (slow). However, ARM11 does have DSP extensions which provide SIMD within an integer register for 8-bit and 16-bit integer values.

Best case performance will be both ARM CPU and VFP coprocessor running in parallel. With VFP doing parallel load/store and ALU work. The "vector" instructions enable an instruction issue rate which is fast enough to make this happen, while also reducing instruction fetch bandwidth. From what I can tell, only one instruction (CPU or VFP) can be issued per cycle, so "vector" instructions are required to keep the pipelines busy. Unfortunately with this kind of ISA, assembly is probably necessary to extract great performance.

With VFP I see only really one good choice: 8 scaler registers and 6 4-wide "vector" registers. The 8 scaler regs is fixed by the ISA. Using larger than 4 wide vectors requires scorecard locking even in run-fast-mode. Using smaller than 4 wide vectors and I think instruction issue will bottleneck. With the limited number of effective vector registers, L1 should end up serving as an "extended register file", with load/store in parallel with ALU. Load/store works in 64-bits per clock, so in theory this should be enough to keep the ALU going even when mostly working out of L1 and not getting a huge amount of register reuse... too bad all this vector assembly magic is at best going to get 1 ALU op per clock cycle.