Transcript VU-Assembly

Vector Unit Assembly

[email protected]

Overview

Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library

Review

Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit VU1 is the GS’s alternate processing unit Each Unit has a direct pipeline to it’s respective processor Vector Units are designed for 4Dx32bit vectors

Review

VU0/1 each have access to 32 float registers and 16 integer register Float registers are not like PC registers; they are 128bits in size (PC is 32bit) 128bits can fit 4 float values at once (4D vector) Integer registers are typically used as loop counters and address calculators

Review

VU0 has two bus lines One bus is dedicated to the CPU The other bus is used to communicate with all other devices VU0 has 4KB of $ dedicated VU0 I$ 4KB D$ 4KB CPU CORE shared bus SYS RAM

Vector Unit Processing Speed

The graph shows some vector-math intensive function calls 200K calls were made to each function

time(ms) 70 60 50 40 30 20 10 0 Add Scale Cross VU0 EE

Macro and Micro Modes

Vector Unit Zero (VU0) has two modes Micro mode is a mode that allows your vector processor to act as an independent CPU A mini program is uploaded and executed in parallel to the main CPU Macro mode allows your CPU to directly offload heavy vector computation with low overhead Most popular method, hands down.

Micro Mode

When uploaded, the micro program is executed independent to the CPU This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine

Macro Mode

Macro mode is a much easier method of executing fast math functionality Assembly can be used as inline instructions, telling the compiler to offload the math to VU0 Notes Just because it’s in assembly does not mean it will be faster Switching CPU focus has it’s overheads

Assembly Structure

There is typically a specific method to writing assembly routines Load the variable data/addresses to registers Apply vector computations to those registers Store the result back into a variable address Overhead of using assembly is in the load and store Make sure that the computation stage will improve performance enough to offset the load/store overhead

Vector Unit MIPS Instructions

Coprocessor Transfer Instructions Store / Load Coprocessor Branch Instructions Macro (primitive) calculation instructions Add / Subtract / Multiply / Divide / ect… Micro subroutine execution instructions ( VU Macro Instructions )

EEVectorAdd

Adding two vectors using the EE Core (CPU)

// (Vec4T *v0, Vec4T *v1, Vec4T *v2) { v2->x = v0->x + v1->x; v2->y = v0->y + v1->y; v2->z = v0->z + v1->z; v2->w = v0->w + v1->w; }

VectorAdd

Adding two vectors using the VU0

// (Vec4T *v0, Vec4T *v1, Vec4T *v2) { asm __volatile__ (" lqc2 lqc2 vf05, 0x0(%0) vf06, 0x0(%1) vadd.xyzw vf07, vf05, vf06 sqc2 vf07, 0x0(%2)” : : "r" (v0) , "r" (v1), "r" (v2) ); }

EECrossProduct

Notice how we must use a temp because of the cross

// (Vec4T *v1, Vec4T *v2, Vec4T *cross) { Vec4T temp; temp.x = v1->y * v2->z - v1->z * v2->y; temp.y = v1->z * v2->x - v1->x * v2->z; temp.z = v1->x * v2->y - v1->y * v2->x; VectorCopy(&temp, cross); }

CrossProduct

// (Vec4T *v1, Vec4T *v2, Vec4T *cross) { asm __volatile__(" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vopmula.xyz ACC, vf05, vf06 # first vopmsub.xyz vf06, vf06, vf05 # - second vsub.w vf06, vf00, vf00 # w = 0 sqc2 vf06, 0x0(%2)” : // No Output : "r"(v1), "r"(v2), "r"(cross) ); }

Vector Outer Product

The vopmula instruction performs an outer product The result is stored into the special purpose ACC register

VF05 X VF06 X ACC X Y Y Y Z Z Z

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3