Transcript Slide 1

General Purpose GPU
(GPGPU)
Aaron Smith
University of Texas at Austin
Spring 2003
Motivation
• Graphics processors are becoming more
programmable
– DirectX/OpenGL - Vertex and Pixel Shaders
• Explore the current state of the art
– How would a typical application run on a
GPU?
– What are the difficulties? Requirements?
MPEG Overview
•
•
Format for storing compressed audio and video
Uses prediction between frames to achieve compression (exploits spatial
locality)
– “I” or intra-frames
• simply a frame encoded as a still image (no history)
– “P” or predicted frames
• predicted from most recently reconstructed I or P frame
• can also be treated like I frames when no good match
– “B” or bi-directional frames
• predicted from closest two I or P frames, one in the past and one in the future
• no good match then intra code like I frame
•
Typical sequence looks like:
– IBBPBBPBBPBBIBBPBBPB...
•
Remember what a B frame is???
– decode the I frame, then the first P frame then the first and second B frame
– 0xx312645
GPU Programming Model
• Streams Programming
• Pixel Shaders
– store data in texture memory
– use multiple passes to render and re-render to texture
memory
• Vertex Shaders???
– more powerful than pixel shaders from an instruction
standpoint
– but...not very useful because of restriction on
accessing texture memory
• What are the limitations?
– branching ?
MPEG and the GPU
• decoding is sequential
• data structures are regular
– typical video stream is 352x240
• basic result is pixel color data
NVIDIA Cg
•
•
•
•
High Level Shading Language
Vertex and Pixel Shaders
OpenGL and DirectX Support
Can be compiled at runtime!
Cg Profiles
1) Which profile do we choose? Will the model fit?
2) What about portability?
Profile
vs_2_x
vs_2_0
ps_2_x
ps_2_0
arbvp1
arbfp1
vp30
fp30
vs_1_1
ps_1_1
ps_1_2
ps_1_3
vp20
fp20
Can we move between architectures?
Platform
Vertex Pixel Instr Limit
DirectX 9
X
256
DirectX 9
X
256
DirectX 9
X 96 - 1024 Total
DirectX 9
X 32 Texture and 64 Arith
OpenGL ARB
X
see vp20
OpenGL ARB
X >= 24 Texture and >= 48 Arith
OpenGL NV30
X
?
OpenGL NV30
X ?
DirectX 8
X
128
DirectX 8
X 4 Texture Addr and 8 Arith
DirectX 8
X 4 Texture Addr and 8 Arith
DirectX 8
X 4 Texture Addr and 8 Arith
OpenGL NV2X
X
128
OpenGL NV2X
X 4 Texture Shader and 8 Reg Comb
Registers
256 R, 12-32 R/W
256 R, 12-32 R/W
32 R, 12-32 R/W
32 R, 12-32 R/W
Loops
Yes
Unroll Only
Unroll Only
Unroll Only
Array Indexing
Uniform Constant
Uniform Constant
No Variable Indexing
No Variable Indexing
>= 24 Locals and >= 16 Temp
256 R, 16 R/W
64 R/W
96 R and 12 R/W
2 Input Color, 6 Const, 6 Temp
2 Input Color, 6 Const, 6 Temp
2 Input Color, 6 Const, 6 Temp
?
Unroll Only
Yes
Unroll Only
Unroll Only
Unroll Only
Unroll Only
Unroll Only
Unroll Only
Unroll Only
No Variable Indexing
No Variable Indexing
Uniform Constant
Compile Time Constant
Compile Time Constant
Compile Time Constant
Compile Time Constant
DirectX 8 – PS_2_0
Instruction Set
Name
Description
Instruction
slots
Setup Arithmetic Macroops
abs
Absolute value
1
add
Add two vectors
1
x
cmp
Compare source to 0
1
x
crs
Cross product
2
dcl
Map a vertex element type to an
input vertex register
0
x
dcl_textureType Declare the texture coordinate
dimension for a sampler register
0
x
def
Define constants
0
x
dp2add
2-D dot product and add
2
x
dp3
3-D dot product
1
x
dp4
4-D dot product
1
x
exp
Full precision 2
x
1
x
frc
Fractional component
1
x
log
Full precision log2(x)
1
x
lrp
Linear interpolate
2
x
m3x2
3x2 multiply
2
x
m3x3
3x3 multiply
3
x
m3x4
3x4 multiply
4
x
m4x3
4x3 multiply
3
x
m4x4
4x4 multiply
4
x
mad
Multiply and add
1
x
x
x
Texture
PS_2_0 Cont.
Name Description
Instruction
slots
mov
Move
1
x
mul
Multiply
1
x
nop
No operation
1
x
nrm
Normalize
3
x
3
x
x
Setup Arithmetic Macroops
pow
2
ps
Version
0
rcp
Reciprocal
1
x
rsq
Reciprocal square root
1
x
sincos Sine and cosine
8
sub
1
Subtract
Texture
x
x
x
texkill Kill pixel render
1(tex)
x
texld
1 + 3CUBE
x
texldb Texture sampling with level of detail
(LOD) bias from w-component
1(tex)
x
texldp Texture sampling with projective divide
by w-component
1(tex)
x
Sample a texture
MPEG -> Cg Challenges
•
Data Types
– float/int basic types on GPU
– unsigned char dominate type in MPEG
•
Loops
– Most profiles do not support loops unless they can be completely unrolled
– i.e. loop.cg(49) : warning C7012: not unrolling loop that executes 352 times
since maximum loop unroll count is 256
•
No recursion
– Normally not a problem we can change to iterative
– But on the GPU we have a problem with “Loops”
•
Arrays
– Severe restrictions on index variables
– Some profiles assign each array element to a register
• Ie. float array[10] uses ten registers
•
Pointers
– Not supported
Implementation
• Only support 352x240 resolution
• Allocate fixed data structures to hold frame
– 352x240 = 84880 x 21120 x 21120 (yuv)
• Hold data in texture memory
• Use Cg pixel shaders
– vertex shaders cannot access texture memory
• Work backwards
An Example C -> CG
• Convert MPEG decoder store() routine
into CG shader
– Simplify…simplify…simplify
– Factor
store_ppm_tga() - Original
static void store_ppm_tga(outname,src,offset,incr,height,tgaflag)
char *outname;
unsigned char *src[];
int offset, incr, height;
int tgaflag;
{
int i, j;
int y, u, v, r, g, b;
int crv, cbu, cgu, cgv;
unsigned char *py, *pu, *pv;
static unsigned char tga24[14] = {0,0,2,0,0,0,0, 0,0,0,0,0,24,32};
char header[FILENAME_LENGTH];
static unsigned char *u422, *v422, *u444, *v444;
if (chroma_format==CHROMA444) {
u444 = src[1];
v444 = src[2];
} else {
if (!u444) {
if (chroma_format==CHROMA420) {
if (!(u422 = (unsigned char *)malloc((Coded_Picture_Width>>1)
*Coded_Picture_Height)))
Error("malloc failed");
if (!(v422 = (unsigned char *)malloc((Coded_Picture_Width>>1)
*Coded_Picture_Height)))
Error("malloc failed");
}
if (!(u444 = (unsigned char *)malloc(Coded_Picture_Width
*Coded_Picture_Height)))
Error("malloc failed");
else {
conv422to444(src[1],u444);
conv422to444(src[2],v444);
} }
strcat(outname,tgaflag ? ".tga" : ".ppm");
if ((outfile =
open(outname,O_CREAT|O_TRUNC|O_WRONLY|O_BINARY,0666))==-1)
{
sprintf(Error_Text,"Couldn't create %s\n",outname);
Error(Error_Text);
}
optr = obfr;
if (tgaflag) {
/* TGA header */
for (i=0; i<12; i++)
putbyte(tga24[i]);
putword(horizontal_size); putword(height);
putbyte(tga24[12]); putbyte(tga24[13]);
}
crv = Inverse_Table_6_9[matrix_coefficients][0];
cbu = Inverse_Table_6_9[matrix_coefficients][1];
cgu = Inverse_Table_6_9[matrix_coefficients][2];
cgv = Inverse_Table_6_9[matrix_coefficients][3];
for (i=0; i<height; i++) {
py = src[0] + offset + incr*i;
pu = u444 + offset + incr*i;
pv = v444 + offset + incr*i;
for (j=0; j<horizontal_size; j++) {
u = *pu++ - 128;
v = *pv++ - 128;
y = 76309 * (*py++ - 16); /* (255/219)*65536 */
r = Clip[(y + crv*v + 32768)>>16];
g = Clip[(y - cgu*u - cgv*v + 32768)>>16];
b = Clip[(y + cbu*u + 32786)>>16];
if (tgaflag) putbyte(b); putbyte(g); putbyte(r);
else putbyte(r); putbyte(g); putbyte(b);
}
if (!(v444 = (unsigned char *)malloc(Coded_Picture_Width
*Coded_Picture_Height)))
Error("malloc failed");
}
if (chroma_format==CHROMA420) {
conv420to422(src[1],u422);
conv420to422(src[2],v422);
conv422to444(u422,u444);
conv422to444(v422,v444);
}
}
if (optr!=obfr) write(outfile,obfr,optr-obfr);
close(outfile);
}
Quick Analysis
• Pointers
– Remove
• Conditionals (if/else)
– Remove
• Dynamic Memory
– Remove
• File I/O
– Remove
• Table lookups
– Remove
• Constant array indexes
– OK!
• Constant loop invariants
– OK!
store_tga() - Simplified
static void store_tga(unsigned char *src[])
{
int i, j;
int y, u, v, r, g, b;
int crv, cbu, cgu, cgv;
int incr
= 352;
int height = 240;
int data_idx = 0;
/* index into BitMap.data[] */
static unsigned char u422[176*240];
static unsigned char v422[176*240];
static unsigned char u444[352*240];
static unsigned char v444[352*240];
/* matrix coefficients */
crv = 104597;
cbu = 132201;
cgu = 25675;
cgv = 53279;
/* convert YUV to RGB */
for (i=0; i<height; i++)
{
for (j=0; j<horizontal_size; j++)
{
u = u444[incr*i+j] - 128;
v = v444[incr*i+j] - 128;
y = 76309 * (src[0][incr*i+j] - 16);
#define CLIP(x) ( (x<0) ? 0 : ((x>255) ? 255 : x) )
r = CLIP((y + crv*v + 32768)>>16);
g = CLIP((y - cgu*u - cgv*v + 32768)>>16);
b = CLIP((y + cbu*u + 32786)>>16);
/* 352 x 240 x 3 frame */
BitMap.channels = 3;
BitMap.size_x = 352;
BitMap.size_y = 240;
BitMap.data[data_idx++] = r;
BitMap.data[data_idx++] = g;
BitMap.data[data_idx++] = b;
}
conv420to422(src[1],u422); /* u422 = src[1] */
conv420to422(src[2],v422); /* v422 = src[2] */
conv422to444(u422,u444); /* u444 = u422 */
conv422to444(v422,v444); /* v422 = v444 */
}
#ifdef _WIN32
// output the frame
DrawGLScene((tImageTGA *)&BitMap);
#endif
}
Quick Analysis
• Removed
– If/else
– Pointers
– File i/o
– Table lookups
• What’s Left?
– Function calls (for chrominance conversion)
• conv420to422() and conv422to444()
– YUV to RGB loop
YUV -> RGB (cg)
float3 main(
in
float3 texcoords0 : TEXCOORD0,
/* texture coord */
uniform sampler2D yImage
: TEXUNIT0,
/* handle to texture with Y data */
in
float3 texcoords1 : TEXCOORD1,
/* texture coord */
uniform sampler2D uImage
: TEXUNIT1,
/* handle to texture with U data */
in
float3 texcoords2 : TEXCOORD2,
/* texture coord */
uniform sampler2D vImage
: TEXUNIT2
/* handle to texture with V data */
) : COLOR
{
float3 yuvcolor; // f(xyz) -> yvu
float3 rgbcolor;
yuvcolor.x = tex2D(yImage, texcoords0).x;
yuvcolor.z = tex2D(uImage, texcoords1).y-0.5;
yuvcolor.y = tex2D(vImage, texcoords2).z-0.5;
rgbcolor.r = 2*(yuvcolor.x/2 + 1.402/2 * yuvcolor.z);
rgbcolor.g = 2*(yuvcolor.x/2 - 0.344136 *
yuvcolor.y/2 - 0.714136 * yuvcolor.z/2);
rgbcolor.b = 2*(yuvcolor.x/2 + 1.773/2 * yuvcolor.y);
}
return rgbcolor;
texld
texld
add
mov
texld
add
mov
dp3
mul
dp3
mov
dp3
mul
mov
mov
mov
mov
// 17
dcl_2d
s0
dcl_2d
s1
dcl_2d
s2
def
c0,
0.000000,
def
c1,
2.000000,
def
c2,
1.000000,
0.000000
def
c3,
0.500000,
dcl
t0.xyz
dcl
t1.xyz
dcl
t2.xyz
r0, t1, s1
r1, t0, s0
r0.x, r0.y, -c1.y
r1.z, r0.x
r0, t2, s2
r0.x, r0.z, -c1.y
r1.y, r0.x
r0.x, r1, c3
r0.x, c1.x, r0.x
r0.w, r1, c2
r0.y, r0.w
r0.w, r1, c1.x
r0.w, c1.x, r0.w
r0.z, r0.w
r1.w, c0.w
r1.xyz, r0
oC0, r1
instructions, 2 R-regs.
0.000000, 0.000000, 1.000000
0.500000, 0.886500, 0.000000
-0.344000, -0.714000,
0.000000, 0.701000, 0.000000
Quick Analysis
• YUV -> RGB
– 17 instructions and 2 registers
– 352x240 = 84480 px * 17 = ~1.4M instr/frame
Pixel Shader Instructions
Pixel Shader Instructions
500000000
400000000
300000000
at 30fps
200000000
per frame
100000000
0
352x240
640x480
800x600
Resolution
1024x768
Just for Fun
• What if we needed 1024 instructions??
– 352x240 = 84480 px * 1024 = 86,507,520 instr/frame
Pixel Shader Instructions
Instructions
1000000000
800000000
600000000
per frame
400000000
200000000
0
352x240
640x480
800x600
Resolution
1024x768