Status – Week 283

Download Report

Transcript Status – Week 283

Status – Week 276
Victor Moya
Hardware Pipeline
Command Processor.
 Vertex Shader.
 Rasterization.
 Pixel Shader.
 Fragment Operations and Tests.

vertex data (16x4D):
1 pos
1 weight
1 normal
2 colors
1 fog coord
8 texture coords
Vertex Output (15x4D):
1 Homogeneous pos
4 colors
1 fog coord
1 point size
8 texture coord
Fragment 10x4D)
2 colors
8 texture coords
Fragment Output (10x4D)
1 color
1 depth coordinate
Fragment Coords
Fragment Coords
Framebuffer
Fragment
Operations
and Tests
Pixel
Shader
Rasterization
Vertex
Shader
Command
Processor
CPU
Memory
OGLState
Vertex Program
and Constants
OGLState
Fragment
Program and Texture
Memory
Constants
OGLState
Color Buffer
ZBuffer
Stencil Buffer
Command Processor
Recieves commands from the CPU
(driver, OpenGL/Direct3D).
 Fetches data from memory: vertex data
(DMA).
 Updates and stores OpenGL/Direct3D
render state.

Vertex Shader
Transforms and lits vertex streams.
 Vertex shader program (from GPU
memory?).
 Vertex shader constans (from GPU
memory?).
 Inputs: vertex data 16x4D
 Outputs: vertex data 14x4D

Rasterization





Includes:
 Clipping
 Divide by w
 Affine transform
 Primitive assembly
 Culling
 Setup
 Fragment generation.
Recieves vertexs and produces fragments.
Uses OpenGL/Direct3D render state.
Input: vertex (15x4D).
Output: fragments (10x4D).
Pixel Shader
Shades fragments: calculate texture address,
read texture, color operations.
 Pixel Shader program and constants (from
GPU memory?).
 Texture read: TMU (texture sample, filter
unit, texture cache, GPU memory).
 Optional:



Modify depth coordinate (1 Z output).
Render to texture (up to 4 colors outputs).
Input: fragment (12x4D).
 Output: color (2x4D).

Fragment Operations and Tests






Includes (OpenGL):
 Fog.
 Color Sum.
 Ownership Test.
 Scissor Test.
 Alpha Test.
 Stencil Test.
 Depth Test.
 Blend.
 Logic Operation.
Accesses framebuffer (GPU memory). Updates framebuffer.
Framebuffer: color, Z and stencil.
OpenGL/Direct3D render state defines operations.
Input: color.
Output: FB updated.
COMMANDS
AGP
COMMAND PROCESSOR
Vertex Array
VERTEX
BUFFER
VERTEX SETUP
Vertex Program
Vertex Constants
VERTEX SHADER
VERTEX
VERTEX
CACHE
PRIMITIVE ASSEMBLY
Primitive List
TRIANGLES
MEMORY
TRIANGLE SETUP
FRAGMENT
GENERATOR
FRAGMENT
(color, position, Z,
textures)
EARLY Z TEST
Z Buffer
PIXEL SHADER SETUP
Textures
PIXEL SHADER
FOG & COLOR SUM
OWNERSHIP & SCISSOR
TESTS
Stencil
Buffer
Z Buffer
GL_COLOR_SUM
GL_FOG
GL_Fog()
GL_SCISSOR_TEST
GL_Scissor()
ALPHA TEST
GL_ALPHA_TEST
GL_AlphaFunc()
STENCIL TST
GL_STENCIL_TEST
GL_StencilFunc()
GL_StencilOp()
DEPTH TEST
GL_DEPTH_TEST
GL_Depth_Func()
BLEND
Color
Buffer
Pixel Shader Program
Pixel Shader Constants
LOGIC OP
FRAGMENT
(color, position, Z)
GL_BLEND
GL_BlendEquation()
GL_BlendFuncSeparate()
GL_BlendFunc()
GL_BlendColor()
GL_COLOR_LOGIC_OP
GL_LogicOp()
PIXEL
Vertex Shader
The command processor sends a vertex
stream to the vertex shaders.
 A vertex buffer stores data read from DMA.
 A vertex cache (~ 10 vertexs) can be used to
avoid to execute vertex shader for the same
vertex twice.
 The vertex stream is grouped in primitives
and sent to the rasterizer.

address
MEMORY
FETCH
vertex array
Hardware Pipeline
address
vertex array
VERTEX BUFFER
vertex array address
vertex data
index list
INDEX
FIFO
VERTEX SHADER
index
vertex data (T&L)
COMMAND
PROCESSOR
index
primitive
(n vertexs)
index
PRIMITIVE
ASSEMBLY
VERTEX
CACHE
hit/miss
vertex data (T&L)
PRIMITIVE
FIFO
offset
PRIMITIVE
BUFFER
primitive
(n vertex)
primitive data
(n vertexs)
commands
AGP
Vertex Shader Architecture





SIMD architecture. Registers are 128b wide, four 32 bit fields.
Instruction set: typical arithmetic instructions (vector mul, add)
and some special instructions (ARL, DST), some complex
mathematic instructions (EXP, COS), support for branching,
loops and procedures.
3 different sources of data:
 Input stream (~ 16 registers).
 Constants (~ 256 registers).
 Temporaries (~ 16 registers).
2 different destinations:
 Output stream (~ 15 registers).
 Temporaries (~ 16 registers).
Conditional registers (NV30) and boolean constants
(R300, DX9) for conditional ‘execution’.
Vertex Shader Inputs and Outputs
SREG
VERTEX INPUT (16x128 bits)
SREG
2
TEMPORARY
(16 x 128 bits)
CONSTANTS
(256 x 128 bits)
DREG
2
OP
1
1
SREG
1
1
ADDRESS (2 x 128 bits)
MUX/ABS/NEGATE/SWIZZLE
SREG
OP
ALU/MASK
1
VERTEX OUTPUT
(15 x 128 bits)
DREG
DREG
STACK
+1
Vertex Shader Architecture
MUX
PC
CONSTANTS
BRANCH
VERTEX INPUT
INSTRUCTIONS
IR
ADDRESS
TEMPORALS
MUX
MUX
SWIZZLE
NEG/ABS
ALU
CCs
MASK
VERTEX OUTPUT
MUX
Vertex Shader: NV20
Exposes programmability of a small part of
the geometry pipeline.
 Vertex load & store, format conversion,
primitive assembly, clipping, triangle setup
occur completely in parallel, in pipeline
fashion.
 4-wide fine grained SIMD FP to provide the
necessary performance, and run multiple
execution threads to maintain efficiency
and provide a very simple programming
mode.

NV20: Introduction





Independent vertices.
IEEE single precission FP.
4 component vectors (x, y, z, w).
Input registers can have their components
arbitrarily rearranged/replicated (swizzled).
Any operation generating a scalar must
generate that scalar replicated across all
components, and output writes have a
component write mask.
NV20: Program Model
NV20: Input Attributes


Input Attributes:
 16 quad-float vertex source attribute registers.
 Position, normal, two colors, up to 8 texture coordinate sets,
skin weights, fog and point size.
 Default 0.0 for second and third components, 1.0 for the
fourth.
 Attributes are persistent.
 Only one vertex attribute may be read per program
instruction.
Constant memory:
 96 quad floats.
 Can only be loaded before vertices are processed.
 Only one constant may be read by one program instruction.
 The program may not read to constants.
NV20: Input Attributes

Integer address register:



Read/Write register file:




Loaded using ARL.
Indexed constant reads with out-of-range reads
returning (0,0,0,0).
12 quad floats.
Three reads and one write per instruction.
Initialized to (0,0,0,0) per vertex.
Any vector read may be sourced as multiple
operands and individually swizzled/negated
each time.
NV20: Output attributes







Standard mapping for the fixed function
pipeline at the homogeneous clip space point.
Position for clipping.
Vertex color output clamped to the range 0.0
to 1.0.
Fog distance, point size.
8 texture coordinates.
All instruction writes have an optional 4component write mask.
Initialized to (0.0, 0.0, 0.0, 1.0).
NV20: Instruction Set.


No branching.
Constant Latency: issue any instruction per clock and execute
all instructions with thhe same latency. All operands are
immediately available, limiting the size of registers and memory
banks.
NV20: Hardware Implementation

Two blocks: vertex attribute buffer
(VAB) and the floating point core.
NV20: VAB







The VAB is responsible for vertex attribute persistence.
16 input attributes
When a write to an addres is recieved defaults (0.0, 0.0, 0.0,
1.0) and the valid data overwrites the components.
The VAB drains into a number of input buffers (IB) that are used
to feed the FP core in a round robin fashion.
Dirty bits are maintained in the VAB so only changed attributes
are updated when the same buffer is again the drain target.
The transfer of a vertex is triggered by a write to address 0
(vertex position).
To prevent bubbles during simultaneous loading and draining of
the VAB, incoming writes may push out th contents of the target
address, superceding a default drain sequence.
NV20: VAB
NV20: Floating Point Core










Processes the instruction set.
Multithreaded vector processor operating on quad-float data.
Vertex data read from input buffers and transformed into
output buffers (OB).
Same latency for vector and special function units.
Multiple vertex threads are used to hide this latency.
SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX,
SLT, SGE.
Special FU: RCP, RSQ, LOG, EXP, LIT.
VU is approximately IEEE (no denormalized numbers or
exceptions, rounding always toward negative infinity).
1 instruction per clock and all input/output options have no
performance penalty.
All input vectors are available with no latency.
NV20: Float Point Core
Vertex Shader: R300



4 vertex shader units.
1 scalar unit, 1 vector unit.
Registers:
 ALU Registers:
 Constants: 256 read only vectors.
 Temporary: 12 read/write vectors
 Input: 16 read only vectors.
 Output: 15 write only vectors.
 Flow Control Registers:
 Integer Constat: 16 read only vectors.
 Address: 1 read/write vector.
 Loop Counter: 1 scalar.
 Boolean Constant: 16 read only bits.
R300: Instructions
Up to 256 instructions long shaders.
 Up to 64K executed instructions per vertex.
 ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE,
FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV,
MUL, POW, RCP, RSQ, SGE, SLT.
 Control Flow instructions: CALL, LOOP, ENDLOOP,
JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN.
 Address Instructions: ARL, ARR.
 Graphic Instructions: DST, LIT.
 Instructions based in DX9 VS2.0.

NV30: Overview








Supports all VS1 instructions and features.
Beyond VS2?
Condition codes.
Branches and subroutines.
Modifiers: absolute.
User clip support (new output registers CLP0CLP5).
New instructions.
More registers.
NV30: Overview
Up to 256 instructions per program.
 Up to 64K executed instructions per
vertex.
 16 temporary registers.
 2 vector address registers.
 256 program parameters (constants).

NV30: Condition Codes
4 component register:
 LT: less than zero.
 EQ: equal to zero.
 GT: greater than zero.
 UN: unordered, for comparisions involving NaN.
 Instructions optionally update condition code state:
 “C” suffix: DP4C, MOVC.
 “CC” pseudo register for update condition codes.
 Condition code used in:
 Branches and procedure call/return.
 Result masking.

NV30: Modifiers

Source:
Swizle
 Negate
 Absolute


Target
Masking
 Conditional masking

NV30: Branching and subroutines

BRA




Unconditional.
Conditional: BRA label (LE.xyww)
Computed (indirect): BRA [A1.z] (GT.x)
Call & return for subroutines.




CAL & RET.
Same options that with branches.
Four levels of subroutin execution.
No parameter stack.
NV30: Clipping
New output registers: o[CLP0]..o[CLP5].
 GL_CLIP_PLANEn enabled.

Clip coordinate n interpolated across the
primitive.
 Only the portion of the primitive where the
clip coordinate is greater than zero is
rasterized.
 Hardware performs fast trivial reject if all
clip coordinats of a primitive are negative.

NV30: New Instructions
ARL: supports loading 4-component A0 and A1 intergre registers now.
 ARR: like ARL except rounds rather than truncates before storing
integer result in an address register.
 BRA, CAL, RET: branching instructions.
 COS, SIN: high precision trigonometric functions.
 FLR, FRC: floor and fraction of floating point values.
 EX2, LG2: high-preccision exponentiation and logarithm functions.
 ARA: adds pairs of components of an address register, useful for
looping and other operations.
 SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to
SLT and SGE.
 SSG: “set sign” operation generates a vector holding –1.0 for negative
operand components , 0 for zero components, and +1.0 for positive
components.

NV30: Instruction List
Add & multiply instructions: ADD, DP3, DP4, DPH,
MAD, MOV, SUB.
 Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG,
RCP, RSQ, SIN.
 Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT,
SNE, STR.
 Branching instructions: BRA, CAL, RET.
 Address register instructions: ARL, ARA.
 Graphics-oriented instructions: DST, LIT, RCC, SSG.
 Minimum/maximum instructions: MAX, MIN

Others

Antialiasing
 Anisotropic Filtering (textures).
 Line Antialiasing.
 Edge Antialiasing
 Full Screen Antialiasing (FSAA):




Supersampling.
MultiSampling.
TBDR: Tile Based Deferred Rendering (STMicro PowerVR).
HOS (High Order Surfaces): N-Patches, Bezier, Displacement
Mapping, TruForm, Tesselation.