8085 Architecture & Its Assembly language programming

Download Report

Transcript 8085 Architecture & Its Assembly language programming

Dr A Sahu
Dept of Comp Sc & Engg.
IIT Guwahati
1
• Graphics System
• GPU Architecture
• Memory Model
– Vertex Buffer, Texture buffer
• GPU Programming Model
– DirectX, OpenGL, OpenCL
• GP GPU Program
– Introduction to Nvidia Cuda Programming
2
3D application
3D API Commands
3D API:
OpenGL
DirectX/3D
CPU-GPU Boundary
GPU Command
& Data Stream
Vertex Index
Stream
GPU
Command
Assembled
polygon, line
& points
Primitive
Assembly
Pretransformed
Vertices
Programmable
Vertex
Processor
transformed
Vertices
Pixel
Updates
Pixel
Location
Stream
Rastereisation
Interpolation
Raster
Operation
Rastorized
Pretransformed
Fragments
Programmable
Fragment
Processors
Frame
Buffer
Transformed
Fragments
3
Vertices
(x,y,z)
Memory
System
Vertex
Shadder
Vertex
Processing
Pixel
Shadder
Pixel
Processing
Texture
Memory
Pixel
R, G,B
Frame
Buffer
4
Modeling
Transformations
Illumination
(Shading)
Viewing Transformation
(Perspective / Orthographic)
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
• Primitives are processed in a series
of stages
• Each stage forwards its result on to
the next stage
• The pipeline can be drawn and
implemented in different ways
• Some stages may be in hardware,
others in software
• Optimizations & additional
programmability are available at
some stages
Modeling
Transformations
Illumination
(Shading)
Viewing Transformation
(Perspective / Orthographic)
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
• Graphics pipeline (simplified)
Textures
OUT
IN
Pixel
Shader
Vertex
Shader
Object space
Window space
Framebuffer
The computing capacities of
graphics processing units (GPUs)
have improved exponentially in
the recent decade.
NVIDIA released a CUDA
programming model for GPUs.
The CUDA programming
environment applies the parallel
processing capabilities of the GPUs
to medical image processing
research.
• CUDA Cores 480
– (Compute Unified Dev Arch)
•
•
•
•
•
•
•
•
•
•
Microsoft® DirectX® 11 Support
3D Vision™ Surround Ready
Interactive Ray Tracing
3-way SLI® Technology
PhysX® Technology
CUDA™ Technology
32x Anti-aliasing Technology
PureVideo® HD Technology
PCI Express 2.0 Support.
Dual-link DVI Support, HDMI 1.4
• This generation is the first generation of
fully-programmable graphics cards
• Different versions have different resource
limits on fragment/vertex programs
Vertex
Transforms
AGP
Primitive
Assembly
Programmable
Vertex shader
Rasterization
and
Interpolation
Raster
Operations
Programmable
Fragment
Processor
Frame
Buffer
• Writing assembly is
– Painful
– Not portable
– Not optimize-able
• High level shading language solves these
– Cg, HLSL
• CPU and GPU Memory Hierarchy
Disk
CPU Main
Memory
CPU Caches
CPU Registers
GPU Video
Memory
GPU Caches
GPU Constant
Registers
GPU Temporary
Registers
• Much more restricted memory access
– Allocate/free memory only before computation
– Limited memory access during computation (kernel)
• Registers
– Read/write
• Local memory
– Does not exist
• Global memory
– Read-only during computation
– Write-only at end of computation (pre-computed address)
• Disk access
– Does not exist
• At any program point
– Allocate/free local or global memory
– Random memory access
• Registers
– Read/write
• Local memory
– Read/write to stack
• Global memory
– Read/write to heap
• Disk
– Read/write to disk
• Where is GPU Data Stored?
– Vertex buffer
– Frame buffer
– Texture
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
• Each GPU memory type supports subset of the
following operations
– CPU interface
– GPU interface
• CPU interface
– Allocate
– Free
– Copy CPU  GPU
– Copy GPU  CPU
– Copy GPU  GPU
– Bind for read-only vertex stream access
– Bind for read-only random access
– Bind for write-only framebuffer access
17
• GPU (shader/kernel) interface
– Random-access read
– Stream read
Vertex Buffers
• GPU memory for vertex data
• Vertex data required to initiate render pass
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
• Supported Operations
– CPU interface
•
•
•
•
•
Allocate
Free
Copy CPU  GPU
Copy GPU  GPU (Render-to-vertex-array)
Bind for read-only vertex stream access
– GPU interface
• Stream read (vertex program only)
• Limitations
– CPU
• No copy GPU  CPU
• No bind for read-only random access
• No bind for write-only framebuffer access
– GPU
• No random-access reads
• No access from fragment programs
• Random-access GPU memory
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
• Supported Operations
– CPU interface
•
•
•
•
•
•
•
Allocate
Free
Copy CPU  GPU
Copy GPU  CPU
Copy GPU  GPU
(Render-to-texture)
Bind for read-only random access (vertex or fragment)
Bind for write-only framebuffer access
– GPU interface
• Random read
• Memory written by fragment processor
• Write-only GPU memory
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
• Fixed function pipeline
– Made early games look fairly similar
– Little freedom in rendering
– “One way to do things”
• glShadeModel(GL_SMOOTH);
• Different render methods
– Triangle rasterization, proved to be very efficiently
implemented in hardware.
– Raytracing, voxels, produce nice results, very slow
and require large amounts of memory
• DirectX before version 8 entirely fixed function
• OpenGL before version 2.0 entirely fixed
function
– Extensions were often added for different effects, but no real
programmability on the GPU.
• OpenGL is just a specification
– Vendors must implement the specification, but
on whatever platform they wish
• DirectX is a library, Windows only
– Direct3D is the graphics component
• Direct3D 8.0 (2000), OpenGL 2.0 (2004) added
support for assembly language programming of
vertex and fragment shaders.
– NVIDIA GeForce 3, ATI Radeon 8000
• Direct3D 9.0 (2002) added HLSL (High Level
Shader Language) for much easier programming
of GPUs.
– NVIDIA GeForce FX 5000, ATI Radeon 9000
• Minor increments on this for a long time, with
more capabilities being added to shaders.
• Vertex data sent in by graphics API
– Mostly OpenGL or DirectX
• Processed in vertex program – “vertex shader”
• Rasterized into pixels
• Processed in “fragment shader”
Vertex
Data
Vertex
Shader
Rasterize
To Pixels
Fragment
Shader
Output
• No longer need to write shaders in assembly
• GLSL, HLSL, Cg, offer C style programming
languages
• Write two main() functions, which are executed
on each vertex/pixel
• Declare auxiliary functions, local variables
• Output by setting position and color
• Prior to Direct3D 10/GeForce 8000/Radeon
2000, vertex and fragment shaders were
executed in separate hardware.
• Direct3D 10 (with Vista) brought shader
unification, and added Geometry Shaders.
– GPUs now used the same ‘cores’ to
geometry/vertex/fragment shader code.
• CUDA comes out alongside GeForce 8000 line,
allowing ‘cores’ to run general C code, rather
than being restricted to graphics APIs.
3D Geometric Primitives
GPU
Programmable Unified Processors
Vertex
Programs
Geometry
Programs
Pixel
Programs
Rasterization
GPU memory (DRAM)
Compute
Programs
Hidden
Surface
Removal
Final Image
• CUDA the first to drop graphics API, and allows
the GPU to be treated as a coprocessor to the
CPU.
– Linear memory accesses (no more buffer objects)
– Run thousands of threads on separate scalar cores
(with limitations)
– High theoretical/achieved performance for data
parallel applications
• ATI has Stream SDK
– Closer to assembly language programming for
Stream
• Apple announces OpenCL initiative in 2008
– Officially owned by Khronos Group, the same that
controls OpenGL
– Released in 2009, with support from NVIDIA/ATI.
– Another specification for parallel programming, not
entirely specific to GPUs (support for CPU SSE
instructions, etc.).
• DirectX11 (and Direct3D10 extension) add in
DirectComputeshaders
– Similar idea to OpenCL, just tied in with Direct3D
CS101 GPU Programming
33
• DirectX11 also adds multithreaded rendering, and
tessellation stages to the pipeline
– Two new shader stages in the unified pipeline; Hull
and Domain shaders
– Allow high detail geometry to be created on the
GPU, rather than flooding the PCI-E bus with
geometry data.
– More programmable geometry
• OpenGL 4 (specification just released) is close to
feature parity with Direct3D11
– Namely also adds tessellation
• Newest GPUs have incredible compute power
– 1-3 TFlops, 100+ GB/s memory access bandwidth
• More parallel constructs
– High speed atomic operations, more control over
thread interaction/synchronization.
• Becoming easier to program
– NVIDIA’s ‘Fermi’ architecture has support for C++
code, 64bit pointers, etc.
• GPU computing starting to go mainstream
– Photoshop5,Video encode/decode, physics/fluid
simulation, etc.
• GPUs are fast…
– 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
– NVIDIA GeForceFX 7800: 165 GFLOPs
– 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
– ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster
– CPUs: 1.4× annual growth
– GPUs: 1.7×(pixels) to 2.3× (vertices) annual
growth
• Modern GPUs are deeply programmable
– Programmable pixel, vertex, video engines
– Solidifying high-level language support
• Modern GPUs support high precision
– 32 bit floating point throughout the pipeline
– High enough for many (not all) applications
• GPUs designed for & driven by video games
– Programming model unusual
– Programming idioms tied to computer graphics
– Programming environment tightly constrained
• Underlying architectures are:
– Inherently parallel
– Rapidly evolving (even in basic feature set!)
– Largely secret
• Can’t simply “port” CPU code!
• Application specifies
geometry  rasterized
• Each fragment is shaded w/
SIMD program
• Shading can use values
from texture memory
• Image can be used as
texture on future passes
• Draw a screen-sized quad
 stream
• Run a SIMD kernel over
each fragment
• “Gather” is permitted
from texture memory
• Resulting buffer can be
treated as texture on next
pass
• Introduced November of 2006
• Converts GPU to general purpose CPU
• Required hardware changes
– Only available on N70 or later GPU
• GeForce 8000 series or newer
• Implemented as extension to C/C++
– Results in lower learning curve
• 16 Streaming Multiprocessors (SM)
– Each one has 8 Streaming Processors (SP)
– Each SM can execute 32 threads simultaneously
– 512 threads execute per cycle
– SPs hide instruction latencies
• 768 MB DRAM
– 86.4 Gbps memory bandwidth to GPU cores
– 4 Gbps memory bandwidth with system memory
Host
Input Assembler
Thread Execution
Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
CUDA Execution Model
• Starts with Kernel
• Kernel is function called on host that executes
on GPU
• Thread resources are abstracted into 3 levels
– Grid – highest level
– Block – Collection of Threads
– Thread – Execution unit
CUDA Execution Model
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1)
(1,0,1)
(2,0,1)
(3,0,1)
Thread
(0,0,0)
Thread Thread Thread
(1,0,0)
(2,0,0) (3,0,0)
Thread
(0,1,0)
Thread Thread Thread
(1,1,0)
(2,1,0) (3,1,0)
Court esy: NDVIA
• 768 GB global memory
– Accessible to all threads globally
– 86.4 Gbps throughput
• 16 KB shared memory per SP
– Accessible to all threads within a block
– 384 Gbps throughput
• 32 KB register file per SM
– Allocated to threads at runtime (local variables)
– 384 Gbps throughput
– Threads can only see their own registers
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Host
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global Memory
(From C/C++ function)
• Allocate memory on CUDA device
• Copy data to CUDA device
• Configure thread resources
– Grid Layout (max 65536x65536)
– Block Layout (3 dimensional, max of 512 threads)
• Execute kernel with thread resources
• Copy data out of CUDA device
• Free memory on CUDA device
• Multiply matrices M and N to form result R
• General algorithm
– For each row i in matrix R
• For each column j in matrix R
– Cell (i, j) = dot product of row i of M and column j of N
• Algorithm runs in O(length3)
•
•
•
•
Each thread represents cell (i, j)
Calculate value for cell (i, j)
Use single block
Should run in O(length)
– Much better than O(length3)
WIDTH
N
P
WIDTH
M
WIDTH
WIDTH
• Max threads allowed per block is 512.
• Only supports max matrix size of 22x22
– 484 threads needed
• Split result matrix into smaller blocks
• Utilizes more SM’s rather than the single block
approach
• Better speed-up
bx
0
1
2
tx
01 2 TILE_WIDTH-1
WIDTH
Nd
Md
Pd
1
ty
Pdsub
TILE_WIDTH-1
TILE_WIDTH
2
WIDTH
WIDTH
WIDTH
by
0
1
2
TILE_WIDTHE
0
• Runs 10 times as fast as serial approach
• Solution runs 21.4 GFLOPS
– GPU is capable of 384 GFLOPS
– What gives?
• Each block assigned to SP
– 8 SPs to 1 SM
• SM executes single SP
• SM switches SPs when long-latency is found
– Works similar to Intel’s Hyperthreading
• SM executes batch of 32 threads at a time
– Batch of 32 threads called warp.
•
•
•
•
Global Memory bandwidth is 86.4 Gbps
Shared Memory bandwidth is 384 Gbps
Register File bandwidth is 384 Gbps
Key is to use shared memory and registers
when possible
•
•
•
•
Each SP has 16 KB shared memory
Each SM has 32 KB register file
Local variables in function take up registers
Register file must support all threads in SM
– If not enough registers, then less blocks are
scheduled
– Program still executes, but less parallelism occurs.
• SM can only handle 768 threads
• SM can handle 8 blocks, 1 block for each SP
• Each block can have up to 96 threads
– Max out SM resources
• Intel’s new approach to
a GPU
• Considered to be a
hybrid between a multicore CPU and a GPU
• Combines functions of a
multi-core CPU with the
functions of a GPU
63