Real-Time Parallel Radiosity

Transcript Real-Time Parallel Radiosity

Real-Time Parallel Radiosity
Matt Craighead
May 8, 2002
6.338J/18.337J, Course 6 AUP
What is Radiosity?
• A computer graphics technique for lighting
• Two types of lighting algorithms:
– Local: easy, fast, but not realistic
– Global: slow, difficult, but highest quality
• Radiosity is a global algorithm
• Global algorithms try to take into account
interreflections in scenes.
Radiosity in Real Time?
• Local algorithms can easily run in real time,
either in software or hardware.
• Computational demands grow with number
of surfaces and number of lights. 106
surfaces is not unreasonable!
• Most global algorithms take quadratic time
in number of surfaces (all interactions!).
Local Doesn’t Mean Bad
Id Software’s Doom 3 (video capture)
…but Global is Better
State-of-the-art radiosity rendering from 1988 (5 hours render time!)
Local Lighting Math
• Consider a light source at some point in
three-dimensional space, and a surface at
some other location.
• How much does the light directly contribute
to the brightness of the surface?
• Note that global algorithms still need to do
local lighting as a first step.
Local Lighting Math
• Set up a 3-dimensional coordinate system
centered on the surface. Define unit vectors
L (light), N (normal), E (eye):
– L points towards the light.
– N is perpendicular to the surface.
– E points towards the viewer.
E
N
L
Local Lighting Math
• If an object sits between the light and
surface, the lighting contribution is zero.
• Otherwise…
• Surfaces generally reflect light in two ways:
“diffuse” (dull) and “specular” (shiny).
• We can see three colors: red, green, blue.
• So let Md, Ms be 3-vectors indicating how
much of [R,G,B] are reflected in each way.
Local Lighting Math
• Diffuse lighting is independent of view
angle. It is brightest when N and L are most
closely aligned, and falls off with the cosine
of the angle between them.
• All lighting also falls off with the square of
the distance from the light source.
• So, the diffuse term is d-2Md(N · L).
Local Lighting Math
• Specular lighting is view-dependent. In one
simple formulation (Blinn shading), we
determine the vector H halfway between E
and L and evaluate d-2Ms(N · H)s, where s
indicates the shininess of the surface.
E
H
N
L
Local Lighting Math
• The easiest way to think of H, the half-angle
vector, is that if H and N were aligned, and
the surface were a mirror, then the light
would reflect straight to the eye.
• (N · H)s can be thought of as representing a
probability distribution of “microfacets,”
whose normals are clustered around N but
do vary. Smoother surfaces have higher s.
Local Lighting Math
• So, our full contribution from a light source
is d-2(Md(N · L) + Ms(N · H)s).
– This may also be multiplied by a light color.
• We may also add in an emissive term Me for
glowing objects.
• If the lack of interreflection makes things
too dark, we may add an “ambient” term
LaMd.
Approximations
• We may not compute this formula at every
pixel on the screen, but only at the vertices
of the object instead.
• The specular exponent may be evaluated
using a power approximation rather than a
real power function.
– The specular formula is already a cheesy
hack…
More Approximations
• Shadows are hard. At the cost of much
realism, they can be omitted or faked.
– Draw a little dark spot under an object
• Ambient is itself an approximation of real
global illumination.
• N · L and (N · H)s are idealizations—in
reality, this can be an arbitrary 4dimensional function called a “BRDF.”
Radiosity
• Radiosity is a global algorithm that handles
diffuse lighting only.
• The term “global illumination” refers to
global algorithms that handle specular
lighting also.
– Specular makes things much more difficult.
• I will only discuss plain radiosity.
Radiosity in a Nutshell
• Suppose there are n surfaces in the scene.
• Let Ai be the area of surface i.
• Let Ei be the amount of light energy emitted
from surface i per unit time and area.
• Let Bi be the amount of light energy emitted
and/or reflected from surface i per unit time
and area.
• Let i be the diffuse albedo of surface i.
Radiosity in a Nutshell
• Now, let Fij (called a “form factor”) be the
fraction of light from surface i that reaches
surface j.
• Then, for all i, we must have:
AiBi = AiEi + i j  [1,n] AjBjFji
• This is just a linear system with n equations
and n unknowns. Solve and you get the B’s.
That Easy?
• Well, not quite that easy.
• Solving the system of equations is O(n3).
– So we’ll iterate instead.
• We still have to do local lighting.
– These become the Ei’s.
• We have to compute the Fij terms somehow.
– This turns out to be expensive.
– If the scene is static, precompute!
Computing Form Factors
• Fij turns out to be a big ugly integral over
the area of both i and j.
• Worse, one of the terms in the integral is
whether the two dA’s can see one another!
• So, no closed-form solution.
• Standard numerical integration is no good.
A raycast per sample takes too long.
Computing Form Factors
• The usual solution is called the “hemicube
algorithm.”
– Render the scene from the point of view of the
surface, in all directions. In effect, you are
projecting onto a hemicube.
– Count up the number of times you can see each
surface (weighted appropriately).
– Takes advantage of 3D acceleration!
Hemicube Algorithm in Action
Simplified Radiosity Equation
• It so happens that FjiAj = FijAi.
– This is a simple property of the integral for F.
• So we can simplify the radiosity equation:
AiBi = AiEi + i j  [1,n] AjBjFji
AiBi = AiEi + i j  [1,n] AiBjFij
Bi = Ei + i j  [1,n] BjFij
B = E + RB (where Rij = iFij)
Solving the Radiosity Equation
• B = E + RB is just a matrix/vector equation.
• Direct solution: B = (I  R)-1E
• Iterative solution:
– If E is a local lighting solution, then call it B0.
– Now let Bi+1 = B0 + RBi.
– Then, Bi is simply the lighting solution after i
bounces! Since i < 1 for all i (conservation of
energy), the Bi’s converge to B.
Iterative vs. Direct Radiosity
• If F is a dense matrix, then direct solution
takes time O(n3), while k steps of iteration
takes time O(kn2).
• Realistically, k is just a constant; say, 5.
• Iterative solution is practical with n ranging
up to the hundreds of thousands!
• Iteration time is proportional to the number
of nonzero entries of F.
Sparsity of Form Factors
• In a simple cube-shaped room, all form
factors will be nonzero except for pairs on
the same wall.
– So F is 5/6 nonzero. Not very encouraging…
– As more objects are added, F becomes sparser.
• As the scene expands beyond one room, F
becomes much sparser.
• So iterative radiosity scales extremely well!
Storage and Precision
• Radiosity hogs memory.
– If you grid a cube-shaped room 100x100 on
each wall, and store F as a dense matrix of
floats, that’s 14.4 GB. (!!!)
• Storing F as sparse helps a lot.
– Good for iteration speed too.
• Using smaller values than floats helps too.
– 16-bit fixed-point is good enough.
RLE Encoding of Form Factors
• F tends to have runs of zeros and nonzeros.
• Smart traversal order of grids makes the
runs longer.
Bad
Good
RLE Encoding of Form Factors
• My disk storage format:
– 1 byte: run length, run type
• Length up to 84, type is zero, 255, or 65535
– Then, variable # bytes with run data
– Compression ratio for my scene: 5.97:1
• My memory storage format:
–
–
–
–
2 bytes: run length up to 65535
2 bytes: run type (zero or 65535)
2N bytes: run data
Compression ratio for my scene: 2.49:1
Parallelization
• Split up the surfaces among the CPUs.
– Each CPU owns those rows of the form factor
matrix.
– Each CPU computes its surfaces’ local lighting.
– Every iteration requires an all-to-all
communication, so that each CPU has the full B
vector.
• At present, my storage of F is unbalanced;
compression ranges from 4.3:1 to 1.6:1.
Radiosity Iteration Kernel
• Hand-written MMX assembly code (only
main inner loop shown):
inner_loop:
prefetchnta [ebx+128]
prefetchnta [eax+128]
movq mm4, [eax]
pshufw mm7, mm4, 0xFF
pshufw mm6, mm4, 0xAA
pshufw mm5, mm4, 0x55
pshufw mm4, mm4, 0x00
pmulhuw mm4, [ebx]
pmulhuw mm5, [ebx+6]
pmulhuw mm6, [ebx+12]
pmulhuw mm7, [ebx+18]
paddw
paddw
paddw
paddw
add
add
dec
jnz
mm0,
mm1,
mm2,
mm3,
mm4
mm5
mm6
mm7
eax, 8
ebx, 24
esi
inner_loop
Radiosity Kernel Performance
• Timed 8 CPUs doing 500 iterations.
– Portable C fixed-point kernel: 38.04 s
– Optimized MMX kernel: 21.64 s
• On most loaded CPU, works out to 528
million multiply-adds/s for the C version,
1.24 billion multiply-adds/s for MMX.
• But MMX code wastes 25% of them, so real
rate is 928 million.
Local Lighting Implementation
• One raycast is required for each quad, for
each light source! This can be expensive.
• To accelerate raycasts, I made a simplified
version of my scene that was virtually
indistinguishable for raycasting purposes.
– 13028 quads reduced to 120 polys, 110
cylinders
– Some cylinders used as geometry, some as
bounding volumes
Overall Performance
• Again, 8 CPUs on 500 iterations:
–
–
–
–
–
Iteration only: 21.64 s
Communication only: 5.96 s
Iteration plus communication: 26.84 s
All computation: 65.7 s
All computation plus communication: 64.38 s
Remarks on Performance
• The communication overlaps very well with the
computation, to the point that it is actually a
speedup. (!)
– MPI_Isend, MPI_Irecv are essential to achieving this.
• The O(n) local lighting computation is actually
taking much longer than the O(n2) radiosity
computation.
– Local lighting is only pseudo-O(n), because of the
raycast cost—although for large scenes, raycast cost
should still be much less than O(n2), due to other
optimizations.
Radiosity Frontend
• Separate client application that runs on a Windows
PC with OpenGL acceleration.
– Radiosity solver running on cluster is server.
• Original plan was that the frontend would send the
scene to the server, and the server would use the
scene provided.
– Since the cluster has no OpenGL acceleration, I was
reluctantly forced to precompute form factors.
– All aspects of scene except form factors still sent to the
server by the client; form factors are read from disk.
Client/Server Architecture
• User precomputes form factors and FTPs them to
the Beowulf.
• Server listens on beowulf.lcs.mit.edu:5353.
– Client connects and sends scene information.
• Server reads form factors off of disk.
• Both open a network thread.
– Server streams radiosity to client via TCP/IP.
– Computation, rendering, and communication are
completely decoupled.
Frontend Features
• Per-vertex or per-pixel local lighting, with
local viewer. Optional specular.
• Shadows implemented using stencil buffer.
• Display radiosity input/output (E and B).
• Bilinear filtering of radiosity solutions on
grid-shaped surfaces.
• Ultra-high-quality mode where radiosity is
used for indirect lighting only.
Demonstration of Full System
Demonstration of Full System
Demonstration of Full System
Work Yet to be Done
• More complex scene would be nice. This one is
13K quads, and I should be able to do 50K.
• More optimization work on raycasting.
• Better load balancing.
• Optimization of some modes on frontend, so they
run reasonably on my laptop’s GeForce2 Go, not
just on a GeForce4 Ti…
• Alleviation of ugly banding on certain lighting
modes caused by 8-bit-per-component precision.
Conclusions
• Real-time radiosity is feasible.
– Not tomorrow, but today.
• If today’s cluster is tomorrow’s desktop, real-time
radiosity could start showing up in real
applications not too many years from now.
• Biggest limitation may be the ability to compute
form factors efficiently.
– Faster graphics hardware will make this happen.