Transcript Document

Sort-First, Distributed Memory Parallel
Visualization and Rendering
Wes Bethel, R3vis Corporation and
Lawrence Berkeley National Laboratory
Parallel Visualization and Graphics Workshop
Sunday October 18, 2003,
Seattle, Washington
The Actual Presentation Title:

Why Distributed Memory Parallel Rendering is a
Challenge: Combining OpenRM Scene Graph and
Chromium for General Purpose Use on Distributed
Memory Clusters

Outline:
•
•
•
•
•
•
Problem Statement, Desired Outcomes
Sort-First Parallel Architectural Overview
The Many Ways I Broke Chromium
Scene Graph Considerations
Demo?
Conclusions
2
Motivation and Problem Statement

The Allure of COTS solutions:
•
•
•
•




Performance of COTS GPUs exceed that of custom silicon.
Attractive price/performance of COTS platforms (e.g., x86 PCs).
Gigabit Ethernet is cheap: $100/NIC, $500 8-port switch.
Can build a screamer cluster for about $2K/node.
We’re accustomed to “nice, friendly” software
infrastructure. E.g., hardware accelerated Xinerama.
Enter Chromium – the means to use a bunch of PCs to do
parallel rendering.
Parallel submission of graphics streams is a “custom
solution,” and presents challenges.
Want a flexible, resilient API to interface between parallel
visualization/rendering applications and Chromium.
3
4
Our Approach






Distributed memory parallel visualization application design: amortizes
expensive data I/O and visualization across many nodes.
The scene graph layer mediates interaction between the application
and the rendering subsystem: portability, “hide” the icky parallel
rendering details, provides an infrastructure for accelerating rendering.
Chromium provides routing of graphics commands to support
hardware accelerated rendering on a variety of platforms.
Focus on COTS solutions: all hardware and software we used is
cheap (PC cluster) or free (software).
Focus on simplicity: our sample applications are straightforward in
implementation, easily reproducible by others and highly portable.
Want an infrastructure that is suitable for use regardless of the type of
parallel programming model used by the application:
• No “parallel objects” in the Scene Graph!!!!!
5
Our Approach, ctd.
6
The Many Ways I Broke Chromium



Retained mode object “namespace” collisions in parallel
submission.
Broadcasting: how to burn up bandwidth without even
trying!
Scene Graph Issues: to be discussed in our PVG paper
presentation on Monday.
7
The “Collision” Problem


Want to use OpenGL retained-mode semantics and structures to
realize performance gains in DM environment
Problem:
• “Namespace collision” of retained-mode identifiers during parallel
submission of graphics commands.



Example:
Process A:
Process A:
GLuint n = glNewList(1);
GLuint n = glNewList(1);
printf(“ id==%d\n”);
printf(“ id==%d\n”,n);
// id==0
// id == 0
// build list, draw with list
// build list, draw with list
The problem exists for all OpenGL retained mode objects: display lists,
texture objects and programs.
The problem extends to all OpenGL retained mode objects: display
lists, texture object id’s, programs.
8
Manifestation of Collision Problem

Show image of four textured quads when the problem is
present.
9
Desired Result

Show image of four textured quads when the problem is
fixed.
10
Resolving the Collision Problem

New CR configuration file options:
• shared_textures, shared_display_lists, shared_programs
• When set to 1, beware of collisions in parallel submission.
• When set to 0, collisions resolved in parallel submission.

Using shared_* to zero will enforce “unique retained mode
identifiers” across all parallel submitters.

Thanks, Brian!
11
The Broadcast Problem

What’s the Problem?:
• Geometry and textures from N application PEs is replicated across
M crservers. Bumped into limits of memory and bandwidth.
• To Chromium, a display list is an opaque blob of stuff. Tilesort
doesn’t peek inside a display list to see where it should be sent.

Early performance testing showed two types of
broadcasting:
• Display lists being broadcast from one tilesort to all servers.
• Textures associated with textured geometry in display lists was
being broadcast from one tilesort to all servers.
12
Broadcast Workarounds (Short Term)


Help Chromium decide how to route textures with the
GL_OBJECT_BBOX_CR extension.
Don’t use display lists (for now).
• Immediate mode geometry isn’t broadcast. Sorting/routing is
accelerated using GL_OBJECT_BBOX_CR to provide hints to
tilesort; it doesn’t have to look at all vertices in a geometry blob to
perform routing decisions.
• For scientific visualization, which generates lots of geometry, this is
clearly a problem.
• Our volume rendering application uses 3D textures (N**3 data) and
textured geometry (N**2 data), so the “heavy payload” data isn’t
broadcast. The cost is immediate mode transmission of geometry
(approximately 36KB/frame of geometry as compared to
160MB/frame of texture data).
13
Broadcast Workarounds (Long Term)


Funding for Chromium developers to implement display list
caching and routing, similar to existing capabilities to
manage texture objects.
Lurking problems:
• “Aging” of display lists: the crserver is like a “roach motel”: display
lists check in, but they never check out.
• Adding retained mode object aging and management to
applications is an unreasonable burden (IMO).
• There exists no commonly accepted mechanism for LRU aging,
etc. in the graphics API. Such an aging mechanism will probably
show up as an extension.
• Better as an extension with tunable parameters than requiring
applications to “reach deeply” into graphics API implementations.
14
Scene Graph Issues and Performance
Analysis

Discussed in our 2003 PVG Paper (Monday afternoon).
• Our “parallel scene graph” implementation can be used by any
parallel application, regardless of the type of parallel programming
model used by the application developer.
• The “big issue” in sort-first is how much data is duplicated. What
we’ve seen so far is about 1.8x duplication was required for the
first frame (in a hardware accelerated volume rendering
application).
• While the scene graph supports any type of parallel operation,
certain types of synchronization are required to ensure correct
rendering. These can be achieved using only Chromium barriers –
no “parallel objects” in the scene graph are required.
15
Some Performance Graphs

Bandwidth vs. Traffic
Peak Bandwidth, LOD vs. Non-LOD, Six Displays
70
60
40
6 Displays LOD
30
6 Displays No LOD
20
Aggregate 3D Texture Traffic
10
0
400
13
Frame Number
15
17
4 Displays
300
6 Displays
200
9 Displays
100
16 Displays
0
20 Displays
16
11
13
9
10
7
7
5
4
3
1
1
MBytes
MBytes
50
24 Displays
Fram e Num ber
16
Parallel Scene Graph API Stuff

Collectives
• rmPipeSetCommSize(), rmPipeGetCommSize()
• rmPipeSetMyRank(), rmPipeGetMyRank()

Chromium-specific
• rmPipeBArrierCreateCR()
• Creates a Chromium barrier, number of participants is set by
value specified using rmPipeSetCommSize()
• rmPipeBarrierExecCR()
• Doesn’t block application code execution.
• Used to synchronize rendering execution from
rmPipeGetCommSize() streams of graphics commands.
17
Demo Applications: Parallel Isosurface
18
Demo Application: Parallel Volume
Rendering
19
Demo Application: Parallel Volume
Rendering with LOD Volumes
20
Conclusions

We met our objectives:
• General purpose infrastructure for doing parallel visualization and
hardware accelerated rendering on PC clusters.
• The infrastructure can be used by any parallel application,
regardless of the parallel programming model.



The architecture scaled well from one to 24 displays,
supporting extremely high-resolution output (e.g,
7860x4096).
We bumped into network bandwidth limits (not a big
surprise).
Display lists are still broadcast in Chromium. Please fund
them to add this much-needed capability, which is
fundamental for efficient sort-first operation of clusters.
21
Sources of Software



OpenRM Scene Graph: www.openrm.org
Source code for OpenRM+Chromium applications:
www.openrm.org.
Chromium: chromium.sourceforge.net.
22
Acknowledgement



This work was supported by the U. S. Department of
Energy, Office of Science, Office of Advanced Scientific
Computing Research under SBIR grant DE-FE0302ER83443.
The authors wish to thank Randall Frank of the
ASCI/VIEWS program at Lawrence Livermore National
Laboratory and the Scientific Computing and Imaging
Institute at the University of Utah for use of computing
facilities during the course of this research.
The Argon shock bubble dataset was provided courtesy of
John Bell and Vince Beckner at the Center for
Computational Sciences and Engineering, Lawrence
Berkeley National Laboratory.
23