International Networks and the US
Download
Report
Transcript International Networks and the US
Notes On the GAE
Harvey B. Newman
California Institute of Technology
Grid-enabled Analysis Environment Workshop
June 24, 2003
GAE Workshop Goals (1)
“Getting Our Arms Around” the Grid-Enabled
Analysis “Problem”
Review Existing Work Towards a GAE:
Components, Interfaces, System Concepts
Review Client Analysis Tools; Consider How to Integrate Them
User Interfaces: What does the GAE Desktop Look Like ?
(Different Flavors)
Look At Requirements, Ideas for a GAE Architecture
A Vision of the System’s Goals and Workings
Attention to Strategy and Policy
Develop (Continue) a Program of Simulations
of the System
For the Computing Model, and Defining the GAE
Essential for Developing a Feasible Vision; Developing
Strategies, Solving Problems and Optimizing the System
With a Complementary Program of Prototyping
GAE Collaboration Desktop
Example
Four-screen Analysis Desktop
4 Flat Panels: 5120 X 1024; RH9
Driven by a single server and
single graphics card
Allows simultaneous work on:
Traditional analysis tools
(e.g. ROOT)
Software development
Event displays (e.g. IGUANA)
MonALISA monitoring
displays; Other “Grid Views”
Job-progress Views
Persistent collaboration
(e.g. VRVS; shared windows)
Online event or detector
monitoring
Web browsing, email
GAE Workshop Goals (2)
Architectural Approaches: Choose A Feasible Direction
For example a Managed Services Architecture
Be Prepared to Learn by Doing;
Simulating and Prototyping
Where to Start, and the Development Strategy
Existing and Missing Parts of the System
[Layers; Concepts]
When to Adapt Existing Components,
Or to Re-Build Them “from Scratch”
Manpower Available to Meet the Goals; Shortfalls
Allocation of Tasks; Including Generating a Plan
Linkage Between Analysis and Grid-Enabled Production
Planning for Closer Relationship with LCG, Trillium,
and the Experiments’ starting Efforts in this area
HENP Grids: Services Architecture
Design for a Global System
Self Discovering, Cooperative
Registered Services, Lookup Services; self-describing
“Spaces” for Mobile Code and Parameters
Scalable and Robust
Multi-threaded: with a thread pool managing engine
Loosely Coupled: errors in a thread don’t stop the task
Stateful: System State as well as task state
Rich set of “problem” situations: implies Grid Views,
and User/System Dialogues on what to do
For Example: Raise Priority (Burn Quota); or Redirect Work
Eventually may be increasingly automated as
we scale up and gain experience
Managed; to deal with a Complex Execution Environment
Real time higher level supervisory services monitor,
track, optimize and Revive/Restart services as needed
Policy and strategy-driven; Self-Evaluating and Optimizing
Investable with increasing intelligence
Agent Based; Evolutionary Learning Algorithms
Getting Started Towards a Workable
GAE (1)
Work on Computing Model (Essential) in Parallel
Focus on a Few Scenarios for Doing Analysis
“Grid Enabled PROOF” [in CMS; in ATLAS]
Start with Existing Analysis Applications:
Can they be recast in GAE Form ?
Make Some Starting Assumptions
Need some simple picture of persistency
Supplementary considerations:
Multiuser situation (e.g. with avatars; then Analysis Challenges)
Coming to a few Either/Or Decisions
List of rudimentary analysis tools, and way of working
“External” to the application considerations:
Job planning
Key role of query estimation (not only beforehand)
Transparency versus tracking
Getting Started Towards a Workable
GAE (2)
Session or Sessions on the Desktop
There Modes of Working; All in the GAE
Immediate (within a few seconds)
In the background (seconds to a few minutes)
Spawn batch job or jobs (minutes to hours)
Decisions and tradeoffs
Lay out the strategies and consequences (time, quota etc)
Present Choices
Monitor progress or get “alarms” and be prepared
to re-strategize
Getting Started Towards a Workable
GAE (3)
Smart Caching: Or Methods, of Data, or Time to Process Info.
Intelligence in the system does not only mean problem
solving
Need to apply intelligence/experience to progressively improve
system performance
Time-to-completion estimation: process a small amount of
data to get a realistic first estimate.
3 Slides About Building a Computing
Model & the GAE System
These Slides Focus on Simulation/Prototyping,
as an Integral part of designing and building distributed systems for
the GAE, and the Grid-Enabled Production Environment (GPE) as
well.
Building a Computing Model
and an Analysis Strategy (I)
Generate a Blueprint: A “Computing Model”
Tasks Workload, Facilities, Priorities & GOALS
Persistency; Modes of Accessing Data (e.g. Object Collections)
What runs where; when to redirect
The User’s Working Environment
What is normal (managing expectations) ?
Guidelines for dealing with problems:
based on which information ?
Performance and problem reporting/tracking/handling ?
Known Problems: Strategies to deal with those
Set up, code a Simulation of the Model
Develop mechanisms and sub-models as needed
Set up prototypes to measure the performance parameters
where not already known to sufficient precision
Building a Computing Model
and an Analysis Strategy (II)
Run simulations (avatars for “actors”; agents; tasks; mechanisms)
Analyze and evaluate performance
General performance (throughput; turnaround)
Ensure “all” work is done: learn how to do this: within a
reasonable time; compatible with the Collaboration’s guidelines
Vary Model to Improve Performance
Deal with bottlenecks and other problems
New strategies and/or mechanisms to manage workflow
Represent key features and behaviors, for example:
Responses to Link or Site failures
User input to redirect data or jobs
Monitoring information gathering
Monitoring and management agent actions and
behaviors in a variety of situations
Validate the Model
Using Dedicated setups
Using Data Challenges (measure, evaluate, compare; fix key items)
Learn of new factors and/or behaviors to take into account
Building a Computing Model
and an Analysis Strategy (III)
MAJOR Milestone: Obtain a first picture of a Model that
Seems to Work
This may or may not involve changes in the computing resource
requirements-estimates; or Collaboration policies and expectations
It is hard to estimate how long it will take to
reach this milestone
[most experiments until now have reached it
after the start of data taking]
Evolve the Model to
Distinguish what works and what does not
Incorporate evolving site hardware and network performance
Progressively incorporate new and “better” strategies, to
improve throughput and/or turnarounds, or fix critical problems
Take into account experience with the actual software-system
components as they develop
In parallel with the Model evolution keep developing the overall
data analysis + Grid + monitoring “system”; represent it in the simulation
And the associated strategies