International Networks and the US

Download Report

Transcript International Networks and the US

Notes On the GAE
Harvey B. Newman
California Institute of Technology
Grid-enabled Analysis Environment Workshop
June 24, 2003
GAE Workshop Goals (1)
 “Getting Our Arms Around” the Grid-Enabled
Analysis “Problem”
 Review Existing Work Towards a GAE:
Components, Interfaces, System Concepts
 Review Client Analysis Tools; Consider How to Integrate Them
 User Interfaces: What does the GAE Desktop Look Like ?
(Different Flavors)

 Look At Requirements, Ideas for a GAE Architecture
 A Vision of the System’s Goals and Workings
 Attention to Strategy and Policy
 Develop (Continue) a Program of Simulations
of the System
 For the Computing Model, and Defining the GAE
 Essential for Developing a Feasible Vision; Developing
Strategies, Solving Problems and Optimizing the System
 With a Complementary Program of Prototyping
GAE Collaboration Desktop
Example
 Four-screen Analysis Desktop
4 Flat Panels: 5120 X 1024; RH9
 Driven by a single server and
single graphics card
 Allows simultaneous work on:
 Traditional analysis tools
(e.g. ROOT)
 Software development
 Event displays (e.g. IGUANA)
 MonALISA monitoring
displays; Other “Grid Views”
 Job-progress Views
 Persistent collaboration
(e.g. VRVS; shared windows)
 Online event or detector
monitoring
 Web browsing, email
GAE Workshop Goals (2)
 Architectural Approaches: Choose A Feasible Direction
For example a Managed Services Architecture
Be Prepared to Learn by Doing;
Simulating and Prototyping
 Where to Start, and the Development Strategy
Existing and Missing Parts of the System
[Layers; Concepts]
When to Adapt Existing Components,
Or to Re-Build Them “from Scratch”
 Manpower Available to Meet the Goals; Shortfalls
 Allocation of Tasks; Including Generating a Plan
 Linkage Between Analysis and Grid-Enabled Production
 Planning for Closer Relationship with LCG, Trillium,
and the Experiments’ starting Efforts in this area
HENP Grids: Services Architecture
Design for a Global System
 Self Discovering, Cooperative
 Registered Services, Lookup Services; self-describing
 “Spaces” for Mobile Code and Parameters
 Scalable and Robust
 Multi-threaded: with a thread pool managing engine
 Loosely Coupled: errors in a thread don’t stop the task
 Stateful: System State as well as task state
 Rich set of “problem” situations: implies Grid Views,
and User/System Dialogues on what to do
 For Example: Raise Priority (Burn Quota); or Redirect Work
 Eventually may be increasingly automated as
we scale up and gain experience
 Managed; to deal with a Complex Execution Environment
 Real time higher level supervisory services monitor,
track, optimize and Revive/Restart services as needed
 Policy and strategy-driven; Self-Evaluating and Optimizing
 Investable with increasing intelligence
 Agent Based; Evolutionary Learning Algorithms
Getting Started Towards a Workable
GAE (1)
 Work on Computing Model (Essential) in Parallel
 Focus on a Few Scenarios for Doing Analysis
 “Grid Enabled PROOF” [in CMS; in ATLAS]
 Start with Existing Analysis Applications:
Can they be recast in GAE Form ?
 Make Some Starting Assumptions
 Need some simple picture of persistency
 Supplementary considerations:
 Multiuser situation (e.g. with avatars; then Analysis Challenges)
 Coming to a few Either/Or Decisions
 List of rudimentary analysis tools, and way of working
 “External” to the application considerations:
 Job planning
 Key role of query estimation (not only beforehand)
 Transparency versus tracking
Getting Started Towards a Workable
GAE (2)
 Session or Sessions on the Desktop
 There Modes of Working; All in the GAE
 Immediate (within a few seconds)
 In the background (seconds to a few minutes)
 Spawn batch job or jobs (minutes to hours)
 Decisions and tradeoffs
 Lay out the strategies and consequences (time, quota etc)
 Present Choices
 Monitor progress or get “alarms” and be prepared
to re-strategize
Getting Started Towards a Workable
GAE (3)
 Smart Caching: Or Methods, of Data, or Time to Process Info.
 Intelligence in the system does not only mean problem
solving
 Need to apply intelligence/experience to progressively improve
system performance
 Time-to-completion estimation: process a small amount of
data to get a realistic first estimate.
3 Slides About Building a Computing
Model & the GAE System
 These Slides Focus on Simulation/Prototyping,
as an Integral part of designing and building distributed systems for
the GAE, and the Grid-Enabled Production Environment (GPE) as
well.
Building a Computing Model
and an Analysis Strategy (I)
 Generate a Blueprint: A “Computing Model”
 Tasks  Workload, Facilities, Priorities & GOALS
 Persistency; Modes of Accessing Data (e.g. Object Collections)
 What runs where; when to redirect
 The User’s Working Environment
 What is normal (managing expectations) ?
 Guidelines for dealing with problems:
based on which information ?
 Performance and problem reporting/tracking/handling ?
 Known Problems: Strategies to deal with those
 Set up, code a Simulation of the Model
 Develop mechanisms and sub-models as needed
 Set up prototypes to measure the performance parameters
where not already known to sufficient precision
Building a Computing Model
and an Analysis Strategy (II)
 Run simulations (avatars for “actors”; agents; tasks; mechanisms)
 Analyze and evaluate performance
 General performance (throughput; turnaround)
 Ensure “all” work is done: learn how to do this: within a
reasonable time; compatible with the Collaboration’s guidelines
 Vary Model to Improve Performance
 Deal with bottlenecks and other problems
 New strategies and/or mechanisms to manage workflow
 Represent key features and behaviors, for example:
 Responses to Link or Site failures
 User input to redirect data or jobs
 Monitoring information gathering
 Monitoring and management agent actions and
behaviors in a variety of situations
 Validate the Model
 Using Dedicated setups
 Using Data Challenges (measure, evaluate, compare; fix key items)
 Learn of new factors and/or behaviors to take into account
Building a Computing Model
and an Analysis Strategy (III)
MAJOR Milestone: Obtain a first picture of a Model that
Seems to Work
 This may or may not involve changes in the computing resource
requirements-estimates; or Collaboration policies and expectations
 It is hard to estimate how long it will take to
reach this milestone
[most experiments until now have reached it
after the start of data taking]
Evolve the Model to
 Distinguish what works and what does not
 Incorporate evolving site hardware and network performance
 Progressively incorporate new and “better” strategies, to
improve throughput and/or turnarounds, or fix critical problems
 Take into account experience with the actual software-system
components as they develop
In parallel with the Model evolution keep developing the overall
data analysis + Grid + monitoring “system”; represent it in the simulation
 And the associated strategies