Distributed Systems - The University of Alabama in Huntsville

Download Report

Transcript Distributed Systems - The University of Alabama in Huntsville

Distributed Systems
Early Examples
Projects
• NOW – a Network Of Workstations
University of California, Berkely
Terminated about 1997 after demonstrating
the feasibility of their approach
• Condor – University of Wisconsin-Madison
Started about 1988
Is now an ongoing, world-wide system of
shared computing clusters.
NOW: a Network of Workstations
http://now.cs.berkeley.edu/
• Scale: A NOW system consists of a buildingwide collection of machines providing
memory, disks, and processors.
• Basic Ideas
– Use idle CPU cycles for parallel processing on
clusters of workstations
– Use memories as disk cache to break the I/O
bottleneck (slow disk access times)
– Share the resources over fast LANs
NOW “Opportunities”: Memory
• Network RAM: fast networks, high bandwidth
make it reasonable to page across the network.
– Instead of paging out to slow disks, send over fast
networks to RAM in an idle machine
• Cooperative file caching: improve performance by
using neetwork RAM as a very large file cache
– Shared files can be fetched from another client’s
memory rather than server’s disk
– Active clients can extend their disk cache size by using
memory of idle clients.
NOW Opportunities: “RAWD”
(Redundant Arrays of Workstation Disks)
• RAID systems provide fast performance by
connecting arrays of small disks. By reading/
writing data in parallel, throughput is
increased.
• Instead of a hardware RAID, build software
version by writing data across the work
stations in the network
– Especially useful for parallel programs running on
separate machines in the network
NOW Opportunities: “Parallel
Computing”
• Harnessing the power of multiple idle workstations in a
NOW can support high- performance parallel applications.
• NOW principles:
– avoid going to disk by using RAM on other network nodes
(assumes network faster than disk)
– Further speedup may be achieved by parallelizing the
computation and striping the data to multiple disks.
– allow user processes to access the network directly rather than
going through the operating system
Berkeley NOW Features
• GLUnix (Global Layer UNIX) is a layer on top of
UNIX OS’s running on the workstations
• Applications running on GLUnix have a
protected virtual operating system layer which
catches UNIX system calls and translates them
into GLUnix calls.
• Serverless Network File System – xFS
– Avoids central server bottleneck
– Cooperative file system (basically, peer-to-peer)
Summary
• Successful, in their opinion.
• Ran for several years in the late 90s on the
Berkeley CS system
• Key enabling technologies
– Scalable, high performance network
– Fast access to the network for user processes
– Global operating system layer to support system
resources as a true shared pool.
CONDOR
http://research.cs.wisc.edu/condor/
• Goal: “…to develop, implement, deploy, and evaluate
mechanisms and policies that support High Throughput
Computing (HTC) on large collections of distributively
owned computing resources”
– HTC computing – “… problems that require weeks or months
of computation to solve. …this type of research need a
computing environment that delivers large amounts of
computational power over a long period of time.”
– Compare to High Performance computing (HPC) which
“…delivers a tremendous amount of power over a short
period of time.”
Overview
• Condor can be used to manage computing
clusters. It is designed to take advantage of
idle machines
• Condor lets users submit many jobs at the
same time. Result: tremendous amounts of
computation with very little user intervention.
• No need to rewrite code - just link to Condor
libraries
Features
• Checkpoints: save complete state of the
process
– Critical for programs that run for long periods of
time to recover from crashes or to vacate
machines whose user has returned, or for process
migration due to other reasons.
• Remote system calls: data resides on the
home machine and system calls are directed
there. Provides protection for host machines.
Features
• Jobs can run anywhere in the cluster (which
can be a physical cluster, a virtual cluster, or
even a single machine)
• Different machines have different capabilities;
when submitting a job Condor uses can specify
the kind of machine they wish to run on.
• When sets of jobs are submitted it’s possible to
define dependencies; i.e., “don’t run Job 3 until
jobs 1 and 2 have completed.”
Slides
• From a talk by Myron Livny, “The Principles
and Power of Distributed Computing”,
International Winter School on Grid
Computing 2010.
http://research.cs.wisc.edu/condor/talks.html
• Livny is a professor at the U of WisconsinMadison where he heads the Condor Project,
and other grid/distributed computing projects
or centers.
Condor Daemons
• Title unknown, by Hans Holbein the Younger,
from Historiarum Veteris Testamenti icones,
1543
Condor Daemons
negotiator
collector
master
shadow
schedd
procd
startd
starter
kbdd
exec
Condor today
• http://research.cs.wisc.edu/condor/map/