Recent Progress on Scaleable Servers Jim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays.

Transcript Recent Progress on Scaleable Servers Jim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays.

Recent Progress on Scaleable Servers
Jim Gray, Microsoft Research
Substantial progress has been made towards the goal of building
supercomputers by composing arrays of commodity processors,
disks, and networks into a cluster that provides a single system
image. True, vector-supers still are 10x faster than commodity
processors on certain floating point computations, but they cost
disproportionately more. Indeed, the highest-performance
computations are now performed by processor arrays. In the
broader context of business and internet computing, processor
arrays long ago surpassed mainframe performance, and for a tiny
fraction of the cost. This talk first reviews this history and
describes the current landscape of scaleable servers in the
commercial, internet, and scientific segments. The talk then
discusses the Achilles heels of scaleable systems: programming
tools and system management. There has been relatively little
progress in either area. This suggests some important research
areas for computer systems research.
Outline
• Scaleability: MAPS
• Scaleup has limits, scaleout for really big jobs
• Two generic kinds of computing:
– many little & few big
• Many little has credible programming model
– tp, web, mail, fileserver,… all based on RPC
• Few big has marginal success (best is DSS)
• Rivers and objects
Scaleability
Scale Up and Scale Out
Grow Up with SMP
4xP6 is now standard
SMP
Super Server
Grow Out with Cluster
Cluster has inexpensive parts
Departmental
Server
Personal
System
Cluster
of PCs
Key Technologies
• Hardware
–
–
–
–
commodity processors
nUMA
Smart Storage
SAN/VIA
• Software
–
–
–
–
–
–
–
Directory Services
Security Domains
Process/Data migration
Load balancing
Fault tolerance
RPC/Objects
Streams/Rivers
MAPS - The Problems
• Manageability:
N machines are N times harder to manage
• Availability:
N machines fail N times more often
• Programmability:
N machines are 2N times harder to program
• Scaleability:
N machines cost N times more
but do little more work.
Manageability
• Goal: Systems self managing
• N systems as easy to manage as one system
• Some progress:
–
–
–
–
–
–
Distributed name servers (gives transparent naming)
Distributed security
Auto cooling of disks
Auto scheduling and load balancing
Global event log (reporting)
Automate most routine tasks
• Still very hard and app-specific
Availability
• Redundancy allows failover/migration
(processes, disks, links)
• Good progress on technology (theory and practice)
• Migration also good for load balancing
• Transaction concept helps exception handling
Programmability & Scaleability
• That’s what the rest of this talk is about
• Success on embarrassingly parallel jobs
– file server, mail, transactions, web, crypto
• Limited success on “batch”
– relational DBMs, PVM,..
Outline
• Scaleability: MAPS
• Scaleup has limits, scaleout for really big jobs
• Two generic kinds of computing:
– many little & few big
• Many little has credible programming model
– tp, web, mail, fileserver,… all based on RPC
• Few big has marginal success (best is DSS)
• Rivers and objects
Scaleup Has Limits
(chart courtesy of Catharine Van Ingen)
LANL Loki P6
Linux
NAS Expanded
Linux Cluster
Cray T3E
• Vector Supers ~ 10x supers
– 3 ~ GFlops
– bus/memory ~ 20 GBps
– IO ~ 1GBps
100.000
• PCs are slow
Mflop/s/$K
– 300 ~ Mflops
– bus/memory ~ 2 GBps
– IO ~ 1 GBps
IBM SP
SGI Origin 2000195
Sun Ultra
Enterprise 4000
UCB NOW
10.000
• Supers ~ 10x PCs
Mflop/s/$K vs Mflop/s
1.000
0.100
– 30 ~ Mflops
0.010
– and bus/memory ~ 200MBps
– and IO ~ 100 MBps
0.001
0.1
1
10
100
Mflop/s
1000
10000
100000
Loki: Pentium
Clusters for Science
http://loki-www.lanl.gov/
16 Pentium Pro Processors
x 5 Fast Ethernet interfaces
+ 2 Gbytes RAM
+ 50 Gbytes Disk
+ 2 Fast Ethernet switches
+
Linux…………………...
= 1.2 real Gflops for $63,000
(but that is the 1996 price)
Beowulf project is similar
http://cesdis.gsfc.nasa.gov/pub/people/becker/beo
wulf.html
• Scientists want cheap mips.
Your Tax Dollars At Work
ASCI for Stockpile Stewardship
• Intel/Sandia:
9000x1 node Ppro
• LLNL/IBM:
512x8 PowerPC (SP2)
• LANL/Cray:
?
• Maui Supercomputer Center
– 512x1 SP2
TOP500 Systems by Vendor
(courtesy of Larry Smarr NCSA)
500
Other
Japanese Vector Machines
Number of Systems
400
Other
DEC
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
TOP500 Reports: http://www.netlib.org/benchmark/top500.html
Jun-98
Nov-97
Jun-97
Nov-96
Jun-96
Nov-95
Jun-95
Nov-94
Jun-94
Nov-93
0
Jun-93
CRI
NCSA Super Cluster
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
• National Center for Supercomputing Applications
University of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN
• Compaq + HP +Myricom + WindowsNT
• A Super Computer for 3M$
• Classic Fortran/MPI programming
• DCOM programming model
A Variety of Discipline Codes Single Processor Performance Origin vs. T3E
nUMA vs UMA
(courtesy of Larry Smarr NCSA)
Single Processor MFLOPS
160
140
QMC
120
RIEMANN
100
Laplace
80
QCD
60
PPM
40
PIMC
ZEUS
20
0
Origin
T3E
Basket of Applications Average Performance as
Percentage of Linpack Performance
(courtesy of Larry Smarr NCSA)
1800
1600
22%
Application Codes:
1400
Linpack
1200
Apps. Ave.
25%
1000
800
14% 19%
600
33% 26%
400
200
0
T90
C90
SPP2000
SP2160
Origin
195
PCA
CFD
Biomolecular
Chemistry
Materials
QCD
Observations
• Uniprocessor RAP << PAP
– real app performance << peak advertised performance
• Growth has slowed (Bell Prize
–
–
–
–
–
–
1987: 0.5 GFLOPS
1988 1.0 GFLOPS 1 year
1990: 14 GFLOPS 2 years
1994: 140 GFLOPS 4 years
1998: 604 GFLOPS
xxx:
1 TFLOPS 5 years?
• Time Gap = 2N-1 or 2N-1 where N =( log(performance)-9)
“Commercial” Clusters
• 16-node Cluster
– 64 cpus
– 2 TB of disk
– Decision support
• 45-node Cluster
–
–
–
–
140 cpus
14 GB DRAM
4 TB RAID disk
OLTP (Debit Credit)
• 1 B tpd (14 k tps)
Oracle/NT
•
•
•
•
27,383 tpmC
71.50 $/tpmC
4 x 6 cpus
384 disks
=2.7 TB
24 cpu, 384 disks (=2.7TB)
Microsoft.com: ~150x4 nodes
Building 11
Staging Servers
(7)
Ave CFG:4xP6,
Internal WWW
Ave CFG:4xP5,
512 RAM,
30 GB HD
FTP Servers
Ave CFG:4xP5,
512 RAM,
Download 30 GB HD
Replication
SQLNet
Feeder LAN
Router
Live SQL Servers
MOSWest
Admin LAN
Live SQL Server
www.microsoft.com
(4)
register.microsoft.com
(2) Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:12
Ave CFG:4xP6,
512 RAM,
50 GB HD
www.microsoft.com
(4)
premium.microsoft.com
(2)
home.microsoft.com
(3)
FDDI Ring
(MIS2)
cdm.microsoft.com
(1)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:7
Ave CFG:4xP6,
256 RAM,
30 GB HD
Ave Cost:$25K
FY98 Fcst:2
Router
Router
msid.msn.com
(1)
premium.microsoft.com
(1)
FDDI Ring
(MIS3)
www.microsoft.com
premium.microsoft.com
(3)
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
30 GB HD
512 RAM,
50 GB HD
FTP
Download Server
(1)
HTTP
Download Servers
(2)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
msid.msn.com
(1)
Switched
Ethernet
search.microsoft.com
(2)
Router
Internet
Secondary
Gigaswitch
support.microsoft.com
search.microsoft.com
(1)
(3)
Router
support.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
13
DS3
(45 Mb/Sec Each)
Ave CFG:4xP5,
512 RAM,
30 GB HD
register.microsoft.com
(2)
register.microsoft.com
(1)
(100Mb/Sec Each)
Router
FTP.microsoft.com
(3)
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
Ave CFG:4xP5,
256 RAM,
20 GB HD
register.msn.com
(2)
search.microsoft.com
(1)
Japan Data Center
Internet
Router
home.microsoft.com
(2)
Switched
Ethernet
Router
Router
www.microsoft.com
(3)
FTP
Download Server
(1)
activex.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP5,
256 RAM,
12 GB HD
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Router
Ave CFG:4xP6
512 RAM
28 GB HD
FDDI Ring
(MIS1)
512 RAM,
30 GB HD
msid.msn.com
(1)
search.microsoft.com
(3)
home.microsoft.com
(4)
Ave CFG:4xP6,
1 GB RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:2
msid.msn.com
(1)
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
www.microsoft.com premium.microsoft.com
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,(3)
512 RAM,
50 GB HD
SQL Consolidators
DMZ Staging Servers
Router
SQL Reporting
Ave CFG:4xP6,
512 RAM,
160 GB HD
European Data Center
IDC Staging Servers
MOSWest
www.microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
2
Ethernet
(100 Mb/Sec Each)
The
Microsoft TerraServer Hardware
•
•
•
•
Compaq AlphaServer 8400
8x400Mhz Alpha cpus
10 GB DRAM
324 9.2 GB StorageWorks Disks
– 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
TerraServer: Example
Lots of Web Hits
Total Average Peak
Hits 913 m
10.3 m 29 m
Queries 735 m
8.0 m 18 m
Images 359 m
3.0 m 9 m
Page Views 405 m
5.0 m 9 m
71
•
•
•
•
•
1 TB, largest SQL DB on the Web
99.95% uptime since 1 July 1998
No downtime in August
No NT failures (ever)
most downtime is for SQL software upgrades
HotMail: ~400 Computers
Outline
• Scaleability: MAPS
• Scaleup has limits, scaleout for really big jobs
• Two generic kinds of computing:
–
–
–
–
many little & few big
Many little has credible programming model
tp, web, mail, fileserver,… all based on RPC
Few big has marginal success (best is DSS)
• Rivers and objects
Two Generic Kinds of computing
• Many little
–
–
–
–
–
embarrassingly parallel
Fit RPC model
Fit partitioned data and computation model
Random works OK
OLTP, File Server, Email, Web,…..
• Few big
– sometimes not obviously parallel
– Do not fit RPC model (BIG rpcs)
– Scientific, simulation, data mining, ...
Many Little Programming Model
•
•
•
•
•
•
•
•
•
many small requests
route requests to data
encapsulate data with procedures (objects)
three-tier computing
RPC is a convenient/appropriate model
Transactions are a big help in error handling
Auto partition (e.g. hash data and computation)
Works fine.
Software CyberBricks
Object Oriented Programming
Parallelism From Many Little Jobs
•
•
•
•
•
Gives location transparency
ORB/web/tpmon multiplexes clients to servers
Enables distribution
Exploits embarrassingly parallel apps (transactions)
HTTP and RPC (dcom, corba, rmi, iiop, …) are basis
Tp mon / orb/ web server
Few Big Programming Model
• Finding parallelism is hard
– Pipelines are short (3x …6x speedup)
• Spreading objects/data is easy,
but getting locality is HARD
• Mapping big job onto cluster is hard
• Scheduling is hard
– coarse grained (job) and fine grain (co-schedule)
• Fault tolerance is hard
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
1 Terabyte
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
Why are Relational Operators
Successful for Parallelism?
Relational data model
uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
Database Systems
“Hide” Parallelism
• Automate system management via tools
– data placement
– data organization (indexing)
– periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance
– duplex & failover
– transactions
• Automatic parallelism
– among transactions (locking)
– within a transaction (parallel execution)
SQL a Non-Procedural
Programming Language
• SQL: functional programming language
describes answer set.
• Optimizer picks best execution plan
– Picks data flow web (pipeline),
– degree of parallelism (partitioning)
– other execution parameters (process placement, memory,...)
Execution
Planning
Monitor
Schema
GUI
Optimizer
Plan
Executors
Rivers
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Automatic Parallel Object Relational DB
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Landsat
date loc image
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Temporal
Spatial
Image
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
date, location,
& image tests
Answer
image
Data Rivers: Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma =
Exchange operator in Volcano /SQL Server.
Generalization: Object-oriented Rivers
• Rivers transport sub-class of record-set (= stream of objects)
– record type and partitioning are part of subclass
• Node transformers are data pumps
– an object with river inputs and outputs
– do late-binding to record-type
• Programming becomes data flow programming
– specify the pipelines
• Compiler/Scheduler does
data partitioning and
“transformer” placement
NT Cluster Sort as a Prototype
• Using
– data generation and
– sort
as a prototypical app
• “Hello world” of distributed processing
• goal: easy install & execute
PennySort
• Hardware
– 266 Mhz Intel PPro
– 64 MB SDRAM (10ns)
– Dual Fujitsu DMA 3.2GB EIDE
• Software
– NT workstation 4.3
– NT 5 sort
PennySort Machine (1107$ )
Disk
25%
Cabinet +
Assembly
7%
Memory
8%
Other
22%
board
13%
• Performance
Network,
Video, floppy
9%
Software
6%
cpu
32%
– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk
– elapsed time 820 sec
• cpu time = 404 sec
Remote Install
•Add Registry entry to each remote node.
RegConnectRegistry()
RegCreateKeyEx()
Cluster StartupExecution
•Setup :
MULTI_QI struct
COSERVERINFO struct
•CoCreateInstanceEx()
MULT_QI
COSERVERINFO
HANDLE
HANDLE
HANDLE
•Retrieve remote object handle
from MULTI_QI struct
Sort()
•Invoke methods as usual
Sort()
Sort()
Cluster Sort
•Multiple Data Sources
A
AAA
BBB
CCC
Conceptual Model
•Multiple Data Destinations
AAA
AAA
AAA
•Multiple nodes
AAA
AAA
AAA
•Disks -> Sockets -> Disk -> Disk
B
C
AAA
BBB
CCC
CCC
CCC
CCC
AAA
BBB
CCC
CCC
CCC
CCC
BBB
BBB
BBB
BBB
BBB
BBB
Summary
• Clusters of Hardware CyberBricks
– all nodes are very intelligent
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
• Software CyberBricks
– standard way to interconnect intelligent nodes
– needs execution model
• partition & pipeline
• RPC and Rivers)
– needs parallelism
Recent Progress on Scaleable Servers
Jim Gray, Microsoft Research
Substantial progress has been made towards the goal of building
supercomputers by composing arrays of commodity processors,
disks, and networks into a cluster that provides a single system
image. True, vector-supers still are 10x faster than commodity
processors on certain floating point computations, but they cost
disproportionately more. Indeed, the highest-performance
computations are now performed by processor arrays. In the
broader context of business and internet computing, processor
arrays long ago surpassed mainframe performance, and for a tiny
fraction of the cost. This talk first reviews this history and
describes the current landscape of scaleable servers in the
commercial, internet, and scientific segments. The talk then
discusses the Achilles heels of scaleable systems: programming
tools and system management. There has been relatively little
progress in either area. This suggests some important research
areas for computer systems research.
end
What I’m Doing
• TerraServer: Photo of the planet on the web
– a database (not a file system)
– 1TB now, 15 PB in 10 years
– http://www.TerraServer.microsoft.com/
• Sloan Digital Sky Survey: picture of the universe
– just getting started, cyberbricks for astronomers
– http://www.sdss.org/
• Sorting:
– one node pennysort (http://research.microsoft.com/barc/SortBenchmark/)
– multinode: NT Cluster sort (shows off SAN and DCOM)
What I’m Doing
• NT Clusters:
– failover: Fault tolerance within a cluster
– NT Cluster Sort: balanced IO, cpu, network benchmar
– AlwaysUp: Geographical fault tolerance.
• RAGS: random testing of SQL systems
– a bug finder
• Telepresence
– Working with Gordon Bell on “the killer app”
– FileCast and PowerCast
– Cyberversity (international, on demand, free university)
Outline
• Scaleability: MAPS
• Scaleup has limits, scaleout for really big jobs
• Two generic kinds of computing:
– many little & few big
• Many little has credible programming model
– tp, web, fileserver, mail,… all based on RPC
• Few big has marginal success (best is DSS)
• Rivers and objects
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)
The Bricks of Cyberspace
• Cost 1,000 $
• Come with
– NT
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
• Compatible with everyone else
• CyberBricks
Super Server: 4T Machine

Array of 1,000 4B machines
1
b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot


A few megabucks
Challenge:
CPU
50 GB Disc
5 GB RAM
Manageability
Programmability
Security
Cyber Brick
a 4B machine
Availability
Scaleability
Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
Cluster Vision
Buying Computers by the Slice
• Rack & Stack
– Mail-order components
– Plug them into the cluster
• Modular growth without limits
– Grow by adding small modules
• Fault tolerance:
– Spare modules mask failures
• Parallel execution & data search
– Use multiple processors and disks
• Clients and servers made from the same stuff
– Inexpensive: built with
commodity CyberBricks
Nostalgia Behemoth in the Basement
• today’s PC
is yesterday’s supercomputer
• Can use LOTS of them
• Main Apps changed:
– scientific  commercial  web
– Web & Transaction servers
– Data Mining, Web Farming
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Technology Drivers: Disks
• Disks on track
• 100x in 10 years
2 TB 3.5” drive
• Shrink to 1” is 200GB
• Disk replaces tape?
Yotta
• Disk is super computer!