Scaleable Computing Jim Gray Microsoft Corporation [email protected] ™ Thesis: Scaleable Servers Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built.
Download
Report
Transcript Scaleable Computing Jim Gray Microsoft Corporation [email protected] ™ Thesis: Scaleable Servers Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built.
Scaleable Computing
Jim Gray
Microsoft Corporation
[email protected]
™
Thesis: Scaleable Servers
Scaleable Servers
Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”
Servers should be able to
Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies
Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark
14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
1988: DB2 + CICS Mainframe
65 tps
IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
1997: 10 years later
1 Person and 1 box = 1250 tps
1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays
What Happened?
Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)
New Economics: Commodity
class
price/mips software
$/mips k$/year
mainframe
10,000
100
minicomputer
100
10
microcomputer
10
1
time
GUI: Human - computer tradeoff
optimize for people, not computers
What Happens Next
Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995 2005
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
Kinds Of
Information Processing
Point-to-point
Immediate
Timeshifted
Broadcast
Conversation
Money
Lecture
Concert
Network
Mail
Book
Newspaper
Database
It’s ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis and automatic processing are being added
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Immediate OR time-delayed
Why Put Everything
In Cyberspace?
Point-to-point
OR
broadcast
Network
Locate
Process
Analyze
Summarize
Database
Magnetic Storage
Cheaper Than Paper
File cabinet:
cabinet (four drawer) 250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3¢/sheet
Disk:
Image:
disk (4 GB =)
ASCII: 2 mil pages
800$
0.04¢/sheet
(80x cheaper)
200,000 pages
0.4¢/sheet
Store everything on disk
(8x cheaper)
Billions Of Clients
Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers
All clients networked
to servers
May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide
Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"
9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine
Array of 1,000 4B machines
1
bps processors
1 BB DRAM
10 BB disks
1 Bbps comm lines
1 TB tape robot
A few megabucks
Challenge:
Manageability
Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
Security
Availability
Scaleability
Affordability
As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?
SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
Commodity platforms
Commodity network
interconnect
Enables parallel applications
Thesis: Scaleable Servers
Scaleable Servers
Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”
Servers should be able to
Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies
Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages
Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
Scale-down problem
(starter systems expensive)
Personal
There is a BIGGEST one
system
Building the Largest Node
There is a biggest node (size grows over time)
Today, with NT, it is probably 1TB
We are building it (with help from DEC and SPIN2)
1 TB GeoSpatial SQL Server database
(1.4 TB of disks = 320 drives).
30K BTU, 8 KVA, 1.5 metric tons.
1-TB home page
Will put it on the Web as a demo app.
10 meter image of the ENTIRE PLANET.
www.SQL.1TB.com
2 meter image of interesting parts (2% of land)
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
One pixel per meter = 500 TB uncompressed.
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
TM
Better resolution in US (courtesy of USGS).
1-TB SQL Server DB
Satellite and aerial
photos
Support
files
What’s TeraByte?
1 Terabyte:
1,000,000,000 business letters 150 miles of book shelf
100,000,000 book pages
15 miles of book shelf
50,000,000 FAX images
7 miles of book shelf
10,000,000 TV pictures (mpeg)
10 days of video
4,000 LandSat images
16 earth images (100m)
100,000,000 web page
10 copies of the web HTML
Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1997: $200 k$ of magnetic disc
$30 k$ nearline tape
Terror Byte !
10,000 discs
10,000 tapes
48 discs
20 tapes
Tpc-C Web-Based Benchmarks
Order
Invoice
Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
easy
to implement
performance is GREAT!
HTTP
Client is a Web browser
(7,500 of them!)
Submits
IIS
= Web
ODBC
SQL
TPC-C Shows How Far SMPs have come
Performance is amazing:
Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC)
Best Price/Perf: 6,712 tpmC @ $65/tpmC (MS SQL/DEC/Intel)
graphs show UNIX high price & diseconomy of scaleup
tpm C & Price Pe rform ance
(only "best" data shown for each vendor)
DB2
400
Informix
MS SQL Server
350
Oracle
300
Sybase
250
$/tpmC
2,000 users is the min!
30,000 users on a 4x12 alpha cluster (Oracle)
200
150
100
50
0
0
5000
10000
tpmC
15000
20000
TPC C SMP Performance
• SMPs do offer speedup
but 4x P6 is better than some 18x MIPSco
tpm C vs CPS
SUN Scaleability
20,000
20,000
18,000
SUN Scaleability
16,000
15,000
SQL Server
14,000
tpmC
tpmC
12,000
10,000
10,000
8,000
6,000
5,000
4,000
2,000
0
0
0
5
10
CPUs
15
20
0
5
10
cpus
15
20
The TPC-C Revolution
Shows How Far
NT and SQL Server have Come
tpmC and $/tpmC
MS
SQL Server: Economy of Scale & Low Price
$250
DB2
Informix
Microsoft
Oracle
Sybase
$200
Better
Economy of scale on Windows NT
Recent Microsoft SQL Server benchmarks
are Web-based
Price $/TPM-C
$150
$100
$50
$0
0
1000
2000
3000
4000
5000
Performance tpmC
6000
7000
8000
What Happens To Prices?
No expensive UNIX front end
(20$/tpmC)
No expensive TP monitor software (10$/tpmC)
=> 65$/tpmC
164
188
TPC Price/tpmC
100
93
90
Informix on SNI
Oracle on DEC Unix
Oracle on Compaq/NT
Sybase on Compaq/NT
Microsoft on Compaq with Visigenics
Microsoft on HP with Visagenics
Microsoft on Intergraph with IIS
Microsoft on Compaq with IIS
80
70
66
64 66
60
50
40
54
45
44
35
44
38
44
40
39 39
35
30
27
30
20
42
40
38
41 39
31
22
18
19 21
16
8
10
3
0
30
processor
disk
software
net
Grow UP and OUT
1 Terabyte DB
SMP super
server
Departmental
server
Personal
system
Cluster:
•a collection of nodes
•as easy to program
and manage as a
single node
1 billion
transactions
per day
Clusters Have Advantages
Clients and servers made from the same stuff
Inexpensive:
Fault tolerance:
Spare modules mask failures
Modular growth
Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Windows NT clusters
Key goals:
Easy: to install, manage, program
Reliable: better than a single node
Scaleable: added parts add power
Microsoft & 60 vendors
defining NT clusters
Almost all big hardware and
software vendors involved
No special hardware needed
but it may help
Enables
Commodity fault-tolerance
Commodity parallelism
(data mining, virtual reality…)
Also great for workgroups!
Initial: two-node failover
Beta testing since December96
SAP, Microsoft, Oracle giving
demos.
File, print, Internet, mail, DB, other
services
Easy to manage
Each node can be 4x (or more) SMP
Next (NT5) “Wolfpack” is modest
size cluster
About 16 nodes (so 64 to 128 CPUs)
No hard limit, algorithms designed
to go further
™
SQL Server Failover
Using “Wolfpack” Windows NT Clusters
Each server “owns” half the database
When one fails…
The other server takes over the shared disks
Recovers the database and serves it
Private
disks
Private
disks
Shared SCSI disk strings
B
A
Clients
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)
400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA
185 million calls
(peak day worldwide)
AT&T
10.
Visa
AT&T
Mtpd
100.
1 Btpd
Billion Transactions per Day
Project
Building a 20-node Windows NT
Cluster (with help from Intel)
> 800 disks
All commodity parts
Using SQL Server &
DTC distributed transactions
Each node has 1/20 th of the DB
Each node does 1/20 th of the
work
15% of the transactions are
“distributed”
Parallelism
The OTHER aspect of clusters
Clusters of machines
allow two kinds
of parallelism
Many little jobs: online
transaction processing
TPC-A, B, C…
A few big jobs: data
search and analysis
TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Any
Sequential
Program
Partition
outputs split N ways
inputs merge M ways
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
The Parallel Law
Of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs:
Linear speedup and linear scale-up
Not always possible
1,000 MIPS
1,000 $
1 MIPS
1$
Thesis: Scaleable Servers
Scaleable Servers
Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”
Servers should be able to
Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies
Objects, Transactions, Clusters, Parallelism
The BIG Picture
Components and transactions
Software modules are objects
Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an
atomic unit: all-or-nothing, durable, isolated
Object Request Broker
ActiveX and COM
COM is Microsoft model, engine inside OLE ALL
Microsoft software is based on COM (ActiveX)
CORBA + OpenDoc is equivalent
Heated debate over which is best
Both share same key goals:
Encapsulation: hide implementation
Polymorphism: generic operations
key to GUI and reuse
Versioning: allow upgrades
Transparency: local/remote
Security: invocation can be remote
Shrink-wrap: minimal inheritance
Automation: easy
COM now managed by the Open Group
Linking And Embedding
Objects are data modules;
transactions are execution modules
Link: pointer to object
somewhere else
Think URL in Internet
Embed: bytes
are here
Objects may be active;
can callback to subscribers
Objects Meet Databases
The basis for universal
data servers, access, & integration
object-oriented (COM oriented)
programming interface to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
The Pattern:
Three Tier Computing
Presentation
Clients do presentation, gather input
Clients do some workflow (Xscript)
Clients send high-level requests to
ORB (Object Request Broker)
ORB dispatches workflows and
business objects -- proxies for client, Business
Objects
orchestrate flows & queues
Server-side workflow scripts call on
distributed business objects to
execute task
workflow
Database
49
The Three
Tiers
Web Client
HTML
VB Java
plug-ins
VBscritpt
JavaScrpt
Middleware
VB or Java
Script Engine
Object
server
Pool
VB or Java
Virt Machine
Internet
HTTP+
DCOM
ORB
ORB
TP Monitor
Web Server...
Object & Data
server.
DCOM (oleDB, ODBC,...)
IBM
Legacy
Gateways
50
Why Did Everyone Go To
Three-Tier?
Manageability
Business rules must be with data
Middleware operations tools
Performance (scaleability)
workflow
Server resources are precious
ORB dispatches requests to server pools
Technology & Physics
Presentation
Put UI processing near user
Put shared data processing near shared
data
Business
Objects
Database
51
What Middleware Does
ORB, TP Monitor, Workflow Mgr, Web Server
Registers transaction programs
workflow and business objects (DLLs)
Pre-allocates server pools
Provides server execution environment
Dynamically checks authority
(request-level security)
Does parameter binding
Dispatches requests to servers
parameter binding
load balancing
Provides Queues
Operator interface
53
Server Side Objects
Easy Server-Side Execution
A Server
Give simple execution
environment
Object gets
Network
start
invoke
shutdown
Everything else is
automatic
Drag & Drop Business
Objects
Queue
Connections
Context
Security
Thread Pool
Service logic
Synchronization
Shared Data
54
Configuration
Management
Receiver
A new programming paradigm
Develop object on the desktop
Better yet: download them from the Net
Script work flows as method invocations
All on desktop
Then, move work flows and objects to server(s)
Gives
desktop
development
three-tier deployment
Software Cyberbricks
Transactions Coordinate
Components (ACID)
Transaction properties
Atomic: all or nothing
Consistent: old and new values
Isolated: automatic locking or versioning
Durable: once committed, effects survive
Transactions are built into modern OSs
MVS/TM Tandem TMF, VMS DEC-DTM, NT-DTC
Transactions & Objects
Application requests transaction
identifier (XID)
XID flows with method invocations
Object Managers join (enlist)
in transaction
Distributed Transaction Manager
coordinates commit/abort
Distributed Transactions
Enable Huge Throughput
Each node capable of 7 KtmpC (7,000 active users!)
Can add nodes to cluster (to support 100,000 users)
Transactions coordinate nodes
ORB / TP monitor spreads work among nodes
Distributed Transactions
Enable Huge DBs
Distributed database technology
spreads data among nodes
Transaction processing technology
manages nodes
Thesis: Scaleable Servers
Scaleable Servers Built from Cyberbricks
Servers should be able to
Allow new applications
Scale up, out, down
Key software technologies
Clusters (ties the hardware together)
Parallelism: (uses the independent cpus, stores, wires
Objects (software CyberBricks)
Transactions: masks errors.
Computer Industry Laws
(Rules of thumb)
Metcalf’s law
Moore’s first law
Bell’s computer classes (7 price tiers)
Bell’s platform evolution
Bell’s platform economics
Bill’s law
Software economics
Grove’s law
Moore’s second law
Is info-demand infinite?
The death of Grosch’s law
Metcalf’s Law
Network Utility = Users2
How many connections can it
make?
1 user: no utility
100,000 users: a few contacts
1 million users: many on Net
1 billion users: everyone on Net
That is why the Internet is so “hot”
Exponential benefit
Moore’s First Law
1GB
XXX doubles every 18 months
128MB
60% increase per year
8MB
Micro processor speeds
1MB
128KB
Chip density
8KB
Magnetic disk density
1970
Communications bandwidth bits: 1K 4K
WAN bandwidth approaching LANs
1980
The past does not matter
10x here, 10x there, soon you’re talking REAL
change
PC costs decline faster than any other
platform
1990
2000
16K 64K 256K 1M 4M 16M 64M 256M
Exponential growth:
1 chip memory size
( 2 MB to 32 MB)
Volume and learning curves
PCs will be the building bricks of all future
Bumps In The Moore’s
Law Road
$/MB of DRAM
1000000
DRAM:
1988: United States
anti-dumping
rules
1993-1995: ?price flat
Magnetic disk:
1965-1989: 10x/decade
1989-1996: 4x/3year!
100X/decade
10000
100
1
1970
1980
1990
2000
$/MB of DISK
10,000
100
1
.01
1970
1980
1990
2000
Gordon Bell’s 1975 VAX Planning
Model... He Didn’t Believe
It!
(t-1972)
System Price = 5 x 3 x .04 x memory size/ 1.26
5x: Memory is
20% of cost
3x: DEC markup
.04x: $ per byte
He didn’t believe:
the projection
$500 machine
He couldn’t
comprehend
the implications
K$
100,000.K$
10,000.K$
1,000.K$
100.K$
10.K$
1.K$
0.1K$
0.01K$
1960
16 KB
1970
1980
64 KB
256 KB
1990
1 MB
2000
8 MB
Gordon Bell’s Processing
Memories, And Comm 100
Years
1.E+18
1.E+15
1.E+12
1.E+09
1.E+06
1.E+03
1.E+00
1947
1967
Processing
1987
2007
2027
Sec. Mem.
Pri. Mem
POTS(bps)
2047
Backbone
Gordon Bell’s Seven Price
Tiers
10$:
100$:
1,000$:
10,000$:
100,000$:
1,000,000$:
10,000,000$:
wrist watch computers
pocket/ palm computers
portable computers
•
personal
computers (desktop)
departmental computers (closet)
site computers (glass house)
regional computers (glass castle)
Super server: costs more than $100,000
“Mainframe”: costs more than $1 million
Must be an array of processors, disks, tapes, comm ports
Bell’s Evolution Of
Computer Classes
Technology enables two evolutionary paths:
1. constant performance, decreasing cost
2. constant price, increasing performance
Log price
Mainframes (central)
Minis (dep’t.)
WSs
PCs (personals)
Time
??
1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8
1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62
Gordon Bell’s
Platform Economics
Traditional computers: custom or semi-custom,
high-tech and high-touch
New computers: high-tech and no-touch
100000
10000
Price (K$)
Volume (K)
Application
price
1000
100
10
1
0.1
0.01
Mainframe
WS
Computer type
Browser
Software
Economics
Microsoft: $9 billion
An engineer costs
Profit
R&D
about
24%
16%
$150,000/year
R&D gets [5%…15%]
SG&A
Tax
34%
13%
of budget
Need [$3 million…
Product
and Service
$1 million] revenue
13%
per engineer
Intel: $16 billion
IBM: $72 billion
Oracle: $3 billion
Profit
22%
R&D
8%
SG&A
11%
Tax
12%
P&S
47%
Profit
Tax 6%
5%
R&D
8%
Profit
15%
Tax
7%
SG&A
22%
P&S
59%
P&S
26%
R&D
9%
SG&A
43%
Software Economics: Bill’s
Law
Fixed_Cost
Price =
+ Marginal _Cost
Units
Bill Joy’s law (Sun):
don’t write software for less than 100,000 platforms
@$10 million engineering expense, $1,000 price
Bill Gate’s law:
don’t write software for less than 1,000,000 platforms
@$10 engineering expense, $100 price
Examples:
UNIX
versus Windows NT: $3,500 versus $500
Oracle versus SQL-Server: $100,000 versus $6,000
No spreadsheet or presentation pack on UNIX/VMS/...
Commoditization of base software and hardware
Grove’s Law
The New Computer Industry
Horizontal
integration
is new structure
Each layer picks
best from lower
layer
Desktop (C/S)
market
1991:
50%
1995: 75%
Function
Operation
Integration
Applications
Middleware
Baseware
Systems
Silicon & Oxide
Example
AT&T
EDS
SAP
Oracle
Microsoft
Compaq
Intel & Seagate
The cost of fab lines
doubles every generation
(three years)
Money limit hard to imagine:
$10-billion line
$20-billion line
$40-billion line
Physical limit
Quantum effects at 0.25
micron now 0.05 micron
seems hard 12 years, three
generations
Lithograph: need Xray
below 0.13 micron
$million/ Fab Line
Moore’s Second
Law
$10,000
$1,000
$100
$10
$1
1960
1970
1980
Year
1990
2000
Constant Dollars Versus
Constant Work
Constant work:
One SuperServer can do
all the world’s computations
Constant dollars:
The world spends 10% on
information processing
Computers are moving from
5% penetration to 50%
$300 billion to $3 trillion
We have the patent
on the byte and algorithm
Crossing The Chasm
New
market
Product finds
customers
No product
no customers
Hard
Old
market
Boring
competitive
slow growth
Old
technology
Hard
Customers
find product
New
technology