Scaleable Computing Jim Gray Microsoft Corporation [email protected] ™ Thesis: Scaleable Servers  Scaleable Servers    Commodity hardware allows new applications New applications need huge servers Clients and servers are built.

Download Report

Transcript Scaleable Computing Jim Gray Microsoft Corporation [email protected] ™ Thesis: Scaleable Servers  Scaleable Servers    Commodity hardware allows new applications New applications need huge servers Clients and servers are built.

Scaleable Computing
Jim Gray
Microsoft Corporation
[email protected]
™
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
price/mips software
$/mips k$/year
mainframe
10,000
100
minicomputer
100
10
microcomputer
10
1

time
GUI: Human - computer tradeoff
optimize for people, not computers
What Happens Next




Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995 2005
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
Kinds Of
Information Processing
Point-to-point
Immediate
Timeshifted
Broadcast
Conversation
Money
Lecture
Concert
Network
Mail
Book
Newspaper
Database
It’s ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis and automatic processing are being added
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Immediate OR time-delayed
Why Put Everything
In Cyberspace?
Point-to-point
OR
broadcast
Network
Locate
Process
Analyze
Summarize
Database
Magnetic Storage
Cheaper Than Paper

File cabinet:
cabinet (four drawer) 250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3¢/sheet


Disk:
Image:
disk (4 GB =)
ASCII: 2 mil pages
800$
0.04¢/sheet
(80x cheaper)
200,000 pages
0.4¢/sheet

Store everything on disk
(8x cheaper)
Databases
Information at Your Fingertips™
Information Network™
Knowledge Navigator™


All information will be in an
online database (somewhere)
You might record everything you



Read: 10MB/day, 400 GB/lifetime
(eight tapes today)
Hear: 400MB/day, 16 TB/lifetime
(three tapes/year today)
See: 1MB/s, 40GB/day, 1.6 PB/lifetime
(maybe someday)
Database Store
ALL Data Types

The old world:
 Millions of objects
 100-byte objects



People
Name
Address
David
NY
Mike
Berk
Won
Austin
The new world:

Billions of objects
Big objects (1 MB)
Objects have
behavior (methods)



People
Name
Address Papers
David
NY
Mike
Berk
Won
Austin
Picture Voice

Paperless office
Library of Congress online
All information online
Entertainment
Publishing
Business
WWW and Internet
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages




Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
 Scale-down problem
(starter systems expensive)
Personal
 There is a BIGGEST one
system

Building the Largest Node

There is a biggest node (size grows over time)
Today, with NT, it is probably 1TB

We are building it (with help from DEC and SPIN2)







1 TB GeoSpatial SQL Server database
(1.4 TB of disks = 320 drives).
30K BTU, 8 KVA, 1.5 metric tons.
1-TB home page
Will put it on the Web as a demo app.
10 meter image of the ENTIRE PLANET.
www.SQL.1TB.com
2 meter image of interesting parts (2% of land)
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
One pixel per meter = 500 TB uncompressed.
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
Todo loo da loo-rah, ta da ta-la la la
TM

Better resolution in US (courtesy of USGS).
1-TB SQL Server DB
Satellite and aerial
photos
Support
files
What’s TeraByte?

1 Terabyte:
1,000,000,000 business letters 150 miles of book shelf
100,000,000 book pages
15 miles of book shelf
50,000,000 FAX images
7 miles of book shelf
10,000,000 TV pictures (mpeg)
10 days of video
4,000 LandSat images
16 earth images (100m)
100,000,000 web page
10 copies of the web HTML

Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1997: $200 k$ of magnetic disc
$30 k$ nearline tape
Terror Byte !
10,000 discs
10,000 tapes
48 discs
20 tapes
TB DB User Interface
+
+
+
Next
Tpc-C Web-Based Benchmarks






Order
Invoice
Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
 easy
to implement
 performance is GREAT!
HTTP

Client is a Web browser
(7,500 of them!)
Submits
IIS
= Web
ODBC

SQL
Grow UP and OUT
1 Terabyte DB
SMP super
server
Departmental
server
Personal
system
Cluster:
•a collection of nodes
•as easy to program
and manage as a
single node
1 billion
transactions
per day
Clusters Have Advantages


Clients and servers made from the same stuff
Inexpensive:


Fault tolerance:


Spare modules mask failures
Modular growth


Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Windows NT Clusters

Microsoft & 60 vendors defining NT clusters



No special hardware needed - but it may help
Fault-tolerant first, scaleable second


Almost all big hardware and software vendors involved
Microsoft, Oracle, SAP giving demos today
Enables



Commodity fault-tolerance
Commodity parallelism (data mining, virtual reality…)
Also great for workgroups!
Billion Transactions per Day
Project






Building a 20-node Windows NT
Cluster (with help from Intel)
> 800 disks
All commodity parts
Using SQL Server &
DTC distributed transactions
Each node has 1/20 th of the DB
Each node does 1/20 th of the
work
15% of the transactions are
“distributed”
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)



400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA

185 million calls
(peak day worldwide)
AT&T

10.
Visa
AT&T
Mtpd

100.
1 Btpd

Parallelism
The OTHER aspect of clusters

Clusters of machines
allow two kinds
of parallelism



Many little jobs: online
transaction processing
 TPC-A, B, C…
A few big jobs: data
search and analysis
 TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Any
Sequential
Program
Partition
outputs split N ways
inputs merge M ways
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
The Parallel Law
Of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs:
Linear speedup and linear scale-up
Not always possible
1,000 MIPS
1,000 $
1 MIPS
1$
Thesis: Scaleable Servers

Scaleable Servers



Commodity hardware allows new applications
New applications need huge servers
Clients and servers are built of the same “stuff”



Servers should be able to




Commodity software and
Commodity hardware
Scale up (grow node by adding CPUs, disks, networks)
Scale out (grow by adding nodes)
Scale down (can start small)
Key software technologies

Objects, Transactions, Clusters, Parallelism
The BIG Picture
Components and transactions




Software modules are objects
Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an
atomic unit: all-or-nothing, durable, isolated
Object Request Broker
Linking And Embedding
Objects are data modules;
transactions are execution modules

Link: pointer to object
somewhere else

Think URL in Internet

Embed: bytes
are here

Objects may be active;
can callback to subscribers
Objects Meet Databases
The basis for universal
data servers, access, & integration






object-oriented (COM oriented)
programming interface to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
The Three
Tiers
Web Client
HTML
VB Java
plug-ins
VBscritpt
JavaScrpt
Middleware
VB or Java
Script Engine
Object
server
Pool
VB or Java
Virt Machine
Internet
HTTP+
DCOM
ORB
ORB
TP Monitor
Web Server...
Object & Data
server.
DCOM (oleDB, ODBC,...)
IBM
Legacy
Gateways
43
Server Side Objects

Easy Server-Side Execution
A Server
Give simple execution
environment
Object gets
Network




start
invoke
shutdown
Everything else is
automatic
Drag & Drop Business
Objects
Queue
Connections
Context
Security
Thread Pool
Service logic
Synchronization
Shared Data
47
Configuration

Management

Receiver
A new programming paradigm






Develop object on the desktop
Better yet: download them from the Net
Script work flows as method invocations
All on desktop
Then, move work flows and objects to server(s)
Gives
desktop
development
three-tier deployment
Software Cyberbricks
Transactions Coordinate
Components (ACID)

Transaction properties





Atomic: all or nothing
Consistent: old and new values
Isolated: automatic locking or versioning
Durable: once committed, effects survive
Transactions are built into modern OSs

MVS/TM Tandem TMF, VMS DEC-DTM, NT-DTC
Transactions & Objects




Application requests transaction
identifier (XID)
XID flows with method invocations
Object Managers join (enlist)
in transaction
Distributed Transaction Manager
coordinates commit/abort
Distributed Transactions
Enable Huge Throughput




Each node capable of 7 KtmpC (7,000 active users!)
Can add nodes to cluster (to support 100,000 users)
Transactions coordinate nodes
ORB / TP monitor spreads work among nodes
Distributed Transactions
Enable Huge DBs


Distributed database technology
spreads data among nodes
Transaction processing technology
manages nodes
Thesis: Scaleable Servers

Scaleable Servers Built from Cyberbricks


Servers should be able to


Allow new applications
Scale up, out, down
Key software technologies




Clusters (ties the hardware together)
Parallelism: (uses the independent cpus, stores, wires
Objects (software CyberBricks)
Transactions: masks errors.