Windows NT Scalability Jim Gray Microsoft Research Gray@Microsoft.com http/www.research.Microsoft.com/~Gray/talks/ Outline • Scalability: What & Why? • Scale UP: NT SMP scalability • Scale OUT: NT Cluster scalability •

Windows NT Scalability Jim Gray Microsoft Research [email protected] http/www.research.Microsoft.com/~Gray/talks/ Outline • Scalability: What & Why? • Scale UP: NT SMP scalability • Scale OUT: NT Cluster scalability •

Transcript Windows NT Scalability Jim Gray Microsoft Research [email protected] http/www.research.Microsoft.com/~Gray/talks/ Outline • Scalability: What & Why? • Scale UP: NT SMP scalability • Scale OUT: NT Cluster scalability •

Windows NT Scalability
Jim Gray
Microsoft Research
[email protected]
http/www.research.Microsoft.com/~Gray/talks/
Outline
• Scalability: What & Why?
• Scale UP: NT SMP scalability
• Scale OUT: NT Cluster scalability
• Key Message:
– NT can do the most demanding apps
today.
– Tomorrow will be even better.
What is Scalability?
Super
Server
Server Cluster
Server
PC
Workstation
Portable
Win Term
NetPC
Handheld
TV
• Grow without limits
– Capacity
– Throughput
• Do not add complexity
– design
– administer
– Operate
– Use
ScaleServer
UPCluster
& OUT Focus Here
• Grow without limits
Super
Server
Server
•
– SMP: 4, 8, 16, 32 CPUs
– 64-bit addressing
– Huge storage
Cluster Requirements
– Auto manage
– High availability
– Transparency
– Programming tools & apps
Scalability is Important
• Automation benefits growing
– ROI of 1 month....
• Slice price going to zero
Server
– Cyberbrick costs 5k$
• Design, Implement & Manage
cost going down
– DCOM & Viper make it easy!
– NT Clusters are easy!
• Billions of clients imply
•
millions of HUGE servers.
Thin clients imply huge servers.
Q: Why Does Microsoft Care?
A: Billions of clients need millions of servers
2,700
2,400
2,100
1,800
1,500
1,200
900
600
300
0
Servers Shipped
per year
WindowsNT
Server
(97-01 are MS estimates)
NetWare
Unix
1994
1995
1996
1997
1998
Expect Microsoft to work hard on
Scaleable Windows NT and
Scaleable BackOffice.
Key technique: INTEGRATION.
1999
2000
2001
How Scaleable is NT??
The Single Node Story
• 64 bit file system in NT 1, 2, 3, 4, 5
• 8 node SMP in NT 4.E, 32 node OEM
• 64 bit addressing in NT 5
• 1 Terabyte SQL Databases (PetaByte capable)
• 10,000 users (TPC-C benchmark)
• 100 Million web hits per day (IIS)
• 50 GB Exchange mail store
next release designed for 16 TB
• 50,000
POP3 users on Exchange
(1.8 M messages/day)
• And, more coming…..
Windows NT Server
• Scalability
Enterprise Edition
– 8x SMP support (32x in OEM kit)
– Larger process memory (3GB Intel)
– Unlimited Virtual Roots in IIS (web)
• Transactions
– DCOM transactions (Viper TP mon)
– Message Queuing (Falcon)
• Availability
– Clustering (WolfPack)
– Web, File, Print,DB … servers fail over.
What Happened?
• Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)
• New Economics: Commodity
class
mainframe
minicomputer
microcomputer
price/mips software
$/mips k$/year
10,000
100
100
10
10
1
• GUI: Human - computer tradeoff
optimize for people, not computers
time
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Thesis:
Scaleable
Servers
• Scaleable Servers
– Commodity hardware allows new applications
– New applications need huge servers
– Clients and servers are built of the same “stuff”
• Commodity software and
•
• Commodity hardware
Servers should be able to
– Scale up (grow node by adding CPUs, disks, networks)
– Scale out (grow by adding nodes)
•
– Scale down (can start small)
Key software technologies
– Objects, Transactions, Clusters, Parallelism
Scaleable Servers
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages
•
•
•
•
•
Single system image easier
to manage, easier to
program threads in shared
memory, disk, Net
4x SMP is commodity
Software capable of 16x
Problems:
– >4 not commodity
– Scale-down problem
(starter systems
expensive)
There is a BIGGEST one
SMP super
server
Departmental
server
Personal
system
Tpc-C Web-Based Benchmarks
•
•
•
– Order
– Invoice
– Query to server via Web
page interface
Web server translates to DB
SQL does DB work
Net:
– easy to implement
– performance is GREAT!
HTTP
•
Client is a Web browser
(9,200 of them!)
Submits
IIS
= Web
ODBC
•
SQL
What Happens in 10 Years?
1987: 256 tps
$ 14 million computer
A dozen people
Two rooms of machines
1997: 1,250 tps
$ 50 k$ computer
One person
1 micro-dollar per transaction
(1,000x cheaper)
Ready for the next 10 years?
1988: DB2 + CICS Mainframe
65 tps
• IBM 4391
• Simulated network of 800 clients
• 2m$ computer
• Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
NT vs UNIX SMPs
•
•
NT traditionally ran on 1 to 4 cpus
– Scales near-linear on them
tpmC vs Time
UNIX boxes: 32-64 way SMPs
35,000
tpmC vs Time
30,000
– They do 3x more tpmC
25,000
35,000
tpmC vs Time
20,000
30,000
– They cost 10x more.
15,000
25,000
35,000
20,000
10,000
10 way NT machines are available
Unix
30,000
15,000
5,000
NT
Unix
25,0000
10,000
– They cost more
20,000
5,000
Jan-95
Jan-96
Jan-97 NT
0
15,000
– They are faster
10,000
Jan-95
Jan-96
Jan-97
5,000
My view (shared by many)
0
Jan-95
Jan-96
Jan-97
– Need clusters for availability
– Cluster commodity servers to make huge systems
tpmC tpmC
•
h
h
tpmC
h
•
– a la Tandem, Teradata, VMScluster, IBM Sysplex, IBM SP2
– Clusters reduce need for giant SMPs
Transaction Throughput TPC-C
• On comparable hardware: NT scales better!
• SQL Server & NT Improving 250% per year
• NT has best Price Performance (2x cheaper)
tpmC on Intel CPUs
tpmC vs Intel CPUs
NT all
14,000
tpmC
10,000
8,000
h hhhh
h
6,000
4,000
2,000
0
0 1 2 3 4 5 6 7 8 9 10
tpmC
NT
UNIX
12,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
0
NT Best
Unix best
h
h
0 1 2 3 4 5 6 7 8 9 10
NT Scales Better Than Solaris
• Microsoft SQL
20,000
15,000
tpmC
•
NT
Intel
scales to 6x
Beats Sybase
Solaris
UltraSPARC
up to 11-way
10,000
5,000
0
0
10
cpus
20
New News: WOW!
HPUX-HPPA-Sybase
• Sybase on HP 16x SMP scales to 40 ktpmC!
• Price/Performance is flat (no diseconomy)
Sybase & HP tpmC vs CPUs
HP + Sybase $/tpmC vs tpmC
45000
40000
35000
$/tmpC
tpmC
30000
25000
20000
15000
10000
5000
$140
$120
$100
$80
$60
$40
$20
$0
0
10000
20000
0
0
5
10
cpus
15
20
tpmC
30000
40000
Low end More Competitive
TPC
Price/tpmC
TPC Price/tpmC
Price/tpmC
• premium on CPUs, disks, & Oracle
50
50
5050
45
4545
40
4040
41
Sun Oracle 52 k tpmC @ 134$/tpmC
HP-Sybase 39K tpmC @96$/tpmC
SUN-Sybase 11.6 ktpmC @ 57$/tpmc
HP-Sybase 39K tpmC @96$/tpmC
Unisys-Microsoft
ktpmC@@57$/tpmc
39$/tpmC
SUN-Sybase 11.612ktpmC
SUN-Sybase 11.6 ktpmC @ 57$/tpmc
Unisys-Microsoft 12 ktpmC @ 39$/tpmC
Unisys-Microsoft
Unisys-Microsoft12
12ktpmC
ktpmC@
@39$/tpmC
39$/tpmC
38
37 37
33 33
3535
3030
2525
202020
2020
212121
17
17
1717
16 16
13
1515
10
10
1010
1010 10
1010
6
66 6
9
7 676 6
6
6
66 6
55
9
66
4
6
44 4
00
processor
processor
disk
disk
disk
software
software
software
net
net
net
total/10
total
total/10
Only NT Has Economy of Scale
• NT is 2x less
•
25.0
20.0
Microsoft/NT
tpmC/k$
•
expensive
40$/tpmC
vs 110$/tpmC
Only NT has
economy of
scale
Unix has
dis-economy
of scale
Transactions/k$ by vendor
15.0
Oracle/Unix
Sybase/Unix
10.0
Informix/Unix
DB2/Unix
5.0
0.0
0
10,000
20,000
tpmC
30,000
40,000
TPC-D Decision Support
Benchmark
• NT has good performance and
price/performance.
TPC D 100 GB results
3,000
Price/Perf ($/QthD)
2,500
More
Througput
2,000
NT
1,500
NT
1,000
Lower
price
NT
500
0
200
400
600
800
Performance
1000
1200
1400
1600
•
•
•
Scaleup To Big Databases?
NT 4 and SQL Server 6.5
– DBs up to 1 Billion records,
– 100 GB
– Covers most (80%) data warehouses
SQL Server 7.0
– Designed for Terabytes
• Hundreds of disks per server.
• SMP parallel search
– Data Mining and Multi-Media
TerraServer is good MM example
Satellite
photos of
Earth (1 TB)
Dayton-Hudson
Sales records
(300GB)
Human Genome
(3GB)
Manhattan phone book
(15MB)
Excel
spreadsheet
Database Scaleup: TerraServer™
•
•
•
•
•
•
•
Demo NT and SQL Server scalability
Stress test SQL Server 7.0
Requirements
– 1 TB
– Unencumbered (put on www)
– Interesting to everyone everywhere
– And not offensive to anyone anywhere
Loaded
– 1.1 M place names from Encarta World Atlas
– 1 M Sq Km from USGS (1 meter resolution)
– 2 M Sq Km from Russian Space agency (2 m)
Will be on web (world’s largest atlas)
Sell images with commerce server.
USGS CRDA: 3 TB more coming.
TerraServer System
•
•
•
•
•
•
DEC Alpha 4100 (4x smp) +
324 StorageWorks Drives (1.4 TB)
RAID 5 Protected
SQL Server 7.0
USGS 1-meter data
(30% of US)
Russian Space data
Two meter
resolution
SPIN-2
images
(2 M km2
2% of earth)
Demo
http://msrlab/terraserver
Manageability
Windows NT 5.0 and Windows 98
• Active Directory tracks all objects in net
• Integration with IE 4.
–Web-centric user interface
• Management Console
–Component architecture
• Zero Admin Kit and Systems
Management Server
• PlugNPlay, Instant On, Remote Boot,..
• Hydra and Intelli-Mirroring
Thin Client Support
TSO comes to NT
lower per-client costs
Net PC
Windows NT Server
with “Hydra” Server
Existing,
Desktop PC
MS-DOS,
UNIX,
Mac
clients
Dedicated
Windows
terminal
Windows NT 5.0
IntelliMirror™
• Extends CMU Coda File System ideas
• Files and settings mirrored on
•
•
•
•
client and server
Great for disconnected users
Facilitates roaming
Easy to replace PCs
Optimizes network performance
Best of PC and
centralized computing advantages
Outline
• Scalability: What & Why?
• Scale UP: NT SMP scalability
• Scale OUT: NT Cluster scalability
• Key Message:
– NT can do the most demanding apps
today.
– Tomorrow will be even better.
•
•
•
•
Scale OUT
Clusters Have Advantages
Fault tolerance:
– Spare modules mask failures
Modular growth without limits
– Grow by adding small modules
Parallel data search
– Use multiple processors and disks
Clients and servers made from the same stuff
– Inexpensive: built with
commodity CyberBricks
How scaleable is NT??
The Cluster Story
• 16-node Tandem Cluster
•
– 64 cpus
– 2 TB of disk
– Decision support
45-node Compaq Cluster
– 140 cpus
– 14 GB DRAM
– 4 TB RAID disk
– OLTP (Debit Credit)
• 1 B tpd (14 k tps)
microsoft.com
•
•
•
•
90m hits/day
– 17m page views
– #4 site on Internet
900k visitors per day
Not cheap
– Data Centers
– Bandwidth
– 27 people on content
– 22 people on systems
•
•
•
Production
– Windows NT.4 and IIS.3
• 20 HTTP,
• 3 download,
• 3 FTP
• 5 SQL 6.5
• Index Server + 3 search
Stagers
– Site Server for content
– DCOM Publishing wizard
Network
– 6 DS3
– 4 TB/day download capacity
Replicas in UK and Japan
Tandem 2 Ton
• 2 TB SQL database
• 1.2 TB user data
• 16 node cluster
• 64 cpus, 480 disks
• Decision support
parallel data-mining
• Will be Wolf Pack aware
• Demoed at DB Expo in
• ServerNet™ interconnect
Billion Transactions per Day Project
•
Built a 45-node Windows NT Cluster
(with help from Intel & Compaq)
•
•
•
•
•
•
> 900 disks
All off-the-shelf parts
Using SQL Server &
DTC distributed transactions
DCOM & ODBC clients
on 20 front-end nodes
DebitCredit Transaction
Each server node has 1/20 th of the DB
Each server node does 1/20 th of the work
15% of the transactions are “distributed”
Billion Transactions Per Day Hardware
• 45 nodes (Compaq Proliant)
• Clustered with 100 Mbps Switched Ethernet
• 140 cpu, 13 GB, 3 TB (RAID 1, 5).
Type
Workflow
MTS
SQL Server
Distributed
Transaction
Coordinator
TOTAL
nodes
CPUs
DRAM
ctlrs
disks
20
Compaq
Proliant
2500
20
Compaq
Proliant
5000
5
Compaq
Proliant
5000
45
20x
20x
20x
20x
RAID
space
20x
2
128
1
1
2 GB
20x
20x
20x
20x
4
512
4
20x
36x4.2GB
7x9.1GB
130 GB
5x
5x
5x
5x
5x
4
256
1
3
8 GB
140
13 GB
105
895
3 TB
Driver
Database
DTC
VIPDC42
VIPDC43
VIPDC2
VIPDC12
Cluster Architecture
VIPDC44
VIPDC3
VIPDC13
VIPDTC1
VIPDC45
VIPDC4
VIPDC14
VIPDC46
VIPDC5
VIPDC15
VIPDTC2
VIPDC47
VIPDC6
VIPDC16
VIPDC48
VIPDC7
VIPDC17
VIPDTC3
VIPDC49
VIPDC8
VIPDC18
VIPDTC4
Switch
VIPDC50
VIPDC9
VIPDC19
Control
VIPDC51
VIPDC10
VIPDC20
VIPDC11
VIPDC21
VIPDTC5
Local Debit Credit
Driver
Thread
DebitCredit
Driver
DebitCredit
Component
Database
1
2
4
3
Run
5
6
Init
8
9
Loop
10
7
DebitCredit
11
12
13
14
DebitCredit
Distributed Debit Credit Same DTC
Database1
Database2
18
11
DebitCredit
21
UpdateAcct
22
23
12
DTC
19
13
20
14
25
15
16
17
24
25
26
27
28
29
26
27
28
Distributed Debit Credit Different DTC
Database1
Database2
20
23
11
DebitCredit
24
UpdateAcct
25
12
DTC1
13
21
14
22
15
16
17
19
18
26
27
27
30
30
31
31
34
35
34
28 29
33
32
DTC2
1.2 B tpd
• 1 B tpd ran for 24 hrs.
• Out-of-the-box software
• Off-the-shelf hardware
• AMAZING!
•Sized for 30 days
•Linear growth
•5 micro-dollars
per transaction
•
•
1 billion tpd = 11,574 tps
~ 700,000 tpm (transactions/minute)
ATT
Millions of Transactions Per Day
– 185 million calls per
peak day (worldwide)
1,000.
900.
800.
Visa ~20 million tpd
100.
700.
600.
– 400 million customers
500.
10.
400.
– 250K ATMs worldwide
300.
1.
200.
– 7 billion transactions
100.
0.
0.1
(card+cheque) in 1994
1 Btpd Visa
ATT BofA NYSE
New York Stock Exchange
– 600,000 tpd
Bank of America
– 20 million tpd checks cleared (more than any other bank)
– 1.4 million tpd ATM transactions
Worldwide Airlines Reservations: 250 Mtpd
Mtpd
Mtpd
•
How Much Is 1 Billion Tpd?
•
•
•
48
1 B tpd: So What?
• Shows what is possible, easy to build
•
•
– Grows without limits
Shows scaleup of DTC, MTS, SQL…
Shows (again) that shared-nothing
clusters scale
• Next task: make it easy.
– auto partition data
– auto partition application
– auto manage & operate
Parallelism
•
•
The OTHER aspect of clusters
Clusters of machines
allow two kinds
of parallelism
– Many little jobs:
online transaction
processing
• TPC-A, B, C…
– A few big jobs: data
search and analysis
• TPC-D, DSS, OLAP
Both give
automatic parallelism
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Clusters (Plumbing)
• Single system image
•
•
– naming
– protection/security
– management/load balance
Fault Tolerance
– Wolfpack
Hot Pluggable hardware & Software
Windows NT clusters
•
•
•
•
Key goals:
•
Initial: two-node failover
– Easy: to install, manage, program – Beta testing since December96
– Reliable: better than a single
– SAP, Microsoft, Oracle giving demos.
node
– File, print, Internet, mail, DB, other
– Scaleable: added parts add
services
power
– Easy to manage
Microsoft & 60 vendors defining
– Each node can be 4x (or more) SMP
NT clusters
– Almost all big hardware and• Next (NT5) “Wolfpack” is modest size
cluster
software vendors involved
No special hardware needed - but – About 16 nodes (so 64 to 128 CPUs)
– No hard limit, algorithms designed
it may help
to go further
Enables
– Commodity fault-tolerance
– Commodity parallelism
(data mining, virtual reality…)
– Also great for workgroups!
So, What’s New?
• When slices cost 50k$, you buy 10 or 20.
• When slices cost 5k$ you buy 100 or 200.
• Manageability, programmability, usability
•
become key issues (total cost of ownership).
PCs are MUCH easier to use and program
MPP
Vicious Cycle
No Customers!
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
New
New
MPP & App
NewOS
Apps
CP/Commodity
Virtuous Cycle:
Standards allow progress
and investment protection
New
New
MPP & App
NewOS
Standard
platform
Customers
•
•
•
Thesis: Scaleable Servers
Scaleable Servers
– Commodity hardware allows new applications
– New applications need huge servers
– Clients and servers are built of the same “stuff”
• Commodity software and
• Commodity hardware
Servers should be able to
– Scale up (grow node by adding CPUs, disks, networks)
– Scale out (grow by adding nodes)
– Scale down (can start small)
Key software technologies
– Objects, Transactions, Clusters, Parallelism
WolfPack Cluster
IIS & SQL Failover Demo
Browser
Alice
Betty
Web
site
Web
site
Database
Database
Web site files
Database files
Summary
• SMP Scale UP: OK but limited
• Cluster Scale OUT: OK and unlimited
• Manageability:
•
•
•
– fault tolerance OK & easy!
– more needed
CyberBricks work
Manual Federation now
Automatic in future
Scalability Research Problems
• Automatic everything
• Scaleable applications
•
•
•
•
•
– Parallel programming with clusters
– Harvesting cluster resources
Data and process placement
– auto load balance
– dealing with scale (thousands of nodes)
High-performance DCOM
– active messages meet ORBs?
Process pairs, other FT concepts?
Real time: instant failover
Geographic (WAN) failover