Computers are Free, Now What? Premise: You're a Fortune 1,000 CIO I’m a DB+OS guy selling CyberBricks What can I say in an hour.

Download Report

Transcript Computers are Free, Now What? Premise: You're a Fortune 1,000 CIO I’m a DB+OS guy selling CyberBricks What can I say in an hour.

Computers are Free, Now What?
Premise:
You're a Fortune 1,000 CIO
I’m a DB+OS guy selling CyberBricks
What can I say in an hour that you do not know?
How can I help you plan for CyberBricks?
Jim Gray
Microsoft Research
[email protected]
http://research.Microsoft.com/~Gray
415 778 8222
1
Outline
• Why cost per transaction dropped 100,000x in 10
years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
2
Systems 30 Years Ago
• MegaBuck per Mega Instruction Per Second (mips)
• MegaBuck per MagaByte
• Sys Admin & Data Admin per MegaBuck
3
Disks of 30 Years Ago
• 10 MB
• Failed every few weeks
4
•
•
•
•
1988: IBM DB2 + CICS Mainframe
65 tps
IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized CPU
16 GB
disk farm
4 x 8 x .5GB
5
1987: Tandem Mini @ 256 tps
• 14 M$ computer (Tandem)
• A dozen people (1.8M$/y)
• False floor, 2 rooms of machines
32 node processor array
Admin expert
Performance
Hardware experts
expert Network expert
Auditor
Manager
Simulate 25,600
clients
40 GB
disk array (80 drives)
DB expert OS expert
6
1997: 9 years later
1 Person and 1 box = 1250 tps
•
•
•
•
1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 100,000x less
5 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays7
Cost Per Transaction
• Industry uses $/tps (or $/tpm):
5 year cost of hardware and software to get 1 tps.
• There are about 1 Million seconds in 3 years
• So, if $/tps is 1$,
$/t is 1 micro-dollar.
• 1988: mini: 50K$/tps
mainframe: 150k$/tps
– 5 cents to 15 cents per transaction
• 1998: micro: 30$/tpmc = 50¢/tpsC
–5 micro-dollars per transaction
note it is actually 6x less than this, tpcC is 6x tpcA
8
UNIX vs WindowsNT
• Solaris on SPARC range 11,559 tpmC @ 57$/tpmc (Sybase)
to
51,871 tpmC @ 135 tpmC (Oracle)
• SQL on NT/Compaq range11,748 tpmC @ 27$/tmpC
to
18,129 tpmC @ 27 $/tpmC
• NT price per transaction is 2x to 4x less,
peak performance per node is 3x less.
• Markup is in
TPC Price/tpmC
Oracle and SPARC
Sun Oracle 52 k tpmC @ 134$/tpmC
(disk and DRAM
HP+ NT4 +SQL Server 16.2 ktpmC @ 33$/tpmC
prices OK.)
50
45
45
40
•
•
Note:current NT prices
are 27$/tpmC not 33 $/tpmC
so 23% lower than shown
UNIX is 5x less than MVS
according to David Matthews,
“Large Server TCO: The
UNIX advantage”, Unix
Review Feb 1998 Reseller
Supplement, pp 3-11
35
35
30
30
25
20
17
15
10
12
8
7
5
4
5
3
0
processor
disk
software
net
total/10
9
What Happened?
Where did the 100,000x come from?
•
•
•
•
Moore’s law:
100X (at most)
Software improvements: 10X (at most)
Commodity Pricing:
100X (at least)
Total
100,000X
• 100x from commodity
– (DBMS was 100K$ to start: now 1k$ to start
– IBM 390 MIPS is 7.5K$ today
– Intel MIPS is 10$ today
– Commodity disk is 50$/GB vs 1,500$/GB
– ...
time
10
Outline
• Why cost per transaction has dropped 100,000x in
10 years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
11
What does 1 μ$/t Mean?
• Human Attention is the precious resource.
• Content is the precious resource
• Impressions (eyeballs) sell for
10,000 μ $ to 100,000 μ $
• All costs (and value) is in content and admin.
• Aside, this month, the TerraServer
got 400M hits,
40 M impressions
a 2M$/mo asset (for satellite photos.)
• That’s why everyone is hot on portals.
12
Administration Costs
• Vendor Rule of thumb (1970s mainframe)
– one systems programmer per MIPS
– one data admin per 10 GB
• DataCenter Rule of thumb:
– Hardware & Facilities is 40%
– Labor is 60%
– => 100 sys pgmrs and 1 data admin per laptop!
• 1995 Federal study of their data centers
– 1 to 3 MIPS per admin! (http://research.microsoft.com/~gray/NC_Servers.doc)
• Thin client:
–
–
–
–
move admin to server
claim: save admin costs
reality: move admin costs to expensive fixed staff
Time will tell.
13
Content Costs
• For most web sites
– Most staff are doing content
– Admin is small fraction of content
• RULE OF THUMB:
– Hardware/software/facilities/admin is 10% of content
– Content is 90% of cost
– This seems to apply to
• microsoft.com, msn, WebTV, HotMail, Inktomi
• MAIN CONCLUSION
– Hardware, software, admin is in micro$/t range
– Unix and mainframes are 2x or 10x more micro$
– Who cares? Cost is in content
– Look for content creation/management tools
14
Legacy Latency:a personal tale
• 1970s helped company X covert to IMS/Fast Path
• 1980s helped company X experiment with Tandem
mini-computers
• 1990s visit and ask:
– Why are you still buying those mainframes?
• Answers:
1. They are up all the time (99.99% up).
2. 25 years ago ROI was 18 months, now it is 1 week.
3.A rewrite would cost more than it would ever save.
4. My career would not survive a rewrite.
5. The devil you know is better than the devil you don’t.
15
Put Anther Way
• You are ATT or the airlines industry or...
You do 300 M transactions/day
• The capital cost of these transactions is
–
300 $/day on NT
– 1,000 $/day on Solaris
– 10,000 $/day on MVS
• Who cares?
Revenue and costs are 200,000,000 $/day
So, transaction cost is .01% or .0001%.
• But, if productivity is higher on Solaris or NT…
Or if tools exist on them, then….
Or if cost of 2nd or 3rd environment is huge (staff), then...
• New apps should not go on MVS!
• Investing in SNA? Investing in IMS? Investing in TPF?..
16
What Happens Next
• Last 10 years:
100,000x improvement
• Next 10 years:
????
• Today:
1985
1995
text and image servers are free
25 m$/hit => advertising pays for them
• Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
2005
17
And So...
• Traditional transaction processing is a zero-billion dollar industry -• Growth is in new apps
Point-to-Point
Immediate
Time
Shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Net
work
Data
Base
Its ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
18
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Immediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Network
Locate
Process
Analyze
Summarize
Data
Base
19
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Some Tera-Byte Databases
•
•
•
•
•
The Web: 1 TB of HTML
TerraServer 1 TB of images
Many 1 TB (file) servers
Hotmail: 7 TB of email
Sloan Digital Sky Survey:
40 TB raw, 2 TB cooked
• EOS/DIS (picture of planet each week)
– 15 PB by 2007
• Federal Clearing house: images of checks
– 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program
Yotta
– 10 Exabytes (???!!)
20
A novel
Kilo
A letter
Mega
Library of
Congress (text)
LoC
(sound +
cinima)
All Disks
All Tapes
Giga
Tera
Peta
Exa
A
Movie
LoC
(image)
All
Photos
Zetta
All Information!
Yotta
22
Michael Lesk’s Points
www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention
Auto-Summarization
Auto-Search
will be a key enabling technology.
23
Outline
• Why cost per transaction has dropped 100,000x in
10 years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
24
Technology (hardware)
NOW
• CPU: nearing 1 BIPS
– but CPI rising fast (2-10)
so less than 100 mips
– 1$/mips to 10$/mips
• DRAM: 3 $/MB
• DISK: 30 $/GB
• TAPE:
2003 Forecast (10x better)
• CPU: 1BIPS real (smp)
– 0.1$ - 1$/mips
• DRAM: 1 Gb chip
– 0.1 $/MB
• Disk:
– 10 GB smart cards
500GB RAID packs (NTinside)
– 3$ GB
– 20 GB/tape, 6 MBps
• Tape
– Lags disk
– 2$/GB offline, 15$/GB nearline – ?
25
System On A Chip
• Integrate Processing with memory on one chip
–
–
–
–
chip is 75% memory now
1MB cache >> 1960 supercomputers
256 Mb memory chip is 32 MB!
IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip
– system bus is a kind of network
– ATM, FiberChannel, Ethernet,.. Logic on chip.
– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
26
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPEC marks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multi-program cache,
On-Chip SMP
27
Storage Latency:
How Far Away is the Data?
109
Andromeda
Tape /Optical
Robot
106 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
28
Gilder’s Telecosom Law:
3x bandwidth/year for 25 more years
• Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle
• In lab 3 Tbps/fiber (400 x WDM)
• In theory 25 Tbps per fiber
• 1 Tbps = USA 1996 WAN bisection bandwidth
1 fiber = 25 Tbps
29
Networking
BIG!! Changes coming!
• Technology
– 10 GBps bus “now”
– 1 Gbps links “now”
– 1 Tbps links in 10 years
– Fast & cheap switches
• Standard interconnects
• CHALLENGE
– reduce software tax
on messages
– Today 30 K ins
+ 10 ins/byte
– Goal: 1 K ins
+ .01 ins/byte
– processor-processor
• Best bet:
– processor-device (=processor)
– SAN/VIA
• Deregulation WILL work
– Smart NICs
someday
– Special protocol
– User-Level Net IO (like disk)30
What if
Networking Was as Cheap As Disk IO?
• TCP/IP
• Disk
– Unix/NT
100% cpu @ 40MBps
– Unix/NT
8% cpu @ 40MBps
Why the Difference?
Host does
TCP/IP packetizing,
checksum,…
flow control
small buffers
Host Bus Adapter does
SCSI packetizing,
checksum,…
flow control
31
DMA
The Promise of SAN/VIA
10x better in 2 years
• Today:
– wires are 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2 cpus
– round-trip latency is ~300 us
• In two years
250
200
Now
Soon
150
100
50
0
Bandwidth
Latency
Overhead
– wires are 100 MBps (1 Gbps Ethernet, ServerNet,…)
– tcp/ip ~ 100 MBps 10% of each processor
– round-trip latency is 20 us
• works in lab today
uses Winsock2 api.
See http://www.viarch.org/
32
SAN:
Standard Interconnect
Gbps Ethernet: 110 MBps
PCI: 70 MBps
UW Scsi: 40 MBps
• LAN faster than
memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
FW scsi: 20 MBps
scsi: 5 MBps
33
Data Gravity
Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in
– Modem
– Display
– Microphones (speech recognition)
& cameras (vision)
– Storage: Data storage and analysis
34
CyberBricks:
Functionally
Specialized Cards
P mips processor
• Storage
ASIC
Today:
P= 20 mips
M MB DRAM
• Network
M= 2 MB
In a few years
ASIC
P= 200 mips
M= 64 MB
• Display
ASIC
35
With Tera Byte Interconnect
and Super Computer Adapters
• Processing is incidental to
– Networking
– Storage
– UI
• Disk Controller/NIC is
– faster than device
– close to device
– Can borrow device
package & power
Tera Byte
Backplane
• So use idle capacity for computation.
• Run app in device.
36
All Device Controllers will be Cray 1’s
• TODAY
Central
Processor &
Memory
– Disk controller is 10 mips risc engine
with 2MB DRAM
– NIC is similar power
• SOON
– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation
(can run Oracle on NT in disk controller).
Tera Byte
Backplane
• Advantages
–
–
–
–
–
Uniform programming model
Great tools
Security
economics (CyberBricks)
Move computation to data (minimize traffic)
37
It’s Already True of Printers
Peripheral = CyberBrick
• You buy a printer
• You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
38
Disk = Node
•
•
•
•
has magnetic storage (100 GB?)
has processor & DRAM
has SAN attachment
has execution
Applications
environment
Services
DBMS
RPC, ...
File System
SAN driver
Disk driver
OS Kernel
49
Outline
• Why cost per transaction has dropped 100,000x in
10 years.
• How does that change things?
• What next (technology trends): CyberBricks
• Clusters of Hardware and Software CyberBricks
50
All God’s Children Have Clusters!
Buying Computing By the Slice
• People are buying computers by the dozens
– Computers only cost 1k$/slice!
• Clustering them together
51
A cluster is a cluster is a cluster
• It’s so natural,
even mainframes cluster !
Looking closer at usage patterns,
a few models emerge
• Looking closer at sites, you see
hierarchies
bunches
functional specialization
52
“Commercial” NT Clusters
• 16-node Tandem Cluster
– 64 cpus
– 2 TB of disk
– Decision support
• 45-node Compaq Cluster
–
–
–
–
140 cpus
14 GB DRAM
4 TB RAID disk
OLTP (Debit Credit)
• 1 B tpd (14 k tps)
53
Tandem Oracle/NT
•
•
•
•
27,383 tpmC
71.50 $/tpmC
4 x 6 cpus
384 disks
=2.7 TB
54
Microsoft.com: ~150x4 nodes
Building 11
Log Processing
Ave CFG:4xP6,
Internal WWW 1 GB RAM,
180 GB HD
Ave Cost:$128K
FY98 Fcst:2
Staging Servers
(7)
The Microsoft.Com Site
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:12
FTP Servers
Ave CFG:4xP5,
512 RAM,
Download 30 GB HD
Replication Ave Cost:$28K
FY98 Fcst: 0
SQLNet
Feeder LAN
Router
Live SQL Servers
MOSWest
Admin LAN
Live SQL Server
All servers in Building11
are accessable from
corpnet.
www.microsoft.com
(4)
register.microsoft.com
(2) Ave CFG:4xP6,
home.microsoft.com
(4)
premium.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:3
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:12
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave Cost:$35K
FY98 Fcst:2
www.microsoft.com
(4)
Ave CFG:4xP6
512 RAM
28 GB HD
Ave Cost: $35K
FY98 Fcst: 17
FDDI Ring
(MIS1)
FDDI Ring
(MIS2)
activex.microsoft.com
(2)
Ave CFG:4xP6,
256 RAM,
30 GB HD
Ave Cost:$25K
FY98 Fcst:2
Router
premium.microsoft.com
(1)
Internet
Ave CFG:4xP5,
256 RAM,
20 GB HD
Ave Cost:$29K
FY98 Fcst:2
register.msn.com
(2)
search.microsoft.com
(1)
Japan Data Center
www.microsoft.com
premium.microsoft.com
(3)
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:1
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:1
FTP
Download Server
(1)
HTTP
Download Servers
(2)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:1
msid.msn.com
(1)
Switched
Ethernet
search.microsoft.com
(2)
Router
Secondary
Gigaswitch
\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd
12/15/97
Router
(100 Mb/Sec Each)
support.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:9
13
DS3
(45 Mb/Sec Each)
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:0
register.microsoft.com
(2)
support.microsoft.com
search.microsoft.com
(1)
(3)
2
Ethernet
Router
FTP.microsoft.com
(3)
register.microsoft.com
(1)
(100Mb/Sec Each)
Internet
Router
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
FDDI Ring
(MIS3)
Switched
Ethernet
Router
Router
home.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:7
Router
msid.msn.com
(1)
FTP
Download Server
(1)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:1
Router
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:3
cdm.microsoft.com
(1)
Ave CFG:4xP5,
256 RAM,
12 GB HD
Ave Cost:$24K
FY98 Fcst:0
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:1
msid.msn.com
(1)
search.microsoft.com
(3)
home.microsoft.com
(3)
Ave CFG:4xP6,
1 GB RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:2
msid.msn.com
(1)
512 RAM,
30 GB HD
Ave Cost:$43K
FY98 Fcst:10
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:17
www.microsoft.com
(3)
www.microsoft.com premium.microsoft.com
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,(3)
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:1
SQL Consolidators
DMZ Staging Servers
Router
SQL Reporting
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:2
European Data Center
IDC Staging Servers
MOSWest
www.microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
55
The
Microsoft TerraServer Hardware
•
•
•
•
Compaq AlphaServer 8400
8x400Mhz Alpha cpus
10 GB DRAM
324 9.2 GB StorageWorks Disks
– 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
56
HotMail: ~400 Computers
57
Inktomi (hotbot), WebTV: > 200 nodes
• Inktomi: ~250 UltraSparcs
–
–
–
–
–
web crawl
index crawled web and save index
Return search results on demand
Track Ads and click-thrus
ACID vs BASE (basic Availability, Serialized Eventually)
• Web TV
– ~200 UltraSparcs
• Render pages, Provide Email
– ~ 4 Network Appliance NFS file servers
– A large Oracle app tracking customers
58
Loki: Pentium
Clusters for Science
http://loki-www.lanl.gov/
16 Pentium Pro Processors
x 5 Fast Ethernet interfaces
+ 2 Gbytes RAM
+ 50 Gbytes Disk
+ 2 Fast Ethernet switches
+
Linux…………………...
= 1.2 real Gflops for $63,000
(but that is the 1996 price)
Beowulf project is similar
http://cesdis.gsfc.nasa.gov/pub/people/becker/beo
wulf.html
• Scientists want cheap mips.
59
Your Tax Dollars At Work
ASCI for Stockpile Stewardship
• Intel/Sandia:
9000x1 node Ppro
• LLNL/IBM:
512x8 PowerPC (SP2)
• LNL/Cray:
?
• Maui Supercomputer Center
– 512x1 SP2
60
Berkeley NOW (network of workstations) Project
http://now.cs.berkeley.edu/
• 105 nodes
– Sun UltraSparc 170,
128 MB,
2x2GB disk
– Myrinet interconnect (2x160MBps
per node)
– SBus (30MBps) limited
•
•
•
•
•
GLUNIX layer above Solaris
Inktomi (HotBot search)
NAS Parallel Benchmarks
Crypto cracker
Sort 9 GB per second
61
Wisconsin COW
• 40 UltraSparcs
64MB + 2x2GB disk
+ Myrinet
• SUN OS
• Used as a compute
engine
62
Andrew Chien’s JBOB
http://www-csag.cs.uiuc.edu/individual/achien.html
• 48 nodes
• 36 HP 2PIIx128 1 disk
Kayak boxes
• 10 Compaq 2PIIx128 1 disk,
Wkstation 6000
• 32-Myrinet&16-ServerNet
connected
• Operational
• All running NT
63
NCSA Super Cluster
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
• National Center for Supercomputing Applications
University of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN
• Compaq + HP +Myricom + WindowsNT
• A Super Computer for 3M$
• Classic Fortran/MPI programming
• DCOM programming model
64
•
•
•
•
•
1.2 B tpd
1 B tpd ran for 24 hrs.
Out-of-the-box software
Off-the-shelf hardware
AMAZING!
20x smaller than Microsoft Internet Data Center (amazing!)
•Sized for 30 days
•Linear growth
•5 micro-dollars per
transaction
65
Scalability
1 billion
transactions
100 million
web hits
• Scale
up: to large SMP nodes
• Scale out: to clusters of SMP nodes
4 terabytes
of data
1.8 million
mail messages
66
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)
The Bricks of Cyberspace
• Cost 1,000 $
• Come with
– NT
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
• Compatible with everyone else
• CyberBricks
67
Super Server: 4T Machine

Array of 1,000 4B machines
1
b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot


A few megabucks
Challenge:
CPU
50 GB Disc
5 GB RAM
Manageability
Programmability
Security
Cyber Brick
a 4B machine
Availability
Scaleability
Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
68
Cluster Vision
Buying Computers by the Slice
• Rack & Stack
– Mail-order components
– Plug them into the cluster
• Modular growth without limits
– Grow by adding small modules
• Fault tolerance:
– Spare modules mask failures
• Parallel execution & data search
– Use multiple processors and disks
• Clients and servers made from the same stuff
– Inexpensive: built with
commodity CyberBricks
69
Nostalgia Behemoth in the Basement
• today’s PC
is yesterday’s supercomputer
• Can use LOTS of them
• Main Apps changed:
– scientific  commercial  web
– Web & Transaction servers
– Data Mining, Web Farming
70
SMP -> nUMA: BIG FAT SERVERS
• Directory based caching
• Needs
– 64 bit addressing
lets you build large SMPs
– nUMA sensitive OS
• (not clear who will do it)
• Every vendor building a
HUGE SMP
• Or Hypervisor
– 256 way
– 3x slower remote memory
– 8-level memory hierarchy
•
•
•
•
•
•
•
L1, L2 cache
DRAM
remote DRAM (3, 6, 9,…)
Disk cache
Disk
Tape cache
Tape
– like IBM LSF,
– Stanford Disco
www-flash.stanford.edu/Hive/papers.html
• You get an expensive
cluster-in-a-box
with very fast network
71
Great Debate: Shared What?
Shared Memory
(SMP)
CLIENTS
Shared Disk
CLIENTS
Easy to program
Difficult to build
Difficult to scale
SGI, Sun, Sequent
Shared Nothing
(network)
CLIENTS
Hard to program
Easy to build
Easy to scale
VMScluster, Sysplex
Tandem, Teradata, SP2, NT
NUMA blurs distinction, but has it’s own problems
72
Technology Drivers
Plug & Play Software
• RPC is standardizing: (DCOM, IIOP, HTTP)
– Gives huge TOOL LEVERAGE
– Solves the hard problems for you:
• naming,
• security,
• directory service,
• operations,...
• Commoditized programming environments
–
–
–
–
FreeBSD, Linix, Solaris,…+ tools
NetWare + tools
WinCE, WinNT,…+ tools
JavaOS + tools
• Apps gravitate to data.
• General purpose OS on controller runs apps.
73
Restatement
The huge clusters we saw
are prototypes for CyberBrick systems:
A Federation of
Functionally specialized nodes
Each node shrinks to a “point” device
With embedded processing.
Each node / device is autonomous
Each talks a high-level protocol
74
Outline
• Clusters of Hardware CyberBricks
– all nodes are very intelligent
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
• Software CyberBricks
– standard way to interconnect intelligent nodes
– needs execution model
– needs parallelism
75
Software CyberBricks: Objects!
• It’s a zoo
• Objects and 3-tier computing (transactions)
– Give natural distribution & parallelism
– Give remote management!
– TP & Web: Dispatch RPCs to pool of object
servers
– Components are a 1B$ business today!
• Need a Parallel & distributed computing
model
76
The COMponent Promise
• Objects are
Software CyberBricks
– productivity breakthrough (plug ins)
– manageability breakthrough (modules)
• Microsoft:
DCOM + ActiveX
• IBM/Sun/Oracle/Netscape:
CORBA + Java Beans
• Both promise
– parallel distributed execution
– centralized management of
distributed system
Both camps
Share key goals:
• Encapsulation: hide
implementation
• Polymorphism: generic ops
key to GUI and reuse
• Uniform Naming
• Discovery: finding a service
• Fault handling: transactions
• Versioning: allow upgrades
• Transparency: local/remote
• Security: who has authority
• Shrink-wrap: minimal
inheritance
• Automation: easy
77
The OO Points So Far
• Objects are software Cyber Bricks
• Object interconnect standards are emerging
• Cyber Bricks become Federated Systems.
• Put processing close to data
• Next point:
– do parallel processing.
89
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
90
Object Oriented Programming
Parallelism From Many Little Jobs
•
•
•
•
•
Gives location transparency
ORB/web/tpmon multiplexes clients to servers
Enables distribution
Exploits embarrassingly parallel apps (transactions)
HTTP and RPC (dcom, corba, rmi, iiop, …) are basis
Tp mon / orb/ web server
91
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
1 Terabyte
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
92
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
98
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
99
Summary
• Clusters of Hardware CyberBricks
– all nodes are very intelligent
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
• Software CyberBricks
– standard way to interconnect intelligent nodes
– needs execution model
– needs parallelism
100
Summary
• Why cost per transaction has dropped
100,000x in 12 years.
• How does that change things?
• What next (technology trends)
• Hardware and Software CyberBricks
101
What I’m Doing
• TerraServer: Photo of the planet on the web
– a database (not a file system)
– 1TB now, 15 PB in 10 years
– http://www.TerraServer.microsoft.com/
• Sloan Digital Sky Survey: picture of the universe
– just getting started, cyberbricks for astronomers
– http://www.sdss.org/
• Sorting:
– one node pennysort (http://research.microsoft.com/barc/SortBenchmark/)
– multinode: NT Cluster sort (shows off SAN and DCOM)
102
What I’m Doing
• NT Clusters:
– failover: Fault tolerance within a cluster
– NT Cluster Sort: balanced IO, cpu, network benchmar
– AlwaysUp: Geographical fault tolerance.
• RAGS: random testing of SQL systems
– a bug finder
• Telepresence
– Working with Gordon Bell on “the killer app”
– FileCast and PowerCast
– Cyberversity (international, on demand, free university)
103