Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Transcript Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Scaleable Systems Research
at Microsoft
(really: what we do at BARC)
• Jim Gray
Microsoft Research
[email protected]
http://research.Microsoft.com/~Gray
Presented to DARPA WindowsNT
workshop 5 Aug 1998, Seattle WA.
1
Outline
• PowerCast, FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• WolfPack Failover
• NTFS IO measurements
• NT-Cluster-Sort
• AlwaysUp
2
Telepresence
• The next killer app
• Space shifting:
» Reduce travel
• Time shifting:
» Retrospective
» Offer condensations
» Just in time meetings.
• Example: ACM 97
» NetShow and Web site.
» More web visitors than attendees
• People-to-People communication
3
Telepresence Prototypes
• PowerCast: multicast PowerPoint
» Streaming - pre-sends next anticipated slide
» Send slides and voice rather than talking head and voice
» Uses ECSRM for reliable multicast
» 1000’s of receivers can join and leave any time.
» No server needed; no pre-load of slides.
» Cooperating with NetShow
• FileCast: multicast file transfer.
» Erasure encodes all packets
» Receivers only need to receive as many bytes
as the length of the file
» Multicast IE to solve Midnight-Madness problem
• NT SRM: reliable IP multicast library for NT
• Spatialized Teleconference Station
» Texture map faces onto spheres
» Space map voices
4
RAGS:
RAndom SQL test Generator
• Microsoft spends a LOT of money on testing.
(60% of development according to one source).
• Idea: test SQL by
» generating random correct queries
» executing queries against database
» compare results with SQL 6.5, DB2, Oracle, Sybase
• Being used in SQL 7.0 testing.
» 375 unique bugs found (since 2/97)
» Very productive test tool
5
Sample Rags
Generated
Statement
This Statement yields an error:
SQLState=37000, Error=8623
Internal Query Processor Error:
Query processor could not
produce a query plan.
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes
FROM titles T0, roysched T1
WHERE EXISTS (
SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , (
"<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" +
RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms
))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))),
T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-((
-4 )-(-(T2.qty ))))))+(-(-1 ))
FROM sales T2
WHERE EXISTS (
SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT
-1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER(
T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+
MAX("\?" )))
FROM authors T3, roysched T4, stores T5
WHERE EXISTS (
SELECT DISTINCT TOP 5 LTRIM(T6.state )
FROM stores T6
WHERE ( (-(-(5 )))>= T4.royalty ) AND (( (
( LOWER( UPPER((("9W8W>kOa" +
T6.stor_address )+"{P~" ))))!= ANY (
SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" ))
FROM roysched T7
WHERE ( EXISTS (
SELECT (T8.city +(T9.pub_id +((">" +T10.country )+
UPPER( LOWER(T10.city))))), T7.lorange ,
((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 ))
FROM publishers T8, pub_info T9, publishers T10
WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1))
AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) )
)
--EOQ
) AND (NOT (EXISTS (
SELECT MIN (T9.i3 )
FROM roysched T8, d2 T9, stores T10
WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id
)) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM(
UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) )
GROUP BY T9.i3, T8.royalty, T9.i3
HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*))
)
--EOQ
) )
)
--EOQ
) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +(
UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip ))))
+T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN (
3 ) AND (-(8 )) ) )
)
--EOQ
GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM(
T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM(
LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+(
RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+
"6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address
)
--EOQ
ORDER BY 1, 6, 5
)
6
Automation
•
Simpler Statement with same error
SELECT roysched.royalty
FROM titles, roysched
WHERE EXISTS (
SELECT DISTINCT TOP 1 titles.advance
FROM sales
ORDER BY 1)
• Control statement attributes
» complexity, kind, depth, ...
• Multi-user stress tests
» tests concurrency, allocation, recovery
7
One 4-Vendor Rags Test
3 of them vs Us
• 60 k Selects on MSS, DB2, Oracle, Sybase.
• 17 SQL Server Beta 2 suspects
•
•
1 suspect per 3350 statements.
Examine 10 suspects, filed 4 Bugs!
One duplicate. Assume 3/10 are new
Note: This is the SS Beta 2 Product
Quality rising fast (and RAGS sees that)
8
Outline
• FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• Wolfpack Failover
• NTFS IO measurements
• NT-Cluster-Sort
9
Billions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Performance = Storage Accesses
not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
Where the time goes:
clock ticks for AlphaSort Components
Disc Wait
Disc Wait
Sort
Sort
OS
Memory Wait
B-Cache
Data Miss
70 MIPS
“real” apps have worse Icache
misses so run at 60 MIPS
if well tuned, 20 MIPS if not
I-Cache
Miss
D-Cache
Miss
Scale Up and Scale Out
Grow Up with SMP
4xP6 is now standard
SMP
Super Server
Grow Out with Cluster
Cluster has inexpensive parts
Departmental
Server
Personal
System
Cluster
of PCs
Microsoft TerraServer:
Scaleup to Big Databases
•
•
Build a 1 TB SQL Server database
Data must be
•
Loaded
•
•
On the web (world’s largest atlas)
Sell images with commerce server.
» 1 TB
» Unencumbered
» Interesting to everyone everywhere
» And not offensive to anyone anywhere
» 1.5 M place names from Encarta World Atlas
» 3 M Sq Km from USGS (1 meter resolution)
» 1 M Sq Km from Russian Space agency (2 m)
15
Microsoft TerraServer Background
• Earth is 500 Tera-meters square • Someday
•
•
•
•
•
•
» USA is 10 tm2
100 TM2 land in 70ºN to 70ºS
We have pictures of 6% of it
» 3 tsm from USGS
» 2 tsm from Russian Space Agency
» multi-spectral image
» of everywhere
» once a day / hour
1.8x1.2 km2 tile
Compress 5:1 (JPEG) to 1.5 TB.
10x15 km2 thumbnail
Slice into 10 KB chunks
20x30 km2 browse image
Store chunks in DB
Navigate with
40x60 km2 jump image
» Encarta™ Atlas
• globe
• gazetteer
» StreetsPlus™
in the USA
16
USGS Digital Ortho Quads (DOQ)
• US Geologic Survey
• 4 Tera Bytes
• Most data not yet published
• Based on a CRADA
» Microsoft TerraServer makes
data available.
1x1 meter
4 TB
Continental
US
New Data
Coming
USGS “DOQ”
17
Russian Space Agency(SovInfomSputnik)
SPIN-2 (Aerial Images is Worldwide Distributor)
•
•
•
•
SPIN-2•
•
1.5 Meter Geo Rectified imagery of (almost) anywhere
Almost equal-area projection
De-classified satellite photos (from 200 KM),
More data coming (1 m)
Selling imagery on Internet.
18
2
Putting 2 tm onto Microsoft TerraServer.
Demo
• navigate by coverage map to White House
• Download image
• buy imagery from USGS
• navigate by name to Venice
• buy SPIN2 image & Kodak photo
• Pop out to Expedia street map of Venice
• Mention that DB will double in next 18
months (2x USGS, 2X SPIN2)
19
Hardware
Internet
Map Site
Server
Servers
SPIN-2
100 Mbps
Ethernet Switch
DS3
Web Servers
STK
Enterprise Storage Array
9710
48
48
48
DLT
9 GB
9 GB
9 GB
Tape
Drives Drives Drives
Library
Alpha
Server
48
8400
9 GB
8 x 440MHz Drives
Alpha cpus
10 GB DRAM
48
9 GB
Drives
48
9 GB
Drives
48
9 GB
Drives
1TB Database Server
AlphaServer 8400 4x400. 10 GB RAM
324 StorageWorks disks
10 drive tape library (STC Timber Wolf DLT7000 )
20
The
Microsoft TerraServer Hardware
• Compaq AlphaServer 8400
• 8x400Mhz Alpha cpus
• 10 GB DRAM
• 324 9.2 GB StorageWorks Disks
» 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
21
Software
Web Client
Image
Server
Active Server Pages
Internet
Information
Server 4.0
Java
Viewer
browser
MTS
Terra-Server
Stored Procedures
HTML
The Internet
Internet Info
Server 4.0
SQL Server 7
Microsoft Automap
ActiveX Server
TerraServer DB
Automap Server
TerraServer Web Site
Internet Information
Server 4.0
Microsoft
Site Server EE
Image Delivery SQL Server
Application
7
22
Image Provider Site(s)
System
Management &
Maintenance
• Backup and Recovery
» STK 9710 Tape robot
» Legato NetWorker™
» SQL Server 7 Backup &
Restore
» Clocked at 80 MBps (peak)
(~ 200 GB/hr)
• SQL Server Enterprise Mgr
» DBA Maintenance
» SQL Performance Monitor
23
Microsoft TerraServer File Group Layout
• Convert 324 disks to 28 RAID5 sets
plus 28 spare drives
• Make 4 WinNT volumes (RAID 50)
595 GB per volume
• Build 30 20GB files on each volume
• DB is File Group of 120 files
E:
F:
G:
H:
24
Image Delivery and Load
Incremental load of 4 more TB in next 18 months
DLT
Tape
DLT
Tape
“tar”
NT
DoJob
\Drop’N’
LoadMgr
DB
Wait 4
Load
Backup
LoadMgr
LoadMgr
ESA
Alpha
Server
4100
100mbit
EtherSwitch
60
4.3 GB
Drives
Alpha
Server
4100
ImgCutter
\Drop’N’
\Images
...
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
Enterprise Storage Array
STK
DLT
Tape
Library
108
9.1 GB
Drives
108
9.1 GB
Drives
108
9.1 GB
Drives
Alpha
Server
8400
25
Technical Challenge
Key idea
• Problem: Geo-Spatial Search without
•
geo-spatial access methods.
(just standard SQL Server)
Solution:
 Geo-spatial search key:
Divide earth into rectangles of 1/48th degree longitude (X) by
1/96th degree latitude (Y)
Z-transform X & Y into single Z value,
build B-tree on Z
Adjacent images stored next to each other
 Search Method:
Latitude and Longitude => X, Y, then Z
Select on matching Z value
26
Sloan Digital Sky Survey
• Digital Sky
» 30 TB raw
» 3TB cooked (1 billion 3KB objects)
» Want to scan it frequently
• Using cyberbricks
• Current status:
» 175 MBps per node
» 24 nodes => 4 GBps
» 5 minutes to scan whole archive
27
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
Some Tera-Byte Databases
• The Web: 1 TB of HTML
• TerraServer 1 TB of images
• Several other 1 TB (file) servers
• Hotmail: 7 TB of email
• Sloan Digital Sky Survey:
40 TB raw, 2 TB cooked
• EOS/DIS (picture of planet each week)
» 15 PB by 2007
• Federal Clearing house: images of checks
» 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program
» 10 Exabytes (???!!)
28
Kilo
A letter
A novel
Info Capture
• You can record
Mega
Giga
A
Movie
•
Tera
Library of
Congress (text)
•
Peta
LoC (image)
everything you see
or hear or read.
What would you do
with it?
How would you
organize & analyze it?
Exa
Video
All Disks Audio
Read or write:
Zetta
All Tapes See: http://www.lesk.com/mlesk/ksg97/ksg.html
Yotta
8 PB per lifetime (10GBph)
30 TB (10KBps)
8 GB (words)
29
Michael Lesk’s Points
www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention
Auto-Summarization
Auto-Search
will be a key enabling technology.
30
A novel
Kilo
A letter
Mega
Library of
Congress (text)
LoC
(sound +
cinima)
All Disks
All Tapes
Giga
Tera
Peta
Exa
A
Movie
LoC
(image)
All
Photos
Zetta
All Information!
Yotta
31
Outline
• FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• Wolfpack Failover
• NTFS IO measurements
• NT-Cluster-Sort
32
Scalability
1 billion
transactions
100 million
web hits
• Scale
up: to large SMP nodes
• Scale out: to clusters of SMP nodes
4 terabytes
of data
1.8 million
mail messages
33
Billion Transactions per Day Project
•
•
•
•
•
•
•
Built a 45-node Windows NT Cluster
(with help from Intel & Compaq)
> 900 disks
All off-the-shelf parts
Using SQL Server &
DTC distributed transactions
DebitCredit Transaction
Each node has 1/20 th of the DB
Each node does 1/20 th of the work
15% of the transactions are
“distributed”
Billion Transactions Per Day Hardware
• 45 nodes (Compaq Proliant)
• Clustered with 100 Mbps Switched Ethernet
• 140 cpu, 13 GB, 3 TB.
Type
Workflow
MTS
SQL Server
Distributed
Transaction
Coordinator
TOTAL
nodes
CPUs
DRAM
ctlrs
disks
20
Compaq
Proliant
2500
20
Compaq
Proliant
5000
5
Compaq
Proliant
5000
45
20x
20x
20x
20x
RAID
space
20x
2
128
1
1
2 GB
20x
20x
20x
20x
4
512
4
20x
36x4.2GB
7x9.1GB
130 GB
5x
5x
5x
5x
5x
4
256
1
3
8 GB
140
13 GB
105
895
3 TB
35
1.2 B tpd
• 1 B tpd ran for 24 hrs.
• Out-of-the-box software
• Off-the-shelf hardware
• AMAZING!
•Sized for 30 days
•Linear growth
•5 micro-dollars per
transaction
36
How Much Is 1 Billion Tpd?
Mtpd
Millions of Transactions Per Day
1,000.
900.
800.
100.
700.
600.
500.
10.
400.
300.
1.
200.
100.
0.
0.1
• 1 billion tpd = 11,574 tps
~ 700,000 tpm (transactions/minute)
• ATT
» 185 million calls per peak day (worldwide)
• Visa ~20 million tpd
»
»
»
1 Btpd
Visa
ATT
BofA
NYSE
400 million customers
250K ATMs worldwide
7 billion transactions
(card+cheque) in 1994
• New York Stock Exchange
» 600,000 tpd
• Bank of America
•
»
»
20 million tpd checks cleared
(more than any other bank)
1.4 million tpd ATM transactions
Worldwide Airlines Reservations:
250 Mtpd
37
NCSA Super Cluster
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
• National Center for Supercomputing Applications
•
•
•
•
•
University of Illinois @ Urbana
512 Pentium II cpus, 2,096 disks, SAN
Compaq + HP +Myricom + WindowsNT
A Super Computer for 3M$
Classic Fortran/MPI programming
DCOM programming model
38
Outline
• FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• Wolfpack Failover
• NTFS IO measurements
• NT-Cluster-Sort
39
NT Clusters (Wolfpack)
• Scale DOWN to PDA: WindowsCE
• Scale UP an SMP: TerraServer
• Scale OUT with a cluster of machines
• Single-system image
» Naming
» Protection/security
» Management/load balance
• Fault tolerance
»“Wolfpack”
• Hot pluggable hardware & software
40
Symmetric Virtual Server
Failover Example
Browser
Server
Server 11
Server 2
Web
site
Web
site
Web
site
Database
Database
Database
Web site files
Web site files
Database files
Database files
41
Clusters & BackOffice
• Research: Instant & Transparent failover
• Making BackOffice PlugNPlay on
Wolfpack
» Automatic install & configure
• Virtual Server concept makes it easy
» simpler management concept
» simpler context/state migration
» transparent to applications
• SQL 6.5E & 7.0 Failover
• MSMQ (queues), MTS (transactions).
42
Next Steps in Availability
• Study the causes of outages
• Build AlwaysUp system:
» Two geographically remote sites
» Users have instant and transparent failover to
2nd site.
» Working with WindowsNT and SQL Server
groups on this.
43
Outline
• FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• Wolfpack Failover
• NTFS IO measurements
• NT-Cluster-Sort
44
Storage Latency:
How Far Away is the Data?
109
Andromeda
Tape /Optical
Robot
106 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
45
The Memory Hierarchy
• Measuring & Modeling Sequential IO
• Where is the bottleneck?
• How does it scale with
» SMP,
RAID, new interconnects
Goals:
balanced bottlenecks
Low overhead
Scale many processors (10s)
Scale many disks (100s)
Memory
File cache
Mem bus
App address
space
PCI
Adapter
SCSI
Controller
46
PAP (peak advertised Performance) vs
RAP (real application performance)
• Goal: RAP = PAP / 2 (the half-power point)
System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
Data
10-15 MBps
7.2 MB/s
File System
Buffers
SCSI
Disk
133 MBps
7.2 MB/s
PCI
47
The Best Case: Temp File, NO IO
• Temp file Read / Write File System Cache
• Program uses small (in cpu cache) buffer.
• So, write/read time is bus move time (3x better
•
MBps
•
than copy)
Paradox: fastest way to move data is to write
then read it.
Temp File Read/Write
200
This hardware is
148
136
150
limited to
150 MBps
100
per processor
54
50
0
Temp read
Temp write
48
Memcopy ()
Bottleneck Analysis
• Drawn to linear scale
Disk R/W
~9MBps
Memory
MemCopy Read/Write
~50 MBps
~150 MBps
Theoretical
Bus Bandwidth
422MBps = 66 Mhz x 64 bits
49
3 Stripes and Your Out!
• 3 disks can saturate adapter • CPU time goes
• Similar story with UltraWide down
with request size
• Ftdisk (striping is
=
cheap)
WriteThroughput vs Stripes 3 deep Fast
Throughput (MB/s)
20
20
Throughput (MB/s)
15
10
5
0
100
15
1 Disk
2 Disks
10
3 Disks
4 Disks
5
4 8 16 32 64 128 192
Request Size (K bytes)
10
1
0
2
CPU miliseconds per MB
Cost (CPU ms/MB)
Read Throughput vs Stripes 3 deep Fast
2
4
8 16 32 64 128 192
Request Size (K bytes)
2
4
8
16
32
64
128
Request Size (bytes)
192
50
Parallel SCSI Busses Help
• Second SCSI bus nearly
One or Two SCSI Busses
Read
Write
WCE
Read
Write
WCE
Throughput (MB/s)
25
20
15
2 busses
1 Bus
•
•
doubles read and wce
throughput
Write needs deeper buffers
Experiment is unbuffered
(3-deep +WCE)
10
5
0
2
4
8
16
32
64
Request Size (K bytes)
128 192
2x

51
File System Buffering & Stripes
(UltraWide Drives)
• FS buffering helps small reads • Write peaks at 20 MBps
• FS buffered writes peak at
• Read peaks at 30 MBps
Three Disks, 1 Deep
35
Three Disks, 3 Deep
35
FS Read
Read
FS Write WCE
Write WCE
30
25
30
25
Throughput (MB/s)
Throughput (MB/s)
•
12MBps
3-deep async helps
20
20
15
15
10
10
5
5
0
0
2
4
8
16
32
64 128
Request Size (K Bytes)
192
2
4
8
16
32
64 128
Request Size (K Bytes)
192
52
PAP vs RAP
• Reads are easy, writes are hard
• Async write can match WCE.
422 MBps
142 MBps
SCSI
Application
Data
Disks
40 MBps
File System
10-15 MBps
31 MBps
9 MBps
•
133 MBps
72 MBps
PCI
SCSI
53
Bottleneck Analysis
• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI
~ 65 MBps Unbuffered read
~ 43 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Adapter
Memory
~30 MBps
PCI Read/Write
~70 MBps
~150 MBps
Adapter
54
Hypothetical Bottleneck
Analysis
• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Adapter
~30 MBps
Adapter
PCI
~70 MBps
Memory
Read/Write
~150 MBps
Adapter
PCI
Adapter
55
Year 2002 Disks
• Big disk
(10 $/GB)
» 3”
» 100 GB
» 150 kaps (k accesses per second)
» 20 MBps sequential
• Small disk (20 $/GB)
» 3”
» 4 GB
» 100 kaps
» 10 MBps sequential
• Both running Windows NT™ 7.0?
(see below for why)
56
How Do They Talk to Each Other?
• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other
RMI?
Applications
?
RPC
streams
datagrams
• Huge leverage in high-level interfaces.
• Same old distributed system story.
VIAL/VIPL
?
RPC
streams
datagrams
Applications
» CORBA? DCOM? IIOP?
» One or all of the above.
h
Wire(s)
57
Outline
• FileCast & Reliable Multicast
• RAGS: SQL Testing
• TerraServer (a big DB)
• Sloan Sky Survey (CyberBricks)
• Billion Transactions per day
• Wolfpack Failover
• NTFS IO measurements
• NT-Cluster-Sort
58
Penny Sort Ground Rules
http://research.microsoft.com/barc/SortBenchmark
• How much can you sort for a penny.
» Hardware and Software cost
» Depreciated over 3 years
» 1M$ system gets about 1 second,
» 1K$ system gets about 1,000 seconds.
» Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident
• Input is
» 100-byte records (random data)
» key is first 10 bytes.
• Must create output file
•
and fill with sorted version of input file.
Daytona (product) and Indy (special) categories
59
PennySort
• Hardware
» 266 Mhz Intel PPro
» 64 MB SDRAM (10ns)
» Dual Fujitsu DMA 3.2GB EIDE
• Software
» NT workstation 4.3
» NT 5 sort
• Performance
PennySort Machine (1107$ )
Disk
25%
Cabinet +
Assembly
7%
Memory
8%
Other
22%
board
13%
Network,
Video, floppy
9%
Software
6%
cpu
32%
» sort 15 M 100-byte records
» Disk to disk
» elapsed time 820 sec
(~1.5 GB)
• cpu time = 404 sec
60
Cluster Sort
•Multiple Data Sources
A
AAA
BBB
CCC
Conceptual Model
•Multiple Data Destinations
AAA
AAA
AAA
•Multiple nodes
AAA
AAA
AAA
•Disks -> Sockets -> Disk -> Disk
B
C
AAA
BBB
CCC
CCC
CCC
CCC
AAA
BBB
CCC
BBB
BBB
BBB
BBB
BBB
BBB
CCC
CCC
CCC
61
Cluster Install & Execute
•If this is to be used by others,
it must be:
•Easy to install
•Easy to execute
• Installations of distributed systems take
time and can be tedious. (AM2, GluGuard)
• Parallel Remote execution is
non-trivial. (GLUnix, LSF)
How do we keep this “simple” and “built-in” to
NTClusterSort ?
62
Remote Install
•Add Registry entry to each remote node.
RegConnectRegistry()
RegCreateKeyEx()
63
Cluster Execution
•Setup :
MULTI_QI struct
COSERVERINFO struct
•CoCreateInstanceEx()
MULT_QI
COSERVERINFO
HANDLE
HANDLE
HANDLE
•Retrieve remote object handle
from MULTI_QI struct
Sort()
•Invoke methods as usual
Sort()
Sort()
64
SAN:
Standard
Interconnect
Gbps Ethernet: 110 MBps
• LAN faster than
PCI 32: 70 MBps
UW Scsi: 40 MBps
•
•
•
memory bus?
1 GBps links in lab.
300$ port cost soon
Port is computer
FW scsi: 20 MBps
scsi: 5 MBps
65

Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Transcript Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Directory