Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research Gray@Microsoft.com Today’s Agenda  Windows NT® clustering     MSCS (Microsoft Cluster Server) Demo MSCS background  Design goals  Terminology 

Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research [email protected] Today’s Agenda  Windows NT® clustering     MSCS (Microsoft Cluster Server) Demo MSCS background  Design goals  Terminology 

Transcript Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research [email protected] Today’s Agenda  Windows NT® clustering     MSCS (Microsoft Cluster Server) Demo MSCS background  Design goals  Terminology 

Clustering Technology
In Windows NT Server,
Enterprise Edition
Jim Gray
Microsoft Research
[email protected]
Today’s Agenda

Windows NT® clustering




MSCS (Microsoft Cluster Server) Demo
MSCS background
 Design goals
 Terminology
 Architectural details
Setting up a MSCS cluster
 Hardware considerations
 Cluster application issues
Q&A
Extra Credit

Included in your presentation
materials but not covered
in this session



Reference materials
SCSI primer
 Speakers notes included
Hardware Certification
MSCS In Action
High Availability Versus
Fault Tolerance


High Availability: mask outages
through service restoration
Fault-Tolerance: mask local faults




RAID disks
Uninterruptible Power Supplies
Cluster Failover
Disaster Tolerance: masks
site failures


Protects against fire, flood, sabotage,..
Redundant system and service at
remote site
Windows NT Clusters
What is clustering to Microsoft?







Group of independent systems that
appear as a single system
Managed as a single system
Common namespace
Services are “cluster-wide”
Ability to tolerate component failures
Components can be added
transparently to users
Existing client connectivity is not
effected by clustered applications
Microsoft Cluster Server

2-node available 97Q3





Commoditize fault-tolerance
(high availability)
Commodity hardware
(no special hardware)
Easy to set up and manage
Lots of applications work out of the box.
Multi-node Scalability in NT5 timeframe
MSCA Initial Goals

Manageability




Availability



Manage nodes as a single system
Perform server maintenance without affecting users
Mask faults, so repair is non-disruptive
Restart failed applications and servers
 Un-availability ~ MTTR / MTBF , so quick repair
Detect/warn administrators of failures
Reliability


Accommodate hardware and software failures
Redundant system without mandating a dedicated
“stand by” solution
MSCS Cluster
Client PCs
Server A
Server B
Heartbeat
Disk cabinet A
Cluster management
Disk cabinet B
Failover Example
Browser
Server 1
Server 2
Web
site
Web
site
Database
Database
Web site files
Database files
Basic MSCS Terms




Resource - basic unit of failover
Group - collection of resources
Node - Windows NT® Server
running cluster software
Cluster - one or more closely-coupled
nodes, managed as a single entity
MSCS Namespace
Cluster view
Cluster name
Node name
Node name
Virtual
server name
Virtual
server name
Virtual
server name
Virtual
server name
MSCS Namespace
Outside world view
Cluster
Node 1
Node 2
Virtual
Virtual
Virtual
server 1 server 2 server 3
Internet
Information
Server
SQL
IP address:
1.1.1.1
Network
name:
WHECCLUS
IP address:
1.1.1.2
Network
name:
WHECNode1
IP address:
1.1.1.3
Network
name:
WHECNode2
IP address:
1.1.1.4
Network
name:
WHEC-VS1
MTS
“Falcon”
IP address:
1.1.1.5
Network
name:
WHEC-VS2
Microsoft
Exchange
IP address:
1.1.1.6
Network
name:
WHEC-VS3
Windows NT Clusters
Target applications





Application & Database servers
E-mail, groupware,
productivity applications server
Transaction processing servers
Internet Web servers
File and print servers
MSCS Design Philosophy

Shared nothing



Remoteable tools
Windows NT manageability enhancements



Simplified hardware configuration
Never take a “cluster” down: shell game
rolling upgrade
Microsoft® BackOffice™ product support
Provide clustering solutions for all levels
of customer requirements

Eliminate cost and complexity barriers
MSCS Design Philosophy





Availability is core for all releases
Single server image for administration,
client interaction
Failover provided for unmodified server
applications, unmodified clients
(cluster-aware server applications
get richer features)
Failover for file and print are default
Scalability is phase 2 focus
Non-Features Of MSCS


Not lock-step/fault-tolerant
Not able to “move” running applications


MSCS restarts applications that are failed over to
other cluster members
Not able to recover shared state between
client and server (i.e., file position)


All client/server transactions should
be atomic
Standard client/server development
rules still apply
 ACID always wins
Setting Up MSCS
Applications
Attributes Of Cluster- Aware
Applications

A persistence model that supports
orderly state transition


Client application support




Database example
 ACID transactions
 Database log recovery
IP clients only
How are retries supported?
No name service location dependencies
Custom resource DLL is a good thing
MSCS Services For
Application Support

Name service mapper


GetComputerName resolves
to virtual server name
Registry replication

Key and underlying keys and values
are replicated to the other node
 Atomic
 Logged to insure partitions
in time are handled
Application Deployment
Planning

System configuration is crucial



Adequate hardware configuration
 You can’t run Microsoft BackOffice
on a 32-MB 75mhz Pentium
Planning of preferred group owners
Good understanding of single-server
performance is critical



See Windows NT Resource Kit
performance planning section
Understand working set size
What is acceptable performance to the
business units?
Evolution Of ClusterAware Applications



Active/passive - general out-of- the-box
applications
Active/active - applications that can run
simultaneously on multiple nodes
Highly scalable - extending
the active/active through I/O shipping,
process groups, and other techniques
Application Evolution
Application
Microsoft SQL Server
Node 1 Node 2


Microsoft Transaction
Server (MTS)
Internet Information
Server (IIS)
Microsoft Exchange
Server


Evolution Of ClusterAware Applications
Application
Node 1 Node 2 Node 3 Node 4








Internet Information
Server (IIS)




Microsoft Exchange
Server




Microsoft SQL Server
Microsoft Transaction
Server (MTS)
Resources
What are they?



Resources are basic system
components such as physical disks,
processes, databases, IP addresses,
etc., that provide a service to clients
in a client/server environment
They are online in only one place
in the cluster at a time
They can fail over from one system
in the cluster to another system
in the cluster
Resources

MSCS includes resource DLL support for:











Physical and logical disk
IP address and network name
Generic service or application
File share
Print queue
Internet Information Server virtual roots
Distributed Transaction Coordinator (DTC)
Microsoft Message Queue (MSMQ)
Supports resource dependencies
Controlled via well-defined interface
Group: offers a “virtual server”
Cluster Service To Resource
Windows NT
cluster service
Initiate changes
Resource events
Resource
monitor
Physical disk
resource DLL
IP address
resource DLL
Generic app
resource DLL
Database
resource DLL
Disk
Network
App
Database
Cluster Abstractions
Cluster
Resource
Group
Resource
Resource: program or device managed by a cluster
e.g., file service, print service, database server
can depend on other resources (startup ordering)
can be online, offline, paused, failed
Resource Group: a collection of related resources
hosts resources; belongs to a cluster
unit of co-location; involved in naming resources
Cluster: a collection of nodes, resources, and groups
cooperation for authentication, administration, naming
Resources
Cluster







Group
Resource
Resources have...
Type: what it does (file, DB, print, Web…)
An operational state (online/offline/failed)
Current and possible nodes
Containing Resource Group
Dependencies on other resources
Restart parameters (in case
of resource failure)
Resource

Fails over (moves) from one
machine to another






Logical disk
IP address
Server application
Database
May depend on another resource
Well-defined properties
controlling its behavior
Resource Dependencies




A resource may depend
on other resources
A resource is brought online after
any resources it depends on
A resource is taken offline before
any resources it depends on
All dependent resources must
fail over together
Dependency Example
Database
resource DLL
IP address
resource DLL
Drive E:
resource DLL
Generic
application
resource DLL
Drive F:
resource DLL
Group Example
Payroll group
Database
resource DLL
IP address
resource DLL
Drive E:
resource DLL
Generic
application
resource DLL
Drive F:
resource DLL
MSCS Architecture
Cluster
API
Cluster administrator
Cluster API DLL
Cluster API stub
Cluster.Exe
Cluster API DLL
Log
Manager
Database
Manager
Event
Processor
Checkpoint
Manager
Failover
Manager
Application
resource DLL
Global
Update
Manager
Object
Manager
Resource
Manager
Resource
monitors
Membership
Manager
Node
Manager
Resource
API
Physical
Logical
Application
resource DLL resource DLL resource DLL
Network
Reliable Cluster
Transport + Heartbeat
MSCS Architecture

Cluster service is comprised of the
following objects










Failover Manager (FM)
Resource Manager (RM)
Node Manager (NM)
Membership Manager (MM)
Event Processor (EP)
Database Manager (DM)
Object Manager (OM)
Global Update Manager (LM)
Checkpoint Manager (CM)
More about these in the next session
Setting Up An
MSCS Cluster
MSCS Key Components

Two servers



Shared SCSI bus


SCSI HBAs, SCSI RAID HBAs, HW RAID boxes
Interconnect


Multi versus uniprocessor
Heterogeneous servers
Many types can be supported
 Remember, two NICs per node
 PCI for cluster interconnect
Complete MSCS HCL configuration
MSCS Setup

Most common problems





Duplicate SCSI IDs on adapters
Incorrect SCSI cabling
SCSI Card order on PCI bus
Configuration of SCSI Firmware
Let’s walk through getting
a cluster operational
Test Before You Build

Bring each system up independently


Network adapters
 Cluster interconnect
 Organization interconnect
SCSI and disk function
 NTFS volume(s)
Top Ten Setup “Concerns”
10. SCSI is not well known. Please use the MSCS and
IHV setup documentation. Consider the SCSI book
reference for this session
9. Build a support model that will support clustering
requirements. For example, in clustering
components are paired exactly (i.e., SCSI bios
revision levels. Include this in your plans)
8. Build extra time into your deployment planning to
accommodate cluster setup, both for hardware and
software. Hardware examples include SCSI setup.
Software issues would include installation across
cluster nodes
7. Know the certification process
and its support implications
Top Ten Setup “Concerns”
6. Applications will become more cluster-aware through
time. This will include better setup, diagnostics, and
documentation. In the meantime, plan and test accordingly
5. Clustering will impact your server maintenance
and upgrade methodologies. Plan accordingly
4. Use multiple network adapters and hubs to eliminate
single points of failure (everywhere possible)
3. Today’s clustering solutions are more complex
to install and configure than single servers. Plan
your deployments accordingly
2. Make sure that your cabinet solutions and peripherals both
fit and function well. Consider the serviceability
implications
1. Cabling is a nightmare. Color coded, heavily
documented, Y cable inclusive, maintenance-designed
products are highly desirable
Cluster Management Tools

Cluster administrator


Cluster CLI/COM


Monitor and manage cluster
Command line and COM interface
Minor modifications to existing tools



Performance monitor
 Add ability to watch entire cluster
Disk administrator
 Add understanding of shared disks
Event logger
 Broadcast events to all nodes
MSCS
Reference Materials
In Search of Clusters; The Coming Battle
In Lowly Parallel Computing
Gregory F. Pfister
ISBN 0-13-437625-0
The Book of SCSI
Peter M. Ridge
ISBN 1-886411-02-6
The Basics Of SCSI





Why SCSI?
Types of
interfaces?
Caching and
performance…
RAID
The future…
Why SCSI?

Faster then IDE - intelligent card/drive




Uses less processor time
Can transfer data up to 100 MB/sec.
More devices on a single chain up to 15
Wider variety of devices




DASD
Scanners
CD-ROM writers and optical drives
Tape drives
Types Of Interfaces

SCSI and SCSI II



Wide SCSI



68-pin, 16-bit, max transfer = 20 MB/s
Internal transfer rate = 7 to 15.5 MB/s
Ultra SCSI



50-pin, 8-bit, max transfer = 10 MB/s
(early 1.5 to 5 MB/s )
Internal transfer rate = 4 to 8 MB/s
50-pin, 8-bit, higher transfer rate,
max transfer = 20 MB/s
Internal transfer rate = 7 to 15.5 MB/s
Ultra wide


68-pin, 16-bit, max transfer rate = 40 MB/s
Internal transfer rate = 7 to 30 MB/s
Performance Factors



Cache on the drive or controller
Caching in the OS
Different variables


Seek time
Transfer rates
Redundant Array Of
Inexpensive Disks (RAID)



Developed from paper published in 1987
at University of California Berkeley
The idea is to combine multiple inexpensive drives
(eliminate SLED - single large expensive drive)
Provided redundancy by storing parity information
Raid Types A.K.A
Description
RAID 0
RAID 1
RAID 2
RAID 3
RAID 4
RAID 5
the fastest RAID - data is "stripped" across multiple volumes , no redundancy
a simple pair of drives with data replicated on both, writes are slower
these sector stripe data across drives with some storing ECC info - done in HW now
Sector striping but one drive dedicated to storing parity information for the set
identical to RAID 3 but large stripes
Best for Multi-user environments, parity is spread across 3 or more drives
Striping
Mirroring
The Future For SCSI


Faster interfaces - why?
Fibre Channel



Optical standard
Proposed as part of SCSI III (not final)
Up to 100 MB/s transfer
 Still using ultra-wide SCSI
inside enclosures
 Drives with optical interfaces not
available yet in quantity, higher cost
than SCSI
The Future Of SCIS

Fibre Channel-arbitrated loop



Ring instead of bus architecture
Can support up to 126 devices/hosts
Hot pluggable through the use
of a port bypass circuit
 No disruption of the loop as devices
are added/removed
 Generally implemented using
a backplane design
HCL List For MSCS

Servers on normal Windows NT HCL


MSCS SCSI component HCL



Self-test of MP machines soon
Tested by WHQL
Must pass Windows NT HCT as well
MSCS interconnect HCL


Tested by WHQL
Not required to pass 100% of HCT
 I.e., point-to-point adapters
MSCS System Certification
Process
Windows NT 4.0+
SCSI
HCL
Windows NT 4.0+
Network
HCL
Windows NT 4.0+
MSCS
SCSI
HCL
Complete MSCS configuration ready
for self-test
Windows NT 4.0+
Server
HCL
Testing Phases

HW compatibility (24 hours)


One-node testing (24 hours)


Eight clients
Two-node with failover (72 hours)


SCSI and interconnect testing
Eight-client with asynchronous failovers
Stress testing (24 hours)


Dual initiator I/O, split-brain problems
Simultaneous reboots
Final MSCS HCL


Only complete configurations
are supported
Self test results sent to Microsoft



Logs checked and configuration reviewed
HCL updated on Web and for
next major Windows NT release
For more details see the MSCS
Certification document

Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research [email protected] Today’s Agenda  Windows NT® clustering     MSCS (Microsoft Cluster Server) Demo MSCS background  Design goals  Terminology 

Transcript Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research [email protected] Today’s Agenda  Windows NT® clustering     MSCS (Microsoft Cluster Server) Demo MSCS background  Design goals  Terminology 

Directory