Scalable Clusters Jed Liu 11 April 2002 Overview  Microsoft Cluster Service     Built on Windows NT Provides high availability services Presents itself to clients as a single.

Download Report

Transcript Scalable Clusters Jed Liu 11 April 2002 Overview  Microsoft Cluster Service     Built on Windows NT Provides high availability services Presents itself to clients as a single.

Scalable Clusters
Jed Liu
11 April 2002
Overview

Microsoft Cluster Service




Built on Windows NT
Provides high availability services
Presents itself to clients as a single system
Frangipani

A scalable distributed file system
Microsoft Cluster Service

Design goals:




Cluster composed of COTS components
Scalability – able to add components without
interrupting services
Transparency – clients see cluster as a single
machine
Reliability – when a node fails, can restart services
on a different node
Cluster Abstractions


Nodes
Resources


Quorum resource


e.g., logical disk volumes, NetBIOS names, SMB
shares, mail service, SQL service
Implements persistent storage for cluster
configuration database and change log
Resource dependencies

Tracks dependencies btw resources
Cluster Abstractions (cont’d)

Resource groups


The unit of migration: resources in the same
group are hosted on the same node
Cluster database


Configuration data for starting the cluster is kept
in a database, accessed through the Windows
registry.
Database is replicated at each node in the cluster.
Node Failure


Active members broadcast periodic heartbeat
messages
Failure suspicion occurs when a node misses
two successive heartbeat messages from
some other node


Regroup algorithm gets initiated to determine new
membership information
Resources that were online at a failed member are
brought online at active nodes
Member Regroup Algorithm




Lockstep algorithm
Activate. Each node waits for a clock tick,
then starts sending and collecting status
messages
Closing. Determine whether partitions exist
and determines whether current node is in a
partition that should survive
Pruning. Prune the surviving group so that
all nodes are fully-connected
Regroup Algorithm (cont’d)


Cleanup. Surviving nodes local membership
information as appropriate
Stabilized. Done
Joining a Cluster

Sponsor authenticates the joining node


Sponsor sends version info of config database


Denies access if applicant isn’t authorized to join
Also sends updates as needed, if changes were
made while applicant was offline
Sponsor atomically broadcasts information
about applicant to all other members

Active members update local membership
information
Forming a Cluster


Use local registry to find address of quorum
resource
Acquire ownership of quorum resource


Arbitration protocol ensures that at most one node
owns quorum resource
Synchronize local cluster database with
master copy
Leaving a Cluster


Member sends an exit message to all other
cluster members and shuts down immediately
Active members gossip about exiting member
and update their cluster databases
Node States


Inactive nodes are offline
Active members are either online or paused


All active nodes participate in cluster database
updates, vote in the quorum algorithm, maintain
heartbeats
Only online nodes can take ownership of resource
groups
Resource Management


Achieved by invoking a calls through a
resource control library (implemented as a
DLL)
Through this library, MSCS can monitor the
state of the resource
Resource Migration

Reasons for migration:






Node failure
Resource failure
Resource group prefers to execute at a different
node
Operator-requested migration
In the first case, resource group is pulled to
new node
In all other cases, resource group is pushed
Pushing a Resource Group



All resources in the old node are brought
offline
Old host node chooses a new host
Local copy of MSCS at new host brings up the
resource group
Pulling a Resource Group

Active nodes capable of hosting the group
determine amongst themselves the new host
for the group



New host chosen based on attributes that are
stored in the cluster database
Since database is replicated at all nodes, decision
can be made without any communication!
New host brings online the resource group
Client Access to Resources

Normally, clients access SMB resources using
names of the form \\node\service


This presents a problem – as resources migrate
between nodes, the resource name will change
With MSCS, whenever a resource migrates,
resource’s network name also migrates as
part of resource group

Clients only sees services and their network
names – cluster becomes a single virtual node
Membership Manager

Maintains consensus among active nodes
about who is active and who is defined


A join mechanism admits new members into the
cluster
A regroup mechanism determines current
membership on start up or suspected failure
Global Update Manager



Used to implement atomic broadcast
A single node in the cluster is always
designated as the locker
Locker node takes over atomic broadcast in
case original sender fails in mid-broadcast
Frangipani

Design goals:




Provide users with coherent, shared access to files
Arbitrarily scalable to provide more storage, higher
performance
Highly available in spite of component failures
Minimal human administration


Full and consistent backups can be made of the entire
file system without bringing it down
Complexity of administration stays constant
despite the addition of components
Server Layering
User
program
User
program
User
program
Frangipani
file server
Frangipani
file server
Distributed
lock service
Petal
distributed virtual
disk service
Physical disks
Assumptions

Frangipani servers trust:




One another
Petal servers
Lock service
Meant to run in a cluster of machines that are
under a common administration and can
communicate securely
System Structure



Frangipani implemented as a file system
option in the OS kernel
All file servers read and write the same file
system data structures on the shared Petal
disk
Each file server keeps a redo log in Petal so
that when it fails, another server can access
log and recover
User programs
User programs
File system switch
File system switch
Frangipani
file server module
Frangipani
file server module
Petal
device driver
Petal
device driver
Network
Petal virtual disk
Petal
server
Lock
server
Petal
server
Lock
server
Petal
server
Lock
server
Security Considerations

Any Frangipani machine can access and
modify any block of the Petal virtual disk




Must run only on machines with trusted OSes
Petal servers and lock servers should also run
on trusted OSes
All three types of components should
authenticate one another
Network security also important:
eavesdropping should be prevented
Disk Layout

264 bytes of addressable disk space,
partitioned into regions:




Shared configuration parameters
Logs – each server owns a part of this region to
hold its private log
Allocation bitmaps – each server owns parts of
this region for its exclusive use
Inodes, small data blocks, large data blocks
Logging and Recovery

Only log changes to metadata – user data is
not logged


Log implemented as a circular buffer


Use write-ahead redo logging
When log fills, reclaim oldest ¼ of buffer
Need to be able to find end of log

Add monotonically increasing sequence numbers
to each block of the log
Concurrency Considerations

Need to ensure logging and recovery work in
the presence of multiple logs


Updates requested to same data by different
servers are serialized
Recovery applies a change only if it was logged
under an active lock at the time of failure

To ensure this, never replay an update that has already
been completed

keep a version number on each metadata block
Concurrency Considerations
(cont’d)

Ensure that only one recovery daemon is replaying
the log of a given server

Do this through an exclusive lock on the log
Cache Coherence


When lock service detects conflicting lock
requests, current lock holder is asked to
release or downgrade lock
Lock service uses read locks and write locks



When a read lock is released, corresponding cache
entry must be invalidated
When a write lock is downgraded, dirty data must
be written to disk
Releasing a write lock = downgrade to read lock,
then release
Synchronization

Division of on-disk data structures into
lockable segments is designed to avoid lock
contention




Each log is lockable
Bitmap space divided into lockable units
Unallocated inode or data block is protected by
lock on corresponding piece of the bitmap space
A single lock protects the inode and any file data
that it points to
Locking Service



Locks are sticky – they’re retained until
someone else needs them
Client failure dealt with by using leases
Network failures can prevent a Frangipani
server from renewing its lease


Server discards all locks and all cached data
If there was dirty data in the cache, Frangipani
throws errors until file system is unmounted
Locking Service Hole

If a Frangipani server’s lease expires due to
temporary network outage, it might still try to
access Petal


Problem basically caused by lack of clock
synchronization
Can be fixed without synchronized clocks by
including a lease identifier with every Petal
request
Adding and Removing Servers

Adding a server is easy!


Just point it to a Petal virtual disk and a lock
service, and it automagically gets integrated
Removing a server is even easier!


Just take a sledgehammer to it
Alternatively, if you want to be nicer, you can flush
dirty data before using the sledgehammer
Backups

Just use the snapshot features that are built
into Petal to do backups


Resulting snapshot is crash-consistent: reflects
state reachable if all Frangipani servers were to
crash
This is good enough – if you restore the backup,
recovery mechanism can handle the rest
Summary

Microsoft Cluster Service



Aims to provide reliable services running on a
cluster
Presents itself as a virtual node to its clients
Frangipani




Aims to provide a reliable distributed file system
Uses metadata logging to recover from crashes
Clients see it as a regular shared disk
Adding and removing nodes is really easy