Scalable Clusters Jed Liu 11 April 2002 Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to clients as a single.
Download
Report
Transcript Scalable Clusters Jed Liu 11 April 2002 Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to clients as a single.
Scalable Clusters
Jed Liu
11 April 2002
Overview
Microsoft Cluster Service
Built on Windows NT
Provides high availability services
Presents itself to clients as a single system
Frangipani
A scalable distributed file system
Microsoft Cluster Service
Design goals:
Cluster composed of COTS components
Scalability – able to add components without
interrupting services
Transparency – clients see cluster as a single
machine
Reliability – when a node fails, can restart services
on a different node
Cluster Abstractions
Nodes
Resources
Quorum resource
e.g., logical disk volumes, NetBIOS names, SMB
shares, mail service, SQL service
Implements persistent storage for cluster
configuration database and change log
Resource dependencies
Tracks dependencies btw resources
Cluster Abstractions (cont’d)
Resource groups
The unit of migration: resources in the same
group are hosted on the same node
Cluster database
Configuration data for starting the cluster is kept
in a database, accessed through the Windows
registry.
Database is replicated at each node in the cluster.
Node Failure
Active members broadcast periodic heartbeat
messages
Failure suspicion occurs when a node misses
two successive heartbeat messages from
some other node
Regroup algorithm gets initiated to determine new
membership information
Resources that were online at a failed member are
brought online at active nodes
Member Regroup Algorithm
Lockstep algorithm
Activate. Each node waits for a clock tick,
then starts sending and collecting status
messages
Closing. Determine whether partitions exist
and determines whether current node is in a
partition that should survive
Pruning. Prune the surviving group so that
all nodes are fully-connected
Regroup Algorithm (cont’d)
Cleanup. Surviving nodes local membership
information as appropriate
Stabilized. Done
Joining a Cluster
Sponsor authenticates the joining node
Sponsor sends version info of config database
Denies access if applicant isn’t authorized to join
Also sends updates as needed, if changes were
made while applicant was offline
Sponsor atomically broadcasts information
about applicant to all other members
Active members update local membership
information
Forming a Cluster
Use local registry to find address of quorum
resource
Acquire ownership of quorum resource
Arbitration protocol ensures that at most one node
owns quorum resource
Synchronize local cluster database with
master copy
Leaving a Cluster
Member sends an exit message to all other
cluster members and shuts down immediately
Active members gossip about exiting member
and update their cluster databases
Node States
Inactive nodes are offline
Active members are either online or paused
All active nodes participate in cluster database
updates, vote in the quorum algorithm, maintain
heartbeats
Only online nodes can take ownership of resource
groups
Resource Management
Achieved by invoking a calls through a
resource control library (implemented as a
DLL)
Through this library, MSCS can monitor the
state of the resource
Resource Migration
Reasons for migration:
Node failure
Resource failure
Resource group prefers to execute at a different
node
Operator-requested migration
In the first case, resource group is pulled to
new node
In all other cases, resource group is pushed
Pushing a Resource Group
All resources in the old node are brought
offline
Old host node chooses a new host
Local copy of MSCS at new host brings up the
resource group
Pulling a Resource Group
Active nodes capable of hosting the group
determine amongst themselves the new host
for the group
New host chosen based on attributes that are
stored in the cluster database
Since database is replicated at all nodes, decision
can be made without any communication!
New host brings online the resource group
Client Access to Resources
Normally, clients access SMB resources using
names of the form \\node\service
This presents a problem – as resources migrate
between nodes, the resource name will change
With MSCS, whenever a resource migrates,
resource’s network name also migrates as
part of resource group
Clients only sees services and their network
names – cluster becomes a single virtual node
Membership Manager
Maintains consensus among active nodes
about who is active and who is defined
A join mechanism admits new members into the
cluster
A regroup mechanism determines current
membership on start up or suspected failure
Global Update Manager
Used to implement atomic broadcast
A single node in the cluster is always
designated as the locker
Locker node takes over atomic broadcast in
case original sender fails in mid-broadcast
Frangipani
Design goals:
Provide users with coherent, shared access to files
Arbitrarily scalable to provide more storage, higher
performance
Highly available in spite of component failures
Minimal human administration
Full and consistent backups can be made of the entire
file system without bringing it down
Complexity of administration stays constant
despite the addition of components
Server Layering
User
program
User
program
User
program
Frangipani
file server
Frangipani
file server
Distributed
lock service
Petal
distributed virtual
disk service
Physical disks
Assumptions
Frangipani servers trust:
One another
Petal servers
Lock service
Meant to run in a cluster of machines that are
under a common administration and can
communicate securely
System Structure
Frangipani implemented as a file system
option in the OS kernel
All file servers read and write the same file
system data structures on the shared Petal
disk
Each file server keeps a redo log in Petal so
that when it fails, another server can access
log and recover
User programs
User programs
File system switch
File system switch
Frangipani
file server module
Frangipani
file server module
Petal
device driver
Petal
device driver
Network
Petal virtual disk
Petal
server
Lock
server
Petal
server
Lock
server
Petal
server
Lock
server
Security Considerations
Any Frangipani machine can access and
modify any block of the Petal virtual disk
Must run only on machines with trusted OSes
Petal servers and lock servers should also run
on trusted OSes
All three types of components should
authenticate one another
Network security also important:
eavesdropping should be prevented
Disk Layout
264 bytes of addressable disk space,
partitioned into regions:
Shared configuration parameters
Logs – each server owns a part of this region to
hold its private log
Allocation bitmaps – each server owns parts of
this region for its exclusive use
Inodes, small data blocks, large data blocks
Logging and Recovery
Only log changes to metadata – user data is
not logged
Log implemented as a circular buffer
Use write-ahead redo logging
When log fills, reclaim oldest ¼ of buffer
Need to be able to find end of log
Add monotonically increasing sequence numbers
to each block of the log
Concurrency Considerations
Need to ensure logging and recovery work in
the presence of multiple logs
Updates requested to same data by different
servers are serialized
Recovery applies a change only if it was logged
under an active lock at the time of failure
To ensure this, never replay an update that has already
been completed
keep a version number on each metadata block
Concurrency Considerations
(cont’d)
Ensure that only one recovery daemon is replaying
the log of a given server
Do this through an exclusive lock on the log
Cache Coherence
When lock service detects conflicting lock
requests, current lock holder is asked to
release or downgrade lock
Lock service uses read locks and write locks
When a read lock is released, corresponding cache
entry must be invalidated
When a write lock is downgraded, dirty data must
be written to disk
Releasing a write lock = downgrade to read lock,
then release
Synchronization
Division of on-disk data structures into
lockable segments is designed to avoid lock
contention
Each log is lockable
Bitmap space divided into lockable units
Unallocated inode or data block is protected by
lock on corresponding piece of the bitmap space
A single lock protects the inode and any file data
that it points to
Locking Service
Locks are sticky – they’re retained until
someone else needs them
Client failure dealt with by using leases
Network failures can prevent a Frangipani
server from renewing its lease
Server discards all locks and all cached data
If there was dirty data in the cache, Frangipani
throws errors until file system is unmounted
Locking Service Hole
If a Frangipani server’s lease expires due to
temporary network outage, it might still try to
access Petal
Problem basically caused by lack of clock
synchronization
Can be fixed without synchronized clocks by
including a lease identifier with every Petal
request
Adding and Removing Servers
Adding a server is easy!
Just point it to a Petal virtual disk and a lock
service, and it automagically gets integrated
Removing a server is even easier!
Just take a sledgehammer to it
Alternatively, if you want to be nicer, you can flush
dirty data before using the sledgehammer
Backups
Just use the snapshot features that are built
into Petal to do backups
Resulting snapshot is crash-consistent: reflects
state reachable if all Frangipani servers were to
crash
This is good enough – if you restore the backup,
recovery mechanism can handle the rest
Summary
Microsoft Cluster Service
Aims to provide reliable services running on a
cluster
Presents itself as a virtual node to its clients
Frangipani
Aims to provide a reliable distributed file system
Uses metadata logging to recover from crashes
Clients see it as a regular shared disk
Adding and removing nodes is really easy