Function Placement for Data Intensive Cluster Computing

Transcript Function Placement for Data Intensive Cluster Computing

Khalil Amiri*, David Petrou, Gregory R. Ganger* and Garth
A. Gibson "Dynamic Function Placement for Data-intensive
Cluster
Computing," Proceedings of the USENIX Annual Technical
Conference, San Diego, CA, June 2000.
(http://www.pdl.cs.cmu.edu/Publications/publications.html)
Function Placement for Data
Intensive Cluster Computing
• Data intensive applications that filter, mine, sort or
manipulate large data sets
– Spread data parallel computations across source/sink
servers
– Exploit servers’ computational resources
– Reduce network bandwidth
Programming model and runtime system
• Compose data intensive applications from explicitly-migratable,
functionally independent components
• Mobile objects provide explicit methods that checkpoint and restore
state
• Application and filesystem represented as graph of communicating
mobile objects
• Graph rooted at storage servers by non-migratable storage objects
• Anchored at clients by non-migratable console object
• Mobile objects have explicit methods that checkpoint and restore state
during migration
•
•
Storage objects provide persistent storage
Console object contains part of application that must remain at the node where
application is started
ABACUS runtime system
•
Migration and location-transparent invocation component (binding manager)
– Responsible for creation of location-transparent references to mobile objects
– Redirection of method invocations in face of object migrations
– Each machine’s binding manager notifies local resource manager of each procedure
call to and return from mobile object
•
Resource monitoring and management component (resource manager)
– Uses notifications to collect statistics about bytes moved between objects and
resources used by objects
– Monitors load on local processor and costs associated with moving data to and
from storage servers
– Server side managers collect statistics from client side resource managers
– Employ analytic models to estimate performance advantages that might accrue
from moving to alternate placements
Programming Model
• Mobile Objects
– C++
– Required to implement a few mthods to enable runtime system to create
instances and migrate
– Medium granularity – performs self-contained processing step that is data
intensive, such as parity computation, caching, searching or aggregation
– Has private state not accessible to outside objects except via exported
interface
– Responsible for saving private state, including state of all embedded
objects when Checkpoint() method is called by ABACUS
– Responsible for restoring state, including creation and initialization of all
embedded objects, when runtime system invokes restore() method after
migration to a new node
– Checkpoint and restore go to/from external file or memory
• See Figure 1
Storage Servers
• Provides local storage objects exporting a flat file interface
• Accessible only at server that hosts them and never migrates
• Migratable objects lie between storage objects and console objects
– Applications can declare other objects to be non-migratable
– Object that implements write-ahead logging can be declared by filesystem
as non-migratable
Iterative Processing Model
• Synchronous invocations start at the top-level console
object and propagate down the object graph
• Amount of data moved is an application-specific number
of records, rather than the entire file or data set
• ABACUS accumulates statistics on return from method
invocations to make object migration decisions
Object based distributed
filesystem
•
Filesystems composed of explicitly migratable objects (Figure 2)
– RAID
– Caching
– Application specific functionality
•
•
Coherent file and directory abstractions atop flat file space exported by base
storage objects
File
– Stack of objects supporting services bound to file
– Files whose data cannot be lost include RAID object
– When file is opened, top-most object is instantiated, lower level objects then
instantiated
– Supports inter-client file and directory sharing
– Allows both file data and directory data to be cached and manipulated at trusted
clients
– AFS style call backs for cache coherence
– Timestamp ordering protocol to make sure that updates performed at client are
consistent before being committed at server
Virtual File System Interface
(VFS)
• The Virtual File System (VFS) interface hides implementation
dependent parts of the file system
• BSD implemented VFS for NFS: aim dispatch to different
filesystems
• Manages kernel level file abstractions in one format for all file
systems
• Receives system call requests from user level (e.g. write, open, stat,
link)
• Interacts with a specific file system based on mount point traversal
• Receives requests from other parts of the kernel, mostly from
memory management
• (http://bukharin.hiof.no/fag/iad22099/innhold/lab/lab3/nfsnis_slides/
text13.htm)
Virtual File System Interface
(VFS)
• Microsoft Windows have VFS type interfaces
• Functions of the VFS:
• Separate file system generic operations from their
implementation.
• Enables transparent access to a variety of different local
file systems.
• At the VFS interface, files are represented as v-nodes,
which are networkwide unique numerical designator for
a file.
• Vnode contains pointer to its parent file system and to the
file system over which it's mounted
File Graph
• File’s graph provides:
– VFS interface to applications
– Cache object
• Keeps index of particular object’s blocks in the shared cache kept by the
ABACUS filesystem
– Optional RAID 5 object
• Stripes and maintains parity for individual files across set of storage servers
– One or more storage objects
– RAID isolation/atomicity object anchored at storage servers
• Intercepts reads and writes to base storage object and verifies the consistency
of updates before committing
– Linux ext2 filesystem or CMU’s NASD prototype can be used for backing
store
Directory Graph
• Directory object
– POSIX like directory calls and caches directory entries
• Isolation/atomicity object
– Specialized to directory semantics for performance reasons
• Storage object
Accessing ABACUS filesystem
• Applications that include ABACUS objects directly append per-file
object subgraphs onto their application object graphs
• Can be mounted as a standard file system via VFS layer redirection
– Legacy applications can use filesystem objects adaptively migrating below
them
– Legacy applications themselves do not migrate
Object-based applications
•
•
•
•
•
•
•
•
Data intensive applications decomposed into objects that search, aggregate or
data mine
Formulate applications to iterate over input data and operate on data one buffer
at a time
Encapsulate the filtering component into C++ object, write checkpoint, restore
methods
Applications instantiate mobile objects by making request to ABACUS runtime system
ABACUS allocates and returns to caller network-wide unique run-time
identifier
Acts as layer of indirection
Per-node has tables map rid to (node, object_reference_within_node) pair
Data is passed by procedure call when objects are in the same address space,
and RPCs when objects cross machines
Object Migration
• Migrate from source to target
– Binding manager blocks new calls to the migrating object
– Binding manager waits until all active invocations to migrating object
have drained
– Object is locally checkpointed by invoking Checkpoint() method
– State written into buffer or memory
– Location tables at source, target and home node are updated
– Invocations are unblocked and redirected to proper node via updated
location table
– Per-node hash tables may contain stale data, if object cannot be located,
up-to-date information can be found in object’s home node
Resource Monitoring
• Memory consumption, instructions executed per byte, stall time
– Since runtime system redirects calls to objects, it is in a position to collect
all necessary statistics
– Monitoring code interposed between mobile object procedure call and
return
– Number of bytes transferred recorded in timed data flow graph
– Moving averages of bytes moved between every pair of communicating
objects in graph
– Runtime system tracks dynamic memory allocation using wrappers around
each memory allocation routine
– Linux interval times or Pentium cycle counter used to count instructions
executed by objects
– Track amount of time thread is blocked by having kernel update blocking
timers of any threads in the queue marked as blocked
Dynamic Placement
• Net benefit
– Server side resource manager collects per-object measurements
– Receives statistics about client processor speed and current load
– Given data flow graph between objects, latency of client-server
link, model estimates changes in stall time if object changes
location
– Model also estimates change in execution time for other objects
executing at target node
Example – Placement of RAID
Object
• Figure 3:
– In Client A, RAID object runs in client, Client B RAID object runs at
storage device
– If LAN is slow, better if RAID object runs at storage device
• Figure 4:
–
–
–
–
–
–
Two clients writing 4MB files sequentially
Stripe size is 5 (4 data + parity)
Stripe unit is 32KB
In LAN case, better to execute on server
In SAN case, RAID running RAID object locally at client is 1.3X faster
In degraded read case, client based RAID wins (due to computational cost
of doing reconstruction)
Placement of filter
•
•
•
Vary filter’s selectivity and CPU consumption
High selectivity filters are better on server
Possible to arrange things so that filter should run on client if filter is
expensive
David F. Nagle, Gregory R. Ganger, Jeff Butler,
Garth Gibson, and Chris Sabol "Network Support for
Network-Attached
Storage", Hot Interconnects 1999, August 18 - 20,
1999, Stanford University, Stanford, California.
Network Support for Network
Attached Storage
• Enable scalable storage systems in ways that minimize file manager
bottleneck
– Homogeneous network of trusted clients that issue unchecked commands
to shared storage
• Poor security and integrity (anybody can read or write to anything!)
– NetSCSI
• Minimal changes to hardware, software of SCSI disks
• NetSCSI disks send data directly to clients
• Crytographic hashes, encryption, verified by NetSCSI disks provide for
integrity and privacy
• File manager still involved in each storage access
• Translates namespaces and sets up third party transfer on each request
Network Attached Secure Disks
(NASD)
•
•
•
•
•
NASD architecture provides command interface that reduce number of clientstorage interactions
Data intensive operations go right to disk, less common policy making
operations (e.g. namespace and access control) go to the file manager
See Figure 1 for scheme
NASD drives map and authorize requests for disk sectors
Time limited capability provided by file manager
– Allows access to a given file’s map and contents
– Storage mapping metadata is maintained by drive
• Smart drives can exploit knowledge of their own resources to optimize data layout, readahead and cache management
– NASD drive exports “namespace”
• Describes file objects which can generalize to banks of stripped files
NASD Implementation
• Ported AFS and NSF to use interface
• Implemented striped version of NSF on top of interface
• NAS/AFS and NASD/NFS filesystems, frequent data moving
operations occur directly between client and NASD drive
– NFS --- stateless server, weak cache consistency
– Based on client’s RPC opcode, RPC destination addresses are modifie to
deliver requests to NASD drive
• AFS port had to maintain sequential consistency guarantees of AFS
– Used ability of NASD capabilities to be revoked based expired time or
object attributes
Performance Comparisons
•
Compare NASD/AFS, NASD/NFS v.s. Server Attached Disk (SAD) AFS,
NFS
– Computing frequent sets of 300MB sales transactions
– Maximum bandwidth with given number of disks or NASD drives and up to 10
clients
– Bandwidth of client reading from single NASD file striped across n drives (linear
scaling)
– NFS – performance of all clients reading from single file striped across n disks on
server (bottlenecks near 20 MB/s)
– NFS_parallel – each client reads from separate file on an independent disk through
the same server (bottlenecks near 22.5 MB/s)
Network Support
• Non-cached read/write can be serviced by modest hardware
• Requests that hit in cache need much higher bandwidth – lots of time
in network stack
• Need to deliver scalable bandwidth
– Deal efficiently with small messages
– Don’t spend too much time going between OS layers
– Don’t copy data too many times
Reducing Data Copies
• Integrate buffering/caching into OS
– Effective where caching plays central role
• Direct user level access to network
– High bandwidth applications
• Layered NASD on top of VI-Architecture (VIA) interface
– Integrates user-level Network Interface Control access with protection
mechanisms – send/receive/DMA
– Commercial VIA implementations are available – full network bandwidth
while consuming less than 10% of CPU’s cycles
– Support from hardware, software vendors
Integrating VIA with NASD
•
•
•
•
NASD software runs in kernel mode
Drive must support external VIA interface and semantics
Can result in 100’s of connections and lots of RAM
Interface
–
–
–
–
–
Client preposts set of buffers matching read request size
Issues NASD read comman
Drive returns data
Writes are analogous but bursts require large amount of preposted buffer
VIA’s remote DMA
• Clients send write command with pointer to data stored in client’s pinned
memory
• Drive uses VIA RDMA command to pull data out of client’s memory
• Drive would treat client’s RAM as extended buffer/cache
Network striping and Incase
• File striping across multiple storage devices
– Poor support for incase (many to one)
– Client should receive equal bandwidth from each source
– Poor performance (Figure 4)