Lilliput meets Brobdingnagian: Data Center Systems

Download Report

Transcript Lilliput meets Brobdingnagian: Data Center Systems

3rd International Workshop on Dependability of Clouds, Data Centers
and Virtual Machine Technology (DCDV)
Held in conjunction with Dependable Systems and Networks (DSN)
Budapest, Hungary June 18, 2013
Lilliput meets Brobdingnagian: Data Center
Systems Management through
Mobile Devices
Saurabh Bagchi,
Fahad Arshad
Jan Rellermeyer, Thomas Osiecki,
Michael Kistler, Ahmed Gheith
IBM Research
Slide 1/18
System Management Workflow
Something
is wrong!
IBM Research
Slide 2/18
Systems Management: A Changed View
Patch
IBM Research
Filtering
Slide 3/18
So What Exactly Are the Changes?
1. Platform being used for doing the systems management
Mobile
devices
Server
1. Large screen
2. Resource rich
3. Within organization’s
security perimeter
4. High dependability
IBM Research
1. Small screen
2. Resource constrained
3. Outside organization’s
security perimeter
4. Lower dependability
Slide 4/18
So Exactly Are the Changes?
2. Layered systems management to flat hierarchy
Filtering
IBM Research
Slide 5/18
Case Study: IBM Research’s IBM Remote Project
IBM Blade Centers
User Interface
Simple
Focused
Instantaneous
Always Connected
IBM Research
visualization of complex data
relevance first
drill-down UI
Communication
direct connection to the managed machines
refresh rate vs. power consumption
Slide 6/18
Case Study: IBM Remote Project
IBM Research
Slide 7/18
Research Challenges Due To The Changes
1. Platform being used for doing the systems management:
Server to Mobile Devices
I.
How do we optimize the scarce resources of the systems
management platforms? Primarily, battery and
communication bandwidth.
II. How do we handle the fact that the platforms will be
insecure and fault-intolerant for parts of their operation?
III. How do we visualize the (hopefully) rare failure event in a
deluge of systems monitoring data?
IBM Research
Slide 8/18
Research Challenges Due To The Changes
2. Layered systems management to flat hierarchy
I.
Can we avoid chaos due to the looser coordination?
II. Can we leverage overlap between interests to cut down
on traffic to individual mobile devices?
IBM Research
Slide 9/18
Solution Directions for Question 1
1. Platform being used for doing the systems management:
Server to Mobile Devices
I.
How do we optimize the scarce resources of the systems
management platforms? Primarily, battery and
communication bandwidth.
• Minimize number of messages, while still receiving enough
to reliably detect failures
– Use publish-subscribe or other push mechanism, in preference to
pull mechanism
– BUT: Most hardware management modules do not support push
– Use an intermediate server for aggregation and filtering
• Apply principles of rare event detection
– Non-events occur with much higher frequency than events of interest
– BUT: Requires model of events: time distribution, correlation, etc.
IBM Research
Slide 10/18
Solution Directions for Question 1
1. Platform being used for doing the systems management:
Server to Mobile Devices
II. How do we handle mismatch in dependability
characteristics (between target platform and
management platform)?
–
–
Mobile device can be physically compromised and OS-level
protection can be bypassed
Mobile devices are often employee owned
• Application security and server-side security need to be
built in
– Periodic authentications, not one-time authentications
– Biometric-based authentication
IBM Research
Slide 11/18
Solution Directions for Question 1
1. Platform being used for doing the systems management:
Server to Mobile Devices
III. How do we visualize the needle in the haystack?
–
–
–
Needle: Outages, failures, or behavior that is indicative of an
imminent failure
Haystack: Deluge of monitored data about target platforms
Screen real estate is limited
• First off, deliver only a small superset of relevant messages
– Push notification, such as, through Google Cloud Messaging (GCM)
• Drill-down views, starting with summary alert view for all
machines in data center
– Followed up with root cause analysis techniques that run on servers
IBM Research
Slide 12/18
Solution Directions for Question 2
1. Layered systems management to flat hierarchy, OR
Crowdsourcing systems management
I.
Tight vertical integration of different software layers
implies different domain experts will be concurrently
involved in problem troubleshooting
• Relevant features of social media will be used
– Example: At IBM, you can “friend” specific Blade Centers and have
“circles” of administrators
• Role-based Access Control (RBAC) can be used for
security control of different software layers
– Fine-grained roles can be assigned
– RBAC solutions exist for sophisticated management of these roles,
such as, hierarchies, overlaps, and trasience
IBM Research
Slide 13/18
Solution Directions for Question 2
1. Layered systems management to flat hierarchy, OR
Crowdsourcing systems management
I.
Overlap between interests of multiple mobile devices and
their geographical proximity
• Commonalities of interest can be used to cut down on
cellular bandwidth usage
– Commonalities can exist due to proximal geographic location or
overlap among system administration responsibilities
– Distribute information to a subset of mobile devices and then use
local communication (Bluetooth, Wi-Fi) to disseminate information
among proximal devices
IBM Research
Slide 14/18
Case Study: IBM Remote
• Health view
(left) broken
into critical,
non-critical,
and systemlevel health
messages
• Event log view
(right) is
filtered to show
only warnings
and errors
IBM Research
Slide 15/18
Related Work
• Much work on managing mobile devices – opposite
direction than what we are discussing in this paper
– Some work on mobile agents for managing servers [18 – NOMS02,
19 – Software07]
– Sophistication lies in designing a dynamic set of agents whose
monitoring policies can be changed on the fly
• Some commercial prototypes for monitoring and control of
target end points from mobile devices
– UCSand for Android devices [21] for Cisco Unified Systems
monitoring and control
– PCMonitor [22] from MMSoft Design Ltd.
– VMWare vCenter Mobile Access [23] is a virtual appliance on the
server side for managing a datacenter from mobile devices
– Recent offering from HP [18]
IBM Research
Slide 16/18
Take-away Lessons
• A changed vision of systems management is happening –
mobile clients being used to manage large masses of
physical and virtual servers
• This throws open some technical challenges
1. Management to be done through resource-constrained
mobile devices which have lower dependability than
target devices
2. Crowd-sourcing of systems management, rather than
linear flow of control through hierarchies of sysadmins
• These challenges are being addressed in multiple projects
at commercial organizations, including in the IBM Remote
project at IBM Research
IBM Research
Slide 17/18
Presentation available at:
Dependable Computing Systems Lab (DCSL)
web site
engineering.purdue.edu/dcsl
IBM Research
Slide 18/18