Introduction - Northern Kentucky University

Download Report

Transcript Introduction - Northern Kentucky University

CIT 470: Advanced Network and
System Administration
Upgrades and Maintenance
CIT 470: Advanced Network and System Administration
Slide #1
Topics
1.
2.
3.
4.
Upgrade Procedure
Maintenance Windows
Service Conversions
Centralization and de-centralization
CIT 470: Advanced Network and System Administration
Slide #2
Upgrades
1.
2.
Develop a service checklist
Verify each software package will work with new OS or
plan upgrade.
3. Develop test for each service.
4. Write a back-out plan.
5. Select a maintenance window.
6. Announce upgrade.
7. Lock out users.
8. Do upgrade.
9. Perform tests.
10. Communicate success or back out.
11. Let users back in.
CIT 470: Advanced Network and System Administration
Slide #3
Service Checklist
1.
2.
3.
4.
What services are provided by server?
Who are the customers of each service?
What package provides each service?
What other services depend on server?
CIT 470: Advanced Network and System Administration
Slide #4
Verify Software Compatibility
• Don’t trust the vendor.
• Test the software yourself.
• What if the software isn’t compatible?
– Upgrade to release supported by both OSes.
– Upgrade to release supported by new OS only.
– No upgrade path—don’t upgrade OS or migrate
service to a VM running old OS.
CIT 470: Advanced Network and System Administration
Slide #5
Verification Tests
Automate tests with script.
– Script compares actual and expected output.
– Prints OK or FAIL for each test.
– You’ll upgrade server or OS more than once.
Tests can be simple.
– “Hello world” program for compiler.
– Use netcat to send text message to server.
Some services come with tests.
CIT 470: Advanced Network and System Administration
Slide #6
Back-Out Plan
• Back-out plan must be quick enough to
perform within maintenance window.
• Upgrade strategies that support back-out
–
–
–
–
Clone disks, perform upgrade on clones.
Clone system disks, backup data disks.
Use snapshot capability of virtual machines.
Install upgrade on new server hardware.
CIT 470: Advanced Network and System Administration
Slide #7
Select Maintenance Window
When?
– Evening or weekend.
– Vendor support may be unavailable.
How long?
– t(upgrade) + t(testing) + t(debug) + t(back-out)
– x 2 because you probably underestimated
What time will back-out plan be initiated?
CIT 470: Advanced Network and System Administration
Slide #8
Announcements
• Brief, direct, always use the same format.
• What you need to communicate:
–
–
–
–
Who is affected
What will happen
When
Why
CIT 470: Advanced Network and System Administration
Slide #9
Test, Upgrade, Test
1.
2.
3.
4.
5.
Perform verification tests before upgrade.
Perform the upgrade.
Repeat the tests.
Be sure a customer can access system too.
Back-out if the tests fail.
CIT 470: Advanced Network and System Administration
Slide #10
Success or Failure
• Communicate success or failure.
• Be short.
• Provide contact in case something is broken.
CIT 470: Advanced Network and System Administration
Slide #11
Disabling Services
• Follow the same procedures as upgrade.
• Be certain no one is still using service.
– Check lists.
– Use network sniffer to check for traffic.
• Disable service so that it’s easy to re-enable.
– Don’t delete software until grace period passed.
– Back up software before deletion.
CIT 470: Advanced Network and System Administration
Slide #12
Upgrade Tips
• Don’t make two changes at the same time, as
it makes debugging much more difficult.
• Practice the upgrade beforehand on a spare
machine or VM.
CIT 470: Advanced Network and System Administration
Slide #13
Maintenance Windows
Scheduled for time-consuming changes.
–
–
–
–
–
Multiple sysadmins changing diff systems.
Large-scale data migration.
Shutting down services with many dependents.
Hardware changes: AC, re-wiring.
Moving to another data center.
Evening, day, or weekend duration.
CIT 470: Advanced Network and System Administration
Slide #14
Scheduling
•
•
•
•
Coordinate with rest of organization.
Avoid end of month, quarter, year.
Schedule far in advance.
Plan upgrade beforehand.
CIT 470: Advanced Network and System Administration
Slide #15
Flight Director
• Single person responsible.
• Send out announcements.
• Approving and scheduling work proposals.
– Ensure that workers don’t conflict with each other.
• Monitor progress during window.
– Ensure that testing is performed.
– Deciding if and when back-out should be initiated.
• Communicate success or failure at end.
CIT 470: Advanced Network and System Administration
Slide #16
Change Proposals
1.
2.
3.
4.
5.
6.
7.
8.
What changes are going to be made?
What machines will be affected?
What are the premaintenance dependencies?
What needs to be up for change to happen?
Who is performing the work?
How long will the change take?
What are the test procedures?
What is the back-out procedure?
CIT 470: Advanced Network and System Administration
Slide #17
Master Plan
• Takes into account
– Dependencies (people, services, hardware)
– Resources (people, time, hardware)
• Need slack in schedule for when things go
wrong.
CIT 470: Advanced Network and System Administration
Slide #18
Disabling Access
• Disable all access at start of window
–
–
–
–
Place notices on doors.
Disable remote access.
Announce over PA system.
Helpdesk voicemail message.
• Prevents people from using systems during
maintenance and causing inconsistencies or
accidental loss of data.
CIT 470: Advanced Network and System Administration
Slide #19
Shutdown/Boot Sequence
• Proper sequence to ensure that all systems
shutdown or boot cleanly.
• Takes into account dependencies
–
–
–
–
–
–
–
–
Network
Console servers
DNS
Authentication
License servers
File services
Database servers
Web and other application servers
CIT 470: Advanced Network and System Administration
Slide #20
Deadlines
Each change must be completed by deadline.
– Back-out if change cannot be completed.
– Ensures that dependent tasks won’t get started if
they cannot be completed.
CIT 470: Advanced Network and System Administration
Slide #21
System Testing
• Verification tests for each upgrade.
• Whole system tests to ensure everything
works together before end of window.
• Shutdown and restart all systems.
CIT 470: Advanced Network and System Administration
Slide #22
Completion
• Postmaintenance announcement.
– Write this before the window starts.
• Re-enable remote access.
• Be available early the next morning to ensure
that problems are detected and fixed quickly.
CIT 470: Advanced Network and System Administration
Slide #23
Postmortem
• Meeting after all problems fixed.
• Review maintenance window
– What went wrong?
– What went right?
– How can future windows go better?
• Data collection
– How long does it really take to upgrade?
– Track historical trends.
CIT 470: Advanced Network and System Administration
Slide #24
High Availability Sites
What’s high availability?
–
–
–
–
99.9% (9 hours per year downtime)
99.99% (1 hour per year)
99.999% (5 minutes per year)
99.9999% (<1 minute per year)
What’s different during maintenance?
– Redundant systems.
– No full shutdown/reboots.
– Availability must be closely monitored.
CIT 470: Advanced Network and System Administration
Slide #25
Service Conversions
•
•
•
•
Replacing existing svc with a new svc.
One, some, many procedure.
Communicate change to customers.
Minimize service downtime.
CIT 470: Advanced Network and System Administration
Slide #26
Layers vs. Pillars
Layers
– Perform one task for all customers at once.
– Then move onto next task.
– Better for non-intrusive tasks
Pillars
– Perform all tasks for each customer.
– Then move onto next customer.
– Better for intrusive tasks, as reduces # intrusions.
CIT 470: Advanced Network and System Administration
Slide #27
Avoid Flash Cuts
• Avoid converting everyone at once.
• Convert willing test subjects first.
• Make both svcs available simultaneously.
– Customers can try new service, get used to it.
– Return to old service if they experience problems.
• Sometimes a flash-cut is the only solution.
– Careful planning.
– Comprehensive testing.
– Back-out plan.
CIT 470: Advanced Network and System Administration
Slide #28
Centralization
• Single, central focus of control.
• Centralize distributed systems.
– Distributed systems can be complex.
– Multiple servers, one point of control.
• Centralize administration
– Single point of contact to get IT help.
– Consolidate expertise.
• Centralize infrastructure decisions
– Volume purchasing discounts.
– One PC model = easy to repair, keep spare parts.
CIT 470: Advanced Network and System Administration
Slide #29
De-Centralization
Fault Tolerance
– Systems work even when WAN is down.
– Distributed systems can solve this too.
Customization
– Some groups need customized software/hardware.
– One size never fits all customers.
CIT 470: Advanced Network and System Administration
Slide #30
References
1.
2.
3.
4.
5.
6.
7.
Mark Burgess, Principles of System and Network Administration,
Wiley, 2000.
Aeleen Frisch, Essential System Administration, 3rd edition, O’Reilly,
2002.
R. Evard. "An analysis of unix system configuration." Proceedings of
the 11th Systems Administration conference (LISA), page 179,
http://www.usenix.org/publications/library/proceedings/lisa97/full_pa
pers/20.evard/20_html/main.html, 1997
Evi Nemeth et al, UNIX System Administration Handbook, 3rd
edition, Prentice Hall, 2001.
SAGE, Job Descriptions, http://www.sage.org/field/jobsdescriptions.mm.
SAGE, SAGE Code of Ethics, http://www.sage.org/ethics.mm
Shelley Powers et. al., UNIX Power Tools, 3rd edition, O’Reilly,
2002.
CIT 470: Advanced Network and System Administration
Slide #31