Transcript Network Management Session 1 Network Basics
COMP3122 Network Management
Richard Henson April 2012
Week 11 – Troubleshooting & Optimisation
Learning Objectives: – Explain the principles of troubleshooting as a means of mitigating against failure – Use the various tools available on a named operating system to identify potential faults and problems – Take appropriate action to stop a fault becoming a failure
“A stitch in time saves nine”
Business - Worst Possible Scenario (1)
There is an interruption in the power supply – UPS is invoked – the interruption continues… – servers all have to be shut down Power supply restored… – but main domain controller doesn’t reboot – no other domain controllers therefore connect to it – the domain tree fails
Business - Worst Possible Scenario (2) Organisation cannot do business with the network down… – server can’t be persuaded to boot – new main domain controller has to be commissioned – whole directory tree has to be rebuilt!!!
– word spreads very rapidly… Business loses so much custom, trust, and credibility that even when it starts doing business again customers choose to go elsewhere – without a flourishing customer base…
the business folds
Analysis: This scenario shouldn’t have occurred…
Unlikely that the server would fail to boot without prior warning… – warnings would have been presented… – but were clearly not acted upon!
Disaster recovery plan!?!
– not formulated? – not tested?
– not effective (in the event of a domain tree controller failure…)
But it does…
Actual example (15 time…) th Feb 2010): – root domain controller [on the network] had not been backed up for 10 months, when it crashed (well… at least it had been backed up at some – http://searchwindowsserver.techtarget.com/generi c/0,295582,sid68_gci1381567,00.html
The consultant called in to fix it reported that: – “I had never seen a case where the forest root domain had to be recovered -- and I couldn't find anyone who had.”
Analysis: Who is to blame? (1)
In this example, the organisation said they were following Microsoft guidelines – they set up an
empty
root domain – the root domain controller had a RAID-5 disk configuration This was true, to some extent – Microsoft did espouse this as best practice… in the year 2000!
– guidelines had changed since then…
Analysis: Who is to blame? (2)
The disaster that struck was: – two RAID drives failed on the same day!
– unlucky? possible to prepare for this?
The recovery process took about three weeks – most of the time was spent studying logs, doing the restore, etc. In this case, the tree was still able to function without a root domain – business was able to continue – customer base wasn’t compromised…
Fault Tolerance and Risk Assessment
General “common sense” principle: –
always
have a backup – ESPECIALLY for the most important computer on the network… Q: – How can you tell what needs backing up?
A: – Risk Assessment and Risk Management
Why not Risk Management?
Time consuming!
However, without proper risk management… – how does the organisation know what processes are most important to its functioning?
– how can an organisation provide resources to protect aspects of its network?
Risk Management and Risk Assessment
Risk Assessment is an essential first step – requires putting a “value” on assets – more valuable… greater protection Do information assets have value?
– organisations still failing to acknowledge that they do… – categorisation of information assets therefore potentially problematic – need to look at the consequence to the organisation of losing that asset…
How do you back up a Domain Controller?
The Windows “Backup” program works, and can easily be scheduled – but heavily criticised… – even the 2008 server version… Third Party products give more flexibility and protection e.g. : – Recovery Manager » http://www.quest.com/recovery-manager-for-active-directory – Backup Exec » http://www.symantec.com/business/products/family.jsp?familyid=backupexec
Prevention is Better than Cure
A server shouldn’t crash unexpectedly!
– should be kept cool (environmental unit mustn’t break down!) – monitoring should show that unexpected things are happening – action can then (usually) be taken to take care of the unexpected Many tools available to: – Check/monitor the system on a regular basis – Provide stats/ to administrators » could also be used for security purposes – Generate alerts if something is starting to go wrong…
Troubleshooting Tools for a Windows Server: Task Manager Applications tab: – shows which applications are running – enables changing of process priority » use view/update speed – can be used to » open new applications » shut rogue applications down
Task Manager (continued)
Processes tab: – all system processes – Memory usage of each – % CPU time for each – total CPU time since boot up – also used to close a process down » careful! (but you get a warning…)
Task Manager (continued)
Performance tab: – total no. of threads, processes, handles running – Graph: % CPU usage » User mode » Kernel mode (optional: view menu) » graph per CPU (optional: view menu) – physical (Page File) memory available/usage – virtual memory available/usage
Event Viewer
Events recorded into “event log” files – System log – Auditing log (customisable) – Application log – customisable - additional files New files recorded daily; old ones archived – time before archiving also customisable
Event Viewer
Three types of events recorded in log: – Information – Warning – Error More information on each event obtained by double-clicking – make note of event code – heed and take action if necessary
Using Event Viewer
Wise to check all event logs regularly – take time/trouble to find out that those messages really mean… The action is needed that it – sort out potential problems now – Make sure they don’t become real ones later…
Auditing Further Events
Any “object” can be audited Objects to audit, and processes audited can be set through audit (group) policy – Using MMC & relevant snap-in Types of process audited: – access – attempt to access
Security auditing
Same principles as general auditing Refers to “restricted” objects Events appear in separate security log
Event Management software (SIEM)
Who’s going to look at all these log files?
– in practice, often no-one..
Solution – SIEM software to analyse and present information from: – network and security devices – identity & access management applications – vulnerability management/policy compliance tools – os, database & application logs – external threat data http://www.focus.com/briefs/how -select-security-information-and event-management-siem
Other Troubleshooting Resources
NT Diagnostics ( winmsd.exe)
– hardware & operating system data from registry
Performance Monitor
– Can monitor many aspects of system performance – Either display current data graphically, in real-time – or log data at regular intervals to get a longer term picture – Useful role in
system optimisation
Other Troubleshooting Resources
System Monitor (perfmon.msc)
– captures, filters, or analyses frames or packets sent over the network
Alerts
– notify administrator when a particular threshold value has been reached
System Recovery
– if a fatal error occurs: » a dump of system memory is made, and can be used for identifying the cause of the problem » alerts are sent to users » system is restarted automatically
Performance Monitor
Windows 2003 Server, but not available on disk To obtain and download Performance Monitor Wizard (PerfWiz), visit the following Web site: –
http://www.microsoft.com/downloads/details.a
spx?FamilyID=31fccd98-c3a1-4644-9622 faa046d69214&displaylang=en
What if the machine doesn’t boot…
Tools available: – The boot error itself » blue screen? driver software » constant reboot? motherboard – Last Known Good… » Gives machine a chance to go back to the previous (usually last but one) configuration
What if the machine doesn’t boot… (continued)
Safe Mode – includes VGA Mode or boot logging – Debugging mode also available » output difficult to decipher for non experts Recovery Console – “DOS-type prompt” for performing minor repairs
What if the machine doesn’t boot… (continued) System Configuration Utility (Msconfig.exe) – automates the routine troubleshooting steps relating to Windows configuration issues – can be used to modify the system configuration and troubleshoot the problem using a process-of-elimination method
What if the machine doesn’t boot… (continued)
Emergency Repair Disk (ERD) – reboot machine using different media » e,g. floppy disk – media should be generated BEFORE it needs to be used!
– option to create the ERD during the set up process…
What if the machine doesn’t boot… (continued)
Full restore – assumes a full backup has already been made – still have to: » reformat hard disk from scratch… » and then restore the backup files using backup/restore option….
– but better than losing all your data!
Network Troubleshooting Chart -1
Identify the problematic network node
Is there a problem with one of the network protocols?
Is there a memory problem?
Use commands such as PING & TraceRt Isolate the problem to a protocol layer and fix it Is there a memory leak?
Is there sufficient memory?
URL: http://teamapproach.ca
/trouble Fix or eliminate the software with the memory leak Add more memory
Network Troubleshooting Chart - 2
Does the system freeze?
Investigate priority and device driver problems Is there high processor utilization?
Is it caused by hardware or software?
hardware Can an upgraded device driver fix the problem?
Provide adequate processor resources Upgrade you hardware to offload the processor
Network Troubleshooting Chart – 3
Is there a disk problem?
Is there sufficient file cache?
Add more memory to ensure sufficient cache Use NTFS and do regular maintenance Is there a boot record problem?
Use RAID Use FixBoot or FixMBR from the recovery console
Network Troubleshooting Chart – 4
Is there a network problem?
Use Network Monitor to identify top broadcasters Eliminate unnecessary broadcasts Use Network Monitor to identify top talkers Eliminate unnecessary network traffic Correct poor configuration Reorganize & upgrade network for more capacity Is there a address or name resolution problem?
Examine ARP cache, WINS, DNS, and NBTstats
Optimisation…
All about improving the performance of system resources… A network manager should never have “nothing to do…”