Network Management Session 1 Network Basics

Download Report

Transcript Network Management Session 1 Network Basics

COMP3122 Network Management

Richard Henson April 2012

Week 11 – Troubleshooting & Optimisation

 Learning Objectives: – Explain the principles of troubleshooting as a means of mitigating against failure – Use the various tools available on a named operating system to identify potential faults and problems – Take appropriate action to stop a fault becoming a failure

“A stitch in time saves nine”

Business - Worst Possible Scenario (1)

  There is an interruption in the power supply – UPS is invoked – the interruption continues… – servers all have to be shut down Power supply restored… – but main domain controller doesn’t reboot – no other domain controllers therefore connect to it – the domain tree fails

Business - Worst Possible Scenario (2)   Organisation cannot do business with the network down… – server can’t be persuaded to boot – new main domain controller has to be commissioned – whole directory tree has to be rebuilt!!!

– word spreads very rapidly… Business loses so much custom, trust, and credibility that even when it starts doing business again customers choose to go elsewhere – without a flourishing customer base…

the business folds

Analysis: This scenario shouldn’t have occurred…

 Unlikely that the server would fail to boot without prior warning… – warnings would have been presented… – but were clearly not acted upon!

 Disaster recovery plan!?!

– not formulated? – not tested?

– not effective (in the event of a domain tree controller failure…)

But it does…

  Actual example (15 time…) th Feb 2010): – root domain controller [on the network] had not been backed up for 10 months, when it crashed (well… at least it had been backed up at some – http://searchwindowsserver.techtarget.com/generi c/0,295582,sid68_gci1381567,00.html

The consultant called in to fix it reported that: – “I had never seen a case where the forest root domain had to be recovered -- and I couldn't find anyone who had.”

Analysis: Who is to blame? (1)

 In this example, the organisation said they were following Microsoft guidelines – they set up an

empty

root domain – the root domain controller had a RAID-5 disk configuration  This was true, to some extent – Microsoft did espouse this as best practice… in the year 2000!

– guidelines had changed since then…

Analysis: Who is to blame? (2)

 The disaster that struck was: – two RAID drives failed on the same day!

– unlucky? possible to prepare for this?

 The recovery process took about three weeks – most of the time was spent studying logs, doing the restore, etc.  In this case, the tree was still able to function without a root domain – business was able to continue – customer base wasn’t compromised…

Fault Tolerance and Risk Assessment

 General “common sense” principle: –

always

have a backup – ESPECIALLY for the most important computer on the network…  Q: – How can you tell what needs backing up?

 A: – Risk Assessment and Risk Management

Why not Risk Management?

 Time consuming!

 However, without proper risk management… – how does the organisation know what processes are most important to its functioning?

– how can an organisation provide resources to protect aspects of its network?

Risk Management and Risk Assessment

 Risk Assessment is an essential first step – requires putting a “value” on assets – more valuable… greater protection  Do information assets have value?

– organisations still failing to acknowledge that they do… – categorisation of information assets therefore potentially problematic – need to look at the consequence to the organisation of losing that asset…

How do you back up a Domain Controller?

 The Windows “Backup” program works, and can easily be scheduled – but heavily criticised… – even the 2008 server version…  Third Party products give more flexibility and protection e.g. : – Recovery Manager » http://www.quest.com/recovery-manager-for-active-directory – Backup Exec » http://www.symantec.com/business/products/family.jsp?familyid=backupexec

Prevention is Better than Cure

  A server shouldn’t crash unexpectedly!

– should be kept cool (environmental unit mustn’t break down!) – monitoring should show that unexpected things are happening – action can then (usually) be taken to take care of the unexpected Many tools available to: – Check/monitor the system on a regular basis – Provide stats/ to administrators » could also be used for security purposes – Generate alerts if something is starting to go wrong…

Troubleshooting Tools for a Windows Server: Task Manager  Applications tab: – shows which applications are running – enables changing of process priority » use view/update speed – can be used to » open new applications » shut rogue applications down

Task Manager (continued)

 Processes tab: – all system processes – Memory usage of each – % CPU time for each – total CPU time since boot up – also used to close a process down » careful! (but you get a warning…)

Task Manager (continued)

 Performance tab: – total no. of threads, processes, handles running – Graph: % CPU usage » User mode » Kernel mode (optional: view menu) » graph per CPU (optional: view menu) – physical (Page File) memory available/usage – virtual memory available/usage

Event Viewer

 Events recorded into “event log” files – System log – Auditing log (customisable) – Application log – customisable - additional files  New files recorded daily; old ones archived – time before archiving also customisable

Event Viewer

 Three types of events recorded in log: – Information – Warning – Error  More information on each event obtained by double-clicking – make note of event code – heed and take action if necessary

Using Event Viewer

 Wise to check all event logs regularly – take time/trouble to find out that those messages really mean…  The action is needed that it – sort out potential problems now – Make sure they don’t become real ones later…

Auditing Further Events

 Any “object” can be audited  Objects to audit, and processes audited can be set through audit (group) policy – Using MMC & relevant snap-in  Types of process audited: – access – attempt to access

Security auditing

  Same principles as general auditing Refers to “restricted” objects  Events appear in separate security log

Event Management software (SIEM)

 Who’s going to look at all these log files?

– in practice, often no-one..

 Solution – SIEM software to analyse and present information from: – network and security devices – identity & access management applications – vulnerability management/policy compliance tools – os, database & application logs – external threat data http://www.focus.com/briefs/how -select-security-information-and event-management-siem

Other Troubleshooting Resources

NT Diagnostics ( winmsd.exe)

– hardware & operating system data from registry 

Performance Monitor

– Can monitor many aspects of system performance – Either display current data graphically, in real-time – or log data at regular intervals to get a longer term picture – Useful role in

system optimisation

Other Troubleshooting Resources

  

System Monitor (perfmon.msc)

– captures, filters, or analyses frames or packets sent over the network

Alerts

– notify administrator when a particular threshold value has been reached

System Recovery

– if a fatal error occurs: » a dump of system memory is made, and can be used for identifying the cause of the problem » alerts are sent to users » system is restarted automatically

Performance Monitor

 Windows 2003 Server, but not available on disk  To obtain and download Performance Monitor Wizard (PerfWiz), visit the following Web site: –

http://www.microsoft.com/downloads/details.a

spx?FamilyID=31fccd98-c3a1-4644-9622 faa046d69214&displaylang=en

What if the machine doesn’t boot…

 Tools available: – The boot error itself » blue screen? driver software » constant reboot? motherboard – Last Known Good… » Gives machine a chance to go back to the previous (usually last but one) configuration

What if the machine doesn’t boot… (continued)

 Safe Mode – includes VGA Mode or boot logging – Debugging mode also available » output difficult to decipher for non experts  Recovery Console – “DOS-type prompt” for performing minor repairs

What if the machine doesn’t boot… (continued)  System Configuration Utility (Msconfig.exe) – automates the routine troubleshooting steps relating to Windows configuration issues – can be used to modify the system configuration and troubleshoot the problem using a process-of-elimination method

What if the machine doesn’t boot… (continued)

 Emergency Repair Disk (ERD) – reboot machine using different media » e,g. floppy disk – media should be generated BEFORE it needs to be used!

– option to create the ERD during the set up process…

What if the machine doesn’t boot… (continued)

 Full restore – assumes a full backup has already been made – still have to: » reformat hard disk from scratch… » and then restore the backup files using backup/restore option….

– but better than losing all your data!

Network Troubleshooting Chart -1

Identify the problematic network node

 Is there a problem with one of the network protocols?

 Is there a memory problem?

   Use commands such as PING & TraceRt  Isolate the problem to a protocol layer and fix it Is there a memory leak?

 Is there sufficient memory?

 URL: http://teamapproach.ca

/trouble  Fix or eliminate the software with the memory leak Add more memory

Network Troubleshooting Chart - 2

Does the system freeze?

 Investigate priority and device driver problems  Is there high processor utilization?

   Is it caused by hardware or software?

 hardware Can an upgraded device driver fix the problem?

 Provide adequate processor resources  Upgrade you hardware to offload the processor

Network Troubleshooting Chart – 3

Is there a disk problem?

     Is there sufficient file cache?

 Add more memory to ensure sufficient cache  Use NTFS and do regular maintenance  Is there a boot record problem?

  Use RAID Use FixBoot or FixMBR from the recovery console

Network Troubleshooting Chart – 4

Is there a network problem?

 Use Network Monitor to identify top broadcasters  Eliminate unnecessary broadcasts  Use Network Monitor to identify top talkers  Eliminate unnecessary network traffic  Correct poor configuration  Reorganize & upgrade network for more capacity  Is there a address or name resolution problem?

 Examine ARP cache, WINS, DNS, and NBTstats

Optimisation…

 All about improving the performance of system resources…  A network manager should never have “nothing to do…”