Network Management Session 1 Network Basics

Transcript Network Management Session 1 Network Basics

COMP3122 Network Management

Richard Henson April 2012

Week 11 – Troubleshooting & Optimisation

 Learning Objectives: – Explain the principles of troubleshooting as a means of mitigating against failure – Use the various tools available on a named operating system to identify potential faults and problems – Take appropriate action to stop a fault becoming a failure

“A stitch in time saves nine”

Business - Worst Possible Scenario (1)

  There is an interruption in the power supply – UPS is invoked – the interruption continues… – servers all have to be shut down Power supply restored… – but main domain controller doesn’t reboot – no other domain controllers therefore connect to it – the domain tree fails

Business - Worst Possible Scenario (2)   Organisation cannot do business with the network down… – server can’t be persuaded to boot – new main domain controller has to be commissioned – whole directory tree has to be rebuilt!!!

– word spreads very rapidly… Business loses so much custom, trust, and credibility that even when it starts doing business again customers choose to go elsewhere – without a flourishing customer base…

the business folds

Analysis: This scenario shouldn’t have occurred…

 Unlikely that the server would fail to boot without prior warning… – warnings would have been presented… – but were clearly not acted upon!

 Disaster recovery plan!?!

– not formulated? – not tested?

– not effective (in the event of a domain tree controller failure…)

But it does…

  Actual example (15 time…) th Feb 2010): – root domain controller [on the network] had not been backed up for 10 months, when it crashed (well… at least it had been backed up at some – http://searchwindowsserver.techtarget.com/generi c/0,295582,sid68_gci1381567,00.html

The consultant called in to fix it reported that: – “I had never seen a case where the forest root domain had to be recovered -- and I couldn't find anyone who had.”

Analysis: Who is to blame? (1)

 In this example, the organisation said they were following Microsoft guidelines – they set up an

empty

root domain – the root domain controller had a RAID-5 disk configuration  This was true, to some extent – Microsoft did espouse this as best practice… in the year 2000!

– guidelines had changed since then…

Analysis: Who is to blame? (2)

 The disaster that struck was: – two RAID drives failed on the same day!

– unlucky? possible to prepare for this?

 The recovery process took about three weeks – most of the time was spent studying logs, doing the restore, etc.  In this case, the tree was still able to function without a root domain – business was able to continue – customer base wasn’t compromised…

Fault Tolerance and Risk Assessment

 General “common sense” principle: –

always

have a backup – ESPECIALLY for the most important computer on the network…  Q: – How can you tell what needs backing up?

 A: – Risk Assessment and Risk Management

Why not Risk Management?

 Time consuming!

 However, without proper risk management… – how does the organisation know what processes are most important to its functioning?

– how can an organisation provide resources to protect aspects of its network?

Risk Management and Risk Assessment

 Risk Assessment is an essential first step – requires putting a “value” on assets – more valuable… greater protection  Do information assets have value?

– organisations still failing to acknowledge that they do… – categorisation of information assets therefore potentially problematic – need to look at the consequence to the organisation of losing that asset…

How do you back up a Domain Controller?

 The Windows “Backup” program works, and can easily be scheduled – but heavily criticised… – even the 2008 server version…  Third Party products give more flexibility and protection e.g. : – Recovery Manager » http://www.quest.com/recovery-manager-for-active-directory – Backup Exec » http://www.symantec.com/business/products/family.jsp?familyid=backupexec

Prevention is Better than Cure

  A server shouldn’t crash unexpectedly!

– should be kept cool (environmental unit mustn’t break down!) – monitoring should show that unexpected things are happening – action can then (usually) be taken to take care of the unexpected Many tools available to: – Check/monitor the system on a regular basis – Provide stats/ to administrators » could also be used for security purposes – Generate alerts if something is starting to go wrong…

Troubleshooting Tools for a Windows Server: Task Manager  Applications tab: – shows which applications are running – enables changing of process priority » use view/update speed – can be used to » open new applications » shut rogue applications down

Task Manager (continued)

 Processes tab: – all system processes – Memory usage of each – % CPU time for each – total CPU time since boot up – also used to close a process down » careful! (but you get a warning…)

Task Manager (continued)

 Performance tab: – total no. of threads, processes, handles running – Graph: % CPU usage » User mode » Kernel mode (optional: view menu) » graph per CPU (optional: view menu) – physical (Page File) memory available/usage – virtual memory available/usage

Event Viewer

 Events recorded into “event log” files – System log – Auditing log (customisable) – Application log – customisable - additional files  New files recorded daily; old ones archived – time before archiving also customisable

Event Viewer

 Three types of events recorded in log: – Information – Warning – Error  More information on each event obtained by double-clicking – make note of event code – heed and take action if necessary

Using Event Viewer

 Wise to check all event logs regularly – take time/trouble to find out that those messages really mean…  The action is needed that it – sort out potential problems now – Make sure they don’t become real ones later…

Auditing Further Events

 Any “object” can be audited  Objects to audit, and processes audited can be set through audit (group) policy – Using MMC & relevant snap-in  Types of process audited: – access – attempt to access

Security auditing

  Same principles as general auditing Refers to “restricted” objects  Events appear in separate security log

Event Management software (SIEM)

 Who’s going to look at all these log files?

– in practice, often no-one..

 Solution – SIEM software to analyse and present information from: – network and security devices – identity & access management applications – vulnerability management/policy compliance tools – os, database & application logs – external threat data http://www.focus.com/briefs/how -select-security-information-and event-management-siem

Other Troubleshooting Resources



NT Diagnostics ( winmsd.exe)

– hardware & operating system data from registry 

Performance Monitor

– Can monitor many aspects of system performance – Either display current data graphically, in real-time – or log data at regular intervals to get a longer term picture – Useful role in

system optimisation

Other Troubleshooting Resources

  

System Monitor (perfmon.msc)

– captures, filters, or analyses frames or packets sent over the network

Alerts

– notify administrator when a particular threshold value has been reached

System Recovery

– if a fatal error occurs: » a dump of system memory is made, and can be used for identifying the cause of the problem » alerts are sent to users » system is restarted automatically

Performance Monitor

 Windows 2003 Server, but not available on disk  To obtain and download Performance Monitor Wizard (PerfWiz), visit the following Web site: –

http://www.microsoft.com/downloads/details.a

spx?FamilyID=31fccd98-c3a1-4644-9622 faa046d69214&displaylang=en

What if the machine doesn’t boot…

 Tools available: – The boot error itself » blue screen? driver software » constant reboot? motherboard – Last Known Good… » Gives machine a chance to go back to the previous (usually last but one) configuration

What if the machine doesn’t boot… (continued)

 Safe Mode – includes VGA Mode or boot logging – Debugging mode also available » output difficult to decipher for non experts  Recovery Console – “DOS-type prompt” for performing minor repairs

What if the machine doesn’t boot… (continued)  System Configuration Utility (Msconfig.exe) – automates the routine troubleshooting steps relating to Windows configuration issues – can be used to modify the system configuration and troubleshoot the problem using a process-of-elimination method

What if the machine doesn’t boot… (continued)

 Emergency Repair Disk (ERD) – reboot machine using different media » e,g. floppy disk – media should be generated BEFORE it needs to be used!

– option to create the ERD during the set up process…

What if the machine doesn’t boot… (continued)

 Full restore – assumes a full backup has already been made – still have to: » reformat hard disk from scratch… » and then restore the backup files using backup/restore option….

– but better than losing all your data!

Network Troubleshooting Chart -1

Identify the problematic network node

 Is there a problem with one of the network protocols?

 Is there a memory problem?

   Use commands such as PING & TraceRt  Isolate the problem to a protocol layer and fix it Is there a memory leak?

 Is there sufficient memory?

 URL: http://teamapproach.ca

/trouble  Fix or eliminate the software with the memory leak Add more memory

Network Troubleshooting Chart - 2

Does the system freeze?

 Investigate priority and device driver problems  Is there high processor utilization?

   Is it caused by hardware or software?

 hardware Can an upgraded device driver fix the problem?

 Provide adequate processor resources  Upgrade you hardware to offload the processor

Network Troubleshooting Chart – 3

Is there a disk problem?

     Is there sufficient file cache?

 Add more memory to ensure sufficient cache  Use NTFS and do regular maintenance  Is there a boot record problem?

  Use RAID Use FixBoot or FixMBR from the recovery console

Network Troubleshooting Chart – 4

Is there a network problem?

 Use Network Monitor to identify top broadcasters  Eliminate unnecessary broadcasts  Use Network Monitor to identify top talkers  Eliminate unnecessary network traffic  Correct poor configuration  Reorganize & upgrade network for more capacity  Is there a address or name resolution problem?

 Examine ARP cache, WINS, DNS, and NBTstats

Optimisation…

 All about improving the performance of system resources…  A network manager should never have “nothing to do…”

Network Management Session 1 Network Basics

Transcript Network Management Session 1 Network Basics

COMP3122 Network Management

Week 11 – Troubleshooting & Optimisation

“A stitch in time saves nine”

Business - Worst Possible Scenario (1)

Analysis: This scenario shouldn’t have occurred…

But it does…

Analysis: Who is to blame? (1)

Analysis: Who is to blame? (2)

Fault Tolerance and Risk Assessment

Why not Risk Management?

Risk Management and Risk Assessment

How do you back up a Domain Controller?

Prevention is Better than Cure

Task Manager (continued)

Task Manager (continued)

Event Viewer

Event Viewer

Using Event Viewer

Auditing Further Events

Security auditing

Event Management software (SIEM)

Other Troubleshooting Resources

Other Troubleshooting Resources

Performance Monitor

What if the machine doesn’t boot…

What if the machine doesn’t boot… (continued)

What if the machine doesn’t boot… (continued)

What if the machine doesn’t boot… (continued)

Network Troubleshooting Chart -1

Network Troubleshooting Chart - 2

Network Troubleshooting Chart – 3

Network Troubleshooting Chart – 4

Optimisation…

Directory