Document 7375997

Download Report

Transcript Document 7375997

Network planning considering
reliability aspects
Hálózatok tervezése megbízhatósági
szempontok figyelembevételével
Takács György
5. Előadás
2010. október 6.
1
Fundamentals of reliability issues in
network planning
• You must dimension networks higher
parameters then the exact specification
calculated from the demand parameters.
Networks always need some spare capacity.
• Calculate with unpredictable situations!
• Reliability and availability dimensioning are
important part of network dimensioning and
planning.
2010. október 6.
2
Organizations are increasingly reliant on computer
networks for business or mission-critical applications.
The scope and size of these networks have expanded
so rapidly over the past two decades that considerable
effort and expense are now targeted at keeping
network resources available, sometimes 24 hours a
day, all year.
Traditionally this area of network design has been the
preserve of large mainframe sites and those sites
requiring high levels of protection (such as nuclear
power plants). However, the explosion of Web-based
business methods means than many more
organizations are now eager to maintain high
availability in order to minimize service losses.
2010. október 6.
3
If the network is poorly designed, and insufficient
attention is paid to providing availability in core systems,
users can experience anything from slow response times
to complete loss of service (referred to as downtime) for
extended periods.
The technical issues in maintaining high availability are
both complex and subtle, and it is the network designer’s
job to balance loss probability against cost, providing
guidance to senior management on the likelihood of
failures and their impact on the business.
2010. október 6.
4
Networks are rarely static environments, and budgets are
finite. In practice network designers are required to make a
range of pragmatic and technical decisions that address,
accept, mitigate, or transfer the risks of failure—all within
the constraints of a budget.
The designer must also ensure that the solutions provided
are scalable, so that additional nodes, services, and
capacity can be added without major upheaval and without
adversely affecting existing users.
Downtime for truly business- and mission- critical systems
can equate to losses of millions of dollars per minute; these
organizations, therefore, demand high-availability (HA)
networks and are often prepared to go to extraordinary
lengths to achieve them.
2010. október 6.
5
2010. október 6.
6
Failure knows no boundaries in a network design, and the
smallest component failure can effectively bring down a
whole business without warning (e.g., a failed hard disk
controller on your core e-business server could stop all
transactions).
For practical reasons organizations are invariably broken
down into teams responsible for different aspects of IT
(desktop support, communications, applications, database,
cabling, etc.). When a problem occurs, it is all too common
for application staff to blame the network and vice versa.
To maintain HA networks, different disciplines must work
together, both at the design phase and subsequently. Good
diagnostic, monitoring, and management tools can also
help.
2010. október 6.
7
Planning for failure
When designing a reliable data network, network
designers are well advised to keep two quotations in mind
at all times:
Anything that can go wrong, will go wrong
—Murphy
Whatever can go wrong will go wrong at the worst
possible time and in the worst possible way . . .
Expect the unexpected. (Számíts a váratlanra!)
—Douglas Adams, The Hitchhiker’s Guide to the Galaxy
2010. október 6.
8
Failure refers to a situation where the observed behavior
of a system differs from its specified behavior. A failure
occurs because of an error, caused by a fault. The time
lapse between the error occurring and the resulting failure
is called the error latency.
Faults can be
•hard (permanent) or
•soft (transient).
For example, a cable break is a hard failure,
whereas intermittent noise on the line is a soft failure.
2010. október 6.
9
Single Point of Failure (SPOF) indicates that a system or
network can be rendered inoperable, or significantly
impaired in operation, by the failure of one single
component. For example, a single hard disk failure could
bring down a server; a single router failure could break
all connectivity for a network.
Multiple points of failure indicate that a system or
network can be rendered inoperable through a chain or
combination of failures (as few as two). For example,
failure of a single router, plus failure of a backup modem
link, could mean that all connectivity is lost for a net.
Planning for failure
In general it is much more expensive to cope with
multiple points of failure and often financially impractical.
2010. október 6.
10
Fault tolerance indicates that every component in the
chain supporting the system has redundant features or is
duplicated. A fault-tolerant system will not fail because
any one component fails (i.e., it has no single point of
failure).
The system should also provide recovery from multiple
failures.
Components are often overengineered or purposely
underutilized to ensure that while performance may be
affected during an outage, the system will perform within
predictable, acceptable bounds.
2010. október 6.
11
Fault resilience implies that at least one of the modules or
components within a system is backed up with a spare
(e.g., a power supply).
This may be in hot standby, cold standby, or load-sharing
mode.
In contrast with fault-tolerant systems, not all modules or
components are necessarily redundant (i.e., there may be
several single points of failure).
For example, a fault-resilient router may have multiple
power supplies but only one routing processor.
By definition, one fault-resilient component does not make
the entire system fault tolerant.
2010. október 6.
12
Disaster recovery is the process of identifying all potential
failures, their impact on the system/network as a whole,
and planning the means to recover from such failures.
2010. október 6.
13
Calculating the true cost of downtime
Network designers are largely unfamiliar with financial
models. It is, however, imperative in designing reliable
networks that the designer gathers some basic financial data
in order to cost justify and direct suitable technical solutions.
The data may come from line managers or financial support
staff and may not be readily collated. Without these data the
scale of the problem is undefined, and it will be hard to
convince senior financial and operational management that
additional features are necessary.
2010. október 6.
14
To illustrate the point let us consider a hypothetical
consumer-oriented business (such as an airline, car rental,
vacation, or hotel reservation call center). The call center
is required to be online 24 hours a day, 7 days a week,
365 days a year. The business has 800 staff involved in
call handling (transactions), each with an average
burdened cost of $25 an hour (i.e., the cost of providing a
desk, heating, lighting, phone, data point, etc.). There is a
small profit made on each transaction, plus a large profit
on any actual sale that can be closed. We assume here
that there are on average three sales closed per hour.
2010. október 6.
15
Cost of Idle Staff is calculated as (Headcount × Burdened
Cost × Downtime).
Production Losses are calculated as (Headcount
×Transactions per Hour × Profit per Transaction ×
Downtime).
Lost Sales are calculated as (Headcount × Sales per Hour
× Profit per Sale × Downtime).
2010. október 6.
16
2010. október 6.
17
2010. október 6.
18
Developing a disaster recovery plan
All networks are vulnerable to disruption. Sometimes
these disruptions may come from the most unlikely
sources. Natural events such as flooding, fire,
lightning strikes, earthquakes, tidal waves, and
hurricanes are all possible, as well as fuel shortages,
electricity strikes, viruses, hackers, system failures,
and software bugs. History shows us that these events do
happen regularly.
As recently as 1999 and 2000 we saw the seemingly
impossible: power shortages in California threatened to
cripple Silicon Valley, and a combination of fuel
shortages, train safety issues, and massive flooding ….
2010. október 6.
19
In fact, various studies indicate that the majority of
system failures can be attributed to a relatively small set
of events.
These include, in decreasing order of frequency, natural
disaster, power failure, systems failure,
sabotage/viruses, fire, and human error.
There is also a general consensus that companies that
take longer than a full business week to get back online
run a high risk of being forced out of business entirely
(some analysts state as high as 50 percent).
2010. október 6.
20
A general approach to the creation of a Disaster
Recovery (DR) :
• Benchmark the current design—Perform a full risk
assessment for all key systems and the network as a
whole. Identify key threats to system and network
integrity. Analyze core business requirements and identify
core processes and their dependence upon the network.
Assign monetary values of loss of service or systems.
• Define the requirements—Based on business needs,
determine an acceptable recovery window for each
system and the network as a whole. If practical specify a
worst-case recovery window and a target recovery
window. Specify priorities for mission- or business-critical
systems.
2010. október 6.
21
•Define the technical solution—Determine the technical
response to these challenges by evaluating alternative
recovery models, and select solutions that best meet the
business requirements. Ensure that a full cost analysis of
each solution is provided, together with the recovery times
anticipated under catastrophic failure conditions and
lesser degrees of failure.
• Develop the recovery strategy—Formulate a crisis
management plan identifying the processes to be followed
and key personnel response to failure scenarios. Describe
where automation and manual intervention are required.
Set priorities to clearly identify the order in which systems
should be brought back online.
2010. október 6.
22
•Develop an implementation strategy—Determine how
new/additional technology is to be deployed and over
what time period. Document changes to the existing
design. Identify how new/additional processes and
responsibilities are to be communicated.
• Develop a test program—Determine how business- and
mission-critical systems may be exercised and what the
expected results should be. Define procedures for
rectifying test failures. Run tests to see if the strategy
works; if not, make refinements until satisfied.
•Implement continuous monitoring and improvements—
Once the disaster recovery plan is established, hold
regular reviews to ensure that the plan stays
synchronized as the network grows or design features
are modified.
2010. október 6.
23
Disaster recovery models
2010. október 6.
24
Tape or CD site backup—Tape or CD-ROM backup and
restore are the widely used DR methods for sites.
Traditionally, key data repositories and configuration files are
backed up nightly or every other night.
Backup media are transported and securely stored at a
different location. This enables complete data recovery should
the main site systems be compromised. If the primary site
becomes inoperable, the plan is to ship the media back,
reboot, and resume normal operations.
Pros and Cons: This is a low-cost solution, but the recovery
window could range from a few hours to several days; this
may prove unacceptable for many businesses. Media
reliability may not be 100 percent and, depending upon the
backup frequency, valuable data may be lost.
2010. október 6.
25
Electronic vaulting—With remote electronic vaulting, data are
archived automatically to tape or CD over the network to a secure
remote site. Electronic vaulting ideally requires a dedicated
network connection to support large or frequent background data
transfers; otherwise, archiving must be performed during off-peak
periods or low-utilization periods (e.g., via a nightly backup).
Backup procedures can, however, be optimized by archiving only
incremental changes since the last archive, reducing both traffic
levels and network unavailability.
Pros and Cons: The operating costs for electronic vaulting can be
up to four times more expensive than simple tape or CD backup;
however, this approach can be entirely automated. Unlike simple
media backup there is no requirement to transport backup data
physically. Recovery still depends on the most recent backup
copy, but this is likely to be more recent due to automation.
Electronic vaulting is more reliable and significantly decreases the
recovery window (typically, just a few hours).
2010. október 6.
26
Data replication/disk mirroring—Remote disk mirroring provides
faster recovery and less data loss than remote electronic
vaulting. Since data are transferred to disk rather than tape,
performance impacts are minimized. With disk mirroring you can
maintain a complete replica file system image at the backup
site; all changes made to production data are tracked and
automatically backed up.
Data are typically synchronized in the background, and when
the recovery site is initialized or when a failed site comes back
online, all data are resynchronized from the replica to production
storage. Note that data may be available only in read-only mode
at the recovery site if the original site fails (to ensure at least one
copy is protected), so services will recover but applications that
are required to update data may be somewhat compromised
unless some form of local data cache is available until the
primary storage comes online. A disk mirroring solution should
ideally be able to use a variety of disks using industrystandard
2010. október 6.
27
interfaces (e.g., SCSI, Fibre Channel, etc.).
Data replication/disk mirroring
Pros and Cons: Data replication is more expensive than the
previous two models, and for large sites considerable traffic
volumes can be generated. Ideally, a private storage network
should be deployed to separate storage traffic from user
traffic. Although more optimal, this requires more maintenance
than earlier models.
2010. október 6.
28
Server mirroring and clustering—These techniques can
be used to significantly reduce the recovery time to
acceptable levels. Ideally, servers should be running live
and in parallel, distributing load between them but
located at different physical locations. If incremental
changes are frequently synchronized between servers,
then backup could be a matter of seconds, and only a
few transactions may be lost (assuming there isn’t largescale telecommunications or power disruption and staff
are well briefed on what to do and what not to do in
such circumstances). The increasing focus on electronic
commerce and large-scale applications such as ERP
means that this configuration is becoming increasingly
common.
2010. október 6.
29
Server mirroring and clustering
Pros and Cons: This approach is widely used at data
centers for
major financial and retail institutions but is often too
expensive to
justify for small businesses. Server mirroring requires
more infrastructure
to achieve (high-speed wide area links, more routers,
more firewalls,
and tight management and control systems).
2010. október 6.
30
Storage Area Networks (SANs) and Optical Storage
Network (OSNs)—There is increasing interest in moving
mission- and business- critical data off the main network
and offloading it onto a privately managed infrastructure
called a Storage Area Network (SAN).
Storage can be optically attached via standard high-speed
interfaces such as Fibre Channel and SCSI (with optical
extenders), providing a physical separation of storage
from 600 meters to 10 kilometers. Servers are directly
attached to this network (typically via Fibre Channel
or ESCON/FICON interfaces [5] and are also attached to
the main user network. SANs may be further extended (to
thousands of kilometers) via technologies such as Dense
Wave Division Multiplexing (DWDM), forming optical
storage networks. This allows multiple sites to share
storage over reliable high-speed private links.
2010. október 6.
31
Storage Area Networks (SANs) and Optical Storage
Network (OSNs)
Pros and Cons: This approach is an excellent model for
disaster recovery and storage optimization. It significantly
increases complexity and cost (though storage
consolidation may recover some of these costs), and it
is, therefore, appropriate only for major enterprises at
present. One big attraction for many large enterprises is
that the whole storage infrastructure can be outsourced
to a Storage Service Provider (SSP). This facilitates a
very reliable DR model (some providers are currently
quoting four-nines (99.99 percent) availability.
2010. október 6.
32
Quantifying availability
• A% = Operational Time/Total Time
2010. október 6.
33
2010. október 6.
34
Mean Time Between Service Outages (MTBSO) or Mean
Time Between Failure (MTBF) is the average time
(expressed in hours) that a system has been working
between service outages and is typically greater than
2,000 hours. Since modern network devices may have a
short working life (typically five years), MTBF is often a
predicted value, based on stress-testing systems and then
forecasting availability in the future. Devices with moving
mechanical parts such as disk drives often exhibit lower
MTBFs than systems that use fixed components
(e.g., flash memory).
2010. október 6.
35
Mean Time To Repair (MTTR) is the average time to repair
systems that have failed and is usually several orders of
magnitude less that MTBF. MTTR values may vary markedly,
depending upon the type of system under repair and the
nature of the failure. Typical values range from 30 minutes
through to 3 or 4 hours. A typical MTTR for a complex system
with little inherent redundancy might be several hours.
2010. október 6.
36
2010. október 6.
37
Soros rendszerre:
2010. október 6.
38
2010. október 6.
39
2010. október 6.
40
2010. október 6.
41
2010. október 6.
42
2010. október 6.
43
2010. október 6.
44
2010. október 6.
45