Performance Management: Application-driven Evolution

Download Report

Transcript Performance Management: Application-driven Evolution

Presented by:

Yaakov (J) Stein

Chief Scientist

OAM and QoS

Unique Access Solutions

SERVICE GUARANTEES

OAMQoS-YJS Slide 2

Why do we pay for services ?

Generally good (and frequently much better than toll quality) voice service is available free of charge (Skype, Fring, Nimbuzz, …) So why does anyone pay for voice services ?

• • • • Similarly, one can get free • (WiFi) Internet access email boxes file storage and sharing web hosting software services So why pay ?

OAMQoS-YJS Slide 3

Paying for QoS

The simple answer is that one doesn’t pay for the service one pays for

Quality of Service guarantees

In our voice model

price

toll quality with mobility BE

But what does QoS mean and why are we willing to pay for it ?

To explain, we need to review some history

QoS

OAMQoS-YJS Slide 4

Father of the telephone

Everyone knows that the father of the telephone was Alexander Graham Bell (along with his assistant Mr. Watson) But Bell did not invent the telephone network Bell and Watson sold pairs of phones to customers The father of the telephone network was Theodore Vail

OAMQoS-YJS Slide 5

Theodore Vail -

Theodore Who?

Son of Alfred Vail (Morse’s coworker) Ex-General Superintendent of US Railway Mail Service First general manager of Bell Telephone Father of the PSTN Why is he so important?

Organized PSTN Established principle of reinvestment in R&D Established Bell Telephones IPR division Executed merger with Western Union to form AT&T Solved the main technological problems • use of copper wire • use of twisted pairs Organized telephony as a service (like the postal service!) Vailism is the philosophy that public services should be run as closed centralized monopolies for the public good

OAMQoS-YJS Slide 6

What’s the difference ?

In the Bell-Watson model the customer pays once, but is responsible for • installation • • • wires wiring operations • power

+

• • fault repair performance (distortion and noise) • infrastructure maintenance while the Bell company is responsible only for providing functioning telephones In the Vail model the customer pays a monthly fee but the provider assumes responsibility for everything including fault repair and performance maintenance the telephone company owns the telephone sets and even the wires in the walls !

OAMQoS-YJS Slide 7

Service Level Agreements

In order to justify recurring payments the provider agrees to a minimum level of service in an SLA SLAs should capture Quality of user Experience (QoE) but this is often hard to quantify So SLAs usually actually detail measurable network parameters that influence QoE, such as : • • • • • availability (e.g., the famous five nines) time to repair (e.g., the famous 50 ms) information rate (throughput) information latency (delay) allowable defect densities (noise/distortion) Availability (basic connectivity) always influences QoE It is hard to predict the effect of the other parameters on QoE even when there is only one application (e.g., voice) When multiple applications are in use - it may be impossible

OAMQoS-YJS Slide 8

Some Applications

System traffic routing protocols, DNS, DHCP, time delivery, system update, OAM, tunneling and VPN setup Business processes database access, backup and data-center, B2B, ERP Communications - interactive voice, video conferencing, telepresence, instant messaging, remote desktop, application sharing Communications – non-interactive email, broadcast programming, music video : progressive download, live streaming, interactive Information gathering http(s), Web 2.0, file transfer Recreational gaming, p2p file transfer Malicious DoS, malware injection, illicit information retrieval

OAMQoS-YJS Slide 9

What do applications need ?

Some applications only require availability Some also require minimum available throughput Some require delay less then some end-end (or RT) delay Some require packet loss ratio (PLR) less than some percentage and these parameters are not necessarily independent For example, TCP throughput drops with PLR

1000 B packets 50 ms RTT OAMQoS-YJS Slide 10

Some rules of thumb

Mission Critical (and life critical) applications require • high availability If there are any MC applications then system traffic requires high availability too MC applications do not necessarily require strict throughput but always indirectly require • • a certain minimal average throughput bounded delay If the MC application uses TCP then it requires • low PLR Real-time applications require • sufficient throughput but not necessarily low PLR (audio and video codecs have PLC) Interactive applications require • low RT delay It may be more scalable for a SP to measure 1-way delays

OAMQoS-YJS Slide 11

OAM

OAMQoS-YJS Slide 12

Monitoring an SLA

The Service Provider’s justification for payment is the maintenance of an SLA To ensure SLA compliance, the SP must : • monitor the SLA parameters • take action if parameter is dropping below compliance levels But how does the SP verify/ensure that the SLA is being met ?

Monitoring is carried out using Operations, Administration, Maintenance (OAM) The customer too may use OAM to see that the SP is compliant !

Technical note:

OAM is a user-plane function • but may influence control and management plane operations for example OAM may trigger protection switching, but doesn’t switch • OAM may detect provisioned links, but doesn’t provision them

OAMQoS-YJS Slide 13

Operations, Administration, Maintenance

Traditionally, one distinguishes between 2 OAM functionalities :

1.

Fault Monitoring • OAM runs continuously/periodically at required rate • detection and reporting of anomalies, defects, and failures • • used to trigger mechanisms in the control plane (e.g. protection switching) and • • management plane (alarms)

required for maintenance of basic connectivity

(availability)

2.

Performance Monitoring • OAM run : • • • • • before enabling a service on-demand or per schedule measurement of performance criteria (delay, PDV, etc.)

required for maintenance of all other QoE attributes OAMQoS-YJS Slide 14

Early OAM

Analog channels and 64 kbps digital channels did not have mechanisms to check signal validity and quality Thus • major faults could go undetected for long periods of time • • hard to characterize and localize faults when reported minor defects might be unnoticed indefinitely As PDH networks evolved, more and more OAM was added on : • monitoring for valid signal • • • loopbacks defect reporting alarm indication/inhibition The OAM overhead started to explode in size !

When SONET/SDH was designed bounded overhead was reserved for OAM functions

OAMQoS-YJS Slide 15

OAM for Packet Switched Networks

OAM is more complex for Packet Switched Networks in addition to the previous defects : • loss of signal • we have new defect types • • • bit errors packets may be lost packets may be delayed packets may delivered to the wrong destination The first PSN-like network to acquire OAM was ATM (I.610) Although technically ATM is cell-based, not packet-based

OAMQoS-YJS Slide 16

Some FM OAM mechanisms (1)

How do we perform Continuity Check ?

• • send OAM packets at a constant known rate if CC packets are not received for >3 intervals then declare a fault see also LB / echo mode How do we perform Connectivity Verification ?

• • send OAM packets to a known destination if CV packets are received somewhere else then declare a fault How do we indicate

AIS

• • (FDI) ?

when do not receive forward traffic send AIS OAM packets if AIS packets received then declare a fault How do we indicate

RDI

• • (BDI) ?

when do not receive reverse traffic send RDI OAM packets if RDI packets received then declare a fault Note: RDI is often a flag set on CC message

OAMQoS-YJS Slide 17

Some FM OAM mechanisms (2)

How do we use • non-intrusive (in-service) (echo mode) • • • LoopBack ?

send LB request OAM packet to remote site remote site replies with LB reply if LB reply not received then declare a fault • intrusive (out-of-service) • put remote site into LB mode • • remote sites reflects (and does not forward) all traffic (note that it must monitor OAM traffic) if packets sent are not received then declare a fault note: need to inform next hops of LB by locking How do we use LinkTrace • • • ?

send LB request OAM packet to next hop send LB request to following hop etc.

OAMQoS-YJS Slide 18

Some PM OAM mechanisms (1)

How do we measure Packet Loss Ratio ?

• • Traffic (counter) based maintain 2 counters: • • • • number of packets transmitted to peer Tx number of packets received from peer Rx send Tx counter to peer at time 1 Tx(1) peer notes its Rx counter at time of reception Rx(2) and its Tx counter at time of its reply Tx(3) • originator notes its Rx counter when reply is received Rx(4) calculate PLR in both directions Synthetic : do not maintain counters – use OAM packets Note : synthetic loss is only a rough estimate How do we measure

Throughput

?

Primitive way (RFC 2544) • send packets at maximum rate and observe packet loss • reduce rate until no loss is observed Note : there are more sophisticated mechanisms !

OAMQoS-YJS Slide 19

Some PM OAM mechanisms (2)

How do we measure 1-way Packet Delay (Latency) ?

synchronize clocks at both OAM peers • send timestamp T1 to peer • peer timestamps receipt with T2 calculate time difference T2 – T1 How do we measure 2-way Packet Delay (Latency) ?

• • • • send timestamp T1 to peer peer timestamps receipt with T2 peer replies at T3 originator timestamps receipt of reply at T4 calculate time difference (T4 – T1) – (T3 - T2) assuming symmetry, 1-way delay is half this amount Note : do not need to synchronize clocks How do we measure Packet Delay Variation ?

• • send timestamps at a constant rate peer calculates timestamp differences and statistics thereof Note : do not need to synchronize clocks

OAMQoS-YJS Slide 20

ETHERNET OAM

OAMQoS-YJS Slide 21

What about Ethernet ?

Carrier Ethernet has replaced ATM as the default layer-2 Ethernet is by far the most widespread network interface Ethernet has some advantages as compared to ATM • it has network-wide unique addresses • it has a source address in every packet but some aspects make Ethernet OAM more difficult • ConnectionLess (CL) • • • multipoint to multipoint overlapping layering – need OAM for operator, SPs, customer some specific problematic ETH behaviors (flooding, multicast, …)

OAMQoS-YJS Slide 22

What’s the problem with CL ?

OAM makes a lot of sense in Connection Oriented environments • connections last a relatively long amount of time • there is some SLA at the connection level For CL networks, the network path is neither known nor pinned So it doesn’t really make sense to talk about FM what does continuity mean if when a link goes down the network automatically reroutes around the failure ?

The Ethernet CL problem is solved by overlaying CO functionality : • flows or • EVCs

OAMQoS-YJS Slide 23

Ethernet OAM

For many years there was no OAM for Ethernet (LANs don’t need OAM) now there are two incompatible ones!

• Link layer OAM – 802.3 clause 57 (EFM OAM, 802.3ah) single link only slow protocol, limited functionality some management functions • Service OAM – Y.1731, 802.1ag (CFM) any network configuration multilevel OAM functionality In some cases one may need to run both while in others only service OAM makes sense Link layer OAM is only for a single link, which is necessarily CO Service OAM is most frequently used for infrastructure networks, which are also CO

OAMQoS-YJS Slide 24

Layer 2 control protocols (L2CPs)

Do not be confused - L2CPs are NOT OAM !

Here are a few well-known L2CPs :

protocol STP/RSTP/MSTP PAUSE LACP/LAMP Link OAM ESMC DA 01-80-C2-00-00-00 802.2 LLC 01-80-C2-00-00-01 01-80-C2-00-00-02 EtherType 88-09 Subtype 01 and 02 01-80-C2-00-00-02 EtherType 88-09 Subtype 03 01-80-C2-00-00-02 EtherType 88-09 Subtype 10 01-80-C2-00-00-03 01-80-C2-00-00-07 reference 802.1D

§

8,9 802.1D

§

17 802.1Q

§

13 802.3

§

31B 802.3x

802.3

§

43 (ex 802.3ad) 802.3

§

57 (ex 802.3ah) G.8264

Port Authentication E-LMI Provider MSTP Provider MMRP 01-80-C2-00-00-08 01-80-C2-00-00-0D 802.1X

MEF-16 802.1D

§

802.1ad

802.1ak

LLDP GARP (GMRP, GVRP) 01-80-C2-00-00-0E 802.1AB-2009 EtherType 88-CC Block 01-80-C2-00-00-20 802.1D

§

10, 11, 12 through 01-80-C2-00-00-2F

Note: IEEE disallows forwarding of L2CPs, MEF allows it under certain circumstances

OAMQoS-YJS Slide 25

Link Layer OAM (AKA EFM OAM)

Ethernet in the First Mile (Last Mile ?)

EFM

networks are mostly p2p DSL links or p2mp PONs thus a link layer OAM is sufficient for EFM applications Since EFM link is between customer and Service Provider EFM OAM entities are either active (SP) or passive (customer) active entity can place passive one into LB mode but not the reverse EFM OAMPDUs are a slow protocol frames – never forwarded Ethertype = 88-09 and subtype 03 messages multicast to slow protocol specific group address OAMPDUs must be sent once per second (heartbeat) messages are TLV-based

DA

01-80-C2 00-00-02

SA TYPE 8809 SUB TYPE

03 FLAGS (2B) CODE (1B)

DATA CRC

OAMQoS-YJS Slide 26

EFM OAM capabilities

• • • • 6 codes are defined • Information (autodiscovery, heartbeat, fault notification) • Event notification (statistics reporting) Variable request (active entity query passive’s configuration) (mngt) Variable response (passive entity responds to query) (mngt) Loopback control (active entity enable/disable of intrusive LB mode) Organization specific (proprietary extensions) and there are flags in every OAMPDU to expedite notification of critical events • link fault (RDI) • • dying gasp unspecified monitor slow degradations in performance

OAMQoS-YJS Slide 27

Service OAM (AKA CFM, Y.1731)

Many SPs need to monitor full networks not just single links Service layer OAM provides end-to-end integrity of the Ethernet service over arbitrary server layers Because Ethernet is flat not true client-server layering (except MAC-in-MAC) service layer OAM is multilevel Because SPs want to replace transport networks with Ethernet service OAM must support all OAM features

and must enable advanced transport capabilities (such as linear/ring protection switching)

a transport network is a network with : 1.

High availability (Fault Management OAM and Automatic Protection Switching) 2.

3.

4.

SLA support (Performance Management OAM and QoS mechanisms) a Management plane (optionally a control plane) for configuration and provisioning Efficiency and Scalability

OAMQoS-YJS Slide 28

Y.1731 messages

• • • • • • • • • • • • • Y.1731 supports many OAM message types: Continuity Check proactive heartbeat with 7 possible rates Synthetic Loss Measurement on demand loss rate estimation LoopBack Link Trace

AIS RDI

unicast/multicast pings with optional patterns identify path taken to detect failures and loops periodically sent when CC fails flag set to indicate reverse defect Client Signal Fail LoCK signal sent by MEP when client doesn’t support AIS inform peer entity about diagnostic actions TeST signal in-service/out-of-service tests for loss rate, etc. Automatic Protection Switching Maintenance Communications Channel EXPerimental remote maintenance Vendor SPecific

OAMQoS-YJS Slide 29

Y.1731 frame format

after DA, SA and Ethertype (8902) Y.1731/802.1ag PDUs have the following header (may be VLAN tagged)

LEVEL (3b) VER (5b) OPCODE (1B) FLAGS (1B) TLV-OFF (1B)

if there are sequence numbers/timestamp(s) they immediately follow then come TLVs, the “end TLV”, followed by the CRC TLVs have 1B type and 2B length fields there may or not be a value field the “end-TLV” has type = zero and no length or value fields

OAMQoS-YJS Slide 30

Y.1731 PDU types

opcode

1 3 2 5 4 6-31 41 43 42 45 47 46 49 32-63 unused 33 35 37 39 40 48 51 50 52 55 54 64-255

OAM Type

CCM LBM LBR LTM LTR RES IEEE RES ITU-T AIS LCK TST Linear APS Ring APS MCC LMM LMR 1DM DMM DMR EXM EXR VSM VSR CSF SLM SLR RES IEEE M1 or U M1 or U U M2 U M1 or U M1or U M1 or U M1or U M1or U M1 or U M1 or U U DA M1 or U M1 or U UA

DA

M1 or U U U

OAMQoS-YJS Slide 31

MEPs and MIPs

Maintenance Entity (ME) – entity that requires maintenance ME is a relationship between ME end points because Ethernet is MP2MP, we need to define a ME Group MEGs can be nested, but not overlapped MEG LEVEL takes a value 0 … 7 by default - 0,1,2 operator, 3,4 SP, 5,6,7 customer MEP = MEG end point (MEG = ME group, ME = Maintenance Entity) (in IEEE MEG is called MA = Maintenance Association) unique MEG IDs specify to which MEG we send the OAM message MEPs responsible for OAM messages not leaking out but transparently transfer OAM messages of higher level MIPs = MEG Intermediate Points • never originate OAM messages, • • process some OAM messages transparently transfer others

OAMQoS-YJS Slide 32

MEPs and MIPs (cont.)

OAMQoS-YJS Slide 33

How is OAM used ?

MEF-30 Service OAM FM and MEF-xx Service OAM PM

describe the use of OAM for Carrier Ethernet networks, such as • • • • which Y.1731/802.1 features/messages should be used where to put MEPs, what MA and MEG levels names should be used minimum number of EVCs that must be supported what should be reported and how

Y.1564 (ex Y.156sam)

• Ethernet Service Activation Test Methodology describes commissioning procedures (replaces RFC2544-like benchmarking) Tests that desired performance level can be achieved, including CIR, EIR (and optionally CBS and EBS for bursting) • • • traffic policing rate, loss, delay, delay variation, availability (measured simultaneously) Testing in two steps : • Service Configuration Test – each service separately • Service Performance Test – all services together Performance testing may be for : 15 minutes (new service on operational network) • • 2 hours (single operator network) 24 hours (multiple operator networks)

OAMQoS-YJS Slide 34

QOS ENFORCEMENT

OAMQoS-YJS Slide 35

QoS approaches

There are two approaches to QoS handling

IntServ

• (guaranteed QoS) define traffic flows (CO approach) • • • guarantee QoS attributes for each flow reserve resources at each router along the flow signaling protocol (e.g., RSVP) needed

DiffServ

• (statistical QoS) retain CL paradigm • • • • no guaranteed QoS attributes mark packets (differentiated – • e.g., gold, silver, bronze ) marking can be by VLAN, P-bits, IP-ToS/DSCP, or general “flow” offer special treatment (priority) relative to other packets no resource reservation For Ethernet and IP DiffServ is the preferred approach

OAMQoS-YJS Slide 36

Some fields for marking

Example: For an IPv4 packet inside Q-in-Q Ethernet we have various choices for marking priority

DA

(6B)

SA

(6B)

ET=8100

(2B)

ET=88A8

(2B)

ET=0800

(2B) P (3b) CFI (1b)

CVID

(12b) P (3b) DEA (1b)

SVID

(12b)

Ver

(4b)

IHL

(4b)

ToS

(1B)

Len

(2B)

. . .

Source IP Address

(4B)

Destination IP Address

(4B) 802.1p

user priority field AKA P-bits 0 … 7 priority tagging (VLAN=0) if no VLAN P=0 means non-expedited traffic 802.1Q recommends mappings IP ToS

• •

RFC 2474 redefined ToS to contain 6 bit DSCP (see also RFC 4594) 2 bit ECN OAMQoS-YJS Slide 37

Queuing

Ethernet switches have queues FIFO buffers on each output port If there were only one queue then traffic handling would be FIF To enable DiffServ prioritization multiple queues are used Outgoing frames are inserted into queues according to priority marking Many methods for emptying queues The most popular are : • Strict Priority always take from nonempty queue of highest priority • Weighted Fair Queuing take from nonempty queues according to configured “weight”

switch fabric

OAMQoS-YJS Slide 38

Traffic shaping

One of the most important parts of an SLA is the Committed Information Rate (bps) This is the datarate (bandwidth) SP guarantees will be forwarded There may also be an Extra Information Rate (bps) This is a datarate that the SP will forward if possible Packet traffic is often bursty A customer who did not send data for a while will expect to be able to send a higher rate afterwards This is accomplished via traffic shaping • time integration is accomplished by leaky/token buckets • the effect of shaping is marking drop eligibility (marking a packet on the line is only possible with S-tags!) There is often also traffic policing policing simply discards packets to police a maximum rate !

OAMQoS-YJS Slide 39

MEF token bucket algorithm

Metro Ethernet Forum 10.x defines a bandwidth profile there are two byte buckets,

C

of size CBS and

E

of size EBS (in bytes) tokens are added to the buckets at rate CIR/8 and EIR/8 when bucket overflows tokens are lost (use it or lose it) if ingress frame length < number of tokens in C bucket frame is green and its length in tokens is debited from C bucket else if ingress frame length < number of tokens in E bucket frame is

yellow

and its length of tokens is debited from E bucket else frame is red •

for simplicity we assume no coupling and

green frames are delivered and service objectives apply

yellow frames are delivered but service objectives don’t apply

red frames are discarded •

no sharing !

CBS C EBS E OAMQoS-YJS Slide 40