Automation of Mobile Radio Network Performance and Fault Management

Download Report

Transcript Automation of Mobile Radio Network Performance and Fault Management

Automation of
Mobile Radio Network
Performance and Fault Management
(Matkapuhelinradioverkon suorituskyvyn- ja vianhallinnan automatisointi)
A thesis submitted in partial fulfilment
of the requirements for the degree of
Master of Science
Espoo 28.2.2007
Helsinki University of Technology
Department of Electrical and Communications Engineering
Author: Magnus Wallström [email protected]
Supervisor: Timo Korhonen [email protected]
Instructor: Mikko Lamberg, MSc (Tech), Nokia Networks [email protected]
1
2007-02-28 / Magnus Wallström
Contents
• Introduction and background
• Literature review:
•
•
•
•
Architecture of Mobile Radio Access Networks
State of the art in management of mobile networks as defined by 3GPP
Performance management data
Functionality scenarios in UTRAN
• Methods of the practical study
• Results:
•
•
•
•
•
•
Results (I/IX): Current PM and FM organisation and process
Results (II/IX): Example on current PM and FM process (1/3)
Results (III/IX): Example on current PM and FM process (2/3)
Results (IV/IX): Example on current PM and FM process (3/3)
Results (V/IX): Analysis of the current organisation and process
Results (VI/IX): Problems in current PM and FM process, Interrelationship- and why-why-diagrams of the
process problems
• Results (VII/IX): Summary of the analysis
• Results (VIII/IX): Solution for automated investigation
• Results (IX/IX): Implementation of the solution
• Conclusions of the thesis
• References
2
2007-02-14 / Magnus Wallström
Key concepts
• 3GPP = Project that aims to develop GSM and UMTS specifications in
cooperation with the vendors, operators and standardisation organisations. The
acronym 3GPP stands for Third Generation Partnership Project.
• Fault management = Functions that enable the detection and location of failures
in the network and scheduling of repairs. 3GPP specifies the requirements of the
concept.
• Mobile radio access network = a network that provides wireless access to
users through radio interface and allows the mobile users to move between
coverage areas without losing connection, i.e. handover.
• Performance management = Functions that enable the performance
measurements of network services. 3GPP specifies the requirements of the
concept.
3
2007-02-14 / Magnus Wallström
Introduction and background
How to enhance the
productivity of the UTRAN
performance management
investigations?
• Research area: Mobile Radio Network Performance and Fault Management
• Research questions:
• What is a mobile radio access network and how it is managed?
• What is the current performance management set-up in the organisation under study?
• What is the organisation and communication structure?
• What is the process to tackle performance problems in UTRAN?
• What are the problems of the current set-up?
• What could be solutions to the root-problems found from the current setup?
• Scope:
• Limited to European 3G mobile radio network = UTRAN (UMTS Terrestrial Radio Access
Network)
• Other major mobile radio network technologies are: GERAN (GSM radio network), Wimax
(WMAN) and WiFi (WLAN)
4
2007-02-14 / Magnus Wallström
Architecture of Mobile Radio Access Networks
• General architecture
•
UE – User Equipment
•
•
•
•
Currently the most popular mobile RANs are UTRAN and GERAN
•
Other radio access technologies are LTE, WiMAX and WiFi
Iu / A /Gb / ..
Mobile netw ork
Consists of Mobile Equipment (ME) and Subscriber Identity Module (SIM) for the
end-user to access the mobile network
RAN
UTRAN
RAN – Radio Access Network
PSTN
CN
GERAN
UE
CN – Core Network
•
•
Uu / Um / ..
IP
X
All RANs are attached to a CN that provides switching and access to services in
PSTN and any IP network
O&M
OSS – Operations Support System
•
OSS
All parts of the mobile network may be managed by a centralised system
• UTRAN architecture
•
•
5
Uu
Network elements
•
RNC – Radio Network Controller
•
Node B aka. BTS – Base Transceiver Station
•
A – ATM transmission nodes
Iub
Mobile netw ork
UTRAN
CN
B
RNC
Interfaces
•
IuCS: RNC to Circuit Switched Core Network (voice and video calls)
•
IuPS: RNC to Packet Switched Core Network (data calls)
•
Iur: RNC to RNC
•
Iub: RNC to BTS
•
Uu: BTS to UE
•
O&M: OSS to any network element: RNC, BTS, ATM-nodes and CN elements
(MSC, HLR, SGSN, GGSN etc.))
2007-02-14 / Magnus Wallström
Iu (CS/PS)
B
CS
Iur
B
B
A
RNC
UE
O&M
OSS
PS
State of the art in management of mobile networks
as defined by 3GPP
• Network management areas relevant to RAN technical support
• Performance Management (PM)
• Keeps track on the network performance status and analyses the effects of configuration changes in the network [3GPP
TS 32.101]
• Bases on measurements that are continuously recorded in the network elements
• Fault Management (FM)
• Consists of fault detection, fault localisation, fault reporting, fault correction and fault repair [3GPP TS 32.111-1]
• Bases mainly on alarms and system logs that the network elements produce
• Software management (SWM)
• Covers software request management, installation, customer feedback and software fault management, i.e. detection of
software faults and finding resolution to the problems. This duty is close to and overlapping with fault management (FM)
[3GPP TS 32.101]
• Configuration management (CM)
• Controls the operational parameters of network elements [3GPP TS 32.600]
• Process of applying the network management:
1. Performance monitored
2. Faults localised
Depending on the type of failure:
3a. Configuration changed
or
3b. Software defect(s) corrected
3a. CM
1. PM
2. FM
param or
software?
4. Monitor the performance (step 1)
3b. SWM
6
2007-02-14 / Magnus Wallström
Functionality scenarios in UTRAN
•
Control plane, i.e. signaling, on RRC connection (Radio Resource Control)
•
•
Major purpose: setup and release a call
User plane, i.e. the traffic, on RAB connections (Radio Access Bearer):
•
Major purpose: define the QoS class of the call:
•
•
•
•
•
Signaling scenarios:
•
MTC (Mobile Terminated Call) scenario
1.
2.
3.
4.
5.
6.
7.
8.
•
Paging: RNC sends an “RRC Paging Type 1” message to the Uu interface
RRC connection setup: The paged UE responses by starting the radio control connection establishment procedure by (1.) sending an “RRC
Connection Request” message to RNC (“RRC Connection Setup Attempt” counter is updated). (2.) RNC tries to allocate radio resources (BTS)
and if successful, it responses with “RRC: Connection Setup” message (“RRC Connection Setup Complete” counter is updated). (3.) Finally the
UE responses with “RRC: Connection Setup Complete” message (“RRC Access Complete” counter is updated).
Transaction reasoning: RNC and CN negotiate on the transaction type
Authentication and Security procedure: UMTS subscriber and network authenticate each other, and other security mechanism are activated
RAB setup for transaction: Actual communication resources for the transaction are allocated.
Transaction: UE has an active user plane bearer connection across the whole UMTS network
RAB release for transaction clearing: Network resources related to the transaction are released, i.e. all the RAB active connections for an UE
are released
RRC connection release: Radio control connection between the UE and the UTRAN is released
Mobility (handover scenario):
1.
2.
3.
7
Conversational class, RT (Real Time), applications: CS voice and video calls
Streaming class, RT, applications: CS streaming video
Interactive class, NRT (Non RT), applications: PS (Packet Switched) web browsing
Background class, NRT, applications: emails, MMS (Multimedia Messaging Service)
Measurement: the UE sends a radio-link measurement report to the RNC
Decision: the final decision to make a handover is done in RNC by the RRM handover control algorithms. Decision bases on the handover
criteria and algorithm parameters
Execution: handover signalling between e.g. UE and RNC, and radio resource allocation e.g. in BTS
2007-02-14 / Magnus Wallström
Performance management data
• Performance counters
• UTRAN collects thousands of counters that measure the amount of specific events
• E.g. RRC Setup Attempts, RRC Setup Completes, RRC Setup Attempt Failure RNC,
RRC Setup Attempt Failure BTS etc.
• KPI (Key Performance Indicator)
Calculated most often from performance counters to relative %-values
• Relative% KPIs are comparable between networks of different sizes,
absolute values are not, because the amount of traffic varies
• Form: KPI = (a formula of performance counters)
Examples:
• RRC_Acc% = “RRC access complete ratio” = “RRC Access Completes” / “RRC Setup Attempts”
• CSSR, Call Setup Success Rate (voice call) =
RRC_Acc% * (RAB_voice_attempts-RAB_voice_failures) / RAB_attempts
• CCSR, Call Completion Success Rate (voice call) =
(RAB_active_voice_failures) / (RAB_active_voice_failures + RAB_active_voice_succesful_completes)
8
2007-02-14 / Magnus Wallström
Methods of the practical study
•
Based on UCD (User Centered Design) process and framework
•
Chronologically the practical study had three phases:
I.
II.
III.
9
Study and define the current process and organisation
a.
Study: interview
b.
Study: focus group
c.
Study: contextual enquiry
Analyse the current set-up
a.
Analysis: brainstorming
b.
Analysis: affinity diagram
c.
Analysis: double teams
d.
Analysis: interrelationship diagram
e.
Analysis: why-why-analysis
Develop an enhanced process
a.
Solution: brainstorming
b.
Solution: SWOT analysis
c.
Solution: UML diagrams
2007-02-14 / Magnus Wallström
Results (I/IX):
Current PM and FM organisation and process
•
Organisation-wise Technical support is the
communicator between the local customer
contact teams and product line R&D
organisation.
Customer B
2.
Technical support investigates and analyses the
performance degradations and makes decisions
to fix them with the co-operation of R&D.
Customer C
Technical
Support
R&D
R&D’s responsibility is to develop corrections to
the system, if no other solution is effective.
•
Investigation (maps to PM [3GPP])
•
Analysis (maps to FM [3GPP])
•
nvestigation~~PM
PM
I Investigation
2. Find performance dips
Decision (maps to SWM and CM [3GPP])
2007-02-14 / Magnus Wallström
Analysis~~FM
FM
Analysis
Decision~~CC
SM
Decision
MM~~SM
1. Get top level PM data
3. Get detail level PM data
Each phase of the process has deliverables
that are utilised in the later phases.
10
Local
team B
Local teams communicate the performance
status of the customer networks to the technical
support.
The process follows the three phases of the
root-cause analysis methodology:
•
Local
team A
1.
3.
•
Customer A
4. Find problematic sites
5. Gather logs, alarms etc.
6. Analyse the logs and
other detail level data
that w as gathered
during the investigation
7. Generate solutions
Results (II/IX):
Example on current PM and FM process (1/3)
1. Get KPIs and failure counters for the required
top object (i.e. RNC)
•
Achieved by using a reporting tool that collects
the needed counters from the OSS
measurement database and calculates the KPI
values based on the counters. By manual postprocessing the data, the graphical output
•
Call setup performance: at 11 the CSSR KPI has had
poor values. The phenomenon has been partly
ongoing during the next hour
Retainability: high drop call ratio at 16. Counter
diagram verifies that the drop in CCSR is due to high
number of RAB active failures.
•
Failure counters:
9000
9000
500
100.00
8000
8000
450
95.00
7000
7000
400
90.00
6000
85.00
5000
80.00
4000
75.00
3000
70.00
2000
65.00
1000
60.00
0
11
2007-02-14 / Magnus Wallström
CSSR
CCSR
350
6000
300
5000
250
4000
200
3000
150
2000
100
1000
50
0
0
20
0
20 6/1
2
20 06/1 /23
06 2/ 0
20 /1 23 0
0 2
20 6/1 /23 01
0 2
20 6/1 /23 02
0 2
20 6/1 /23 03
2
20 06/1 /23 04
06 2/ 0
20 /1 23 5
0 2
20 6/1 /23 06
0 2
20 6/1 /23 07
0 2
20 6/1 /23 08
2
20 06/1 /23 09
06 2/ 1
20 /1 23 0
0 2
20 6/1 /23 11
0 2
20 6/1 /23 12
0 2
20 6/1 /23 13
2
20 06/1 /23 14
06 2/ 1
20 /1 23 5
0 2
20 6/1 /23 16
0 2
20 6/1 /23 17
0 2
20 6/1 /23 18
2
20 06/1 /23 19
06 2/ 2
20 /1 23 0
06 2/ 2
/1 23 1
2/ 2
23 2
23
RRC setup attempts
RRC Connection Setup Attempts
105.00
20
0
20 6/1
0 2/
20 6/1 23 0
0 2/ 0
20 6/1 23 0
0 2/ 1
20 6/1 23 0
0 2/ 2
20 6/1 23 0
0 2/2 3
20 6/12 3 0
0 /2 4
20 6/1 3 0
0 2/ 5
20 6/1 23 0
0 2/ 6
20 6/1 23 0
0 2/ 7
20 6/1 23
0 2/ 08
20 6/1 23
0 2/2 09
20 6/12 3 1
0 /2 0
20 6/1 3 1
0 2/ 1
20 6/1 23 1
0 2/ 2
20 6/1 23 1
0 2/ 3
20 6/1 23 1
0 2/ 4
20 6/1 23 1
0 2/2 5
20 6/12 3 1
0 /2 6
20 6/1 3 1
0 2/ 7
20 6/1 23 1
0 2/ 8
20 6/1 23 1
0 2/ 9
20 6/1 23 2
0 2/ 0
20 6/1 23 2
06 2/2 1
/1 3
2/ 22
23
23
KPI [%]
KPIs:
2. Find measurement periods where there is a dip
in performance:
RRC setup failures
RRC access failures
RAB setup failures
RAB active failures
RRC active failures
RRC setup attempts
RAB access failures
Results (III/IX):
Example on current PM and FM process (2/3)
• 3. Get the KPIs and failure counters on BTS level.
• It can be achieved using the same reporting tool than in the first phase. The output is
extensive list of all the BTS under one RNC, all measurement periods and counters
per each BTS.
• 4. Find the network elements that are causing the performance dip.
12
2007-02-14 / Magnus Wallström
4
4
7
6
3
3
4
3
2
3
RRC active failures
8
3
4
2
1
RAB active failures
RAB setup failures
1 123 1231 98.46 15.80 1234
1 123 1232 98.45 24.70 1032
1 123 1233 97.51 45.32 602
1 69 691 96.62 91.40 385
1 69 693 97.66 91.62 342
RAB access failures
RRC access failures
time
2006/12/23 16
2006/12/23 16
2006/12/23 16
2006/12/23 16
2006/12/23 16
RRC setup failures
4
5
4
2
2
3
RRC setup attempts
2
6
3
1
2
2
CCSR
4
5
1
3
1
2
CSSR
3
5
8
2
8
3
Cell id
RRC active failures
4
4
7
6
7
6
BTS id
RAB active failures
99.01 1345 1132
96.97 1032 820
97.76 602 452
99.05 185
69
96.61
99
24
96.30
84
19
RNC id
RAB access failures
15.02
19.19
22.26
56.76
59.60
64.29
RAB setup failures
CSSR
1041
1042
1043
691
693
2011
RRC access failures
Cell id
104
104
104
69
69
201
RRC setup failures
BTS id
1
1
1
1
1
1
RRC setup attempts
RNC id
2006/12/23 11
2006/12/23 11
2006/12/23 11
2006/12/23 11
2006/12/23 11
2006/12/23 11
CCSR
time
• After post-processing the data, the results are lists of BTS that are the main
contributors to the performance dips
4 1023
5 765
1 321
3
32
1
28
4
5
4
2
2
Results (IV/IX):
Example on current PM and FM process (3/3)
5. Gather the system logs for those network elements that are main contributors of the RNC
performance dip.
• Achieved by connecting to the network element’s O&M unit either by manual command procedures or using a
tool that automates the procedure. The log files are usually in binary format, so they need to be opened by a
parser or converted to textual format before the analysis can take place.
6. Analyse the detailed data.
• The format of the data is vendor specific, i.e. not defined in public specifications
=> no general guidance can be set for the analysis itself.
• Highly dependant on the individual system specialists that can handle the versatile analysis and can produce
reliable results The analysis can be in this context treated as a black box, which has the input of system data,
i.e. logs, parameters, alarms, counters and KPIs, and output of set of root-causes for the occurred
performance problem.
• 7. Generate a solution to the root-cause.
• Needs the presence of a skilled system specialist. Depending on the type of solution, finding a working
solution might need trial and error approach.
• Before applying the solution to a live network, it is tested in a test bed of the vendor. Some network operators
have also test beds of their own, on which they verify the solutions, e.g. SW corrections, before they are
installed to the live network.
13
2007-02-14 / Magnus Wallström
Results (V/IX):
Analysis of the current organisation and process
• Main problems:
• Problems in current organisation operation
• 7.2.1 High travel costs
• 7.2.2 Troubleshooting poorly controlled
• Problems in current PM and FM process
• 7.3.2 NE logs not available for performance dips
• 7.3.3 Alarms not mapped to performance dips
• 7.3.4 Configuration data not available for performance dips
• 7.3.5 Internal failures not distinguished from external causes
• 7.3.6 Investigation is time consuming
14
2007-02-14 / Magnus Wallström
Results (VI/IX): Problems in current PM and FM process,
Interrelationship- and why-why-diagrams of the process
problems
Interrelationship
Why-why
Internal failures not
distinguished from
No reference points
external
Investigation
is timeconsuming
Manual w ork
No consistent set of tools
for performance management
and troubleshooting available
Alarms not mapped
NE logs are too large to be saved continuously for a long time
Alarms not collected
Performance data is not available to base decision to gather logs
CM data not available
NE logs not available
Too much traffic
per RNCCPU and
memory capacity for
continuous monitoring
Lack of competence
15
2007-02-14 / Magnus Wallström
RNCCPU/MEM
capacity too small
for monitoring
Results (VII/IX):
Summary of the analysis
•
16
Analysis set two general requirements for the solution:
•
Support fault management analysis conducted by system specialists.
The solution should be able to collect relevant fault management (FM) data, i.e. NE
logs, configuration data and alarms, for troubleshooting. The evaluation of the FM
data relevance bases on the performance measurement data, which may be
collected either from OSS or from RNC.
•
Support general reporting of performance conducted by operator and vendor
performance management bodies.
The solution should produce scalable reports of the performance measurement
data. Reports should represent the performance data both on whole network and
individual network element level down to the level of a single cell. Other statistical
requirements are: timely aggregation and that the data can be averaged.
2007-02-14 / Magnus Wallström
Results (VIII/IX):
Solution for automated investigation
INPUT:
•“Connection to a live network”. The requirement of the developed solution is either a working remote or onsite connection to the network. This prevents limitations on from which specific parts of the network the data is
gathered, i.e. the OSS, NEs or some other databases in the network.
Investigation~~PM
PM
Investigation
Connection to a
live netw ork
“thesolution”:
solution”:
“the
Automated
Automated
investigation
investigation
ofthe
thenetw
network
ork
of
performance
performance
System log files for the failures
that have occurred in the
live netw ork
Analysis~~FM
FM
Analysis
Decision
Decision
M~~SM
SM
~~CC
M
Overview report of the
live netw ork performance
OUTPUT:
•System log files and other detail data for the failures that have occurred in the live network. The root-cause
analysis phase utilises this data to make decisions.
•Overview reporting of the network performance that can be utilised in reporting the status of the network to
company management and to customer, i.e. the network operator.
17
2007-02-14 / Magnus Wallström
Results (IX/IX):
Implementation of the solution
• The distributed system consists of five separate applications:
• RNC monitor
• RNC static performance data fetcher
• OSS data fetcher
• Processor & Report (application)
• Report (server)
Sol2:
Sol2:
RNC
RNCmonitor/
monitor/data
data fetcher
fetcher
Sol3:
Sol3:
Log&alarm
Log&alarm
fetcher
fetcher
Iub
Iu (CS/PS)
Mobile netw ork
UTRAN
CN
B
Sol5:
Sol5:
Sol4:
Sol4:
Report
Report
Processor
Processor
RNC
B
Iur
B
B
RNC
A
O&M
OSS
Sol1:
Sol1:
Oss
Ossdata
data fetcher
fetcher
18
2007-02-14 / Magnus Wallström
CS
PS
Conclusions of the thesis
•
Summary of the thesis, Thesis studied practical problems of mobile radio network management:
•
Conclusion: UTRAN vendor technical support requires a distributed system of troubleshooting tools to
enhance its troubleshooting processes
•
•
Purpose of the troubleshooting tools is to enhance the performance investigation by automating gathering of the
performance and other relevant network behaviour data for the time periods where network suffers from low
performance
The reasoning of the solution bases on
•
Current troubleshooting set-up study:
•
•
•
The analysis of the current set-up:
•
•
Same principles are applicable to other radio network (e.g. GERAN) performance and fault management
•
Utilization of an OSS in data gathering makes the solution more portable to other radio network systems
•
Typically OSS uses relational SQL databases. Different radio networks have different performance indicators. Then the same tools may
be used after modifying SQL-queries, which is a straightforward process
Future work
•
•
19
currently the main problem is the inefficiency of the first, i.e. investigation, phase in the performance and fault management process.
Generalisation of the results
•
•
Organisation: vendor home base technical support that is a link between local teams, which are located by the operated networks, and
the vendor R&D. During special occasions, e.g. a new product release or emergency situation in network, the organisation may adjust
itself by transferring temporarily system specialist to work locally by the operated network.
Process: The practical performance and fault management process consists of three phases: investigation, analysis and decision.
Scope was limited to investigation. Also the complex analysis-phase has demanding development needs.
Technical support organization requires product-processes to manage the development and maintanance of
the troubleshooting tools.
2007-02-14 / Magnus Wallström
References
• Standards and Technical Specifications
• 3GPP: GSM, 3G and LTE
• IEEE: WiFi and WiMAX
• Commercial material
• Nokia Multiradio: http://www.nokia.com/NOKIA_COM_1/Microsites/NokiaWorld/Press/Multiradio_Press_Backgrounder.pdf
• Cisco WiMAX: http://www.cisco.com/en/US/netsol/ns616/networking_solutions_customer_profile0900aecd80334a23.html
• Previous thesis’
• Kujala, Kimmo (2006) Expert System for Mobile Network Troubleshooting. Thesis. Diplomityö,
TKK / Sähkö- ja tietoliikennetekniikan osasto, 2006. 72p.
• An attempt to build automated fault analysis tool system. The result in the thesis was that automated
analysis is still unreliable!
• Utriainen, Juha (2004) UTRAN Operation System Security. Thesis. Diplomityö, TKK / Sähkö- ja
tietoliikennetekniikan osasto, 2004. 64p.
• Gives a good overview on the UTRAN O&M (Operation and Maintenance)
• Handbooks
• Kaaranen, Heikki (2005) UMTS Networks – Architecture, Mobility and Services. Second Edition.
JOHN WILEY & SONS. ISBN: 0470011033
• Nielsen, Jakob (1993) Usability Engineering. Boston: Academic Press, 1993.
20
2007-02-14 / Magnus Wallström