Ideenaustausch - Royal Institute of Technology
Download
Report
Transcript Ideenaustausch - Royal Institute of Technology
Data cleaning and outlier detection
Fredrik Strandberg, HypoVereinsbank
July 12 2001
IDD and EOD
8 I. Intraday data (IDD) „cleaning“
by a fast, adaptive filter. General model assumed and
special treatment of specific known error types.
8 II. End-of-Day data (EOD) Outlier Detection
Sensitive real-time and ex-post analysis by statistical
means
8 (III. Backtesting the results from 2. again using 1. but with
instrument specific parameters)
8 (IV. Error examples database and self-learning features)
I. IDD Cleaning
8 Several special filters for specific error types needed
8 Real-time => fast routines needed
8 1. Univariate comparisons
- to empirical data (pairwise)
- to a model (mean, median, trends, forecast residuals, ...)
8 2. Multivariate comparisons
Problems in IDD
8 Problems in high-frequency (HF) data:
1. Non-homogenity (irregular spacing) =>
2. (Multivariate case:) Non-synchronous data
3. Sparse series (low liquidity)
4. Strong intra-day seasonal patterns
5. GARCH and EWMA models not applicable
6. Computational efforts
Specific error types
8 Decimal errors
8 Scaling problems due to quote unit conventions
8 Test quotes (as connection test by contributors)
Could be one bad quote in the morning or a linearly
changing series. Usually at non-liquid times.
8 Repeated old quotes. Can be harmful if too many.
8 Quote copying. Some contributors copy and re-send
quotes of other contributors, just to show a strong
presence of data feed. Sometimes modified by adding
small random disturbances.
Source: Olsen
The Olsen Filter
8 Hierarchial structure of special filters
8 Complicated time scale transformation
8 Central mechanism:
Weighted sum of credibilities from pairwise comparisons
with previous and past values, depending on quotes origins
and time differences.
8 General assumption:
Cred ~ P[DX > x], for “big” x, and f(x) ~ x- . Olsen choose
=4
8 34 parameters
8 Multivariate filtering is not yet implemented
Multivariate filtering
8 Multivariate filtering seems to be the final answer for
telling weather a jump was true or not.
8 For sparse series, MF seems to be the only answer!
8 Idea:
1. Use all well (anti)correlated dense series (or all series) and
the Expectation Maximization (EM) algorithm to fill the gaps,
as described in the RiscMetrics technical document.
2. Assign the estimated values appropriate credibility.
(Of course lowered due to asynchronous data and the
estimation)
3. Use the univariate filter as usual.
4. Remove the created quotes from the output.
8 Other possibilities: Arbitrage Conditions implied prices?
Summary of the Olsen filter
8 Object-oriented structure, easy to implement and modify.
8 Adaptive: self-learning and self-calibrating to new
instruments
8 34 parameters
8 Possible improvements:
8 1. Multivariate
8 2. A better time scale (specified and recommended by
Olsen!)
8 3. Less general assumptions, such as tail indices etc.
A question of computational time.
II. Outlier Detection in EOD
8
Goal: Detection of affective values
8
Benefits:
- Homogeneous (and synchronized ) data
- Time to estimate sophisticated models
8
Methods for real time usage
1. Conditional VaR (ESF)
2. GARCH residuals analysis
3. High-Pass Filtration
8
Methods for historical usage
1. GARCH residuals analysis
2. Low-Pass Filtration (Trend deviation)
3. High-Pass Filtration
(4. Wavelet Transform Multiresolution frequency decomposition)
The GARCH model
8 Generalisation of the standard EWMA.
8 Price: Troubelsome non-linear optimzation (ML)
8 => Suitable Goodness-of-fit measure needed (see Mikosch)
8 Reliefs:
- Re-estimation is not needed very often
- Simplest possible: GARCH(1,1) with „targeted
variance“ and zero mean - should be enough!
8 Innovation distribution?
X(t)=s(t)Z(t) + m(t)
Outliers in GARCH
8 Problem: An outlier affects the model estimate
(Alternative: Robust estimation methods)
8 Tests for the residuals: Z(t)=X(t,w)/s(t)
{Z(t)} approximate WN! (lepokurtic distributed)
- Very simple: Specify critical values for Z
(Distribution function estimate or Monte Carlo)
- Sophisticated 1: LR-test for Z
- Sophisticated 2: EVT for Z
8
Method 2: LR-test in GARCH
Van Dijk and Franses (2000)
Single outlier detection:
1. Augmented GARCH-jump model
2. Transformation -> ARMA(1,1)
3. Modelling of the outlier effects
4. Derivation of a simple linear regression.
5. LS gives the conditional outlier effects
6. Achieve t-statistic t with standard methods
7. Compare max(t) with a derived critical value.
Multiple outlier detection
8. Remove the found outlier
9. Iterate
The outlier implied remaining (decaying) effects in the model are observed
=>
The method performs better a few days back in time than in real-time
Benefits and drawbacks
8 + Simple formula for the critical values C(a, b)
8 + Computationally easy
8 + No innovations distribution assumptions
8 + The authors: „Works remarkably good!“
8 - Prerequisite: Estimated GARCH-model.
8
Usage: Real Time and historical
Outlier definitions
8
How can and should an „outlier“ be defined?
8
Vaguely distinguish between technical outliers and market outliers:
Requirement: Significant effect - which are of interest to us?
- Too small: Effects the volatility by a factor 1/n, not dangerous
- Too big: Important, even though no technically „true“ outlier!
8
Suggested definition: A market outlier is a value that is
1. Aberrant from the market situation
2. Affecting market statistics significantly
8
Statistically,the task of determine weather an observation originates from
the same distribution is very hard, already in the IID case. With fat tails and
heteroscedasticity, it gets even worse.However, if defined as above, the
task gets much easier. Now also other, intuitive methods can be applied.
8
Perfect case:
IDD cleaning focus on technical outliers, EOD only on market outliers.
Method 1: Conditional VaR
8 Frey, McNeil (2000).
Standard P-L distibution estimation methods:
1. Non-parametrical (Historical Simulation, HS)
2. Parametrical, such as GARCH, EWMA
3. EVT
Problems:
1. Bad estimates of the extremes
2. Distribution assumptions
3. Unconditional variance
Improved P-L estimation
8 Idea:
X(t) = s(t)Z(t)
1. HS for the central part of the distribution
2. GARCH for the conditional volatility s(t)
3. EVT for the tail (using ordered GARCH exceedance
residuals)
=> Improved P-L model => Improved VaR and ESF
estimates
8 Outliers: Cred(X(t)) ~ S(t) = m(t-1) + s(t-1) E[Z | Z > zq],
8 Usage: Conditional on (t-1) => Real time (daily)
8 Drawback: Also the GPD must be ML-estimated
Method 3. LP filter (Smoothing)
8 Trivial but useful for the pseudo outliers:
Technically correct quotes can still be„wrong“ in the sense
of
the martingale propertied market!
8 1. Trend estimation („smoothing“)
8 2. Trend deviation
8 3. Critical values
INTUITIVE!
Smoothing
8 1. Methods for smoothing (LP filtering):
- 2 sided weighted moving averages m(t)=Sa(j)X(t+j)
- Fourier smoothing
- Wavelet transforms (not so good, but fast)
Tradeoff: The smoother must describe the trend but still adapt
fast enough to market changes.
For example, a simple 5-point symmetrical EWMA seems to
work OK...
Fractal structure: ...for both EOD and IDD!
(For real-time use, i e one-sided kernel, we get an AR(p)-process,
which we shall of course NOT use - since this model is bad and we
now anyway are back in the familiar framework of the standard
RiscMetrics EWMA “IGARCH” or GARCH(1,1) )
Critical Values
8 1. Subjective calibration
Given an estimated model for the trend (l), it has to be
defined when is a value is “wrong” (in a market sense).
2. Trend deviations
Consider the trend deviations and use D=max |x(t)-m(t)|/
s
8 Monte Carlo simulations => quantiles for D => C(a)
8 3. Improvements
1. C = C(a,RMSE, l, s) => no Monte Carlo needed
2. l = l (s, RMSE) => no manual calibration needed
Method 4. HP filtration
8 Idea:
An outlier, “defined” as „looking aberrant“
possesses a higher localized frequency.
Good for:
- Finding small outliers
- Finding clusters of outliers.
- IDD and EOD
Multiresolution Frequency
Decomposition
8 Idea:
Financial time series possess a fractal structure; they contain
different time scales, due to the market participants working in
different time horizons. Therefore, it seems a plausible idea to
put the time series under a magnifying glass and decompose
the different time scales.
8 Tool:
Wavelet Transforms create a MFD
8 Benefits:
1. Computationally fast and easy!
2. We want the High-pass part anyway
Conclusion of EOD analysis
8 GARCH-estimation is demanding, especially the
very first time, but then it can be used for
- The LR-test detection method
- Improved estimation of ESF (and VaR)
(- Volatility comparisons)
8 The chosen definition of an outlier is crucial:
What exactly are we looking for?
8 Deviations from an MA-smoothed curve seems trivial,
but
could be a useful indicator!
III. Testing EOD suspects in IDD
8
Idea:
„Zoom in“ and use the IDD information for closer suspect examination:
- New IDD filtering, with instrument specified parameters (tail index etc)
- Simpler: Smoothing + trend deviation analysis in IDD
Tail index estimation: P(X > x) ~ cx
- Hill estimator (standard, unconditional)
- GARCH residuals threshold exceedance (approx. Pareto => ML
estimation)
(conditional, as previously described)
In the special case of GARCH(1,1) and IGARCH(1,1) (RiscMetrics EWMA)
the tail index can actually be explicit calculated!
= (a,b,DF(Z))
Conclusion: Suggested plan
8 1. Implementation of the Olsen filter, as it is or a
modification, for the basic data cleaning . (C++?)
8 2. Extend to multivariate filtering
8 3. Filter calibration and testing.
8 4. Implementation of EOD methods
8 5. Implementation of a verifying system EOD --> IDD
8 (6. Improvements: Self-learning, error examples
database, …)
Selected references
8
McNeil, J. and Fray, R. (2000)
Estimation of Tail-related risk measures for heteroscedastic financial time series: an
extreme
value approach. ETH Zürich.
8
Franses, P H and van Dijk, D.
Outlier detection in GARCH models. Econometric institute, Erasmus university
Rotterdam
8
Mikosch, T. Modeling financial time series. Copenhagen University.
8
Tolvi, J. Outliers in time series; A review. University of Turku.
8
Müller, U. The Olsen filter for financial data. Internal paper, Olsen & associates.
8
Greenblatt, S A. Wavelets in econometrics: An application to outlier testing
University of reading.
8
Numerical Recipes in C - On-line book, www.nr.com
8
RiscMetrics technical document