Two Consolidation Projects: Towards an International MME: CFS+EUROSIP(UKMO,ECMWF,METF)

Download Report

Transcript Two Consolidation Projects: Towards an International MME: CFS+EUROSIP(UKMO,ECMWF,METF)

Two Consolidation Projects:
• Towards an International MME:
CFS+EUROSIP(UKMO,ECMWF,METF)
11 slides
• Towards a National MME: CFS and GFDL
18 slides
Does the NCEP CFS
add to the skill of
the European DEMETER-3
to produce a viable
International Multi Model Ensemble (IMME) ?
Huug van den Dool
Climate Prediction Center, NCEP/NWS/NOAA
Suranjana Saha and Åke Johansson
Environmental Modeling Center, NCEP/NWS/NOAA
August 2007
DATA and DEFINITIONS USED
• DEMETER-3 (DEM3) = ECMWF + METFR + UKMO
• CFS
• IMME = DEM3 + CFS
• 1981 – 2001
• 4 Initial condition months : Feb, May, Aug and Nov
• Leads 1-5
• Monthly means
DATA/Definitions USED (cont)
• Deterministic : Anomaly Correlation
• Probabilistic : Brier Score (BS) and Rank Probability Score (RPS)
• Ensemble Mean and PDF
• T2m and Prate
• Europe and United States
“ NO (fancy) consolidation, equal weights, NO Cross-validation”
DATA/Definitions USED (cont)
Verification Data :
• T2m : CPC Monthly Analysis of the CAMS + Global
Historical Climate Network (Fan and Van den Dool 2007)
• Prate : CMAP (Xie-Arkin 1997)
Number of times IMME improves upon DEM-3 :
out of 20 cases (4 IC’s x 5 leads):
Region
EUROPE
EUROPE
USA
USA
Variable
T2m
Prate
T2m
Prate
Anomaly
9
14
14
14
16
18.5
19
20
14
15
19.5
20
Correlation
Brier
Score
RPS
“The bottom line”
Frequency of being the best model in 20 cases
in terms of
Anomaly Correlation of the Ensemble Mean
CFS
ECMWF METFR
UKMO
T2m
USA
4
5
5
6
T2m
EUROPE
3
5
6
5
Prate
USA
7
3
3
6
Prate
EUROPE
11
0
0
5
“Another bottom line”
Frequency of being the best model in 20 cases
in terms of
Brier Score of the PDF
CFS
ECMWF METFR
UKMO
T2m
USA
11
2
1
5
T2m
EUROPE
10
3
1
3
Prate
USA
17
2
0
1
Prate
EUROPE
18
0
1
1
“Another
bottom line”
Frequency of being the best model in 20 cases
in terms of
Ranked Probability Score (RPS) of the PDF
CFS
ECMWF METFR
UKMO
T2m
USA
9
4
1
6
T2m
EUROPE
9
3
4
3
Prate
USA
19
0
0
1
Prate
EUROPE
18
0
0
1
“Another
bottom line”
CONCLUSIONS
• Overall, NCEP CFS contributes to the skill of
IMME (relative to DEM3) for equal weights.
• This is especially so in terms of the
probabilistic Brier Score
and for Precipitation
CONCLUSIONS (Cont)
In comparison to ECMWF, METFR and UKMO,
the CFS as an individual model does:
•
•
well in deterministic scoring (AC) for Prate and
very well in probability scoring (BS) for Prate
and T2m
over both USA and EUROPEAN domains
CONCLUSIONS (Cont)
•
The relative weakness of the CFS is in the deterministic
scoring (AC) for T2m (which is near average of the other
models) over both EUROPE and USA
•
Skill (if any) over EUROPE or USA is very modest for any
model, or any combination of models
•
The Brier Score shows rare improvements over
climatological probabilities in this study
•
The AC for the ensemble mean gives a more
“positive” impression about skill than the Brier Score
Study of the performance of
GFDL seasonal forecasts in a
Multi Model Ensemble at NCEP
Huug van den Dool
Climate Prediction Center/NCEP/NWS/NOAA
Suranjana Saha
Environmental Modeling Center/NCEP/NWS/NOAA
Data Used
• 4 initial conditions: April 1, May 1, Oct 1 and Nov 1
• 10 member one-year forecasts (leads 0 thru 11)
• Period 1981-2005 (25 years)
• GFDL has a fully coupled model CM2.1 (IPCC version)
Verification Data Used
• Focus on monthly mean 2m-temperature and
precipitation over the continental US
• Verification of 2m-temperature against GHCN+CAMS
(land only)
• Verification of precipitation against CMAP ( land and
ocean)
• Area: valid grid points (2.5x2.5) within 25N-50N,
125W-65W box over the US
Comparison to the NCEP Climate Forecast System
(CFS)
GFDL members start a few days before and on the first of the
month.
CFS members are clustered around the 11th and 21st of the previous
month and the 1st of the initial month.
In an NCEP operational setting, the GFDL model would be run
everyday (similar to the CFS).
Therefore, the calibration of the operational forecast would be
obtained from an interpolation of two sets of forecasts, a month
apart (one of which would be a month old), thus resulting in a
possible degradation of skill.
VERIFICATION OF US PRATE
ANOMALY CORRELATION
CFS US PRATE ANOMALY CORRELATION
There are 32 ENTRIES: 8 leads for 4 initial months
initial month
lead
8
7
6
5
4
3
2
1
0
Worst
-.001
apr
.126
.135
-.001
.098
.027
.049
.058
.142
.191
mean-sd
.043
may
oct
nov
.143
.174
.071
.035
.087
.025
.061
.030
.244
.059
.101
.081
.123
.227
.166
.119
.149
.189
.083
.034
.112
.041
.168
.231
.220
.161
.277
mean+sd
.166
best
.231
mean
.104
NO CROSS VALIDATION
Some skill
in ENSO months
CFS US PRATE ANOMALY CORRELATION
There are 32 ENTRIES: 8 leads for 4 initial months
initial month
lead
8
7
6
5
4
3
2
1
0
Worst
-.090
apr
may
oct
nov
.062
.125
-.090
.038
-.033
-.023
-.028
.113
.139
.093
.107
-.039
-.056
.033
-.073
-.016
-.059
.241
.007
.062
.039
.056
.192
.120
.065
.122
.116
.024
-.017
.058
-.016
.086
.191
.182
.101
.248
mean-sd
-.032
mean
.045
CV brings
all numbers down
mean+sd
best
.121
.192
CROSS VALIDATION CV3RE
GFDL US PRATE ANOMALY CORRELATION
There are 32 ENTRIES: 8 leads for 4 initial months
initial month
lead
8
7
6
5
4
3
2
1
0
Worst
-.040
apr
may
oct
nov
.027
.122
-.040
.003
.061
.093
.059
.120
.184
.044
-.002
.138
.047
-.019
.024
.099
.074
.219
.045
.032
.057
.113
.153
.071
.166
.085
.087
.048
.091
.066
.016
.081
.154
.109
.217
.250
mean-sd
.018
mean
.074
mean+sd
.130
NO CROSS VALIDATION
Weak skill
in ENSO months
best
.217
MME2 US PRATE ANOMALY CORRELATION
There are 32 ENTRIES: 8 leads for 4 initial months
initial month
lead
8
7
6
5
4
3
2
1
0
Worst
-.032
apr
may
oct
nov
.104
.168
-.032
.062
.060
.093
.076
.165
.223
.135
.112
.140
.056
.043
.032
.105
.067
.277
.065
.091
.092
.149
.241
.147
.183
.145
.170
.082
.075
.115
.039
.159
.234
.199
.227
.311
mean-sd
.051
mean
.113
mean+sd
.176
NO CROSS VALIDATION
best
.241
US PRATE (AC)
BEST out of 32 cases (4 IC’s x 8 leads):
NO CV
MEAN AC
CV3RE
MEAN AC
CFS
GFDL
21
20
11
MME2
12
20
.104
12
.074
.113
CFS
GFDL
MME2
22
13
10
.045
19
3
.009
28
.035
US PRATE (summary)
 CFS alone is slightly better than GFDL alone
 MME2 is slightly better than CFS alone
MME2 is better than GFDL alone
Numerically, differences are minuscule,
and the existence of any skill is debatable
PRATE OVER NINO 3.4 AREA
(summary)
CFS
.532
.511
MME2
.520
.481
GFDL
.313 (NO CV)
.252 (CV3RE)
• Adding GFDL to CFS for MME2 degrades scores
• GFDL has ENSOs, maybe even too strong in 1983 and 1998, but
the precipitation anomalies are weak at the equator and are pushed
away from the equator, mainly into the southern hemisphere.
VERIFICATION OF
US SURFACE TEMPERATURE
ANOMALY CORRELATION
US 2m TEMPERATURE (AC)
BEST out of 32 cases (4 IC’s x 8 leads):
NO CV
MEAN AC
CV3RE
MEAN AC
CFS
GFDL
15
13
17
MME2
18
18
.080
13
.099
.113
CFS
GFDL
MME2
13
13
19
.026
19
19
.029
13
.009
US 2m TEMPERATURE (summary)
 CFS alone is not better than GFDL alone
 MME2 is slightly better than CFS alone
 MME2 is not better than GFDL alone
 Numerically, differences are minuscule,
and the existence of any skill is debatable
TREND ANALYSIS OF US 2m TEMP
Effect of OCN (Optimal Climate Normals)
filtering on AC scores for all 32 cases
(NO-CV)
9 year running mean is removed
RAW
OCN-filtered
GFDL
0.099
0.068
CFS
0.080
0.073
• GFDL loses its advantage over the CFS
when the trend is removed
CONCLUSIONS (1)
• Skill of both, CFS and GFDL, is extremely low for both 2m
temperature (T2M) and precipitation (PRATE) over the US,
and this skill wilts further upon cross validation (CV3RE)
• GFDL makes no contribution to the skill of MME2 for PRATE
over the US
• GFDL makes no contribution to the skill of MME2 for PRATE
over the tropical Pacific (Nino 3.4 area)
• GFDL has a small edge over the CFS and contributes to
MME2 for T2M over the US
CONCLUSIONS (2)
• The inconsistency between performance in PRATE and T2M
is explained by inclusion of historical CO2 etc, i.e. GFDL does a
better job on the decadal temperature trends. This is explained
by the drop in the skill when the trend is removed.
• The empirical tool, OCN (Optimal Climate Normals), is
routinely used by CPC to incorporate decadal trends in the
consolidation of the official seasonal forecasts for US T2M. Its
performance is better than any of their dynamical tools.
From Delsole(2007)
• Surprisingly, none of the regression models
proposed here can consistently beat the
skill of a simple multi-model mean
• “Under suitable assumptions, both the
Bayesian estimate and the constrained
least squares solution reduce to standard
ridge regression”.
•
•
•
•
Kharin and Zwiers(2002):
Several methods of combining individual forecasts
from a group of climate models to produce an
ensemble forecast are considered
In the extratropics, the regression-improved
ensemble mean performs best.
The “superensemble” forecast that is obtained by
optimally weighting the individual ensemble
members does not perform as well as either the
simple ensemble mean or the regression-improved
ensemble mean.
The sample size evidently is too small to estimate
reliably the relatively large number of optimal
weights required for the superensemble approach.
Finally
Huug van den Dool, 2007
• There is essentially not enough hindcast data for
these fancy consolidation methods to work (21-25
years is nothing !!). ((There may be exceptions))
• There is no (or not enough) independent
information in model A versus Model B
• We have to be rigorous in CV procedures!
The rest is EXTRA
Classic
+Delsole
limit
+CPC
limit
Appendix: Consolidation Techniques
• A technique to linearly combine any set of models
Example: Con3 = a*A + b*B + c*C,
where A, B and C are forecasts and a, b, and c coefficients.
• The coefficients ideally depend on skill and co-linearity
among the models, as determined from many hindcasts
• Because of near instability of the matrix problem, NCEP
applies ‘ridging’ to the covariance matrix, and tries to pool as
much data as possible (areas, leads..).
• To arrive at a skill estimate, we perform a 3 year-out cross
validation (CV3), namely the year in consideration and two
more years chosen at random (to reduce CV pathological
problems)
BRIER SCORE FOR 3-CLASS SYSTEM
1. Calculate tercile boundaries from observations 1981-2001 (1982-2002
for longer leads) at each gridpoint.
2. Assign departures from model’s own climatology (based on 21 years,
all members) to one of the three classes: Below (B), Normal (N) and
Above (A), and find the fraction of forecasts (F) among all participating
ensemble members for these classes denoted by FB, FN and FA
respectively, such that FB+ FN+FA=1 .
3. Denoting Observations as O, we calculate a Brier Score (BS) as :
BS={(FB-OB)**2 +(FN-ON)**2 + (FA-OA)**2}/3,
aggregated over all years and all grid points.
{{For example, when the observation is in the B class, we have (1,0,0)
for (OB, ON, OA) etc.}}
4. BS for random deterministic prediction: 0.444
BS for ‘always climatology’ (1/3rd,1/3rd,1/3rd) : 0.222
5. RPS: The same as Brier Score, but for cumulative distribution (noskill=0.148)
Daily Raw Z500 Scores
Anomaly Correlation
1981-2005 NH (IC=Oct30,0Z)
100
80
60
40
20
0
0
2
4
6
8 10 12
forecast day
14
gfdl m06 cfs m11
Anomaly correlation does not asymptote to 100 at fcst time=0
Interpolation of initial conditions from Reanalysis 2 may not be correct or accurate
CROSS-VALIDATION
90
80
70
60
50
40
30
D1
D2
D3
D4
D5
D6
D7
CFS
CA
MM
COR
FRE
RID
RI2
RIM
RIW
UR
Anomaly Pattern correlation over the tropical Pacific. Average for all leads and
initial months. Empty bar: Full (dependent), filled bar: 3-yr out cross-validated.
Peña and Van den Dool (2008)
Consolidation of Multi Method Forecasts by Ridge Regression:
Application to Pacific Sea Surface Temperature
•
•
•
•
•
 Strategies to increase the ratio of the effective sample size of the
training data to the number of coefficients to be fitted are proposed
and tested. These strategies include:
i) objective selection of a smaller subset of models, ii) pooling of
information from neighboring gridpoints, and iii) consolidating all
ensemble members rather than each model’s ensemble average.
In all variations of the ridge regression consolidation methods
tested, increased effective sample size produces more stable
weights and more skillful predictions on independent data.
In the western tropical Pacific, most consolidation methods
outperform the simple equal weight ensemble average; in other
regions they have similar skill as measured by both the anomaly
correlation and the relative operating curve.
The main obstacle to progress is a short period of data and a lack of
independent information among models.
CV3RE