From: Ben Santer
To: John Lanzante , Thomas R Karl , carl mears , "David C. Bader" , "'Dian J. Seidel'" , "'Francis W. Zwiers'" , Frank Wentz , Karl Taylor , Leopold Haimberger , Melissa Free , "Michael C. MacCracken" , "'Philip D. Jones'" , Steven Sherwood , Steve Klein , 'Susan Solomon' , "Thorne, Peter" , Tim Osborn , Tom Wigley , Gavin Schmidt
Subject: More significance testing
Date: Thu, 27 Dec 2007 16:26:19 -0800
Reply-to: santer1@llnl.gov
Dear folks,
This email briefly summarizes the trend significance test results. As I
mentioned in yesterday's email, I've added a new case (referred to as
"TYPE3" below). I've also added results for tests with a stipulated 10%
significance level. Here is the explanation of the four different types
of trend test:
1. "OBS-vs-MODEL": Observed MSU trends in RSS and UAH are tested against
trends in synthetic MSU data in 49 realizations of the 20c3m experiment.
Results from RSS and UAH are pooled, yielding a total of 98 tests for T2
trends and 98 tests for T2LT trends.
2. "MODEL-vs-MODEL (TYPE1)": Involves model data only. Trend in
synthetic MSU data in each of 49 20c3m realizations is tested against
each trend in the remaining 48 realizations (i.e., no trend tests
involving identical data). Yields a total of 49 x 48 = 2352 tests. The
significance of trend differences is a function of BOTH inter-model
differences (in climate sensitivity, applied 20c3m forcings, and the
amplitude of variability) AND "within-model" effects (i.e., is related
to the different manifestations of natural internal variability
superimposed on the underlying forced response).
3. "MODEL-vs-MODEL (TYPE2)": Involves model data only. Limited to the M
models with multiple realizations of the 20c3m experiment. For each of
these M models, the number of unique combinations C of N 20c3m
realizations into R trend pairs is determined. For example, in the case
of N = 5, C = N! / [ R!(N-R)! ] = 10. The significance of trend
differences is solely a function of "within-model" effects (i.e., is
related to the different manifestations of natural internal variability
superimposed on the underlying forced response). There are a total of 62
tests (not 124, as I erroneously reported yesterday!)
4. "MODEL-vs-MODEL (TYPE3)": Involves model data only. For each of the
19 models, only the first 20c3m realization is used. The trend in each
model's first 20c3m realization is tested against each trend in the
first 20c3m realization of the remaining 18 models. Yields a total of 19
x 18 = 342 tests. The significance of trend differences is solely a
function of inter-model differences (in climate sensitivity, applied
20c3m forcings, and the amplitude of variability).
REJECTION RATES FOR STIPULATED 5% SIGNIFICANCE LEVEL
Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL 49 x 2 (98) 2 (2.04%) 1 (1.02%)
2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 58 (2.47%) 32 (1.36%)
3. MODEL-vs-MODEL (TYPE2) --- (62) 0 (0.00%) 0 (0.00%)
4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 22 (6.43%) 14 (4.09%)
REJECTION RATES FOR STIPULATED 10% SIGNIFICANCE LEVEL
Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL 49 x 2 (98) 4 (4.08%) 2 (2.04%)
2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 80 (3.40%) 46 (1.96%)
3. MODEL-vs-MODEL (TYPE2) --- (62) 1 (1.61%) 0 (0.00%)
4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 28 (8.19%) 20 (5.85%)
REJECTION RATES FOR STIPULATED 20% SIGNIFICANCE LEVEL
Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL 49 x 2 (98) 7 (7.14%) 5 (5.10%)
2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 176 (7.48%) 100 (4.25%)
3. MODEL-vs-MODEL (TYPE2) --- (62) 4 (6.45%) 3 (4.84%)
4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 42 (12.28%) 28 (8.19%)
Features of interest:
A) As you might expect, for each of the three significance levels, TYPE3
tests yield the highest rejection rates of the null hypothesis of "No
significant difference in trend". TYPE2 tests yield the lowest rejection
rates. This is simply telling us that the inter-model differences in
trends tend to be larger than the "between-realization" differences in
trends in any individual model.
B) Rejection rates for the model-versus-observed trend tests are
consistently LOWER than for the model-versus-model (TYPE3) tests. On
average, therefore, the tropospheric trend differences between the
observational datasets used here (RSS and UAH) and the synthetic MSU
temperatures calculated from 19 CMIP-3 models are actually LESS
SIGNIFICANT than the inter-model trend differences arising from
differences in sensitivity, 20c3m forcings, and levels of variability.
I also thought that it would be fun to use the model data to explore the
implications of Douglass et al.'s flawed statistical procedure. Recall
that Douglass et al. compare (in their Table III) the observed T2 and
T2LT trends in RSS and UAH with the overall means of the multi-model
distributions of T2 and T2LT trends. Their standard error, sigma{SE}, is
meant to represent an "estimate of the uncertainty of the mean" (i.e.,
the mean trend). sigma{SE} is given as:
sigma{SE} = sigma / sqrt{N - 1}
where sigma is the standard deviation of the model trends, and N is "the
number of independent models" (22 in their case). Douglass et al.
apparently estimate sigma using ensemble-mean trends for each model (if
20c3m ensembles are available).
So what happens if we apply this procedure using model data only? This
is rather easy to do. As above (in the TYPE1, TYPE2, and TYPE3 tests), I
simply used the synthetic MSU trends from the 19 CMIP-3 models employed
in our CCSP Report and in Santer et al. 2005 (so N = 19). For each
model, I calculated the ensemble-mean 20c3m trend over 1979 to 1999
(where multiple 20c3m realizations were available). Let's call these
mean trends b{j}, where j (the index over models) = 1, 2, .. 19.
Further, let's regard b{1} as the surrogate observations, and then use
Douglass et al.'s approach to test whether b{1} is significantly
different from the overall mean of the remaining 18 members of b{j}.
Then repeat with b{2} as surrogate observations, etc. For each
layer-averaged temperature series, this yields 19 tests of the
significance of differences in mean trends.
To give you a feel for this stuff, I've reproduced below the results for
tests involving T2LT trends. The "OBS" column is the ensemble-mean T2LT
trend in the surrogate observations. "MODAVE" is the overall mean trend
in the 18 remaining members of the distribution, and "SIGMA" is the
1-sigma standard deviation of these trends. "SIGMA{SE}" is 1 x
SIGMA{SE} (note that Douglass et al. give 2 x SIGMA{SE} in their Table
III; multiplying our SIGMA{SE} results by two gives values similar to
theirs). "NORMD" is simply the normalized difference (OBS-MODAVE) /
SIGMA{SE}, and "P-VALUE" is the p-value for the normalized difference,
assuming that this difference is approximately normally distributed.
MODEL "OBS" MODAVE SIGMA SIGMA{SE} NORMD P-VALUE
CCSM3.0 0.1580 0.2179 0.0910 0.0215 2.7918 0.0052
GFDL2.0 0.2576 0.2124 0.0915 0.0216 2.0977 0.0359
GFDL2.1 0.3567 0.2069 0.0854 0.0201 7.4404 0.0000
GISS_EH 0.1477 0.2185 0.0906 0.0214 3.3153 0.0009
GISS_ER 0.1938 0.2159 0.0919 0.0217 1.0205 0.3075
MIROC3.2_T42 0.1285 0.2196 0.0897 0.0211 4.3094 0.0000
MIROC3.2_T106 0.2298 0.2139 0.0920 0.0217 0.7305 0.4651
MRI2.3.2a 0.2800 0.2111 0.0907 0.0214 3.2196 0.0013
PCM 0.1496 0.2184 0.0907 0.0214 3.2170 0.0013
HADCM3 0.1936 0.2159 0.0919 0.0217 1.0327 0.3018
HADGEM1 0.3099 0.2095 0.0891 0.0210 4.7784 0.0000
CCCMA3.1 0.4236 0.2032 0.0769 0.0181 12.1591 0.0000
CNRM3.0 0.2409 0.2133 0.0918 0.0216 1.2762 0.2019
CSIRO3.0 0.2780 0.2113 0.0908 0.0214 3.1195 0.0018
ECHAM5 0.1252 0.2197 0.0895 0.0211 4.4815 0.0000
IAP_FGOALS1.0 0.1834 0.2165 0.0917 0.0216 1.5314 0.1257
GISS_AOM 0.1788 0.2168 0.0916 0.0216 1.7579 0.0788
INMCM3.0 0.0197 0.2256 0.0790 0.0186 11.0541 0.0000
IPSL_CM4 0.2258 0.2142 0.0920 0.0217 0.5359 0.5920
T2LT: No. of p-values .le. 0.05: 12. Rejection rate: 63.16%
T2LT: No. of p-values .le. 0.10: 13. Rejection rate: 68.42%
T2LT: No. of p-values .le. 0.20: 14. Rejection rate: 73.68%
The corresponding rejection rates for the tests involving T2 data are:
T2: No. of p-values .le. 0.05: 12. Rejection rate: 63.16%
T2: No. of p-values .le. 0.10: 13. Rejection rate: 68.42%
T2: No. of p-values .le. 0.20: 15. Rejection rate: 78.95%
Bottom line: If we applied Douglass et al.'s ridiculous test of
difference in mean trends to model data only - in fact, to virtually the
same model data they used in their paper - one would conclude that
nearly two-thirds of the individual models had trends that were
significantly different from the multi-model mean trend! To follow
Douglass et al.'s flawed logic, this would mean that two-thirds of the
models really aren't models after all...
Happy New Year to all of you!
With best regards,
Ben
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel: (925) 422-2486
FAX: (925) 422-7675
email: santer1@llnl.gov
----------------------------------------------------------------------------