1 Rationale

For predictive algorithms, assessing model performance is critical. The criteria used to measure the model performance include the area under the curve (AUC) and average positive predictive value (AP), brier score and scaled brier score. However, most studies only provide point estimates of these performance metrics. As mandatory reporting of confidence intervals becomes increasingly popular in medical studies, it is crucial to construct confidence intervals for the performance estimates. There are three popular approaches, including split-sample averaged (SSA), cross-validation (CV) and bootstrapping (BS), for estimating the confidence intervals for those performance metrics. This simulation study aims to investigate the performance of those three methods on estimating the confidence interval using coverage probability, confidence width, as well as bias to evaluate the point estimates.

2 Data generation

The data were generated from a logistic model containing two explanatory variables X1, X2 and their interaction. X1 and X2 come from Gaussian distribution with a zero mean and standard deviation of 1.

\[log(p/1-p)=α+β_1*X_1+β_2*X_2+β_3*X_1*X_2\]

The mis-specified model without the interaction term were estimated and used for prediction, i.e., \[log(p/1-p)=α+β_1*X_1+β_2*X_2\]

3 Method

In internal validation, there are three popular approaches that are used for evaluating the performance: split-sample averaged (SSA), k-fold cross-validation (CV) and bootstrapping (BS).

In the split method, we first randomly divide the data into two parts by a pre-specified split ratio, for example, split ratio=2:1. Then 2/3 part of the data will be used for modelling and the prediction value will be conducted on the remaining 1/3 part of the samples. We repeat this procedure for 100 times so that every observation in the dataset has a chance to be used as the test set and obtain at least one prediction value. as we randomly select the training set and validation set, it is possible that some observations have multiple prediction values. For those observations, we use the averaged prediction value as their final prediction. The predicted values and the observed values were passed on to the APtool package in R, which uses bootstrapping to obtain the confidence interval for performance metrics.

In k-fold cross-validation method, first off, the total n labeled observations in the dataset will be randomly divided into k roughly equal sized parts. Then, we leave out one part of the data as the test set and the remaining 9 parts as a training set. The training set will be used for modelling and the test set will be used for prediction. We repeat this process until all observations get the predicted value. Finally, we combine all the predicted values and compare them to the observed values to assess the model performance using APtools.

4 Simulation Scenarios

In this simulation, we considered several different scenarios. We considered different event rates, different combinations of effect size. Specifically, two interesting event rates: 0.05 and 0.1 were investigated. For main effects, we mainly considered two situations where two main effects differ a lot or not much. For the effect size of the interaction term, we included four different kinds of effect size: small positive, large positive, large negative and no interaction.

5 Summary of simulation setting

Finally, we have 14 different simulation settings for this simulation study. The selection of alpha in the true model is based on reasonable AUC and AP values (0.6<AUC<0.8, and 0.3<AP<0.5). The table below is a summary of the simulation settings:

6 Simulation evaluation

Bias, coverage probability, coverage width and computation time are used to evaluate the performance of the three methods. Bias is defined as the difference between the true value and the estimated value of the performance metric. Bias is calculated to evaluate the accuracy of point estimates of the metrics. Coverage probability and coverage width are used to evaluate the performance of the estimation of confidence interval. Coverage probability calculates the proportion of the time that the calculated confidence interval contains the true value/population parameter of interest. (in our simulation study, the true values of interest refer to AUC/AP/BS/sBS). Coverage probability is calculated to evaluate the accuracy of the estimated confidence interval. Confidence width is defined as the difference between the upper bound and lower bound of the confidence interval. It measures the power of the estimated confidence interval. When using confidence interval to do statistical inference, a smaller confidence width indicates higher power.

7 Load datasets

8 Dataset description

Name of Dataset Description
1 plot_cw The confidence interval width for AUC, AP, BS and sBS
2 plot_bias The bias for AUC, AP, BS and sBS
3 plot_cp The coverage probability for AUC, AP, BS and sBS

The datasets looks like this:

The descriptive analysis of the results by methods are:

Regular
(N=14)
.632_method
(N=14)
.632+_method
(N=14)
CV
(N=14)
SSA
(N=14)
Overall
(N=70)
CW_auc
Mean (SD) 0.117 (0.0204) 0.117 (0.0203) 0.117 (0.0207) 0.119 (0.0219) 0.119 (0.0214) 0.118 (0.0204)
Median [Min, Max] 0.116 [0.0900, 0.145] 0.116 [0.0890, 0.144] 0.116 [0.0890, 0.146] 0.118 [0.0900, 0.151] 0.117 [0.0900, 0.149] 0.116 [0.0890, 0.151]
CW_ap
Mean (SD) 0.195 (0.0402) 0.187 (0.0369) 0.188 (0.0369) 0.169 (0.0415) 0.172 (0.0414) 0.182 (0.0396)
Median [Min, Max] 0.196 [0.140, 0.266] 0.188 [0.137, 0.252] 0.188 [0.137, 0.253] 0.179 [0.114, 0.243] 0.181 [0.118, 0.246] 0.183 [0.114, 0.266]
CW_bs
Mean (SD) 0.0241 (0.00238) 0.0241 (0.00246) 0.0244 (0.00231) 0.0239 (0.00257) 0.0236 (0.00247) 0.0240 (0.00238)
Median [Min, Max] 0.0240 [0.0210, 0.0270] 0.0240 [0.0210, 0.0270] 0.0250 [0.0210, 0.0270] 0.0240 [0.0200, 0.0270] 0.0240 [0.0200, 0.0270] 0.0250 [0.0200, 0.0270]
CW_Sbs
Mean (SD) 0.0241 (0.00238) 0.0241 (0.00246) 0.0244 (0.00231) 0.0239 (0.00257) 0.0236 (0.00247) 0.0240 (0.00238)
Median [Min, Max] 0.0240 [0.0210, 0.0270] 0.0240 [0.0210, 0.0270] 0.0250 [0.0210, 0.0270] 0.0240 [0.0200, 0.0270] 0.0240 [0.0200, 0.0270] 0.0250 [0.0200, 0.0270]
Regular
(N=14)
.632_method
(N=14)
.632+_method
(N=14)
CV
(N=14)
SSA
(N=14)
Overall
(N=70)
bias_auc
Mean (SD) -0.00101 (0.00220) -0.000907 (0.00247) -0.00105 (0.00256) -0.00950 (0.00484) -0.00536 (0.00239) -0.00356 (0.00454)
Median [Min, Max] -0.000677 [-0.00561, 0.00146] -0.000438 [-0.00601, 0.00178] -0.000492 [-0.00643, 0.00173] -0.00808 [-0.0191, -0.00350] -0.00536 [-0.0106, -0.00179] -0.00293 [-0.0191, 0.00178]
bias_ap
Mean (SD) 0.0142 (0.00630) 0.0304 (0.0102) 0.0303 (0.0101) -0.00134 (0.00496) 0.00354 (0.00508) 0.0154 (0.0152)
Median [Min, Max] 0.0147 [0.00146, 0.0234] 0.0306 [0.0105, 0.0444] 0.0304 [0.0105, 0.0441] -0.00125 [-0.00788, 0.00644] 0.00363 [-0.00411, 0.0114] 0.0123 [-0.00788, 0.0444]
bias_bs
Mean (SD) -0.0000406 (0.000284) -0.0000842 (0.000299) 0.0000412 (0.000291) -0.0622 (0.0192) -0.0622 (0.0192) -0.0249 (0.0329)
Median [Min, Max] 0.0000947 [-0.000720, 0.000249] 0.0000693 [-0.000810, 0.000232] 0.000186 [-0.000697, 0.000287] -0.0611 [-0.0849, -0.0411] -0.0611 [-0.0849, -0.0411] -0.000311 [-0.0849, 0.000287]
bias_Sbs
Mean (SD) -0.00122 (0.00210) -0.00411 (0.00257) -0.00389 (0.00251) 0.000568 (0.0161) 0.000204 (0.0162) -0.00140 (0.0112)
Median [Min, Max] -0.000282 [-0.00440, 0.000843] -0.00297 [-0.00812, -0.00190] -0.00279 [-0.00779, -0.00168] 0.00366 [-0.0248, 0.0244] 0.00347 [-0.0255, 0.0233] -0.00226 [-0.0255, 0.0244]
Missing 4 (28.6%) 4 (28.6%) 4 (28.6%) 0 (0%) 0 (0%) 12 (17.1%)
Regular
(N=14)
.632_method
(N=14)
.632+_method
(N=14)
CV
(N=14)
SSA
(N=14)
Overall
(N=70)
CP_auc
Mean (SD) 0.947 (0.0110) 0.947 (0.0113) 0.947 (0.0116) 0.934 (0.0137) 0.948 (0.00726) 0.944 (0.0121)
Median [Min, Max] 0.948 [0.929, 0.967] 0.947 [0.929, 0.967] 0.947 [0.930, 0.971] 0.937 [0.903, 0.950] 0.948 [0.938, 0.960] 0.945 [0.903, 0.971]
CP_ap
Mean (SD) 0.955 (0.0103) 0.913 (0.0306) 0.914 (0.0299) 0.922 (0.0241) 0.925 (0.0198) 0.926 (0.0281)
Median [Min, Max] 0.956 [0.940, 0.980] 0.922 [0.840, 0.939] 0.924 [0.845, 0.945] 0.933 [0.859, 0.947] 0.932 [0.879, 0.948] 0.935 [0.840, 0.980]
CP_bs
Mean (SD) 0.950 (0.00877) 0.950 (0.00915) 0.953 (0.00915) 0.944 (0.00556) 0.944 (0.00626) 0.948 (0.00855)
Median [Min, Max] 0.948 [0.937, 0.967] 0.949 [0.936, 0.966] 0.951 [0.940, 0.973] 0.944 [0.935, 0.956] 0.943 [0.933, 0.959] 0.947 [0.933, 0.973]
CP_Sbs
Mean (SD) 0.948 (0.0204) 0.941 (0.0171) 0.942 (0.0176) 0.874 (0.0433) 0.893 (0.0387) 0.915 (0.0435)
Median [Min, Max] 0.938 [0.927, 0.981] 0.936 [0.922, 0.968] 0.936 [0.921, 0.969] 0.879 [0.767, 0.937] 0.891 [0.827, 0.959] 0.927 [0.767, 0.981]
Missing 4 (28.6%) 4 (28.6%) 4 (28.6%) 0 (0%) 0 (0%) 12 (17.1%)

9 Simulation Results

9.1 CW

9.1.1 CW: AUC

The CI width plots for AUC showed that:
1. For bootstrapping, Regular, .632 method and .632+ method had similar coverage widths;
2. CV and SSA had similar CI width, especially when the event rate[^1]:(setting 1 -5 refer to event rate=0.05, setting 8-14 refer to event rate=0.1) becomes larger;
3. The coverage widths were smaller when event rate is larger, i.e., event rate=0.1. this indicated it provided a more precise estimate when event rate is large.
4. setting 5 and setting 12 with no interaction had the smallest CI width at the same event rate;
5. The differece of the CI width for AUC among these methods did not change with event rate.

9.1.2 CW:AP

The confidence interval width plots for AP showed that:
1. For bootstrapping, Regular method had wider confidence interval; .632_method and .632+_method had almost the same CI width;
2. CV and SSA had comparable CI width, CV had slightly samller CI width; 3. Bootstrapping method had wider CI in comparison to CV and SSA. Regular method had the widest CI, CV had the narrowest CI. 4. plot D and E showed that the difference of the CI width for AP among these methods did not change with event rate.

9.1.3 CW:SBri

The CI width plots for scaled brier score showed that:
1. The range of CI width for scaled brier score change from 0.020 to 0.027. Bootstrapping methods had similar CI widths. CV and SSA had similar CI widths; 2. SSA had the smallest CI width for sclaed brier score; 3. Different from AUC and AP, the scaled brier score had smaller CI width when the event rate is samll (i.e., event rate=0.05).

9.2 Bias

9.2.1 Bias: AUC

The bias plot for AUC showed that:
1. CV and SSA always underestimated the AUC;
2. Regular,.632_method and .632+_method had comparable bias;
3. Bootstraping had the smallest bias, SSA had a smaller bias in comparison to CV, CV had the largest bias;

9.2.2 Bias: AP

The bias plots for AP showed that:
1. For bootstrapping methods, Regular method had a smaller bias in comparison to .632_method and .632+_method. And .632_method and .632+_method had almost the same bias;
2. Bootstrapping always overestimated the AP;
3. Overall speaking, CV and SSA methods had a bias between -0.01 and 0.01, and had smaller bias in comparison to bootstrappoing methods;

9.2.3 Bias: Scaled Brier Score

The bias plots for scaled brier score showed that:
1. For bootstrapping, the Regular method had samller bias in comparison to .632_method and .632+_method; and .632_method and .632+_method had comparable bias; 2. Bootstrapping method used to underestimate scaled brier score; 3. CV and SSA had comparable bias;
4. Bootstrapping had the smallest bias in comparison to CV and SSA.

9.3 CP

9.3.1 CP: AUC

The coverage probability for AUC showed that:
1. Overall speaking,the actual coverage probabilities obtained form SSA were more close to the nominal coverage probability (i.e., 0.95). It had coverage probabilites between 0.94 and 0.96; 2. CV had most coverage probabilites less than the nominal level, indicates the CI estimates from CV are more permissive; 3. For bootstrapping, Regular, .632 and .632_+ methods had similar coverage probabilities. it had a range between 0.93 and 0.97.

9.3.2 CP: AP

The coverage plots for AP showed that:
1. For bootstrapping,.632_method and .632+_method had similar coverage probabilities which were lower than .95; 2. The regular method had larger coverage probabilities than .632_method and .632+_method; 3. CV and SSA had coverage probabilities lower than .95;

9.3.3 CP: SBri

The coverage probability plots for sclaed brier score showed that:
1. For Bootstrapping, Regular method always had higher coverage probability in comparison to .632_method and .632+_method; .632_method and .632+_method. had comparable coverage probabilities;
2. SSA had higher coverage probabilities than CV; 3. Bootstrapping method had coverage probabilities that are more near to the preset CI level (i.e., 95% CI)

9.4 Compute time

In addition to the metric mentioned above. The compute time is another that should be taken into consideration. The running time[^2] using the Intel(R) Core(TM) i5-9300H CPU was summarized in the following table:

Method Running Time
1 Bootstrapping 600 sec
2 CV 6 sec
3 SSA 7 sec

The running time to obtain one conficence interval for CV and SSA are comparable. For bootstrapping method, Regular, .632 and .632_+ method in total take an average of 600 second to obtain one confidence interval.

10 Conclusion

The results above showed that all methods had comparable CI widths for AUC. The bootstrapping methods had smallest bias for AUC. CV and SSA always underestimate AUC; SSA had more stable CP for AUC;

The CV had the smallest CI widths for AP. and bootstrapping had the largest CI widths, The CV and SSA had better and comparable bias for AP; bootstrapping method always overestimated AP.

SSA had the smallest CI widths for sclaed brier score. bootstrapping perform best in piont estimates for sclaed brier score, but it always underestimate scaled brier score. bootstrapping performed best for CP for scaled brier score.