Hacking Sklearn To Do The Optimism Corrected Bootstrap

Its late, I can’t sleep, so I’m writing a blog post about the optimism corrected bootstrap.

In case you don’t know, epidemiology/biostatistics people working on prediction models like to validate their models in a slightly different way than your run-in-the-mill data scientist. Now, it should be unsurprising that this has generated some discussion between ML people and epi/biostats people, but I’m going to ignore this for now. I’m going to assume you have good reason for wanting to do the optimism corrected bootstrap in python, especially with sklearn, and if you don’t and want to discuss the pros and cons fo the method instead then lalalalalala I can’t hear you.

The Optimism Corrected Bootstrap in 7 Steps

As a primer, you might want to tread Alex Hayes’ pretty good blog post about variants of the bootstrap for predictive performance. It is more mathy than I care to be right now and in R should that be your thing.

To do the optimism corrected bootstrap, follow these 7 steps as found in Ewout W. Steyerberg’s Clinical Prediction Models.

Construct a model in the original sample; determine the apparent performance on the data from the sample used to construct the model.
Draw a bootstrap sample (Sample*) with replacement from the original sample.
Construct a model (Model) in Sample, replaying every step that was done in the original sample, especially model specification steps such as selection of predictors. Determine the bootstrap performance as the apparent performance of Model* in Sample.
Apply Model* to the original sample without any modification to determine the test performance.
Calculate the optimism as the difference between bootstrap performance and test performance.
Repeat steps 1–4 many times, at least 200, to obtain a stable mean estimate of the optimism.
Subtract the mean optimism estimate (step 6) from the apparent performance (step 1) to obtain the optimism-corrected performance estimate.

This procedure is very straight forward, and could easily be coded up from scratch, but I want to use as much existing code as I can and put sklearn on my resume, so let’s talk about what tools exist in sklearn to do cross validation and how we could use them to perform these steps.

Cross Validation in Sklearn

When you pass arguments like cv=5 in sklearn’s many functions, what you’re really doing is passing 5 to sklearn.model_selection.KFold. See sklearn.model_selection.cross_validate which calls a function called ‘check_cv’ to verify this. KFold.split returns a generator, which when passed to next yields a pair of train and test indicides. The inner workings of KFold might look something like

for _ in range(number_folds):
    train_ix = make_train_ix()
    test_ix = make_test_ix()
    yield (trian_ix, test_ix)

Those incidies are used to slice X and y to do the cross validation. So, if we are going to hack sklearn to do the optimisim corrected bootstrap for us, we really just need to write a generator to give me a bunch of indicies. According to step 2 and 3 above, the train indicies need to be resamples of np.arange(len(X)) (ask yourself “why?”). According to step 4, the test indicies need to be np.arnge(len(X)) (again….”why?“).

Once we have a generator to do give us our indicies, we can use sklearn.model_selection.cross_validate to fit models on the resampled data and predict on the original sample (step 4). If we pass return_train_score=True to cross_validate we can get the bootstrap performances as well as the test performances (step 5). All we need to do then is calculate the average difference between the two (step 6) and then add this quantity to the apparent performance we got from step 1.

That all sounds very complex, but the code is decieptively simple.

The Code (I Know You Skipped Here, Don’t Lie)

import numpy as np
from numpy.core.fromnumeric import mean
from sklearn.model_selection import cross_validate, RepeatedKFold
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from sklearn.utils import resample

# Need some data to predict with
data = load_diabetes()
X, y = data['data'], data['target']

class OptimisimBootstrap():

    def __init__(self, n_bootstraps):

        self.n_bootstraps = n_bootstraps

    def split(self, X, y,*_):

        n = len(X)
        test_ix = np.arange(n)

        for _ in range(self.n_bootstraps):
            train_ix = resample(test_ix)
            yield (train_ix, test_ix)

# Optimism Corrected
model = LinearRegression()
model.fit(X, y)
apparent_performance = mean_squared_error(y, model.predict(X))

opt_cv = OptimisimBootstrap(n_bootstraps=250)
mse = make_scorer(mean_squared_error)
cv = cross_validate(model, X, y, cv=opt_cv, scoring=mse, return_train_score=True)
optimism = cv['test_score'] - cv['train_score']
optimism_corrected = apparent_performance + optimism.mean()
print(f'Optimism Corrected: {optimism_corrected:.2f}')

# Compare against regular cv
cv = cross_validate(model, X, y, cv = 10, scoring=mse)['test_score'].mean()
print(f'regular cv: {cv:.2f}')

# Compare against repeated cv
cv = cross_validate(model, X, y, cv = RepeatedKFold(n_splits=10, n_repeats=100), scoring=mse)['test_score'].mean()
print(f'repeated cv: {cv:.2f}')

Optimism Corrected: 2999.04
regular cv: 3000.38

repeated cv: 3009.02

The three estimates (optimism corrected, 10 fold, and repeated 10 fold) should be reasonably close together, but uh don’t run this code multiple times. You might see that the optimism corrected estimate is quite noisy meaning I’m either wrong or that twitter thread I linked to might have some merit.