Hacking Sklearn To Do The Optimism Corrected Bootstrap
Statistics
Machine Learning
Python
Scikit-Learn
Author
Demetri Pananos
Published
November 23, 2021
Its late, I can’t sleep, so I’m writing a blog post about the optimism corrected bootstrap.
In case you don’t know, epidemiology/biostatistics people working on prediction models like to validate their models in a slightly different way than your run-in-the-mill data scientist. Now, it should be unsurprising that this has generated some discussion between ML people and epi/biostats people, but I’m going to ignore this for now. I’m going to assume you have good reason for wanting to do the optimism corrected bootstrap in python, especially with sklearn, and if you don’t and want to discuss the pros and cons fo the method instead then lalalalalala I can’t hear you.
Construct a model in the original sample; determine the apparent performance on the data from the sample used to construct the model.
Draw a bootstrap sample (Sample*) with replacement from the original sample.
Construct a model (Model) in Sample, replaying every step that was done in the original sample, especially model specification steps such as selection of predictors. Determine the bootstrap performance as the apparent performance of Model* in Sample.
Apply Model* to the original sample without any modification to determine the test performance.
Calculate the optimism as the difference between bootstrap performance and test performance.
Repeat steps 1–4 many times, at least 200, to obtain a stable mean estimate of the optimism.
Subtract the mean optimism estimate (step 6) from the apparent performance (step 1) to obtain the optimism-corrected performance estimate.
This procedure is very straight forward, and could easily be coded up from scratch, but I want to use as much existing code as I can and put sklearn on my resume, so let’s talk about what tools exist in sklearn to do cross validation and how we could use them to perform these steps.
Cross Validation in Sklearn
When you pass arguments like cv=5 in sklearn’s many functions, what you’re really doing is passing 5 to sklearn.model_selection.KFold. See sklearn.model_selection.cross_validate which calls a function called ‘check_cv’ to verify this. KFold.split returns a generator, which when passed to next yields a pair of train and test indicides. The inner workings of KFold might look something like
Those incidies are used to slice X and y to do the cross validation. So, if we are going to hack sklearn to do the optimisim corrected bootstrap for us, we really just need to write a generator to give me a bunch of indicies. According to step 2 and 3 above, the train indicies need to be resamples of np.arange(len(X)) (ask yourself “why?”). According to step 4, the test indicies need to be np.arnge(len(X)) (again….”why?“).
Once we have a generator to do give us our indicies, we can use sklearn.model_selection.cross_validate to fit models on the resampled data and predict on the original sample (step 4). If we pass return_train_score=True to cross_validate we can get the bootstrap performances as well as the test performances (step 5). All we need to do then is calculate the average difference between the two (step 6) and then add this quantity to the apparent performance we got from step 1.
That all sounds very complex, but the code is decieptively simple.
The Code (I Know You Skipped Here, Don’t Lie)
import numpy as npfrom numpy.core.fromnumeric import meanfrom sklearn.model_selection import cross_validate, RepeatedKFoldfrom sklearn.metrics import mean_squared_error, make_scorerfrom sklearn.linear_model import LinearRegressionfrom sklearn.datasets import load_diabetesfrom sklearn.utils import resample# Need some data to predict withdata = load_diabetes()X, y = data['data'], data['target']class OptimisimBootstrap():def__init__(self, n_bootstraps):self.n_bootstraps = n_bootstrapsdef split(self, X, y,*_): n =len(X) test_ix = np.arange(n)for _ inrange(self.n_bootstraps): train_ix = resample(test_ix)yield (train_ix, test_ix)# Optimism Correctedmodel = LinearRegression()model.fit(X, y)apparent_performance = mean_squared_error(y, model.predict(X))opt_cv = OptimisimBootstrap(n_bootstraps=250)mse = make_scorer(mean_squared_error)cv = cross_validate(model, X, y, cv=opt_cv, scoring=mse, return_train_score=True)optimism = cv['test_score'] - cv['train_score']optimism_corrected = apparent_performance + optimism.mean()print(f'Optimism Corrected: {optimism_corrected:.2f}')# Compare against regular cvcv = cross_validate(model, X, y, cv =10, scoring=mse)['test_score'].mean()print(f'regular cv: {cv:.2f}')# Compare against repeated cvcv = cross_validate(model, X, y, cv = RepeatedKFold(n_splits=10, n_repeats=100), scoring=mse)['test_score'].mean()print(f'repeated cv: {cv:.2f}')
Optimism Corrected: 2999.04
regular cv: 3000.38
repeated cv: 3009.02
The three estimates (optimism corrected, 10 fold, and repeated 10 fold) should be reasonably close together, but uh don’t run this code multiple times. You might see that the optimism corrected estimate is quite noisy meaning I’m either wrong or that twitter thread I linked to might have some merit.