Intuitive Formulae Are Not Always Right

4 minute read

Published:

In the data science class I help TA, we’re going over confidence intervals. Thanks to the central limit theorem, we can report confidence intervals for the out of sample generalization error. Let’s assume our loss function is mean squared error. A confidence interval would then be

\[\widehat{MSE} \pm z_{1-\alpha/2} \dfrac{\sigma_{MSE}}{\sqrt{n}} \>.\]

Here, $\sigma_{MSE}$ the the standard deviation of the squared residuals (squared because our loss is squared error). Well, if you’re familliar with $R^2$ you realize that

\[R^2 = 1- \dfrac{MSE}{SSE}\]

and be tempted to take your confidence interval for MSE and just divide it by the SSE (which is $\sum_i (y - \bar{y})^2$) and call that a confidence interval for $R^2$.

This works…in a very narrow set of circumstances. For most problems, this results in coverage above or far below the nominal and also risks covering impossible $R^2$ values. Let’s make some very charitable assumptions and see just how well this works.

Charitable Assumpions

Let’s analyze this interval for a single varibale regression. Let’s make the assumption that the predictor is normally distributed and that it has variance $\tau^2$.

Now, assume $y\vert x \sim \mathcal{N}(\alpha x + \beta, \sigma^2)$. The marginal distribution of $y$ is then $y \sim \mathcal{N}(\alpha \mu + \beta, \sigma^2 + \alpha^2\tau^2)$ meaning that the the exact value for $R^2$ would be

\[R^2 = 1 - \dfrac{\sigma^2}{\sigma^2 + \alpha^2\tau^2} = 1 - \dfrac{1}{1 + \left(\dfrac{\alpha \tau}{\sigma}\right)^2} \>.\]

We can always standardize our predictor to have unit variance, so the quantity we really care about is $\alpha/\sigma$. I can rearrange the formula for $R^2$ to give me $\alpha = \alpha(\sigma)$ and since we really only care about the ratio of $\alpha$ and $\sigma$ I’ll assume $\sigma=1$ and modulate $\alpha$. Now, I can generate a regression problem which has true $R^2$ whatever I want it to be. Here is some code to do that.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.special import expit, logit
from itertools import product
import pandas as pd
import seaborn as sns

def make_regression_data(n, alpha, sigma):
    
    x = np.random.normal(size = n)
    X = x.reshape(-1,1)
    y = alpha*x + np.random.normal(0, sigma, size = n)
    
    return (X, y)


R2 = 0.8
sigma = 1
alpha = sigma * np.sqrt( 1/(1-R2) -1 )

X, y = make_regression_data(1000, alpha, sigma)

# Scoring is r squared.  Will print something close to 0.8.  Any difference is sampling error.
LinearRegression().fit(X,y).score(X,y)

Now all I have to do is:

  • Specify an R squared and a sample size
  • Compute the “confidence interval”
  • Determine if the true $R^2$ is in the interval
  • Repeat

The coverage of an interval estimator is the long term relative frequency of containing the true estimand. A 95% confidence interval should have a coverage of 0.95. A coverage lower than that means our interval is not likely too narrow and overstating the precision of our estimate. A coverage too high is giving credence to values of $R^2$ it should not, understating the precision of the estimate. It is important that our interval estimate have the appropriate coverage. Ok, moving on.

Results

Code is included at the end of the blog post, but here is a plot of the coverage by R squared and sample size. As you can see, small $R^2$ have a coverage over the nominal value, so we are always overstating what kinds values of $R^2$ are consistent with the data. And for large values of $R^2$ we have coverage way way below the nominal. If this interval were used as a means to test some hypothesis about $R^2$ then we would make a lot of false positives. A lot. I’ve coded the color map so that the color white represents the nominal coverage. As you can see, there is not a ton of white, meaning the interval estimate doesn’t really work as well as one might be lead to believe.

Code

I’ve included the code in a gist here.

Conclusion

Don’t do this. And more importantly, demand some sort of proof when people claim that something is as easy as just doing some algebra. Math has a lot of intuitive results, but there are also a ton of unintuitive results. Stats in particular is really unintuitive, but luckily we can use computers to check our intuition. Do make use of them.t