On Interpretations of Confidence Intervals

Statistics
Author

Demetri Pananos

Published

April 3, 2021

The 95% in 95% confidence interval refers not to the probability that any one interval contains the estimand, but rather to the long term relative frequency of the estimator containing the estimand in an infinite sequence of replicated experiments under ideal conditions.

Now, if this were twitter I would get ratioed so hard I might have to take a break and walk it off. Luckily, this is my blog and not yours so I can say whatever I want with impunity. But, rather than shout my opinions and demand people listen, I thought it might be a good exercise to explain to you why I think this and perhaps why people might disagree. Let’s for a moment ignore the fact that the interpretation I use above is the de jure definition of a confidence interval and instead start where a good proportion of statistical learning starts; with a deck of shuffled cards.

I present to you a shuffled deck. Its a regular deck of cards, no funny business with the cards or the shuffling. What is the probability the top card of this deck an ace? I’d wager a good portion of people would say 4/52. If you, dear reader, said 4/52 then I believe you have made a benign mistake, but a mistake all the same. And I suspect the reason you’ve made this mistake is because you’ve swapped a hard question (the question about this deck) for an easier question (a question about the long term relative frequencies of coming to shuffled decks with no funny business and finding aces).

Swapping hard questions for easy questions is not a new observation. Daniel Khaneman writes about it in Thinking Fast and Slow and provides numerous examples. I’ll repeat some examples from the book here. We might swap the question:

The book Thinking Fast and Slow explains why we do this, or better yet why we have no control over this. I won’t explain it here. But it is important to know that this is something we do, mostly unconsciously.

So back to the deck of cards. Questions about the deck in front of you are hard. Its either an ace or not, but you can’t tell! The card is face down and there is no other information you could use to make the decision. So, you answer an easier one using information that you do know, namely the number of aces in the deck, the number of cards in the deck, the information that each card is equally likely to be on top given the fact there is no funny business with the cards or the shuffling, and the basic rules of probability you might have learned in high school if not elsewhere. But the answer you give is for a fundamentally different question, namely “If I were to observe a long sequence of well shuffled decks with no funny business, what fraction of them have an ace on top?”. Your answer is about that long sequence of shuffled decks. It isn’t about any one particular deck, and certainly not the one in front of you.

I think the same thing happens with confidence intervals. The estimator has the property that 95% of the time it is constructed (under ideal circumstances) it will contain the estimand. But any one interval does or does not contain the estimand. And unlike the deck of cards which can easily be examined, we can’t ever know for certain if the interval successfully captured the estimand. There is no moment where we get to verify the estimand is in the confidence interval, and so we are sort of left guessing thus prompting us to offer a probability that we are right.

The mistake is benign. It hurts no one to think about confidence intervals as having a 95% probability of containing the estimand. Your company will not lose money, your paper will (hopefully) not be rejected, and the world will not end. That being said, it is unfortunately incorrect if not by appealing to the definition, then perhaps for other reasons.

I’ll start with an appeal to authority. Sander Greenland and coauthors (who include notable epidemiologist Ken Rothman and motherfucking Doug Altman) include interpretation of a confidence interval as having 95% probability of containing the true effect as misconception 19 in this amazing paper. They note ” It is possible to compute an interval that can be interpreted as having 95% probability of containing the true value” but go on to say that this results in us doing a Bayesian analysis and computing a credible interval. If these guys are wrong, I don’t want to be right.

Additionally, when I say “The probability of a coin being flipped heads is 0.5” that references a long term frequency. I could, in principle, demonstrate that frequency by flipping a coin a lot and computing the empirical frequency of heads, which assuming the coin is fair and the number of flips large enough, will be within an acceptable range 0.5. To those people who say “This interval contains the estimand with 95% probability” I say “prove it”. Demonstrate to me via simulation or otherwise this long term relative frequency. I can’t imagine how this could be demonstrated because any fixed dataset will yield same answer over and over. Perhaps what supporters of this perspective mean is something closer to the Bayesian interpretation of probability (where probability is akin to strength in a belief). If so, the debate is decidedly answered because probability in frequentism is not about belief strength but about frequencies. Additionally, what is the random component in this probability? The data from the experiment are fixed, to allow these to vary is to appeal to my interpretation of the interval. If the estimand is random, then we are in another realm all together as frequentism assumes fixed parameters and random data. Maybe they mean something else which I just can’t think of. If there is something else, please let me know.

I’ve gotten flack about confidence intervals on twitter.

Flack 1: Framing It As A Bet

You present to me a shuffled deck with no funny business and offer me a bet in which I win X0,000 dollars if the card is an ace and lose X0 dollars if the card is not. “Aha Demetri! If you think the probability of the card on top being an ace is 0 or 1 you are either a fool for not taking the bet or are a fool for being so over confident! Your position is indefensible!” one person on twitter said to me (ok, they didn’t say it verbatim like this, but that was the intent).

Well, not so fast. Nothing about my interpretation precludes me from using the answer to a simpler question to make decisions (I would argue statistics is the practice of doing jus that, but I digress). The top card is still an ace or not, but I can still think about an infinite sequence of shuffled decks anyway. In most of those scenarios, the card on top is an ace. Thus, I take the bet and hope the top card is an ace (much like I hope the confidence interval captures the true estimand, even though I know it either does or does not).

Flack 2: My Next Interval Has 95% Probability

“But Demetri, if 95% refers to the frequency of intervals containing the estimand, then surely my next interval has 95% probability of capturing the estimand prior to seeing data. Hence, individual intervals do have 95% probability of containing the estimand”.

I get this sometimes, but don’t fully understand how it is supposed to be convincing. I see no problem with saying “the next interval has 95% probability” just like I have no problem with saying “If you shuffle those cards, the probability an ace is on top is 4/52” or “My next Roll Up The Rim cup has a 1 in 6 chance of winning”. This is starting to get more philosophical than I care it to, but those all reference non-existent things. Once they are brought into existence, it would be silly to think that they retain these properties. My cup is either winner or loser, even if I don’t roll it.

Flack 3: But Schrödinger’s Cat…

No. Stop. This is not relevant in the least. I’m talking about cards and coins, not quarks or electrons. The Wikipedia article even says “Schrödinger did not wish to promote the idea of dead-and-live cats as a serious possibility; on the contrary, he intended the example to illustrate the absurdity of the existing view of quantum mechanics”. Cards can’t be and not-be aces until flipped. Get out of here.

Wrapping Up, Don’t @ Me

To be completely fair, I think the question about the cards I’ve presented to you is unfair. The question asks for a probability, and while 0 and 1 are valid probabilities, the question is phrased in a way so that you are prompted for a number between 0 and 1. Likewise, the name “95% confidence interval” begs for the wrong interpretation. That is the problem we face when we use language, which is naturally imprecise and full of shortcuts and ambiguity, to talk about things as precise as mathematics. It is a seminal case study in what I like to call the precision-usefulness trade off; precise statements are not useful. It is by, interpreting them and communicating them in common language that they become useful and that usefulness comes at the cost of precision (note, this explanation of the trade off is itself susceptible to the trade off). The important part is that we use confidence intervals to convey uncertainty in the estimate for which they are derived from. It isn’t important what you or I think about it, as the confidence interval is merely a means to an end.

AS I noted, the mistake is benign, and these arguments are mostly a mental exercise than a fight against a method which may induce harm. Were it not for COVID19, I would encourage us all to go out for a beer and have these conversations rather than do it over twitter. Anyway, if you promise not to @ me anymore about this and I promise not to tweet about it anymore.