Jekyll2021-09-22T10:28:49-07:00https://dpananos.github.io/feed.xmlPh.DemetriApplied Mathematician. Statistician. Data Scientist.Demetri Pananosdpananos@gmail.comIn Response To Vaccine Hesitancy From Tommy Caldwell2021-09-22T00:00:00-07:002021-09-22T00:00:00-07:00https://dpananos.github.io/posts/2021/09/blog-post-32<p>On September 21st, 2021 I received an email from Tommy Caldwell, author and owner of Hybrid Fitness in my hometown of London Ontario, with the subject line “COVID-19, Kids, and Vaccines: A difficult decision for parents to navigate”. The email was sent from Hybrid Fitness’ email address (which I had subscribed to sometime in 2018, in the first year of my PhD). In that email, Mr. Caldwell acknowledges that vaccinating is a “no-brainer decision” and professes he is an inexpert in COVID19, science, and research, yet continues on to interpret statistics from <a href="https://www.cdc.gov/vaccines/acip/recs/grade/covid-19-pfizer-biontech-vaccine-12-15-years.html">a clinical trial</a> intended to evaluate the vaccine against COVID19 in persons aged 12-15 years.</p>
<p>Mr. Caldwell cites that the 121 severe adverse events (SAAs) occurred in the intervention arm of the trial, constituting a 5:1 ratio when compared to placebo arm. He goes on to say</p>
<blockquote>
<p>“A direct quote from the trial also suggested that researchers were concerned that they were missing severe adverse events as well due to the study design. In their words, ““There was serious concern of indirectness because the body of evidence does not provide certainty that rare serious adverse events were captured due to the short follow-up and sample size. There was also very serious concern for imprecision, due to the width of the confidence interval.”</p>
</blockquote>
<p>I have included the entire email as a png below so that you can read his words from him in context.</p>
<p>Mr. Caldwell has, in my opinion, grossly misinterpreted these statistics and in doing so weaponizes them to support a narrative that vaccinating one’s children is a “difficult decision” when proper interpretation of the statistics should lead one to believe that the vaccine is at the very least consistent with no increase in risk of harm. This blog post is intended to rectify the mistakes Mr. Caldwell has made to his assumedly large readership.</p>
<p>In what follows, I present corrected claims found in Mr. Caldwell’s email as section titles with more details in the section body. If you question my ability to intelligently comment on statistics at this level (which you should), I encourage you to poke around this website for proof that I understand statistics sufficiently well. If you need additional proof, please see some of the many <a href="https://stats.stackexchange.com/users/111259/demetri-pananos?tab=answers&sort=votes">answers</a> I provide on statistical forums, or review some of the <a href="https://scholar.google.com/citations?hl=en&user=LN16PpgAAAAJ&view_op=list_works&sortby=pubdate">papers</a> I have published in medical and epidemiology journals.</p>
<h1 id="51-ratio-of-people-reporting-reactogenicity-not-severe-adverse-events">5:1 Ratio of People Reporting Reactogenicity, Not Severe Adverse Events</h1>
<p>Table 3d in the link above summaries the number of participants who reported “Reactogenicity”. I assume these are the numbers Mr. Caldwell refers to in his email as I can’t find any others which match the aformentioned ratio. The ratio of reactogencitity between intervention and placebo arms is roughly 5 (121 / 22 is approximately 5.5). However, the definition of reactogenicity is (from the text below table 3d)</p>
<blockquote>
<p>Reactogenicity outcome includes local and systemic events, grade ≥3. Grade 3: prevents daily routine activity or requires use of a pain reliever.</p>
</blockquote>
<p>This means that 121 people may have, for example, needed an Advil, or got a sore arm, or may have needed to take a nap. The study authors do not list the outcomes explicitly, but we can surmise they are very minor from the definition used. On the other hand, a severe adverse event (which Mr. Caldwell originally claimed the 5:1 ratio applied) means (see table 3c in the link above)</p>
<blockquote>
<p>Death, life-threatening event, hospitalization, incapacity to perform normal life functions, medically important event, or congenital anomaly/birth defect</p>
</blockquote>
<p>Note further that counts of SAAs is much lower than the counts he mentions in the email. Mr. Caldwell has, I assume mistakenly, reported a less severe reaction as a more severe reaction. This can (and probably has) garnered fear from readers. I’m not a parent myself, but I wouldn’t want my dog undergoing a SAA let alone my mother or future child.</p>
<h1 id="data-are-consistent-with-no-increase-of-risk-of-saas">Data Are Consistent with No Increase of Risk of SAAs.</h1>
<p>Table 3c in the link above examines SAAs. Sample sizes in each arm are comparable (1131 and 1129 in intervention and placebo respectively) and so an absolute comparison of number of SAAs is allowable.</p>
<p>Five people in the intervention arm reported an SAA and 2 people in the placebo arm reported an SAA, a difference of 3 people. This difference is very small, and is consistent with the assumption the vaccine does not increase the risk of SAAs. A difference of 3 people or more is completely in line with sampling variability. For those comfortable in programming in R, we can simulate this and determine the probability of seeing one of the two groups have 3 or more people than the other.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The risk under the null hypothesis would be</span><span class="w">
</span><span class="c1"># 0.003097345.</span><span class="w">
</span><span class="c1"># Simulate 1 million trails where there is no difference in </span><span class="w">
</span><span class="c1"># risk of SAA</span><span class="w">
</span><span class="c1"># Compute the probability one of the groups</span><span class="w">
</span><span class="c1"># has 3 more SAAs than the other</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="c1">#Set the random seed for reproducibility</span><span class="w">
</span><span class="n">risk</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.003097345</span><span class="w">
</span><span class="c1"># Generate 1 million similar scenarios where there is no real difference </span><span class="w">
</span><span class="c1"># in risk of an SAA</span><span class="w">
</span><span class="n">intervention</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1e6</span><span class="p">,</span><span class="w"> </span><span class="m">1131</span><span class="p">,</span><span class="w"> </span><span class="n">risk</span><span class="p">)</span><span class="w">
</span><span class="n">placebo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1e6</span><span class="p">,</span><span class="w"> </span><span class="m">1129</span><span class="p">,</span><span class="w"> </span><span class="n">risk</span><span class="p">)</span><span class="w">
</span><span class="c1"># Here is the probability we find one of the two groups having a difference of 3 or more</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">intervention</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">placebo</span><span class="p">)</span><span class="o">>=</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.334566</span><span class="w">
</span></code></pre></div></div>
<p>So there is approximately a 33% chance we would see one of the two groups have 3 or more SAAs than the other when in truth no difference in risk exists between groups.</p>
<p>The study authors report something called the <strong>relative risk</strong> (RR in the table). The relative risk is the risk of SAA in intervention arm divided by the risk of SAA in placebo arm. If the relative risk is greater than 1, then the risk of SAA is bigger in the intervention group than the placebo. The authors report a relative risk of 2.5 (meaning the risk of SAA in the intervention arm was estimated to be 2.5x greater than the placebo arm). However, that does not tell the whole story.</p>
<p>A 2.5x increase in risk is one thing, but the authors also report something called a <strong>confidence interval</strong>. A confidence interval gives values of the true relative risks which might have plausibly generated the data we have seen. Note that the associated confidence interval is 0.49 to 12.84, meaning the data are consistent with a true relative risk as small as 0.5 (again, meaning the risk of SAA would be <strong>smaller</strong> in the intervention arm than in placebo arm).</p>
<p>A popular rebuttal would be “Demetri, the confidence interval is also greater than 1, meaning the data are consistent with relative risks as big as 13!”. Agreed, that is true, but if the data are consistent with reductions in risk <strong>and</strong> increases in risk, then we should conclude that we can’t make any conclusions based on the conflicting evidence. The confidence interval is too wide (or as statisticians would say, the estimate is too <strong>imprecise</strong>, a word used by the study authors and quoted by Mr. Caldwell). This does not mean we can conclude the vaccine does not increase the risk of SAA, it only means that the data are consistent with no change in the risk of SAA. To know with more certainty if the risk of SAA changes between groups, we would need more data.</p>
<p>Which leads me to my final point…</p>
<h1 id="very-serious-concern-regarding-the-size-of-the-confidence-interval-is-about-inability-to-comment-on-change-in-risk-of-saa-not-about-how-the-study-was-conducted">Very Serious Concern Regarding The Size of the Confidence Interval is About Inability to Comment on Change in Risk of SAA, Not About How the Study Was Conducted</h1>
<p>As I mentioned before, that the confidence interval spans numbers less than 1 and greater than 1 means we can’t comment on a change in risk of SAA. In Mr. Caldwell’s email, it appears Mr. Caldwell seems to interpret the authors’ concern about “indirectness” as meaning they were missing SAAs or that the concern was about the study design.</p>
<p>For the reasons I have commented on above, this is false. SAA has been very well defined in this paper, and what I anticipate Mr. Caldwell is concerned about is long term effects of the vaccine. An understandable concern, but not one I can see being studied in sufficient depth to allow for an expedient release of a vaccine to combat COVID19.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This letter in response is much larger than Mr. Caldwell’s initial email, a testament to <a href="https://en.wikipedia.org/wiki/Brandolini%27s_law">Brandolini’s Law</a> better known as the “The Bullshit Asymmetry Principle”</p>
<blockquote>
<p>The amount of energy needed to refute bullshit is an order of magnitude larger than to produce it.</p>
</blockquote>
<p>And make no mistake that I believe Mr. Caldwell’s claims about these studies to be nothing short of bullshit. I have no personal vendetta against Mr. Caldwell, nor do I take issue with his personal choices or concerns about vaccinations, but I do take issue with making such gross errors in interpretation of statistics tp laymen audiences which trust your work on matters of health, broadly construed. I’ve contacted Mr. Caldwell and politely and kindly asked him to send another email to the same mail list correcting some of his errors. He has not explicitly said “No” but has yet to directly address the request at this time.</p>
<h1 id="email-from-mr-caldwell">Email From Mr. Caldwell</h1>
<div style="text-align:center"><img src="/images/blog/email.png" /></div>Demetri Pananosdpananos@gmail.comOn September 21st, 2021 I received an email from Tommy Caldwell, author and owner of Hybrid Fitness in my hometown of London Ontario, with the subject line “COVID-19, Kids, and Vaccines: A difficult decision for parents to navigate”. The email was sent from Hybrid Fitness’ email address (which I had subscribed to sometime in 2018, in the first year of my PhD). In that email, Mr. Caldwell acknowledges that vaccinating is a “no-brainer decision” and professes he is an inexpert in COVID19, science, and research, yet continues on to interpret statistics from a clinical trial intended to evaluate the vaccine against COVID19 in persons aged 12-15 years.Three Months Into Industry2021-07-27T00:00:00-07:002021-07-27T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-31<p>I “left” (I say left in quotations because I’m still working on my Ph.D part time) approximately 3 months ago. That is a bit of a milestone. It is a quarter of a year working at a senior-ish level as a data scientist at a national bank. I know a lot of PhDs, especially in quantitative disciplines, are thinking of making the jump from academia to industry. This sequence of blog posts is not advice on how to make that jump, but rather to document one perspective on what that change entails.</p>
<p>I’m not sure how best to document my experiences, but the most interesting comparisons to me are:</p>
<ul>
<li>
<p>Differences in Challenges: How does working on a PhD differ from working in industry with respect to the challenges you face? Are obstacles easier/harder to overcome? What are the differences in obstacles?</p>
</li>
<li>
<p>Differences in Work Satisfaction: How do I like working on a PhD versus working in industry? Is the work better/worse? More/less interesting?</p>
</li>
<li>
<p>Differences in Work Life Balance: This one is pretty self explanatory.</p>
</li>
</ul>
<p>Let’s begin.</p>
<h1 id="differences-in-challenges">Differences in Challenges</h1>
<p>One of the earliest challenges I had to face was to find stuff to do! I imagine this is not typical of most data science positions. You’re usually hired into a data science team, who is likely fostering a bunch of projects. Not me, I was hired into an app team comprised almost entirely of developers in Java and kdb+ (or are they q developers? It doesn’t matter). The team is new, I joined less than a year after the team was formed. Consequently, most of the work being done is development work. It was soon revealed to me that (I’m paraphrasing here) the team did not want a data scientist, they wanted another developer. But the bank works in mysterious bureaucratic ways and so they were handed a data scientist to “inject AI into trade compliance”. However, because the team is so new and there is much development work to be done, machine learning and data science just isn’t on their radar right now. So my first challenge was to find something to do, or more precisely, to understand the processes of the business we support and find areas where I could add value with machine learning.</p>
<p>That’s tough. It requires a skill I did not properly hone in my time in academia: talking to people and empathizing. It would be easy for me to just create a model to do give analysts another number to look at (or ignore, as a manager has said to me), but the trick to being a good data scientist is creating solutions which address an actual need and not solve some sort of math problem. You can see where I (a person who has spent a lot of his life just solving math problems) could encounter some difficulty.</p>
<h1 id="differences-in-work-satisfaction">Differences in Work Satisfaction</h1>
<p>I said this to my manager fairly early on: “I don’t care much about financial markets”. I’m a boring investor. I put a set amount of money into an index fund every month automatically. I don’t care about Gamestop (save for the memes), I could give a shit about SPY and TSLA calls, and those are the interesting bits! I work in fixed income! If I find Telsa drama boring, imagine how I feel about government bonds!</p>
<p>Suffice to say, my work at the hospital was orders of magnitude more interesting and fulfilling (though the pay here is much at the bank. I will note here that there is an interesting inverse relationship between how fulfilling a job is and the pay for those jobs. More on that in <em>Bullshit Jobs</em>, an excellent book in which data scientists are called out by name. I digress… ). That being said, there are other ways I get work satisfaction.</p>
<p>Though I don’t (yet) have access to enormous compute I do have access to enormous data. Consequently, the problems become interesting when I can abstract away the boring financial details and focus on the underlying statistical or math problem. That is an enormous difference between grad school and industry. I finally have the data I need to (at least approximately) answer the questions I want to answer.</p>
<h1 id="differences-in-work-life-balance">Differences in Work Life Balance</h1>
<p>Unsurprisingly, work life balance is better in industry. In grad school, the guilt to work is a long, dull, pain. There is <em>always</em> work to be done, and if you aren’t doing it then (you tell yourself) you are lazy. In industry, the guilt to work is a sharp quick pain that comes in waves. There is always work to be done, but I know I’m going to come back to it tomorrow for the same amount of time so I don’t feel as bad about taking a 40 minute break, or working on something else.</p>
<p>That being said, there are times I’ve had to work after 5 pm. That’s fine with me, its pretty infrequent, but I have been witness to people <em>scheduling meetings at 8 pm</em> as if I didn’t have a life outside work. And even stranger, I’ve seen people decline those meetings <em>because they are already in meetings at 8pm</em>. What the fuck? What the actual fuck?</p>
<p>I’m not interested in imputing why people do this. I benefit from low amounts of responsibility (no mortgage, no kids, no family) and so perhaps I’m not as motivated by promotions, bonuses, or being fired as some people are. That will maybe change as I grow a bit older, but I’d sooner quit than schedule a meeting that late into the night. Famous last words.</p>
<p>I’ll revisit this story in another 3 months, which would be around November. In the meantime, if you have additional questions or would like to chat about my experience (or yours for that matter, I’d much rather listen than talk), please reach out via twitter.</p>Demetri Pananosdpananos@gmail.comI “left” (I say left in quotations because I’m still working on my Ph.D part time) approximately 3 months ago. That is a bit of a milestone. It is a quarter of a year working at a senior-ish level as a data scientist at a national bank. I know a lot of PhDs, especially in quantitative disciplines, are thinking of making the jump from academia to industry. This sequence of blog posts is not advice on how to make that jump, but rather to document one perspective on what that change entails.Riddler Solutions2021-05-11T00:00:00-07:002021-05-11T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-30<p>I like <a href="https://fivethirtyeight.com/features/are-you-smarter-than-a-fourth-grader/">Riddler</a> from 538 mostly because you can solve the riddles with some fun math. If the riddle is interesting enough, <a href="https://dpananos.github.io/posts/2017/12/blog-post-2/">I will post solutions on my blog</a>. This is one such riddle.</p>
<h2 id="the-riddle">The Riddle</h2>
<blockquote>
<p>You and your infinitely many friends are sharing a cake, and you come up with two rather bizarre ways of splitting it.
For the first method, Friend 1 takes half of the cake, Friend 2 takes a third of what remains, Friend 3 takes a quarter of what remains after Friend 2, Friend 4 takes a fifth of what remains after Friend 3, and so on. After your infinitely many friends take their respective pieces, you get whatever is left.
For the second method, your friends decide to save you a little more of the take. This time around, Friend 1 takes $1/2^2$ (or one-quarter) of the cake, Friend 2 takes $1/3^2$ (or one-ninth) of what remains, Friend 3 takes $1/4^2$ of what remains after Friend 3, and so on. Again, after your infinitely many friends take their respective pieces, you get whatever is left.</p>
<p>Question 1: How much of the cake do you get using the first method?</p>
<p>Question 2: How much of the cake do you get using the second method?</p>
</blockquote>
<h2 id="the-solution">The Solution</h2>
<p>It is very easy to argue that method 1 leaves us no cake left. Let’s see why. First, let’s agree to model <em>how much cake is left</em> rather than how much the $k^{th}$ friend takes. The first friend takes half (leaving us half a cake). The second friend takes a third of what remains (meaning there are two thirds of one half left). The third friend takes a fourth of what remains (meaning there is three quarters of two thirds of one half left). See the pattern? Let $\pi_k$ be the proportion of pie (or, er cake) left after friend $k$ takes their share. The sequence is</p>
\[\begin{align}
\pi_1 &= \dfrac{1}{2}\\
\pi_2 &= \dfrac{2}{3} \times \dfrac{1}{2} = \dfrac{1}{3}\\
\pi_3 &= \dfrac{3}{4} \times \dfrac{2}{3} \times \dfrac{1}{2} = \dfrac{1}{4}\\
& \vdots\\
\pi_k &= \dfrac{1}{k+1}
\end{align}\]
<p>Since $\lim_{k \to \infty} \pi_k = 0$ we get no cake.</p>
<p>Method 2 is more interesting. Remember, we are modelling how much is left. The sequence is</p>
\[\begin{align}
\pi_1 &= \dfrac{3}{4}\\
\pi_2 &= \dfrac{8}{9} \times \dfrac{3}{4} \\
\pi_3 &= \dfrac{15}{16} \times\dfrac{8}{9} \times \dfrac{3}{4}\\
& \vdots\\
\pi_k &= \prod_{m=2}^{m=k} \dfrac{(m+1)^2-1}{(m+1)^2}
\end{align}\]
<p>It is not obvious what $\pi_k$ approaches as $k \to \infty$. Let’s see what this quantity approaches empirically and see if we can get some intuition.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cake</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">friend</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">100090</span><span class="p">):</span>
<span class="n">cake</span> <span class="o">=</span> <span class="n">cake</span> <span class="o">*</span><span class="p">(</span><span class="n">friend</span><span class="o">**</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="n">friend</span><span class="o">**</span><span class="mi">2</span>
<span class="k">print</span><span class="p">(</span><span class="n">cake</span><span class="p">)</span>
<span class="o">>>></span><span class="mf">0.5000049955540937</span>
</code></pre></div></div>
<p>Looks like it is approaching 1/2. Now that we know where we are going, let’s prove it. It is easier to work with the log of the product because that turns it into sums. The use of $(m+1)$ in the expression for $\pi_k$ are to make the indicies work. Let’s just use $m$ because we’re mostly interested in the limit, not the indicies.</p>
\[\log(\pi_k) = \sum_{m=2}^{m=k} \log(m^2-1) - \log(m^2) = \sum_{m=2}^{m-k} \log(m+1) + \log(m-1) - 2\log(m)\]
<p>Here, I’ve just factored the difference of squares and applied some log rules.</p>
<p>Writing out the first few terms of the summand</p>
\[\begin{align}
&\log(3) + \log(1) - 2\log(2)\\
&\log(4) + \log(2) - 2\log(3)\\
&\log(5) + \log(3) - 2\log(4)\\
&\log(6) + \log(4) - 2\log(5)\\
\end{align}\]
<p>and so on. Some nice cancellation occurs. The $\log(3)$ term appears in lines 1 and 3, and is cancelled by the 2 factors in line 2. Similar arguments are made for the $\log(4)$ terms in lines 2,4, and 3. What is left uncanceled is $\log(1)$ and one factor of $\log(2)$ as $k \to \infty$, meaning the product approaches $1/2$. Hence, we get half a cake.</p>
<h2 id="extra-credit">Extra Credit</h2>
<p>There is an extra credit portion to this question I can’t figure out. Suppose your friends start taking slices of cake which or proportions of squares of even numbers (the first takes $1/2^2$, the second takes $1/4^2$, the third $1/6^2$, and so on). The proportion of cake left is</p>
\[\begin{align}
\pi_1 &= \dfrac{3}{4}\\
\pi_2 &= \dfrac{15}{16}\times \dfrac{3}{4}\\
\pi_3 &= \dfrac{35}{36} \times \dfrac{15}{16}\times \dfrac{3}{4}\\
& \vdots\\
\pi_k &= \dfrac{3^2 \times 5^2 \times 7^2 \times \cdots}{2^2 \times 4^2 \times 6^2 \times \cdots}
\end{align}\]
<p>But I can’t seem to find a way to beak this sequence’s back. Hints are welcome.</p>Demetri Pananosdpananos@gmail.comI like Riddler from 538 mostly because you can solve the riddles with some fun math. If the riddle is interesting enough, I will post solutions on my blog. This is one such riddle.Answering Easier Questions2021-04-03T00:00:00-07:002021-04-03T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-29<p>The 95% in 95% confidence interval refers not to the probability that any one interval contains the estimand, but rather to the long term relative frequency of the estimator containing the estimand in an infinite sequence of replicated experiments under ideal conditions.</p>
<p>Now, if this were twitter I would get ratioed so hard I might have to take a break and walk it off. Luckily, this is my blog and not yours so I can say whatever I want with impunity. But, rather than shout my opinions and demand people listen, I thought it might be a good exercise to explain to you why I think this and perhaps why people might disagree. Let’s for a moment ignore the fact that the interpretation I use above is the <em>de jure</em> definition of a confidence interval and instead start where a good proportion of statistical learning starts; with a deck of shuffled cards.</p>
<p>I present to you a shuffled deck. Its a regular deck of cards, no funny business with the cards or the shuffling. What is the probability the top card of <em>this</em> deck an ace? I’d wager a good portion of people would say 4/52. If you, dear reader, said 4/52 then I believe you have made a benign mistake, but a mistake all the same. And I suspect the reason you’ve made this mistake is because you’ve swapped a hard question (the question about <em>this</em> deck) for an easier question (a question about the long term relative frequencies of coming to shuffled decks with no funny business and finding aces).</p>
<p>Swapping hard questions for easy questions is not a new observation. Daniel Khaneman writes about it in <em>Thinking Fast and Slow</em> and provides numerous examples. I’ll repeat some examples from the book here. We might swap the question:</p>
<ul>
<li>“How much would you contribute to save dolphins?” for “how much emotion do I feel when I think of dying dolphins?”</li>
<li>“How happy are you with your life?” for “What is my mood right now?”, and poignantly</li>
<li>“This woman is running for the primary. How far will she go in politics?” for “Does this woman look like a political winner”.</li>
</ul>
<p>The book <em>Thinking Fast and Slow</em> explains why we do this, or better yet why we have no control over this. I won’t explain it here. But it is important to know that this is something we do, mostly unconsciously.</p>
<p>So back to the deck of cards. Questions about the deck in front of you are hard. Its either an ace or not, but you can’t tell! The card is face down and there is no other information you could use to make the decision. So, you answer an easier one using information that you do know, namely the number of aces in the deck, the number of cards in the deck, the information that each card is equally likely to be on top given the fact there is no funny business with the cards or the shuffling, and the basic rules of probability you might have learned in high school if not elsewhere. But the answer you give is for a fundamentally different question, namely “If I were to observe a long sequence of well shuffled decks with no funny business, what fraction of them have an ace on top?”. Your answer is about that long sequence of shuffled decks. It isn’t about any one particular deck, and certainly not the one in front of you.</p>
<p>I think the same thing happens with confidence intervals. The estimator has the property that 95% of the time it is constructed (under ideal circumstances) it will contain the estimand. But any one interval does or does not contain the estimand. And unlike the deck of cards which can easily be examined, we can’t ever know for certain if the interval successfully captured the estimand. There is no moment where we get to verify the estimand is in the confidence interval, and so we are sort of left guessing thus prompting us to offer a probability that we are right.</p>
<p>The mistake is benign. It hurts no one to think about confidence intervals as having a 95% probability of containing the estimand. Your company will not lose money, your paper will (hopefully) not be rejected, and the world will not end. That being said, it is unfortunately incorrect if not by appealing to the definition, then perhaps for other reasons.</p>
<p>I’ll start with an appeal to authority. Sander Greenland and coauthors (who include notable epidemiologist Ken Rothman and motherfucking Doug Altman) include interpretation of a confidence interval as having 95% probability of containing the true effect as misconception 19 in <a href="https://link.springer.com/content/pdf/10.1007/s10654-016-0149-3.pdf">this amazing paper</a>. They note “ It is possible to compute an interval that can be interpreted as having 95% probability of containing the true value” but go on to say that this results in us doing a Bayesian analysis and computing a credible interval. If these guys are wrong, I don’t want to be right.</p>
<p>Additionally, when I say “The probability of a coin being flipped heads is 0.5” that references a long term frequency. I could, in principle, demonstrate that frequency by flipping a coin a lot and computing the empirical frequency of heads, which assuming the coin is fair and the number of flips large enough, will be within an acceptable range 0.5. To those people who say “This interval contains the estimand with 95% probability” I say “prove it”. Demonstrate to me via simulation or otherwise this long term relative frequency. I can’t imagine how this could be demonstrated because any fixed dataset will yield same answer over and over. Perhaps what supporters of this perspective mean is something closer to the Bayesian interpretation of probability (where probability is akin to strength in a belief). If so, the debate is decidedly answered because probability in frequentism is not about belief strength but about frequencies. Additionally, what is the random component in this probability? The data from the experiment are fixed, to allow these to vary is to appeal to my interpretation of the interval. If the estimand is random, then we are in another realm all together as frequentism assumes fixed parameters and random data. Maybe they mean something else which I just can’t think of. If there is something else, please let me know.</p>
<p>I’ve gotten flack about confidence intervals on twitter.</p>
<h2 id="flack-1-framing-it-as-a-bet">Flack 1: Framing It As A Bet</h2>
<p>You present to me a shuffled deck with no funny business and offer me a bet in which I win X0,000 dollars if the card is an ace and lose X0 dollars if the card is not. “Aha Demetri! If you think the probability of the card on top being an ace is 0 or 1 you are either a fool for not taking the bet or are a fool for being so over confident! Your position is indefensible!” one person on twitter said to me (ok, they didn’t say it verbatim like this, but that was the intent).</p>
<p>Well, not so fast. Nothing about my interpretation precludes me from using the answer to a simpler question to make decisions (I would argue statistics is the practice of doing jus that, but I digress). The top card is still an ace or not, but I can still think about an infinite sequence of shuffled decks anyway. In most of those scenarios, the card on top is an ace. Thus, I take the bet and hope the top card is an ace (much like I hope the confidence interval captures the true estimand, even though I know it either does or does not).</p>
<h2 id="flack-2--my-next-interval-has-95-probability">Flack 2: My Next Interval Has 95% Probability</h2>
<p>“But Demetri, if 95% refers to the frequency of intervals containing the estimand, then surely my next interval has 95% probability of capturing the estimand prior to seeing data. Hence, individual intervals <em>do</em> have 95% probability of containing the estimand”.</p>
<p>I get this sometimes, but don’t fully understand how it is supposed to be convincing. I see no problem with saying “the next interval has 95% probability” just like I have no problem with saying “If you shuffle those cards, the probability an ace is on top is 4/52” or “My next <a href="https://en.wikipedia.org/wiki/Tim_Hortons#Roll_Up_the_Rim_to_Win_campaign">Roll Up The Rim</a> cup has a 1 in 6 chance of winning”. This is starting to get more philosophical than I care it to, but those all reference non-existent things. Once they are brought into existence, it would be silly to think that they retain these properties. My cup is either winner or loser, even if I don’t roll it.</p>
<h2 id="flack-3--but-schrödingers-cat">Flack 3: But Schrödinger’s Cat…</h2>
<p>No. Stop. This is not relevant in the least. I’m talking about cards and coins, not quarks or electrons. The Wikipedia article even says “Schrödinger did not wish to promote the idea of dead-and-live cats as a serious possibility; on the contrary, he intended the example to illustrate the absurdity of the existing view of quantum mechanics”. Cards can’t be and not-be aces until flipped. Get out of here.</p>
<h2 id="wrapping-up-dont--me">Wrapping Up, Don’t @ Me</h2>
<p>To be completely fair, I think the question about the cards I’ve presented to you is unfair. The question asks for a probability, and while 0 and 1 are valid probabilities, the question is phrased in a way so that you are prompted for a number between 0 and 1. Likewise, the name “95% confidence interval” begs for the wrong interpretation. That is the problem we face when we use language, which is naturally imprecise and full of shortcuts and ambiguity, to talk about things as precise as mathematics. It is a seminal case study in what I like to call the precision-usefulness trade off; precise statements are not useful. It is by, interpreting them and communicating them in common language that they become useful and that usefulness comes at the cost of precision (note, this explanation of the trade off is <em>itself</em> susceptible to the trade off). The important part is that we use confidence intervals to convey uncertainty in the estimate for which they are derived from. It isn’t important what you or I think about it, as the confidence interval is merely a means to an end.</p>
<p>AS I noted, the mistake is benign, and these arguments are mostly a mental exercise than a fight against a method which may induce harm. Were it not for COVID19, I would encourage us all to go out for a beer and have these conversations rather than do it over twitter. Anyway, if you promise not to @ me anymore about this and I promise not to tweet about it anymore.</p>Demetri Pananosdpananos@gmail.comThe 95% in 95% confidence interval refers not to the probability that any one interval contains the estimand, but rather to the long term relative frequency of the estimator containing the estimand in an infinite sequence of replicated experiments under ideal conditions.Don’t Select Features, Engineer Them2020-12-07T00:00:00-08:002020-12-07T00:00:00-08:00https://dpananos.github.io/posts/2020/12/blog-post-28<p>Students in the class I TA love to do feature selection prior to modelling. Examining pairwise correlation and dropping seemingly uncorrelated features is one way they do this, but they also love to fit a LASSO model to their data and refit a model with the selected variables, or they might do stepwise selection if they are feeling in the mood to code it up in python.</p>
<p>Feature selection can be an important thing to do if you want to optimize the predictive performance per unit cost of data aquisition ratio. Why spend money getting customer shoe size if it doesn’t help you predict the propensity to buy? However, students don’t rationalize their feature selection approach this way. I suspect they do it because they feel it will help them achieve a lower loss/better predictive accuracy.</p>
<p>Let me start by saying that I think examining pairwise correlations is a bad way to select features. First, this approach is almost always never validated correctly, second it ignores confounding relationships, third correlations only the strength of a <em>linear</em> relationship, and fourth it is very easy to construct a dataset in which the correlations completely get the variable importance wrong. Let $x$ and $z$ be binary independent random variables and let $y$ be the XOR relationship. Knowing $x$ and $z$ determines $y$, but the correlation as measured by Kendall’s $\tau$ is 0. A strict selection based on correlation would completely miss this relationship.</p>
<p>In this blog post, I argue feature selection is at best not neccesary when the goal is optimal out of sample predictive performance and at worst increases test loss. If the goal is good predictions, then what students – and perhaps, practitioners too – should be doing is engineering new features. The question then becomes “what features” and to that I answer <em>“splines!”</em>.</p>
<p>In what follows, I describe 3 experiments in which I examine 5 models. Those models are:</p>
<ul>
<li>Linear Regression. This serves as our baseline. We just throw everything in the model and get what we get.</li>
<li>Linear Regression + Feature Selection via Correlations. Any features with an absolute correlation of 0.2 or larger are included in the model.</li>
<li>Linear Regression + Feature Selection via Lasso. I use 10 fold cross validation to first fit a LASSO model, then take all the non-zero coefficients from that model and refit a linear regression.</li>
<li>LASSO fit via 10 fold cross validation to examine how shrinkage might compare to these models, and</li>
<li>A linear model where each feature is expanded into a restricted cubic spline with 3 degrees of freedom. That is one additional feature for every feature in the training data, increasing the number of training variables by a factor of 2 as opposed to reducing them.</li>
</ul>
<p>The goal is to examine predictive performance of these models. I examine the performance of these models on three data sets:</p>
<ul>
<li>The diabetes dataset from sklearn</li>
<li>The Boston housing dataset from sklearn. I’ve removed binary indicators from this data because they screw with my spline implementation. No model sees the binary indicator, so its still fair.</li>
<li>A dataset of my own creation. I simulate 2,500 observations of 50 features from a standard gaussian. The first 10 features have a non-linear effect that was created by expanding those features into a restricted cubic spline with 7 degrees of freedom. I do 7 and not 3 because I want to investigate model mispecification (which is always the case) so as not to inflate the perfomance of the last model too much. The 20 features have an effect of 0 (they are nussance variables) and the remaining features just have linear effects. I sample the coefficients for the data from a standard normal and add student t noise with 10 degrees of freedom (something a little fatter in the tails than a normal, for some potential outliers).</li>
</ul>
<h1 id="methods">Methods</h1>
<p>Each model is examined through repeated 10 fold cross validation performed 100 times (that’s a total of 1000 resamples and refits). I ensure each model sees the same 1000 random splits so results are comparable. I don’t have an explicity test set because the diabetes and boston data are small and any single test set could yield noisy estimates of test loss. The repeats of 10 fold CV are meant to cirvument this. I choose mean squared error as my loss function because it penalizes models heavily which cannot make extreme predictions when neccesary. I also monitor the variability in the MSE fold to fold (essentially computing the standard deviation of the 1000 MSE evaluations). Ostensibly, this should be a measure of how variable the model could be were we to get a different training/testing set from the same data generating processes.</p>
<p>I make all code available on <a href="https://github.com/Dpananos/FeatureSelectionSimulation">github</a> for you to extend. I would love to see more models applied to this data (perhaps selection from random forests, or feature selection through a technique described + random forest. Go nuts).</p>
<h1 id="quick-note--what-is-the-right-way-to-select-variables-if-youre-going-to-do-it">Quick Note: What Is The Right Way To Select Variables If You’re Going To Do It?</h1>
<p>The wrong way to select variables is to inspect correlations/perform lasso/do stepwise on the entire training set <em>and then</em> cross validate after choosing features. The selection procedure is part of the model. Had you gotten different training data, you might of selected different features! The right way to make selection part of the model is to make a transformer which does it for you and put it in a pipeline. Here the LASSO selection transformer I wrote for the experiment:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">LassoSelector</span><span class="p">(</span><span class="n">BaseEstimator</span><span class="p">,</span> <span class="n">TransformerMixin</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">mod_</span> <span class="o">=</span> <span class="n">LassoCV</span><span class="p">(</span><span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">).</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">coef_</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">mod_</span><span class="p">.</span><span class="n">coef_</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">coef_</span><span class="p">)</span><span class="o">></span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<p>When we do cross validation now, the training data is sent to the <code class="language-plaintext highlighter-rouge">LassoCV</code> where a LASSO model is fit. Once the model is fit, we can determine which coefficients were selected by the model in the <code class="language-plaintext highlighter-rouge">transform</code> step. Now, when we pass the held out data in cross validation, the held out data’s variables are selected based only on the data the model was able to see during training. This properly accounts for the variability induced by the selection and keeps your models honest! A similar class can be written for selection with correlations</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CorrelationSelector</span><span class="p">(</span><span class="n">BaseEstimator</span><span class="p">,</span> <span class="n">TransformerMixin</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cutoff</span> <span class="o">=</span> <span class="mf">0.2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cutoff</span> <span class="o">=</span> <span class="n">cutoff</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">correlations_</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">pearsonr</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span><span class="p">])</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">selection</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argwhere</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">correlations_</span><span class="p">)</span><span class="o">></span><span class="bp">self</span><span class="p">.</span><span class="n">cutoff</span><span class="p">).</span><span class="n">ravel</span><span class="p">()</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">selection</span><span class="p">]</span>
</code></pre></div></div>
<h1 id="results">Results</h1>
<p>Shown below are the results of the experiments. For each dataset, I plot the relative difference in expected cross validated MSE (relative to linear regression) against the realtive difference in the fold to fold standard deviation (again, relative to lienar regression). Linear regression is at the intersection of the gray lines for reference. Models in the bottom left hand corner have better MSE as compared to linear regression and are less variable fold to fold.</p>
<p>In all three datasets, the splines model has superior MSE as compared to the other models. In fact, linear regression has superior MSE as compared to the selection models across all datasets (with the small exception of LASSO selection in the Boston dataset, though the difference is smaller than 1% and in my own opinion negligible).</p>
<p>Varaibility fold to fold seems to depend on the dataset. The spline model is less variable in the Boston and non linear datasets, but more variable in the diabetes dataset.</p>
<div style="text-align:center"><img src="/images/blog/diabetes.png" /></div>
<div style="text-align:center"><img src="/images/blog/boston.png" /></div>
<div style="text-align:center"><img src="/images/blog/nonlinear.png" /></div>
<h1 id="discussion">Discussion</h1>
<p>From these experiments, we can conclude that adding additional features rather than selecting existing ones leads to smaller MSE (although the No Free Lunch Theorem prevents us from generalizing to other datasets). The results are consistent across two datasets derived from real world data generating processes and a third synethtic dataset of my own design.</p>
<p>What might explain these results? First, splines add additional features to the model, meaning we can estimate a much richer space of functions from the data. Considering we can estimate a broader space of functions it is no surprise the splines come out on top, especially considering the competing models are all high bias models only capable of modelling linear effects. I imagine were we to apply other non-linear modelling techiniques (e.g. random forests or neural networks) that their MSE would also be lower than linear regression. However, splines offer a nice half way point between simplicity of linear regression and the black box of neural networks or random forests: we can add additional flexibility to the model thereby decreasing loss while remaining interpretable, and easily maintainable.</p>
<p>Frank Harrell makes another good argument for splines, which I will summarize here. When considering using splines there are four distinct possibilities:</p>
<p>1) We don’t use splines and the effect is linear. In this case, a simple model will suffice and we benefit from lower variance fits.</p>
<p>2) We use splines and the effect is linear. In this case, we spend extra degrees of freedom unneccisarily, increasing the variance of our fits, but the effect of the variable is appropriately estimated.</p>
<p>3) We don’t use splines and the effect is non-linear. This has the potential to be catastrophic! Imagine that a variable has an effect on $y$ that looks like $y = x^2$. If x is mean centered, then the linear fit could estimate the effect to be 0 or too small.</p>
<p>4) We use splines and the effect is non-linear. This would be the best case scenario where we appropriately spend our degrees of freedom.</p>
<p>Examinig all four cases shows that using splines is nearly always a good idea. With enough data, the varibility in the fit should shrink nullifying the downsides of point 2. The risk of point 3 is likely what explains the abysmal performance of correlation selection in the non-linear dataset. My intuition here is the effect of some of the non-linear variables would have small correlations but have a strong non-linear effect (kind of like the $x^2$ example). Negelcting these variables is a huge mistake as they can reduce the MSE, explaining why correlation selection has an MSE nearly 3x that of OLS. I’ve not conducted a post mortem on these experiments, but I anticipate this to be the cause.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I provide experimental motivation for the use of feature engineering (specifically though splines) over featuer selection for optimizing for predictive accuracy. Five models were applied across 3 datasets in order to examine the relative improvement of selection procedures over linear regression. To measure performance, mean squared error was computed over 100 repeats of 10 fold cross validation. As a comparitor, I incldued a spline model which adds features rather than removing them. The spline model outperformed all other models across all datasets, and linear regression had either superior or comparable performance to selection procedures. From these experiments, coupled with knowledge of how additional variability can impact expected loss, I conclude that rather than select features for inclusion in the model students should consider carefully balancing additional variability of splines with the reduction in bias – and hence loss – they can impart.</p>
<p>The experiment has several limitations. In particular, the spline model I estimated was still quite biased. It would be important to simulate another dataset in which the effects are non-linear and the model we use has too many degrees of freedom. The variability in this approach may attenuate the improvement we see due to splines. Additionally, complexity is not explicity accounted for and is only allowed to penalize models via poor predictive performance. A natural extension would be to measure AIC of all models since they are linear (except perhaps the lasso where effective degrees of freedom may or may not be appropriate to use in AIC).</p>Demetri Pananosdpananos@gmail.comStudents in the class I TA love to do feature selection prior to modelling. Examining pairwise correlation and dropping seemingly uncorrelated features is one way they do this, but they also love to fit a LASSO model to their data and refit a model with the selected variables, or they might do stepwise selection if they are feeling in the mood to code it up in python.Intuitive Formulae Are Not Always Right2020-10-22T00:00:00-07:002020-10-22T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-27<p>In the data science class I help TA, we’re going over confidence intervals. Thanks to the central limit theorem, we can report confidence intervals for the out of sample generalization error. Let’s assume our loss function is mean squared error. A confidence interval would then be</p>
\[\widehat{MSE} \pm z_{1-\alpha/2} \dfrac{\sigma_{MSE}}{\sqrt{n}} \>.\]
<p>Here, $\sigma_{MSE}$ the the standard deviation of the squared residuals (squared because our loss is squared error). Well, if you’re familliar with $R^2$ you realize that</p>
\[R^2 = 1- \dfrac{MSE}{SSE}\]
<p>and be tempted to take your confidence interval for MSE and just divide it by the SSE (which is $\sum_i (y - \bar{y})^2$) and call that a confidence interval for $R^2$.</p>
<p>This works…in a very narrow set of circumstances. For most problems, this results in coverage above or far below the nominal and also risks covering impossible $R^2$ values. Let’s make some very charitable assumptions and see just how well this works.</p>
<h2 id="charitable-assumpions">Charitable Assumpions</h2>
<p>Let’s analyze this interval for a single varibale regression. Let’s make the assumption that the predictor is normally distributed and that it has variance $\tau^2$.</p>
<p>Now, assume $y\vert x \sim \mathcal{N}(\alpha x + \beta, \sigma^2)$. The marginal distribution of $y$ is then $y \sim \mathcal{N}(\alpha \mu + \beta, \sigma^2 + \alpha^2\tau^2)$ meaning that the the exact value for $R^2$ would be</p>
\[R^2 = 1 - \dfrac{\sigma^2}{\sigma^2 + \alpha^2\tau^2} = 1 - \dfrac{1}{1 + \left(\dfrac{\alpha \tau}{\sigma}\right)^2} \>.\]
<p>We can always standardize our predictor to have unit variance, so the quantity we really care about is $\alpha/\sigma$. I can rearrange the formula for $R^2$ to give me $\alpha = \alpha(\sigma)$ and since we really only care about the ratio of $\alpha$ and $\sigma$ I’ll assume $\sigma=1$ and modulate $\alpha$. Now, I can generate a regression problem which has true $R^2$ whatever I want it to be. Here is some code to do that.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">scipy.special</span> <span class="kn">import</span> <span class="n">expit</span><span class="p">,</span> <span class="n">logit</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">product</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="k">def</span> <span class="nf">make_regression_data</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="n">n</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">alpha</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">n</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">R2</span> <span class="o">=</span> <span class="mf">0.8</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">sigma</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">R2</span><span class="p">)</span> <span class="o">-</span><span class="mi">1</span> <span class="p">)</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">make_regression_data</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="c1"># Scoring is r squared. Will print something close to 0.8. Any difference is sampling error.
</span><span class="n">LinearRegression</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">).</span><span class="n">score</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>Now all I have to do is:</p>
<ul>
<li>Specify an R squared and a sample size</li>
<li>Compute the “confidence interval”</li>
<li>Determine if the true $R^2$ is in the interval</li>
<li>Repeat</li>
</ul>
<p>The <em>coverage</em> of an interval estimator is the long term relative frequency of containing the true estimand. A 95% confidence interval should have a coverage of 0.95. A coverage lower than that means our interval is not likely too narrow and overstating the precision of our estimate. A coverage too high is giving credence to values of $R^2$ it should not, understating the precision of the estimate. It is important that our interval estimate have the appropriate coverage. Ok, moving on.</p>
<h1 id="results">Results</h1>
<p>Code is included at the end of the blog post, but here is a plot of the coverage by R squared and sample size. As you can see, small $R^2$ have a coverage over the nominal value, so we are always overstating what kinds values of $R^2$ are consistent with the data. And for large values of $R^2$ we have coverage way way below the nominal. If this interval were used as a means to test some hypothesis about $R^2$ then we would make a lot of false positives. A lot. I’ve coded the color map so that the color white represents the nominal coverage. As you can see, there is not a ton of white, meaning the interval estimate doesn’t really work as well as one might be lead to believe.</p>
<div style="text-align:center"><img src="/images/blog/coverage.png" /></div>
<h1 id="code">Code</h1>
<p>I’ve included the code in a gist <a href="https://gist.github.com/Dpananos/5f9c026d3b21ec53638f6ce067c20184">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Don’t do this. And more importantly, demand some sort of proof when people claim that something is as easy as just doing some algebra. Math has a lot of intuitive results, but there are also a ton of unintuitive results. Stats in particular is really unintuitive, but luckily we can use computers to check our intuition. Do make use of them.t</p>Demetri Pananosdpananos@gmail.comIn the data science class I help TA, we’re going over confidence intervals. Thanks to the central limit theorem, we can report confidence intervals for the out of sample generalization error. Let’s assume our loss function is mean squared error. A confidence interval would then be3 Rules For Giving a Sh!t2020-10-18T00:00:00-07:002020-10-18T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-26<p>I spend a lot of time on cross validated (CV). CV is my statistical escape from statistics, and it is also a place where I like to prove to myself that I am good at what I do.</p>
<p>On CV, people pose questions and for the community to answer. Members post solutions, and OPs can accept that solution if it answers their question. If a question has multiple solutions, then the community can upvote solutions based on whatever they think merits an upvote. Each action (answer accepting and upvotes) gives members points.</p>
<p>Ah, fake interent points. We know them too well. From reddit upvotes to likes on twitter, these digital tally counters are to many (including myself) a measure of how “right” you are – where right is measured by consensus. More points means more people agreed with you and how could you be wrong when so many people agree with you? Its a form of external validation for me; a way to prove that I am good at statistics because other people are marking my answers as good.</p>
<p>That becomes draining very quickly. Somtimes I catch myself wasting time explaining simple stuff because I know it is an easy 20-30 points (equivalent to 2 upvotes and an accepted answer, or 3 upvotes). I’m wasting my time on questions of little importance to get validation from people I don’t know (except I do know some of them because they follow me on Twitter. Hi, Tim). I need to reel it in a little while still engaging (because it is kind of fun sometimes and some of the answers I give are genuinely interesting and have lead to some blog posts). I’ve developed 3 rules I check to see if an answer is worth commiting to ink, er, HTML.</p>
<h2 id="1-does-an-answer-exist-in-a-place-op-really-should-have-looked">1) Does an answer exist in a place OP really should have looked?</h2>
<p>“How do I interpret log odds?” – Next. “What sort of statsitical test do I need?” – Next. “What do I do if my predictor isn’t normal?” – Next. I don’t need to waste my time answering something which exists in software documentation or in introductory stats books. If the question can be answered by reading or by copy and pasting the appropriate link to cononical resources, I don’t waste my time. I might comment and say something like “A good place to look might be…”, but that is it.</p>
<h2 id="2-is-the-answer-complex-or-complicated">2) Is the answer complex or complicated?</h2>
<p>As in the zen of python, complex is preferable to complicated. Complex would mean that the answer is non-trivial, but the “juice is worth the squeeze” so to say. If what we get out of it is an interesting insight, then I will commit my time. Complicated would mean that there is nothing interesting to come out of the answer but getting the answer is tedious. If that sounds ambiguous, that is because I intended it to be so that I could manipulate these rules at will. Remember, these rules serve me and not the other way around.</p>
<h2 id="3-do-i-give-a-shit">3) Do I give a shit?</h2>
<p>This last rule is actually a function of the other two. If the answer exists elsewhere, then I likely do not give a shit. If the answer is complicated, then I likely do not give a shit. If the answer is novel (or atleast novel enough to me) but the answer is complicated, then I might give a shit. If the question has been answered before but I have the opportunity to give my own insight and opinion (i.e. the answer is complex) then I might give a shit. This rule really decides for 80% of questions if I take the time or not. If I give a shit about what you’re asking, I am much more likely to contribute even if one of the other two rules are violated.</p>
<p>Do I follow these all the time? No, but I do find myself using them more frequently. My score on CV has dipped a little, but hey that is the trade off. Now I find I’m not wasting my time explaining first year stats to some poor grad student who just wants a p value. That serves both me and them better in the end.</p>Demetri Pananosdpananos@gmail.comI spend a lot of time on cross validated (CV). CV is my statistical escape from statistics, and it is also a place where I like to prove to myself that I am good at what I do.Log Link vs. Log(y)2020-10-06T00:00:00-07:002020-10-06T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-25<p>You wanna see a little gotcha in statistics? Take the following data</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>and explain why <code class="language-plaintext highlighter-rouge">glm(y ~ 1, family = gaussian(link=log)</code> and <code class="language-plaintext highlighter-rouge">lm(log(y)~1)</code> produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="n">log_lm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">log_lm</span><span class="p">)</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Call:</span><span class="w">
</span><span class="c1">#> lm(formula = log(y) ~ 1)</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Residuals:</span><span class="w">
</span><span class="c1">#> Min 1Q Median 3Q Max </span><span class="w">
</span><span class="c1">#> -1.12328 -0.29604 -0.02781 0.30134 1.20935 </span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Coefficients:</span><span class="w">
</span><span class="c1">#> Estimate Std. Error t value Pr(>|t|) </span><span class="w">
</span><span class="c1">#> (Intercept) 0.51133 0.04413 11.59 <2e-16 ***</span><span class="w">
</span><span class="c1">#> ---</span><span class="w">
</span><span class="c1">#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Residual standard error: 0.4413 on 99 degrees of freedom</span><span class="w">
</span><span class="n">glm_mod</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gaussian</span><span class="p">(</span><span class="n">link</span><span class="o">=</span><span class="n">log</span><span class="p">))</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">glm_mod</span><span class="p">)</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Call:</span><span class="w">
</span><span class="c1">#> glm(formula = y ~ 1, family = gaussian(link = log))</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Deviance Residuals: </span><span class="w">
</span><span class="c1">#> Min 1Q Median 3Q Max </span><span class="w">
</span><span class="c1">#> -1.2997 -0.6014 -0.2201 0.4120 3.7464 </span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Coefficients:</span><span class="w">
</span><span class="c1">#> Estimate Std. Error t value Pr(>|t|) </span><span class="w">
</span><span class="c1">#> (Intercept) 0.61084 0.04851 12.59 <2e-16 ***</span><span class="w">
</span><span class="c1">#> ---</span><span class="w">
</span><span class="c1">#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> (Dispersion parameter for gaussian family taken to be 0.7983696)</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Null deviance: 79.039 on 99 degrees of freedom</span><span class="w">
</span><span class="c1">#> Residual deviance: 79.039 on 99 degrees of freedom</span><span class="w">
</span><span class="c1">#> AIC: 264.26</span><span class="w">
</span><span class="c1">#> </span><span class="w">
</span><span class="c1">#> Number of Fisher Scoring iterations: 5</span><span class="w">
</span></code></pre></div></div>
<p>Answer is the same as the difference between $E(g(X))$ and $g(E(X))$ which are not always the same. Let me explain.</p>
<p>First, let’s start with the lognormal random variable. $y \sim \operatorname{Lognormal}(\mu, \sigma)$ means $\log(y) \sim \operatorname{Normal}(\mu, \sigma)$. So $\mu, \sigma$ are the parameters of the underlying normal distribution. When we do <code class="language-plaintext highlighter-rouge">lm(log(y) ~ 1)</code>, we are modelling $E(\log(y)) = \beta_0$. So $\beta_0$ is an estimate of $\mu$ and $\exp(\mu)$ is an estimate of the median of the lognormal. That is an easy check</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">median</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="o">>>></span><span class="m">1.644421</span><span class="w">
</span><span class="nf">exp</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">model</span><span class="p">))</span><span class="w">
</span><span class="o">>>></span><span class="m">1.626632</span><span class="w">
</span><span class="c1">#Meh, close enough</span><span class="w">
</span></code></pre></div></div>
<p>If I wanted an estimate of the mean of the lognormal, I would need to add $\sigma^2/2$ to my estimate of $\mu$.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="o">>>></span><span class="m">1.833418</span><span class="w">
</span><span class="nf">exp</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">log_lm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sigma</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="o">>>></span><span class="w"> </span><span class="m">1.836362</span><span class="w">
</span><span class="c1">#Meh, close enough</span><span class="w">
</span></code></pre></div></div>
<p>Ok, onto the glm now. When we use the glm, we model $\log(E(y)) = \beta_0$, so we model the mean of the lognormal directly. Case in point</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">exp</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">glm_mod</span><span class="p">))</span><span class="w">
</span><span class="o">>>></span><span class="w"> </span><span class="m">1.833418</span><span class="w">
</span></code></pre></div></div>
<p>and if I wanted the median, I would need to consider the extra factor of $\exp(\sigma^2/2)$</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">exp</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">glm_mod</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">sigma</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="o">>>></span><span class="m">1.612436</span><span class="w">
</span></code></pre></div></div>
<p>Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.</p>Demetri Pananosdpananos@gmail.comYou wanna see a little gotcha in statistics? Take the following dataWhat Do Monty Hall and A Steak Have In Common? I Marinate Both2020-10-01T00:00:00-07:002020-10-01T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-24<p>It took me half a PhD to finally understand the Monty Hall Problem, but I think I get it now. I’ve been tutoring a very bright student in probability (which is so ironic given I failed it way back when) and I’ve surprised myself with how effective I’ve been at solving little homework problems I would have previously not been able to solve. It isn’t like my PhD has been full of these sorts of problems and by virtue to their exposure I’ve gotten better at them. I’ve just been thinking about stats longer than I was thinking about stats in undergrad which sort of leads to this unconcious thinking about homework problems. More on that in the end.</p>
<h1 id="set-up">Set Up</h1>
<p>Ok, quickly though because I’m sure you have heard of the problem before:</p>
<ul>
<li>
<p>Three Doors, 1 Car 2 Goats.</p>
</li>
<li>
<p>You pick a door</p>
</li>
<li>
<p>Monty opens a different door to show you a goat</p>
</li>
<li>
<p>Monty asks if you want to switch doors</p>
</li>
<li>
<p>What should you do?</p>
</li>
</ul>
<p>I’ll tell you what you shouldn’t do: don’t label the doors before you choose them. If you do, that leads to some funky conditioning and argumentation which I think is what initially confused me. If you label them before choosing them, then you have to say stuff like “what is the probability Monty opens door B given I chose door A <em>and</em> A contains the goat?”.</p>
<p>Instead, label them after choosing. Let door $A$ be whatever door you choose, let $B$ be the door Monty opens to reveal the goat, and let $C$ be the remaining closed door. This set up maintains all the generality for the solution and will be easier to think about.</p>
<p>The question is now “how does the probability change if I were to select door C over door A after seeing door B”? In particular, let’s talk about relative changes rather than absolute because it will lead to some simplifiction. Let $A_w$ be the event that $A$ is the winning door, let $B_o$ be the event B is opened, and let $C_w$ be the event that $C$ is the winning door. We want to find $P(A_w \vert B_o)$ and $P(C_w \vert B_o)$.</p>
<p>But because we are interested in relative changes, we don’t need to find these probabilities completely. One thing you will see Bayesians write down a lot is</p>
\[p(\theta \vert y ) \propto p(y \vert \theta) p(\theta)\]
<p>This ignores the normalizing constant of $p(y)$. So I can just calculate</p>
\[p(A_w \vert B_o ) \propto p(B_o \vert A_w) p(A_w)\]
<p>and</p>
\[p(C_w \vert B_o ) \propto p(B_o \vert C_w) p(C_w)\]
<p>and then take their ratio.</p>
<h2 id="the-math">The Math</h2>
<p>A priori, all three doors could possibly have the car. This means $P(A_w) = P(C_w) = \frac{1}{3}$.</p>
<p>If $A$ contained the car, that means $B$ and $C$ would have goats. Monty can’t open your door and he can’t open the door with the car. So that means he can open one of two doors. So $P(B_o \vert A_w) = \frac{1}{2}$.</p>
<p>If $C$ contained the car, that means $A$ and $B$ would have goats. Monty can’t open your door and he can’t open the door with the car. So that means he can open one door (namely, $B$). So $P(B_o \vert C_w) = 1$.</p>
<p>This means</p>
\[p(A_w \vert B_o ) \propto \dfrac{1}{2} \times \dfrac{1}{3}\]
\[p(C_w \vert B_o ) \propto 1 \times \dfrac{1}{3}\]
\[p(C_w \vert B_o ) = 2 p (A_w \vert B_o )\]
<p>and now it becomes very clear that you double your chances of winning by switching doors. The actual probabilities are $1/3$ and $2/3$ but that doesn’t matter since we’re interested in relative probabilites.</p>
<p>There. The hardest undergrad probability probelm I ever encountered done in like 5 lines.</p>
<h2 id="what-worked-this-time">What Worked This Time?</h2>
<p>I think what made the problem click so late in my life is not labelling the doors before choosing. When I did that, I would find myself getting lost in details and I’d ask “what is the probability door A has the car, given he opens door B? But I guess he doesn’t always open door B, sometimes he opens C? Wait, so then what if B has the car? UGH”.</p>
<p>The other thing that helped was not caring about $P(B)$. We could calculate it if we wanted, it involves the quantities we already computed in addition to $P(B_w \vert B_o)$ (which is 0 because Monty can never open the door with the car). Not caring about that left some mental space in my mind to think about what is going on.</p>
<h2 id="lessons-learned">Lessons Learned</h2>
<p>This pattern of thinking, where I solve problems without actually thinking about them, has been happening a lot lately. I go on on walks (yay pandemic activities) and I come back with solutions to coding problems, or I take a shower and suddenly realize the solution to something that has been bugging me. I’d call it “passive problem solving” but I don’t think that is what it is. I think it is just giving myself time and room to think think about problems in different ways, even if I’m not aware I’m thinking about them. When I tell people I’m going to do this, I tell them I’m “going to let the problem marinate”.</p>
<p>And now that is what I am labelling the process. Problem solving through marination.</p>Demetri Pananosdpananos@gmail.comIt took me half a PhD to finally understand the Monty Hall Problem, but I think I get it now. I’ve been tutoring a very bright student in probability (which is so ironic given I failed it way back when) and I’ve surprised myself with how effective I’ve been at solving little homework problems I would have previously not been able to solve. It isn’t like my PhD has been full of these sorts of problems and by virtue to their exposure I’ve gotten better at them. I’ve just been thinking about stats longer than I was thinking about stats in undergrad which sort of leads to this unconcious thinking about homework problems. More on that in the end.Nothing is Normal So Don’t Worry About The T Test2020-02-11T00:00:00-08:002020-02-11T00:00:00-08:00https://dpananos.github.io/posts/2019/08/blog-post-23<p>I hate the objection “I can’t use the t-test, my data aren’t normal”. I see it all the time on Cross Validated when a data analyst is tasked with analyzing an experiment after it has been performed. They have piles of data, thousands of observations, and they have no idea what to do with it. They know of the t-test, but they erroneously believe (through no fault of their own) that the t-test is only valid if their data are normal.</p>
<p>Can I just say that nothing is normally distirbuted? Distributions themselves are convienient fictions – they do not exist. Coin flips are not binomial, heights are not normal, and failure times are not exponential. We <em>idealize</em> our data as coming from those distributions, but that does not mean that our data <em>are de facto</em> generated from that distribution. So to that end, objections like “well my data aren’t normally distributed” are already moot but I’m going to put an end to this because it has been bugging me for a bit. I’m not going to have any math here, I don’t have anything new to add. I’m just going to be brash in hopes the next time your run an AB test you’ll think “Oh yea, that bearded guy got really upset when I said my data aren’t normal”. GAH!</p>
<p><strong>Here is the important part</strong>: <em>In the case you have lots of data, you can make use of the Central Limit Theorem (CLT) to justify your use of the t-test. Roughly, the CLT says that with enough data the sampling distribution of the sample mean is normal with mean equal to the population mean and standard deviation equal to the standard error. So the sample means can be idealied as two draws from two normal distributions, and the difference between two normals is also normal.</em></p>
<p>Boom, there is your need for normality satisfied. Binomial data? Yep, you can still use the t-test (that’s right, I said it). Highly skewed data? T-test that sucker. Categorical data even? Go simulate it. You would be surprised. The only reason I am being so brash is because in the case of AB tests, we have oodles and oodles of data. If we can’t claim we are in an asymptotic regime with that much data, then when can we? I would not be so brash if we were working with smaller sample sizes (in which case we are required to be a little more careful), but for Pete’s sake just simulate some of these problems and look at the false positive rate/power. The t-test is way more robust than you have been lead to believe.</p>
<p>No go forth and use the t-test. I never want to hear of this objection again. And if you need something concrete to reference, check out <a href="https://stats.stackexchange.com/questions/434907/how-can-i-check-if-nominal-and-ordinal-data-is-normally-distributed-for-z-test/434921#434921">this</a> answer I have to a question regarding the t-test and non-normal data.</p>Demetri Pananosdpananos@gmail.comI hate the objection “I can’t use the t-test, my data aren’t normal”. I see it all the time on Cross Validated when a data analyst is tasked with analyzing an experiment after it has been performed. They have piles of data, thousands of observations, and they have no idea what to do with it. They know of the t-test, but they erroneously believe (through no fault of their own) that the t-test is only valid if their data are normal.