Jekyll2022-01-02T18:43:16-08:00https://dpananos.github.io/feed.xmlPh.DemetriApplied Mathematician. Statistician. Data Scientist.Demetri Pananosdpananos@gmail.comNew Year, New Me (Who This?)2022-01-02T00:00:00-08:002022-01-02T00:00:00-08:00https://dpananos.github.io/posts/2022/01/blog-post-35<p>I’m not usually the one to make resolutions, but this year is a little different for a few reasons. I’m working full time with no intent to chicken out and go back to academia (despite my constant threats to no one in particular to open a café), which means I have money for the first time in a long time as well as some unstructured free time. Because I turn 30 this year, I thought it would be prudent to make some resolutions so that I do not coast through the next decade of my life (a complaint I continually made in my 20s).</p>
<p>Its one thing to say “I resolve to $x$”, it is another thing to explain how one will achieve things. I welcome anyone’s input because, shit, I don’t know what I’m doing.</p>
<ol>
<li>
<p>Finish my Ph.D. By the time I am 30 (early September), I want to at the very least be making edits to the thesis after defending. My approach to achieving this is simple; be selfish. There is no requirement that I publish any parts of my thesis. Considering I do not plan to be an academic, wasting time and energy on preparing something for submission would only benefit my supervisor (which is fine, but won’t ultimately help me finish). My approach is to be vocal, polite, but ultimately selfish when it comes to the work. A good thesis is a finished thesis, so the sooner I can finish the sooner it can be considered good. Besides, I can always leave some easy pubs for some future grad student.</p>
</li>
<li>
<p>Be more self compassionate. My therapist says I lack self compassion. One of the ways people can value themselves and be compassionate towards themselves – at least according to my therapist – is to treat themselves to things they enjoy. So I’m adopting the line from <strong>Twin Peaks</strong> as a mantra of sorts: <strong>Every day, once a day, give yourself a little present. Don’t plan it, don’t wait for it, just let it happen.</strong>. A few pumps of flavored syrup in my coffee, Uber Eats because its raining, an extra trip to the barber because hey I like my cut fresh, if it makes me feel good I will do it for myself no matter how small or big. One of the ways I am going to be more compassionate is to throw myself an absolutely bougie 30th birthday.</p>
</li>
<li>
<p>Somewhat vainly, get shredded for my 30th. I don’t need to be huge (I tried power lifting, I liked it but gyms aren’t the same right now and probably won’t be for a long while), but maybe more lean. I want to fight against the bias I have in my mind that my 30s are somehow a slow decline into … being old. One of the ways I am going to do that is by eating healthier. I know, a trite goal if ever there was one. But let me explain. I usually have a cup of coffee for breakfast until around 8pm wherein hunger strikes and I binge eat anything I have in stock, which is usually nothing. I find grocery shopping really tiring, and cooking even more tiring. In order to ease some of that pain, I’m going to have a standard list of stuff I’m going to order online on repeat biweekly or weekly. Every week, the same low effort but healthy meals, in moderate proportions. I think removing some of the obstacles to eating and cooking (mainly the deciding) will go a long way to helping me keep healthy eating habits.</p>
</li>
</ol>Demetri Pananosdpananos@gmail.comI’m not usually the one to make resolutions, but this year is a little different for a few reasons. I’m working full time with no intent to chicken out and go back to academia (despite my constant threats to no one in particular to open a café), which means I have money for the first time in a long time as well as some unstructured free time. Because I turn 30 this year, I thought it would be prudent to make some resolutions so that I do not coast through the next decade of my life (a complaint I continually made in my 20s).Hacking Sklearn To Do The Optimism Corrected Bootstrap2021-11-23T00:00:00-08:002021-11-23T00:00:00-08:00https://dpananos.github.io/posts/2021/11/blog-post-34<p>Its late, I can’t sleep, so I’m writing a blog post about the optimism corrected bootstrap.</p>
<p>In case you don’t know, epidemiology/biostatistics people working on prediction models like to validate their models in a slightly different way than your run-in-the-mill data scientist. Now, it should be unsurprising that <a href="https://twitter.com/GaelVaroquaux/status/1293818409197731840">this has generated some discussion</a> between ML people and epi/biostats people, but I’m going to ignore this for now. I’m going to assume you have good reason for wanting to do the optimism corrected bootstrap in python, especially with sklearn, and if you don’t and want to discuss the pros and cons fo the method instead then lalalalalala I can’t hear you.</p>
<h2 id="the-optimism-corrected-bootstrap-in-7-steps">The Optimism Corrected Bootstrap in 7 Steps</h2>
<p>As a primer, you might want to tread Alex Hayes’ <a href="https://www.alexpghayes.com/blog/predictive-performance-via-bootstrap-variants/">pretty good blog post about variants of the bootstrap</a> for predictive performance. It is more mathy than I care to be right now and in R should that be your thing.</p>
<p>To do the optimism corrected bootstrap, follow these 7 steps as found in <a href="https://link.springer.com/book/10.1007/978-0-387-77244-8">Ewout W. Steyerberg’s <em>Clinical Prediction Models</em></a>.</p>
<ol>
<li>
<p>Construct a model in the original sample; determine the apparent performance on the data from the sample used to construct the model.</p>
</li>
<li>
<p>Draw a bootstrap sample (Sample*) with replacement from the original sample.</p>
</li>
<li>
<p>Construct a model (Model<em>) in Sample</em>, replaying every step that was done in the original sample, especially model specification steps such as selection of predictors. Determine the bootstrap performance as the apparent performance of Model* in Sample.</p>
</li>
<li>
<p>Apply Model* to the original sample without any modification to determine the test performance.</p>
</li>
<li>
<p>Calculate the optimism as the difference between bootstrap performance and test performance.</p>
</li>
<li>
<p>Repeat steps 1–4 many times, at least 200, to obtain a stable mean estimate of the optimism.</p>
</li>
<li>
<p>Subtract the mean optimism estimate (step 6) from the apparent performance (step 1) to obtain the optimism-corrected performance estimate.</p>
</li>
</ol>
<p>This procedure is very straight forward, and could easily be coded up from scratch, but I want to use as much existing code as I can and put sklearn on my resume, so let’s talk about what tools exist in sklearn to do cross validation and how we could use them to perform these steps.</p>
<h2 id="cross-validation-in-sklearn">Cross Validation in Sklearn</h2>
<p>When you pass arguments like <code class="language-plaintext highlighter-rouge">cv=5</code> in sklearn’s many functions, what you’re really doing is passing <code class="language-plaintext highlighter-rouge">5</code> to <code class="language-plaintext highlighter-rouge">sklearn.model_selection.KFold</code>. See <a href="https://github.com/scikit-learn/scikit-learn/blob/0d378913b/sklearn/model_selection/_validation.py#L48"><code class="language-plaintext highlighter-rouge">sklearn.model_selection.cross_validate</code></a> which calls a function called <a href="https://github.com/scikit-learn/scikit-learn/blob/0d378913be6d7e485b792ea36e9268be31ed52d0/sklearn/model_selection/_split.py#L2262">‘check_cv’</a> to verify this. <code class="language-plaintext highlighter-rouge">KFold.split</code> returns a generator, which when passed to <code class="language-plaintext highlighter-rouge">next</code> yields a pair of train and test indicides. The inner workings of <code class="language-plaintext highlighter-rouge">KFold</code> might look something like</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_folds</span><span class="p">):</span>
<span class="n">train_ix</span> <span class="o">=</span> <span class="n">make_train_ix</span><span class="p">()</span>
<span class="n">test_ix</span> <span class="o">=</span> <span class="n">make_test_ix</span><span class="p">()</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">trian_ix</span><span class="p">,</span> <span class="n">test_ix</span><span class="p">)</span>
</code></pre></div></div>
<p>Those incidies are used to slice <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">y</code> to do the cross validation. So, if we are going to hack sklearn to do the optimisim corrected bootstrap for us, we really just need to write a generator to give me a bunch of indicies. According to step 2 and 3 above, the train indicies need to be resamples of <code class="language-plaintext highlighter-rouge">np.arange(len(X))</code> (ask yourself “why?”). According to step 4, the test indicies need to be <code class="language-plaintext highlighter-rouge">np.arnge(len(X))</code> (again….”why?”).</p>
<p>Once we have a generator to do give us our indicies, we can use <code class="language-plaintext highlighter-rouge">sklearn.model_selection.cross_validate</code> to fit models on the resampled data and predict on the original sample (step 4). If we pass <code class="language-plaintext highlighter-rouge">return_train_score=True</code> to <code class="language-plaintext highlighter-rouge">cross_validate</code> we can get the bootstrap performances as well as the test performances (step 5). All we need to do then is calculate the average difference between the two (step 6) and then add this quantity to the apparent performance we got from step 1.</p>
<p>That all sounds very complex, but the code is decieptively simple.</p>
<h2 id="the-code-i-know-you-skipped-here-dont-lie">The Code (I Know You Skipped Here, Don’t Lie)</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numpy.core.fromnumeric</span> <span class="kn">import</span> <span class="n">mean</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_validate</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">mean_squared_error</span><span class="p">,</span> <span class="n">make_scorer</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_diabetes</span>
<span class="kn">from</span> <span class="nn">sklearn.utils</span> <span class="kn">import</span> <span class="n">resample</span>
<span class="c1"># Need some data to predict with
</span><span class="n">data</span> <span class="o">=</span> <span class="n">load_diabetes</span><span class="p">()</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'data'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span>
<span class="k">class</span> <span class="nc">OptimisimBootstrap</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_bootstraps</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">n_bootstraps</span> <span class="o">=</span> <span class="n">n_bootstraps</span>
<span class="k">def</span> <span class="nf">split</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span><span class="o">*</span><span class="n">_</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">test_ix</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">n_bootstraps</span><span class="p">):</span>
<span class="n">train_ix</span> <span class="o">=</span> <span class="n">resample</span><span class="p">(</span><span class="n">test_ix</span><span class="p">)</span>
<span class="k">yield</span> <span class="p">(</span><span class="n">train_ix</span><span class="p">,</span> <span class="n">test_ix</span><span class="p">)</span>
<span class="c1"># Optimism Corrected
</span><span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">apparent_performance</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">))</span>
<span class="n">opt_cv</span> <span class="o">=</span> <span class="n">OptimisimBootstrap</span><span class="p">(</span><span class="n">n_bootstraps</span><span class="o">=</span><span class="mi">250</span><span class="p">)</span>
<span class="n">mse</span> <span class="o">=</span> <span class="n">make_scorer</span><span class="p">(</span><span class="n">mean_squared_error</span><span class="p">)</span>
<span class="n">cv</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">opt_cv</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="n">mse</span><span class="p">,</span> <span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">optimism</span> <span class="o">=</span> <span class="n">cv</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]</span> <span class="o">-</span> <span class="n">cv</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]</span>
<span class="n">optimism_corrected</span> <span class="o">=</span> <span class="n">apparent_performance</span> <span class="o">+</span> <span class="n">optimism</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Optimism Corrected: </span><span class="si">{</span><span class="n">optimism_corrected</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="c1"># Compare against regular cv
</span><span class="n">cv</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">cv</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="n">mse</span><span class="p">)[</span><span class="s">'test_score'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'regular cv: </span><span class="si">{</span><span class="n">cv</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>
<p>The two estimates (optimism corrected and 10 fold) should be reasonably close together, but uh don’t run this code multiple times. You might see that the optimism corrected estimate is quite noisy meaning I’m either wrong or that twitter thread I linked to might have some merit.</p>Demetri Pananosdpananos@gmail.comIts late, I can’t sleep, so I’m writing a blog post about the optimism corrected bootstrap.Gym Data2021-10-15T00:00:00-07:002021-10-15T00:00:00-07:00https://dpananos.github.io/posts/2021/10/blog-post-33<p>Prior to the pandemic, my school’s gym <a href="https://twitter.com/WesternWeightRm">would tweet out how many people were in the weight room at a given time</a>. This was useful for an instantaneous understanding of how busy the gym was, but was not useful if you wanted to answer how busy the gym might be later. During my PhD, I managed to create a little python script to scrape some tweets, extract the number of people in the weight room, and model the data with a linear model. The model did OK, but I wanted to take it a step further and estimate the effects of days of weeks, months, special events, and even weather (because, as the gym’s manager told me when I met with her, the weather is a major factor in people going to the gym).</p>
<p>So long story short, I made <a href="https://github.com/Dpananos/GyMBRo">GyMBRo</a> (short for Gym Monitoring By Robot). GyMBRo is some python code to scrape tweets as far back as 2014, do some lite data engineering, and fit a boosted tree to the data. GyMBRo also tweeted out predictions on the hour and had an MAE of 9 people when testing on 2019 as a hold out year.</p>
<p>Alas, the pandemic has ended GyMBRo’s legacy, but the data remains. I’ve since stopped caring about point predictions because I feel like I can do that satisfactorily. What I’m more interested in now is using the data to learn more about some new modelling techniques and get some reasonable estimates of prediction uncertainty. For example <a href="https://i.imgur.com/THEimAv.png">I recently used the data to learn about <code class="language-plaintext highlighter-rouge">gamlss</code></a> with some modest success. Is that model perfect? No. Is it good enough? Absolutely.</p>
<p>But there is always more I can do, so I figure I would open the data up to other people. This blog post is intended to explore the data a little bit and do a bit of knowledge transfer from me to you. Let’s begin.</p>
<h1 id="a-little-about-the-gym">A Little About The Gym</h1>
<p>The gym is split into two areas: Weight Room (WR in the tweets) and the Cardio Mezzanine (CM). The WR is usually busier, and so the way I extract this information from the tweets is to find the largest number in the tweets. It is a heuristic and it seems to work sufficiently well. The gym tries to tweet every half hour, but will be a few minutes early/late because humans are tweeting. The gym may sometimes go hours without a tweet. Very rarely do they not tweet at all, but do have a few days in the data where there are only a couple tweets. Shown below is a plot of the weight room counts versus the time they were made. We can see some banding on the half hours. For example, we see 7 bands between 8 am and 11 am corresponding to: 8 am, 8:30 am, and so on. There is some imprecision in the timing of the tweet as I mentioned earlier, leading to fuzzy bands and not straight lines.</p>
<div style="text-align:center"><img src="/images/blog/tweet_freq.png" /></div>
<p>The WR can hold maybe 200 people before it gets uncomfortably crowded. The busiest I’ve ever seen it is about 205. The busiest times are weekdays in January and September. I anticipate this is a “new (academic) year, new me” effect which quickly vanishes come midterm time (October and February respectively). Because the gym is on campus, it primarily serves students. Hence, dates which effect students will also effect the gym. For example, the gym’s activity will slowly decrease as thanksgiving and reading week approach. This is because students start to leave through out the week leaving fewer students on campus, and hence fewer patrons. A similar but opposite effect happens as we lead away from “special days” (e.g. the gym becomes busier in the days leading into the start of the term because more students are coming back to the city) Side note: Including <code class="language-plaintext highlighter-rouge">time_to_holiday</code> and <code class="language-plaintext highlighter-rouge">time_since_holiday</code> reduced GyMBRo’s MAE from 11 to 9. Very proud of engineering that feature. The summer months (May to August) are noticeably less busy. Shown below is a plot of the median WR count for each month in each year. You can see how midterms result in fewer people going to the gym in order to study or maybe just relax. The start of the academic and calendar year are usually have the highest median count.</p>
<div style="text-align:center"><img src="/images/blog/median_wr.png" /></div>
<h1 id="the-data">The Data</h1>
<p>You can download the data <a href="https://dpananos.github.io/files/gym_data.csv">here</a>. The two most important columns are <code class="language-plaintext highlighter-rouge">created_at</code> which is the time stamp that the tweet was made, and <code class="language-plaintext highlighter-rouge">y</code> which is the associated WR count in the tweet. From the time stamp, you can extract things like year, month, time, etc. I’ve also included some weather data, but I don’t think the effects are going to be very big as compared to time/day/month effects.</p>Demetri Pananosdpananos@gmail.comPrior to the pandemic, my school’s gym would tweet out how many people were in the weight room at a given time. This was useful for an instantaneous understanding of how busy the gym was, but was not useful if you wanted to answer how busy the gym might be later. During my PhD, I managed to create a little python script to scrape some tweets, extract the number of people in the weight room, and model the data with a linear model. The model did OK, but I wanted to take it a step further and estimate the effects of days of weeks, months, special events, and even weather (because, as the gym’s manager told me when I met with her, the weather is a major factor in people going to the gym).In Response To Vaccine Hesitancy From Tommy Caldwell2021-09-22T00:00:00-07:002021-09-22T00:00:00-07:00https://dpananos.github.io/posts/2021/09/blog-post-32<p>On September 21st, 2021 I received an email from Tommy Caldwell, author and owner of Hybrid Fitness in my hometown of London Ontario, with the subject line “COVID-19, Kids, and Vaccines: A difficult decision for parents to navigate”. The email was sent from Hybrid Fitness’ email address (which I had subscribed to sometime in 2018, in the first year of my PhD). In that email, Mr. Caldwell acknowledges that vaccinating is a “no-brainer decision” and professes he is an inexpert in COVID19, science, and research, yet continues on to interpret statistics from <a href="https://www.cdc.gov/vaccines/acip/recs/grade/covid-19-pfizer-biontech-vaccine-12-15-years.html">a clinical trial</a> intended to evaluate the vaccine against COVID19 in persons aged 12-15 years.</p>
<p>Mr. Caldwell cites that the 121 severe adverse events (SAAs) occurred in the intervention arm of the trial, constituting a 5:1 ratio when compared to placebo arm. He goes on to say</p>
<blockquote>
<p>“A direct quote from the trial also suggested that researchers were concerned that they were missing severe adverse events as well due to the study design. In their words, ““There was serious concern of indirectness because the body of evidence does not provide certainty that rare serious adverse events were captured due to the short follow-up and sample size. There was also very serious concern for imprecision, due to the width of the confidence interval.”</p>
</blockquote>
<p>I have included the entire email as a png below so that you can read his words from him in context.</p>
<p>Mr. Caldwell has, in my opinion, grossly misinterpreted these statistics and in doing so weaponizes them to support a narrative that vaccinating one’s children is a “difficult decision” when proper interpretation of the statistics should lead one to believe that the vaccine is at the very least consistent with no increase in risk of harm. This blog post is intended to rectify the mistakes Mr. Caldwell has made to his assumedly large readership.</p>
<p>In what follows, I present corrected claims found in Mr. Caldwell’s email as section titles with more details in the section body. If you question my ability to intelligently comment on statistics at this level (which you should), I encourage you to poke around this website for proof that I understand statistics sufficiently well. If you need additional proof, please see some of the many <a href="https://stats.stackexchange.com/users/111259/demetri-pananos?tab=answers&sort=votes">answers</a> I provide on statistical forums, or review some of the <a href="https://scholar.google.com/citations?hl=en&user=LN16PpgAAAAJ&view_op=list_works&sortby=pubdate">papers</a> I have published in medical and epidemiology journals.</p>
<h1 id="51-ratio-of-people-reporting-reactogenicity-not-severe-adverse-events">5:1 Ratio of People Reporting Reactogenicity, Not Severe Adverse Events</h1>
<p>Table 3d in the link above summaries the number of participants who reported “Reactogenicity”. I assume these are the numbers Mr. Caldwell refers to in his email as I can’t find any others which match the aformentioned ratio. The ratio of reactogencitity between intervention and placebo arms is roughly 5 (121 / 22 is approximately 5.5). However, the definition of reactogenicity is (from the text below table 3d)</p>
<blockquote>
<p>Reactogenicity outcome includes local and systemic events, grade ≥3. Grade 3: prevents daily routine activity or requires use of a pain reliever.</p>
</blockquote>
<p>This means that 121 people may have, for example, needed an Advil, or got a sore arm, or may have needed to take a nap. The study authors do not list the outcomes explicitly, but we can surmise they are very minor from the definition used. On the other hand, a severe adverse event (which Mr. Caldwell originally claimed the 5:1 ratio applied) means (see table 3c in the link above)</p>
<blockquote>
<p>Death, life-threatening event, hospitalization, incapacity to perform normal life functions, medically important event, or congenital anomaly/birth defect</p>
</blockquote>
<p>Note further that counts of SAAs is much lower than the counts he mentions in the email. Mr. Caldwell has, I assume mistakenly, reported a less severe reaction as a more severe reaction. This can (and probably has) garnered fear from readers. I’m not a parent myself, but I wouldn’t want my dog undergoing a SAA let alone my mother or future child.</p>
<h1 id="data-are-consistent-with-no-increase-of-risk-of-saas">Data Are Consistent with No Increase of Risk of SAAs.</h1>
<p>Table 3c in the link above examines SAAs. Sample sizes in each arm are comparable (1131 and 1129 in intervention and placebo respectively) and so an absolute comparison of number of SAAs is allowable.</p>
<p>Five people in the intervention arm reported an SAA and 2 people in the placebo arm reported an SAA, a difference of 3 people. This difference is very small, and is consistent with the assumption the vaccine does not increase the risk of SAAs. A difference of 3 people or more is completely in line with sampling variability. For those comfortable in programming in R, we can simulate this and determine the probability of seeing one of the two groups have 3 or more people than the other.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The risk under the null hypothesis would be</span><span class="w">
</span><span class="c1"># 0.003097345.</span><span class="w">
</span><span class="c1"># Simulate 1 million trails where there is no difference in </span><span class="w">
</span><span class="c1"># risk of SAA</span><span class="w">
</span><span class="c1"># Compute the probability one of the groups</span><span class="w">
</span><span class="c1"># has 3 more SAAs than the other</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="c1">#Set the random seed for reproducibility</span><span class="w">
</span><span class="n">risk</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.003097345</span><span class="w">
</span><span class="c1"># Generate 1 million similar scenarios where there is no real difference </span><span class="w">
</span><span class="c1"># in risk of an SAA</span><span class="w">
</span><span class="n">intervention</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1e6</span><span class="p">,</span><span class="w"> </span><span class="m">1131</span><span class="p">,</span><span class="w"> </span><span class="n">risk</span><span class="p">)</span><span class="w">
</span><span class="n">placebo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1e6</span><span class="p">,</span><span class="w"> </span><span class="m">1129</span><span class="p">,</span><span class="w"> </span><span class="n">risk</span><span class="p">)</span><span class="w">
</span><span class="c1"># Here is the probability we find one of the two groups having a difference of 3 or more</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">intervention</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">placebo</span><span class="p">)</span><span class="o">>=</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.334566</span><span class="w">
</span></code></pre></div></div>
<p>So there is approximately a 33% chance we would see one of the two groups have 3 or more SAAs than the other when in truth no difference in risk exists between groups.</p>
<p>The study authors report something called the <strong>relative risk</strong> (RR in the table). The relative risk is the risk of SAA in intervention arm divided by the risk of SAA in placebo arm. If the relative risk is greater than 1, then the risk of SAA is bigger in the intervention group than the placebo. The authors report a relative risk of 2.5 (meaning the risk of SAA in the intervention arm was estimated to be 2.5x greater than the placebo arm). However, that does not tell the whole story.</p>
<p>A 2.5x increase in risk is one thing, but the authors also report something called a <strong>confidence interval</strong>. A confidence interval gives values of the true relative risks which might have plausibly generated the data we have seen. Note that the associated confidence interval is 0.49 to 12.84, meaning the data are consistent with a true relative risk as small as 0.5 (again, meaning the risk of SAA would be <strong>smaller</strong> in the intervention arm than in placebo arm).</p>
<p>A popular rebuttal would be “Demetri, the confidence interval is also greater than 1, meaning the data are consistent with relative risks as big as 13!”. Agreed, that is true, but if the data are consistent with reductions in risk <strong>and</strong> increases in risk, then we should conclude that we can’t make any conclusions based on the conflicting evidence. The confidence interval is too wide (or as statisticians would say, the estimate is too <strong>imprecise</strong>, a word used by the study authors and quoted by Mr. Caldwell). This does not mean we can conclude the vaccine does not increase the risk of SAA, it only means that the data are consistent with no change in the risk of SAA. To know with more certainty if the risk of SAA changes between groups, we would need more data.</p>
<p>Which leads me to my final point…</p>
<h1 id="very-serious-concern-regarding-the-size-of-the-confidence-interval-is-about-inability-to-comment-on-change-in-risk-of-saa-not-about-how-the-study-was-conducted">Very Serious Concern Regarding The Size of the Confidence Interval is About Inability to Comment on Change in Risk of SAA, Not About How the Study Was Conducted</h1>
<p>As I mentioned before, that the confidence interval spans numbers less than 1 and greater than 1 means we can’t comment on a change in risk of SAA. In Mr. Caldwell’s email, it appears Mr. Caldwell seems to interpret the authors’ concern about “indirectness” as meaning they were missing SAAs or that the concern was about the study design.</p>
<p>For the reasons I have commented on above, this is false. SAA has been very well defined in this paper, and what I anticipate Mr. Caldwell is concerned about is long term effects of the vaccine. An understandable concern, but not one I can see being studied in sufficient depth to allow for an expedient release of a vaccine to combat COVID19.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This letter in response is much larger than Mr. Caldwell’s initial email, a testament to <a href="https://en.wikipedia.org/wiki/Brandolini%27s_law">Brandolini’s Law</a> better known as the “The Bullshit Asymmetry Principle”</p>
<blockquote>
<p>The amount of energy needed to refute bullshit is an order of magnitude larger than to produce it.</p>
</blockquote>
<p>And make no mistake that I believe Mr. Caldwell’s claims about these studies to be nothing short of bullshit. I have no personal vendetta against Mr. Caldwell, nor do I take issue with his personal choices or concerns about vaccinations, but I do take issue with making such gross errors in interpretation of statistics tp laymen audiences which trust your work on matters of health, broadly construed. I’ve contacted Mr. Caldwell and politely and kindly asked him to send another email to the same mail list correcting some of his errors. He has not explicitly said “No” but has yet to directly address the request at this time.</p>
<h1 id="email-from-mr-caldwell">Email From Mr. Caldwell</h1>
<div style="text-align:center"><img src="/images/blog/email.png" /></div>Demetri Pananosdpananos@gmail.comOn September 21st, 2021 I received an email from Tommy Caldwell, author and owner of Hybrid Fitness in my hometown of London Ontario, with the subject line “COVID-19, Kids, and Vaccines: A difficult decision for parents to navigate”. The email was sent from Hybrid Fitness’ email address (which I had subscribed to sometime in 2018, in the first year of my PhD). In that email, Mr. Caldwell acknowledges that vaccinating is a “no-brainer decision” and professes he is an inexpert in COVID19, science, and research, yet continues on to interpret statistics from a clinical trial intended to evaluate the vaccine against COVID19 in persons aged 12-15 years.Three Months Into Industry2021-07-27T00:00:00-07:002021-07-27T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-31<p>I “left” (I say left in quotations because I’m still working on my Ph.D part time) approximately 3 months ago. That is a bit of a milestone. It is a quarter of a year working at a senior-ish level as a data scientist at a national bank. I know a lot of PhDs, especially in quantitative disciplines, are thinking of making the jump from academia to industry. This sequence of blog posts is not advice on how to make that jump, but rather to document one perspective on what that change entails.</p>
<p>I’m not sure how best to document my experiences, but the most interesting comparisons to me are:</p>
<ul>
<li>
<p>Differences in Challenges: How does working on a PhD differ from working in industry with respect to the challenges you face? Are obstacles easier/harder to overcome? What are the differences in obstacles?</p>
</li>
<li>
<p>Differences in Work Satisfaction: How do I like working on a PhD versus working in industry? Is the work better/worse? More/less interesting?</p>
</li>
<li>
<p>Differences in Work Life Balance: This one is pretty self explanatory.</p>
</li>
</ul>
<p>Let’s begin.</p>
<h1 id="differences-in-challenges">Differences in Challenges</h1>
<p>One of the earliest challenges I had to face was to find stuff to do! I imagine this is not typical of most data science positions. You’re usually hired into a data science team, who is likely fostering a bunch of projects. Not me, I was hired into an app team comprised almost entirely of developers in Java and kdb+ (or are they q developers? It doesn’t matter). The team is new, I joined less than a year after the team was formed. Consequently, most of the work being done is development work. It was soon revealed to me that (I’m paraphrasing here) the team did not want a data scientist, they wanted another developer. But the bank works in mysterious bureaucratic ways and so they were handed a data scientist to “inject AI into trade compliance”. However, because the team is so new and there is much development work to be done, machine learning and data science just isn’t on their radar right now. So my first challenge was to find something to do, or more precisely, to understand the processes of the business we support and find areas where I could add value with machine learning.</p>
<p>That’s tough. It requires a skill I did not properly hone in my time in academia: talking to people and empathizing. It would be easy for me to just create a model to do give analysts another number to look at (or ignore, as a manager has said to me), but the trick to being a good data scientist is creating solutions which address an actual need and not solve some sort of math problem. You can see where I (a person who has spent a lot of his life just solving math problems) could encounter some difficulty.</p>
<h1 id="differences-in-work-satisfaction">Differences in Work Satisfaction</h1>
<p>I said this to my manager fairly early on: “I don’t care much about financial markets”. I’m a boring investor. I put a set amount of money into an index fund every month automatically. I don’t care about Gamestop (save for the memes), I could give a shit about SPY and TSLA calls, and those are the interesting bits! I work in fixed income! If I find Telsa drama boring, imagine how I feel about government bonds!</p>
<p>Suffice to say, my work at the hospital was orders of magnitude more interesting and fulfilling (though the pay here is much at the bank. I will note here that there is an interesting inverse relationship between how fulfilling a job is and the pay for those jobs. More on that in <em>Bullshit Jobs</em>, an excellent book in which data scientists are called out by name. I digress… ). That being said, there are other ways I get work satisfaction.</p>
<p>Though I don’t (yet) have access to enormous compute I do have access to enormous data. Consequently, the problems become interesting when I can abstract away the boring financial details and focus on the underlying statistical or math problem. That is an enormous difference between grad school and industry. I finally have the data I need to (at least approximately) answer the questions I want to answer.</p>
<h1 id="differences-in-work-life-balance">Differences in Work Life Balance</h1>
<p>Unsurprisingly, work life balance is better in industry. In grad school, the guilt to work is a long, dull, pain. There is <em>always</em> work to be done, and if you aren’t doing it then (you tell yourself) you are lazy. In industry, the guilt to work is a sharp quick pain that comes in waves. There is always work to be done, but I know I’m going to come back to it tomorrow for the same amount of time so I don’t feel as bad about taking a 40 minute break, or working on something else.</p>
<p>That being said, there are times I’ve had to work after 5 pm. That’s fine with me, its pretty infrequent, but I have been witness to people <em>scheduling meetings at 8 pm</em> as if I didn’t have a life outside work. And even stranger, I’ve seen people decline those meetings <em>because they are already in meetings at 8pm</em>. What the fuck? What the actual fuck?</p>
<p>I’m not interested in imputing why people do this. I benefit from low amounts of responsibility (no mortgage, no kids, no family) and so perhaps I’m not as motivated by promotions, bonuses, or being fired as some people are. That will maybe change as I grow a bit older, but I’d sooner quit than schedule a meeting that late into the night. Famous last words.</p>
<p>I’ll revisit this story in another 3 months, which would be around November. In the meantime, if you have additional questions or would like to chat about my experience (or yours for that matter, I’d much rather listen than talk), please reach out via twitter.</p>Demetri Pananosdpananos@gmail.comI “left” (I say left in quotations because I’m still working on my Ph.D part time) approximately 3 months ago. That is a bit of a milestone. It is a quarter of a year working at a senior-ish level as a data scientist at a national bank. I know a lot of PhDs, especially in quantitative disciplines, are thinking of making the jump from academia to industry. This sequence of blog posts is not advice on how to make that jump, but rather to document one perspective on what that change entails.Riddler Solutions2021-05-11T00:00:00-07:002021-05-11T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-30<p>I like <a href="https://fivethirtyeight.com/features/are-you-smarter-than-a-fourth-grader/">Riddler</a> from 538 mostly because you can solve the riddles with some fun math. If the riddle is interesting enough, <a href="https://dpananos.github.io/posts/2017/12/blog-post-2/">I will post solutions on my blog</a>. This is one such riddle.</p>
<h2 id="the-riddle">The Riddle</h2>
<blockquote>
<p>You and your infinitely many friends are sharing a cake, and you come up with two rather bizarre ways of splitting it.
For the first method, Friend 1 takes half of the cake, Friend 2 takes a third of what remains, Friend 3 takes a quarter of what remains after Friend 2, Friend 4 takes a fifth of what remains after Friend 3, and so on. After your infinitely many friends take their respective pieces, you get whatever is left.
For the second method, your friends decide to save you a little more of the take. This time around, Friend 1 takes $1/2^2$ (or one-quarter) of the cake, Friend 2 takes $1/3^2$ (or one-ninth) of what remains, Friend 3 takes $1/4^2$ of what remains after Friend 3, and so on. Again, after your infinitely many friends take their respective pieces, you get whatever is left.</p>
<p>Question 1: How much of the cake do you get using the first method?</p>
<p>Question 2: How much of the cake do you get using the second method?</p>
</blockquote>
<h2 id="the-solution">The Solution</h2>
<p>It is very easy to argue that method 1 leaves us no cake left. Let’s see why. First, let’s agree to model <em>how much cake is left</em> rather than how much the $k^{th}$ friend takes. The first friend takes half (leaving us half a cake). The second friend takes a third of what remains (meaning there are two thirds of one half left). The third friend takes a fourth of what remains (meaning there is three quarters of two thirds of one half left). See the pattern? Let $\pi_k$ be the proportion of pie (or, er cake) left after friend $k$ takes their share. The sequence is</p>
\[\begin{align}
\pi_1 &= \dfrac{1}{2}\\
\pi_2 &= \dfrac{2}{3} \times \dfrac{1}{2} = \dfrac{1}{3}\\
\pi_3 &= \dfrac{3}{4} \times \dfrac{2}{3} \times \dfrac{1}{2} = \dfrac{1}{4}\\
& \vdots\\
\pi_k &= \dfrac{1}{k+1}
\end{align}\]
<p>Since $\lim_{k \to \infty} \pi_k = 0$ we get no cake.</p>
<p>Method 2 is more interesting. Remember, we are modelling how much is left. The sequence is</p>
\[\begin{align}
\pi_1 &= \dfrac{3}{4}\\
\pi_2 &= \dfrac{8}{9} \times \dfrac{3}{4} \\
\pi_3 &= \dfrac{15}{16} \times\dfrac{8}{9} \times \dfrac{3}{4}\\
& \vdots\\
\pi_k &= \prod_{m=2}^{m=k} \dfrac{(m+1)^2-1}{(m+1)^2}
\end{align}\]
<p>It is not obvious what $\pi_k$ approaches as $k \to \infty$. Let’s see what this quantity approaches empirically and see if we can get some intuition.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cake</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">friend</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">100090</span><span class="p">):</span>
<span class="n">cake</span> <span class="o">=</span> <span class="n">cake</span> <span class="o">*</span><span class="p">(</span><span class="n">friend</span><span class="o">**</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="n">friend</span><span class="o">**</span><span class="mi">2</span>
<span class="k">print</span><span class="p">(</span><span class="n">cake</span><span class="p">)</span>
<span class="o">>>></span><span class="mf">0.5000049955540937</span>
</code></pre></div></div>
<p>Looks like it is approaching 1/2. Now that we know where we are going, let’s prove it. It is easier to work with the log of the product because that turns it into sums. The use of $(m+1)$ in the expression for $\pi_k$ are to make the indicies work. Let’s just use $m$ because we’re mostly interested in the limit, not the indicies.</p>
\[\log(\pi_k) = \sum_{m=2}^{m=k} \log(m^2-1) - \log(m^2) = \sum_{m=2}^{m-k} \log(m+1) + \log(m-1) - 2\log(m)\]
<p>Here, I’ve just factored the difference of squares and applied some log rules.</p>
<p>Writing out the first few terms of the summand</p>
\[\begin{align}
&\log(3) + \log(1) - 2\log(2)\\
&\log(4) + \log(2) - 2\log(3)\\
&\log(5) + \log(3) - 2\log(4)\\
&\log(6) + \log(4) - 2\log(5)\\
\end{align}\]
<p>and so on. Some nice cancellation occurs. The $\log(3)$ term appears in lines 1 and 3, and is cancelled by the 2 factors in line 2. Similar arguments are made for the $\log(4)$ terms in lines 2,4, and 3. What is left uncanceled is $\log(1)$ and one factor of $\log(2)$ as $k \to \infty$, meaning the product approaches $1/2$. Hence, we get half a cake.</p>
<h2 id="extra-credit">Extra Credit</h2>
<p>There is an extra credit portion to this question I can’t figure out. Suppose your friends start taking slices of cake which or proportions of squares of even numbers (the first takes $1/2^2$, the second takes $1/4^2$, the third $1/6^2$, and so on). The proportion of cake left is</p>
\[\begin{align}
\pi_1 &= \dfrac{3}{4}\\
\pi_2 &= \dfrac{15}{16}\times \dfrac{3}{4}\\
\pi_3 &= \dfrac{35}{36} \times \dfrac{15}{16}\times \dfrac{3}{4}\\
& \vdots\\
\pi_k &= \dfrac{3^2 \times 5^2 \times 7^2 \times \cdots}{2^2 \times 4^2 \times 6^2 \times \cdots}
\end{align}\]
<p>But I can’t seem to find a way to beak this sequence’s back. Hints are welcome.</p>Demetri Pananosdpananos@gmail.comI like Riddler from 538 mostly because you can solve the riddles with some fun math. If the riddle is interesting enough, I will post solutions on my blog. This is one such riddle.Answering Easier Questions2021-04-03T00:00:00-07:002021-04-03T00:00:00-07:00https://dpananos.github.io/posts/2021/04/blog-post-29<p>The 95% in 95% confidence interval refers not to the probability that any one interval contains the estimand, but rather to the long term relative frequency of the estimator containing the estimand in an infinite sequence of replicated experiments under ideal conditions.</p>
<p>Now, if this were twitter I would get ratioed so hard I might have to take a break and walk it off. Luckily, this is my blog and not yours so I can say whatever I want with impunity. But, rather than shout my opinions and demand people listen, I thought it might be a good exercise to explain to you why I think this and perhaps why people might disagree. Let’s for a moment ignore the fact that the interpretation I use above is the <em>de jure</em> definition of a confidence interval and instead start where a good proportion of statistical learning starts; with a deck of shuffled cards.</p>
<p>I present to you a shuffled deck. Its a regular deck of cards, no funny business with the cards or the shuffling. What is the probability the top card of <em>this</em> deck an ace? I’d wager a good portion of people would say 4/52. If you, dear reader, said 4/52 then I believe you have made a benign mistake, but a mistake all the same. And I suspect the reason you’ve made this mistake is because you’ve swapped a hard question (the question about <em>this</em> deck) for an easier question (a question about the long term relative frequencies of coming to shuffled decks with no funny business and finding aces).</p>
<p>Swapping hard questions for easy questions is not a new observation. Daniel Khaneman writes about it in <em>Thinking Fast and Slow</em> and provides numerous examples. I’ll repeat some examples from the book here. We might swap the question:</p>
<ul>
<li>“How much would you contribute to save dolphins?” for “how much emotion do I feel when I think of dying dolphins?”</li>
<li>“How happy are you with your life?” for “What is my mood right now?”, and poignantly</li>
<li>“This woman is running for the primary. How far will she go in politics?” for “Does this woman look like a political winner”.</li>
</ul>
<p>The book <em>Thinking Fast and Slow</em> explains why we do this, or better yet why we have no control over this. I won’t explain it here. But it is important to know that this is something we do, mostly unconsciously.</p>
<p>So back to the deck of cards. Questions about the deck in front of you are hard. Its either an ace or not, but you can’t tell! The card is face down and there is no other information you could use to make the decision. So, you answer an easier one using information that you do know, namely the number of aces in the deck, the number of cards in the deck, the information that each card is equally likely to be on top given the fact there is no funny business with the cards or the shuffling, and the basic rules of probability you might have learned in high school if not elsewhere. But the answer you give is for a fundamentally different question, namely “If I were to observe a long sequence of well shuffled decks with no funny business, what fraction of them have an ace on top?”. Your answer is about that long sequence of shuffled decks. It isn’t about any one particular deck, and certainly not the one in front of you.</p>
<p>I think the same thing happens with confidence intervals. The estimator has the property that 95% of the time it is constructed (under ideal circumstances) it will contain the estimand. But any one interval does or does not contain the estimand. And unlike the deck of cards which can easily be examined, we can’t ever know for certain if the interval successfully captured the estimand. There is no moment where we get to verify the estimand is in the confidence interval, and so we are sort of left guessing thus prompting us to offer a probability that we are right.</p>
<p>The mistake is benign. It hurts no one to think about confidence intervals as having a 95% probability of containing the estimand. Your company will not lose money, your paper will (hopefully) not be rejected, and the world will not end. That being said, it is unfortunately incorrect if not by appealing to the definition, then perhaps for other reasons.</p>
<p>I’ll start with an appeal to authority. Sander Greenland and coauthors (who include notable epidemiologist Ken Rothman and motherfucking Doug Altman) include interpretation of a confidence interval as having 95% probability of containing the true effect as misconception 19 in <a href="https://link.springer.com/content/pdf/10.1007/s10654-016-0149-3.pdf">this amazing paper</a>. They note “ It is possible to compute an interval that can be interpreted as having 95% probability of containing the true value” but go on to say that this results in us doing a Bayesian analysis and computing a credible interval. If these guys are wrong, I don’t want to be right.</p>
<p>Additionally, when I say “The probability of a coin being flipped heads is 0.5” that references a long term frequency. I could, in principle, demonstrate that frequency by flipping a coin a lot and computing the empirical frequency of heads, which assuming the coin is fair and the number of flips large enough, will be within an acceptable range 0.5. To those people who say “This interval contains the estimand with 95% probability” I say “prove it”. Demonstrate to me via simulation or otherwise this long term relative frequency. I can’t imagine how this could be demonstrated because any fixed dataset will yield same answer over and over. Perhaps what supporters of this perspective mean is something closer to the Bayesian interpretation of probability (where probability is akin to strength in a belief). If so, the debate is decidedly answered because probability in frequentism is not about belief strength but about frequencies. Additionally, what is the random component in this probability? The data from the experiment are fixed, to allow these to vary is to appeal to my interpretation of the interval. If the estimand is random, then we are in another realm all together as frequentism assumes fixed parameters and random data. Maybe they mean something else which I just can’t think of. If there is something else, please let me know.</p>
<p>I’ve gotten flack about confidence intervals on twitter.</p>
<h2 id="flack-1-framing-it-as-a-bet">Flack 1: Framing It As A Bet</h2>
<p>You present to me a shuffled deck with no funny business and offer me a bet in which I win X0,000 dollars if the card is an ace and lose X0 dollars if the card is not. “Aha Demetri! If you think the probability of the card on top being an ace is 0 or 1 you are either a fool for not taking the bet or are a fool for being so over confident! Your position is indefensible!” one person on twitter said to me (ok, they didn’t say it verbatim like this, but that was the intent).</p>
<p>Well, not so fast. Nothing about my interpretation precludes me from using the answer to a simpler question to make decisions (I would argue statistics is the practice of doing jus that, but I digress). The top card is still an ace or not, but I can still think about an infinite sequence of shuffled decks anyway. In most of those scenarios, the card on top is an ace. Thus, I take the bet and hope the top card is an ace (much like I hope the confidence interval captures the true estimand, even though I know it either does or does not).</p>
<h2 id="flack-2--my-next-interval-has-95-probability">Flack 2: My Next Interval Has 95% Probability</h2>
<p>“But Demetri, if 95% refers to the frequency of intervals containing the estimand, then surely my next interval has 95% probability of capturing the estimand prior to seeing data. Hence, individual intervals <em>do</em> have 95% probability of containing the estimand”.</p>
<p>I get this sometimes, but don’t fully understand how it is supposed to be convincing. I see no problem with saying “the next interval has 95% probability” just like I have no problem with saying “If you shuffle those cards, the probability an ace is on top is 4/52” or “My next <a href="https://en.wikipedia.org/wiki/Tim_Hortons#Roll_Up_the_Rim_to_Win_campaign">Roll Up The Rim</a> cup has a 1 in 6 chance of winning”. This is starting to get more philosophical than I care it to, but those all reference non-existent things. Once they are brought into existence, it would be silly to think that they retain these properties. My cup is either winner or loser, even if I don’t roll it.</p>
<h2 id="flack-3--but-schrödingers-cat">Flack 3: But Schrödinger’s Cat…</h2>
<p>No. Stop. This is not relevant in the least. I’m talking about cards and coins, not quarks or electrons. The Wikipedia article even says “Schrödinger did not wish to promote the idea of dead-and-live cats as a serious possibility; on the contrary, he intended the example to illustrate the absurdity of the existing view of quantum mechanics”. Cards can’t be and not-be aces until flipped. Get out of here.</p>
<h2 id="wrapping-up-dont--me">Wrapping Up, Don’t @ Me</h2>
<p>To be completely fair, I think the question about the cards I’ve presented to you is unfair. The question asks for a probability, and while 0 and 1 are valid probabilities, the question is phrased in a way so that you are prompted for a number between 0 and 1. Likewise, the name “95% confidence interval” begs for the wrong interpretation. That is the problem we face when we use language, which is naturally imprecise and full of shortcuts and ambiguity, to talk about things as precise as mathematics. It is a seminal case study in what I like to call the precision-usefulness trade off; precise statements are not useful. It is by, interpreting them and communicating them in common language that they become useful and that usefulness comes at the cost of precision (note, this explanation of the trade off is <em>itself</em> susceptible to the trade off). The important part is that we use confidence intervals to convey uncertainty in the estimate for which they are derived from. It isn’t important what you or I think about it, as the confidence interval is merely a means to an end.</p>
<p>AS I noted, the mistake is benign, and these arguments are mostly a mental exercise than a fight against a method which may induce harm. Were it not for COVID19, I would encourage us all to go out for a beer and have these conversations rather than do it over twitter. Anyway, if you promise not to @ me anymore about this and I promise not to tweet about it anymore.</p>Demetri Pananosdpananos@gmail.comThe 95% in 95% confidence interval refers not to the probability that any one interval contains the estimand, but rather to the long term relative frequency of the estimator containing the estimand in an infinite sequence of replicated experiments under ideal conditions.Don’t Select Features, Engineer Them2020-12-07T00:00:00-08:002020-12-07T00:00:00-08:00https://dpananos.github.io/posts/2020/12/blog-post-28<p>Students in the class I TA love to do feature selection prior to modelling. Examining pairwise correlation and dropping seemingly uncorrelated features is one way they do this, but they also love to fit a LASSO model to their data and refit a model with the selected variables, or they might do stepwise selection if they are feeling in the mood to code it up in python.</p>
<p>Feature selection can be an important thing to do if you want to optimize the predictive performance per unit cost of data aquisition ratio. Why spend money getting customer shoe size if it doesn’t help you predict the propensity to buy? However, students don’t rationalize their feature selection approach this way. I suspect they do it because they feel it will help them achieve a lower loss/better predictive accuracy.</p>
<p>Let me start by saying that I think examining pairwise correlations is a bad way to select features. First, this approach is almost always never validated correctly, second it ignores confounding relationships, third correlations only the strength of a <em>linear</em> relationship, and fourth it is very easy to construct a dataset in which the correlations completely get the variable importance wrong. Let $x$ and $z$ be binary independent random variables and let $y$ be the XOR relationship. Knowing $x$ and $z$ determines $y$, but the correlation as measured by Kendall’s $\tau$ is 0. A strict selection based on correlation would completely miss this relationship.</p>
<p>In this blog post, I argue feature selection is at best not neccesary when the goal is optimal out of sample predictive performance and at worst increases test loss. If the goal is good predictions, then what students – and perhaps, practitioners too – should be doing is engineering new features. The question then becomes “what features” and to that I answer <em>“splines!”</em>.</p>
<p>In what follows, I describe 3 experiments in which I examine 5 models. Those models are:</p>
<ul>
<li>Linear Regression. This serves as our baseline. We just throw everything in the model and get what we get.</li>
<li>Linear Regression + Feature Selection via Correlations. Any features with an absolute correlation of 0.2 or larger are included in the model.</li>
<li>Linear Regression + Feature Selection via Lasso. I use 10 fold cross validation to first fit a LASSO model, then take all the non-zero coefficients from that model and refit a linear regression.</li>
<li>LASSO fit via 10 fold cross validation to examine how shrinkage might compare to these models, and</li>
<li>A linear model where each feature is expanded into a restricted cubic spline with 3 degrees of freedom. That is one additional feature for every feature in the training data, increasing the number of training variables by a factor of 2 as opposed to reducing them.</li>
</ul>
<p>The goal is to examine predictive performance of these models. I examine the performance of these models on three data sets:</p>
<ul>
<li>The diabetes dataset from sklearn</li>
<li>The Boston housing dataset from sklearn. I’ve removed binary indicators from this data because they screw with my spline implementation. No model sees the binary indicator, so its still fair.</li>
<li>A dataset of my own creation. I simulate 2,500 observations of 50 features from a standard gaussian. The first 10 features have a non-linear effect that was created by expanding those features into a restricted cubic spline with 7 degrees of freedom. I do 7 and not 3 because I want to investigate model mispecification (which is always the case) so as not to inflate the perfomance of the last model too much. The 20 features have an effect of 0 (they are nussance variables) and the remaining features just have linear effects. I sample the coefficients for the data from a standard normal and add student t noise with 10 degrees of freedom (something a little fatter in the tails than a normal, for some potential outliers).</li>
</ul>
<h1 id="methods">Methods</h1>
<p>Each model is examined through repeated 10 fold cross validation performed 100 times (that’s a total of 1000 resamples and refits). I ensure each model sees the same 1000 random splits so results are comparable. I don’t have an explicity test set because the diabetes and boston data are small and any single test set could yield noisy estimates of test loss. The repeats of 10 fold CV are meant to cirvument this. I choose mean squared error as my loss function because it penalizes models heavily which cannot make extreme predictions when neccesary. I also monitor the variability in the MSE fold to fold (essentially computing the standard deviation of the 1000 MSE evaluations). Ostensibly, this should be a measure of how variable the model could be were we to get a different training/testing set from the same data generating processes.</p>
<p>I make all code available on <a href="https://github.com/Dpananos/FeatureSelectionSimulation">github</a> for you to extend. I would love to see more models applied to this data (perhaps selection from random forests, or feature selection through a technique described + random forest. Go nuts).</p>
<h1 id="quick-note--what-is-the-right-way-to-select-variables-if-youre-going-to-do-it">Quick Note: What Is The Right Way To Select Variables If You’re Going To Do It?</h1>
<p>The wrong way to select variables is to inspect correlations/perform lasso/do stepwise on the entire training set <em>and then</em> cross validate after choosing features. The selection procedure is part of the model. Had you gotten different training data, you might of selected different features! The right way to make selection part of the model is to make a transformer which does it for you and put it in a pipeline. Here the LASSO selection transformer I wrote for the experiment:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">LassoSelector</span><span class="p">(</span><span class="n">BaseEstimator</span><span class="p">,</span> <span class="n">TransformerMixin</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">mod_</span> <span class="o">=</span> <span class="n">LassoCV</span><span class="p">(</span><span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">).</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">coef_</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">mod_</span><span class="p">.</span><span class="n">coef_</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">coef_</span><span class="p">)</span><span class="o">></span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<p>When we do cross validation now, the training data is sent to the <code class="language-plaintext highlighter-rouge">LassoCV</code> where a LASSO model is fit. Once the model is fit, we can determine which coefficients were selected by the model in the <code class="language-plaintext highlighter-rouge">transform</code> step. Now, when we pass the held out data in cross validation, the held out data’s variables are selected based only on the data the model was able to see during training. This properly accounts for the variability induced by the selection and keeps your models honest! A similar class can be written for selection with correlations</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CorrelationSelector</span><span class="p">(</span><span class="n">BaseEstimator</span><span class="p">,</span> <span class="n">TransformerMixin</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cutoff</span> <span class="o">=</span> <span class="mf">0.2</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cutoff</span> <span class="o">=</span> <span class="n">cutoff</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">correlations_</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">pearsonr</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span><span class="p">])</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">selection</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argwhere</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">correlations_</span><span class="p">)</span><span class="o">></span><span class="bp">self</span><span class="p">.</span><span class="n">cutoff</span><span class="p">).</span><span class="n">ravel</span><span class="p">()</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">selection</span><span class="p">]</span>
</code></pre></div></div>
<h1 id="results">Results</h1>
<p>Shown below are the results of the experiments. For each dataset, I plot the relative difference in expected cross validated MSE (relative to linear regression) against the realtive difference in the fold to fold standard deviation (again, relative to lienar regression). Linear regression is at the intersection of the gray lines for reference. Models in the bottom left hand corner have better MSE as compared to linear regression and are less variable fold to fold.</p>
<p>In all three datasets, the splines model has superior MSE as compared to the other models. In fact, linear regression has superior MSE as compared to the selection models across all datasets (with the small exception of LASSO selection in the Boston dataset, though the difference is smaller than 1% and in my own opinion negligible).</p>
<p>Varaibility fold to fold seems to depend on the dataset. The spline model is less variable in the Boston and non linear datasets, but more variable in the diabetes dataset.</p>
<div style="text-align:center"><img src="/images/blog/diabetes.png" /></div>
<div style="text-align:center"><img src="/images/blog/boston.png" /></div>
<div style="text-align:center"><img src="/images/blog/nonlinear.png" /></div>
<h1 id="discussion">Discussion</h1>
<p>From these experiments, we can conclude that adding additional features rather than selecting existing ones leads to smaller MSE (although the No Free Lunch Theorem prevents us from generalizing to other datasets). The results are consistent across two datasets derived from real world data generating processes and a third synethtic dataset of my own design.</p>
<p>What might explain these results? First, splines add additional features to the model, meaning we can estimate a much richer space of functions from the data. Considering we can estimate a broader space of functions it is no surprise the splines come out on top, especially considering the competing models are all high bias models only capable of modelling linear effects. I imagine were we to apply other non-linear modelling techiniques (e.g. random forests or neural networks) that their MSE would also be lower than linear regression. However, splines offer a nice half way point between simplicity of linear regression and the black box of neural networks or random forests: we can add additional flexibility to the model thereby decreasing loss while remaining interpretable, and easily maintainable.</p>
<p>Frank Harrell makes another good argument for splines, which I will summarize here. When considering using splines there are four distinct possibilities:</p>
<p>1) We don’t use splines and the effect is linear. In this case, a simple model will suffice and we benefit from lower variance fits.</p>
<p>2) We use splines and the effect is linear. In this case, we spend extra degrees of freedom unneccisarily, increasing the variance of our fits, but the effect of the variable is appropriately estimated.</p>
<p>3) We don’t use splines and the effect is non-linear. This has the potential to be catastrophic! Imagine that a variable has an effect on $y$ that looks like $y = x^2$. If x is mean centered, then the linear fit could estimate the effect to be 0 or too small.</p>
<p>4) We use splines and the effect is non-linear. This would be the best case scenario where we appropriately spend our degrees of freedom.</p>
<p>Examinig all four cases shows that using splines is nearly always a good idea. With enough data, the varibility in the fit should shrink nullifying the downsides of point 2. The risk of point 3 is likely what explains the abysmal performance of correlation selection in the non-linear dataset. My intuition here is the effect of some of the non-linear variables would have small correlations but have a strong non-linear effect (kind of like the $x^2$ example). Negelcting these variables is a huge mistake as they can reduce the MSE, explaining why correlation selection has an MSE nearly 3x that of OLS. I’ve not conducted a post mortem on these experiments, but I anticipate this to be the cause.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I provide experimental motivation for the use of feature engineering (specifically though splines) over featuer selection for optimizing for predictive accuracy. Five models were applied across 3 datasets in order to examine the relative improvement of selection procedures over linear regression. To measure performance, mean squared error was computed over 100 repeats of 10 fold cross validation. As a comparitor, I incldued a spline model which adds features rather than removing them. The spline model outperformed all other models across all datasets, and linear regression had either superior or comparable performance to selection procedures. From these experiments, coupled with knowledge of how additional variability can impact expected loss, I conclude that rather than select features for inclusion in the model students should consider carefully balancing additional variability of splines with the reduction in bias – and hence loss – they can impart.</p>
<p>The experiment has several limitations. In particular, the spline model I estimated was still quite biased. It would be important to simulate another dataset in which the effects are non-linear and the model we use has too many degrees of freedom. The variability in this approach may attenuate the improvement we see due to splines. Additionally, complexity is not explicity accounted for and is only allowed to penalize models via poor predictive performance. A natural extension would be to measure AIC of all models since they are linear (except perhaps the lasso where effective degrees of freedom may or may not be appropriate to use in AIC).</p>Demetri Pananosdpananos@gmail.comStudents in the class I TA love to do feature selection prior to modelling. Examining pairwise correlation and dropping seemingly uncorrelated features is one way they do this, but they also love to fit a LASSO model to their data and refit a model with the selected variables, or they might do stepwise selection if they are feeling in the mood to code it up in python.Intuitive Formulae Are Not Always Right2020-10-22T00:00:00-07:002020-10-22T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-27<p>In the data science class I help TA, we’re going over confidence intervals. Thanks to the central limit theorem, we can report confidence intervals for the out of sample generalization error. Let’s assume our loss function is mean squared error. A confidence interval would then be</p>
\[\widehat{MSE} \pm z_{1-\alpha/2} \dfrac{\sigma_{MSE}}{\sqrt{n}} \>.\]
<p>Here, $\sigma_{MSE}$ the the standard deviation of the squared residuals (squared because our loss is squared error). Well, if you’re familliar with $R^2$ you realize that</p>
\[R^2 = 1- \dfrac{MSE}{SSE}\]
<p>and be tempted to take your confidence interval for MSE and just divide it by the SSE (which is $\sum_i (y - \bar{y})^2$) and call that a confidence interval for $R^2$.</p>
<p>This works…in a very narrow set of circumstances. For most problems, this results in coverage above or far below the nominal and also risks covering impossible $R^2$ values. Let’s make some very charitable assumptions and see just how well this works.</p>
<h2 id="charitable-assumpions">Charitable Assumpions</h2>
<p>Let’s analyze this interval for a single varibale regression. Let’s make the assumption that the predictor is normally distributed and that it has variance $\tau^2$.</p>
<p>Now, assume $y\vert x \sim \mathcal{N}(\alpha x + \beta, \sigma^2)$. The marginal distribution of $y$ is then $y \sim \mathcal{N}(\alpha \mu + \beta, \sigma^2 + \alpha^2\tau^2)$ meaning that the the exact value for $R^2$ would be</p>
\[R^2 = 1 - \dfrac{\sigma^2}{\sigma^2 + \alpha^2\tau^2} = 1 - \dfrac{1}{1 + \left(\dfrac{\alpha \tau}{\sigma}\right)^2} \>.\]
<p>We can always standardize our predictor to have unit variance, so the quantity we really care about is $\alpha/\sigma$. I can rearrange the formula for $R^2$ to give me $\alpha = \alpha(\sigma)$ and since we really only care about the ratio of $\alpha$ and $\sigma$ I’ll assume $\sigma=1$ and modulate $\alpha$. Now, I can generate a regression problem which has true $R^2$ whatever I want it to be. Here is some code to do that.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">scipy.special</span> <span class="kn">import</span> <span class="n">expit</span><span class="p">,</span> <span class="n">logit</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">product</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="k">def</span> <span class="nf">make_regression_data</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="n">n</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">alpha</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">n</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">R2</span> <span class="o">=</span> <span class="mf">0.8</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">sigma</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">R2</span><span class="p">)</span> <span class="o">-</span><span class="mi">1</span> <span class="p">)</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">make_regression_data</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="c1"># Scoring is r squared. Will print something close to 0.8. Any difference is sampling error.
</span><span class="n">LinearRegression</span><span class="p">().</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">).</span><span class="n">score</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>Now all I have to do is:</p>
<ul>
<li>Specify an R squared and a sample size</li>
<li>Compute the “confidence interval”</li>
<li>Determine if the true $R^2$ is in the interval</li>
<li>Repeat</li>
</ul>
<p>The <em>coverage</em> of an interval estimator is the long term relative frequency of containing the true estimand. A 95% confidence interval should have a coverage of 0.95. A coverage lower than that means our interval is not likely too narrow and overstating the precision of our estimate. A coverage too high is giving credence to values of $R^2$ it should not, understating the precision of the estimate. It is important that our interval estimate have the appropriate coverage. Ok, moving on.</p>
<h1 id="results">Results</h1>
<p>Code is included at the end of the blog post, but here is a plot of the coverage by R squared and sample size. As you can see, small $R^2$ have a coverage over the nominal value, so we are always overstating what kinds values of $R^2$ are consistent with the data. And for large values of $R^2$ we have coverage way way below the nominal. If this interval were used as a means to test some hypothesis about $R^2$ then we would make a lot of false positives. A lot. I’ve coded the color map so that the color white represents the nominal coverage. As you can see, there is not a ton of white, meaning the interval estimate doesn’t really work as well as one might be lead to believe.</p>
<div style="text-align:center"><img src="/images/blog/coverage.png" /></div>
<h1 id="code">Code</h1>
<p>I’ve included the code in a gist <a href="https://gist.github.com/Dpananos/5f9c026d3b21ec53638f6ce067c20184">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Don’t do this. And more importantly, demand some sort of proof when people claim that something is as easy as just doing some algebra. Math has a lot of intuitive results, but there are also a ton of unintuitive results. Stats in particular is really unintuitive, but luckily we can use computers to check our intuition. Do make use of them.t</p>Demetri Pananosdpananos@gmail.comIn the data science class I help TA, we’re going over confidence intervals. Thanks to the central limit theorem, we can report confidence intervals for the out of sample generalization error. Let’s assume our loss function is mean squared error. A confidence interval would then be3 Rules For Giving a Sh!t2020-10-18T00:00:00-07:002020-10-18T00:00:00-07:00https://dpananos.github.io/posts/2020/10/blog-post-26<p>I spend a lot of time on cross validated (CV). CV is my statistical escape from statistics, and it is also a place where I like to prove to myself that I am good at what I do.</p>
<p>On CV, people pose questions and for the community to answer. Members post solutions, and OPs can accept that solution if it answers their question. If a question has multiple solutions, then the community can upvote solutions based on whatever they think merits an upvote. Each action (answer accepting and upvotes) gives members points.</p>
<p>Ah, fake interent points. We know them too well. From reddit upvotes to likes on twitter, these digital tally counters are to many (including myself) a measure of how “right” you are – where right is measured by consensus. More points means more people agreed with you and how could you be wrong when so many people agree with you? Its a form of external validation for me; a way to prove that I am good at statistics because other people are marking my answers as good.</p>
<p>That becomes draining very quickly. Somtimes I catch myself wasting time explaining simple stuff because I know it is an easy 20-30 points (equivalent to 2 upvotes and an accepted answer, or 3 upvotes). I’m wasting my time on questions of little importance to get validation from people I don’t know (except I do know some of them because they follow me on Twitter. Hi, Tim). I need to reel it in a little while still engaging (because it is kind of fun sometimes and some of the answers I give are genuinely interesting and have lead to some blog posts). I’ve developed 3 rules I check to see if an answer is worth commiting to ink, er, HTML.</p>
<h2 id="1-does-an-answer-exist-in-a-place-op-really-should-have-looked">1) Does an answer exist in a place OP really should have looked?</h2>
<p>“How do I interpret log odds?” – Next. “What sort of statsitical test do I need?” – Next. “What do I do if my predictor isn’t normal?” – Next. I don’t need to waste my time answering something which exists in software documentation or in introductory stats books. If the question can be answered by reading or by copy and pasting the appropriate link to cononical resources, I don’t waste my time. I might comment and say something like “A good place to look might be…”, but that is it.</p>
<h2 id="2-is-the-answer-complex-or-complicated">2) Is the answer complex or complicated?</h2>
<p>As in the zen of python, complex is preferable to complicated. Complex would mean that the answer is non-trivial, but the “juice is worth the squeeze” so to say. If what we get out of it is an interesting insight, then I will commit my time. Complicated would mean that there is nothing interesting to come out of the answer but getting the answer is tedious. If that sounds ambiguous, that is because I intended it to be so that I could manipulate these rules at will. Remember, these rules serve me and not the other way around.</p>
<h2 id="3-do-i-give-a-shit">3) Do I give a shit?</h2>
<p>This last rule is actually a function of the other two. If the answer exists elsewhere, then I likely do not give a shit. If the answer is complicated, then I likely do not give a shit. If the answer is novel (or atleast novel enough to me) but the answer is complicated, then I might give a shit. If the question has been answered before but I have the opportunity to give my own insight and opinion (i.e. the answer is complex) then I might give a shit. This rule really decides for 80% of questions if I take the time or not. If I give a shit about what you’re asking, I am much more likely to contribute even if one of the other two rules are violated.</p>
<p>Do I follow these all the time? No, but I do find myself using them more frequently. My score on CV has dipped a little, but hey that is the trade off. Now I find I’m not wasting my time explaining first year stats to some poor grad student who just wants a p value. That serves both me and them better in the end.</p>Demetri Pananosdpananos@gmail.comI spend a lot of time on cross validated (CV). CV is my statistical escape from statistics, and it is also a place where I like to prove to myself that I am good at what I do.