Demetri Pananos Ph.D - Journeyman Statistics

A number of people have asked me how to learn statistics. I don’t have a good answer for them. I find all books are deficient in some way: some too theoretical, others not theoretical enough. Some too focused on pen and paper calculation, others provide code that was very likely written decades ago and do not use modern packages or tidy principles (e.g they begin their analysis with rm(list=ls() or use attach and detatch). I can’t recommend my own path to learning statistics either, mostly because I did a Ph.D in statistics and had the benefit of a Masters in Applied Mathematics.

I want to write a book for people who are not statisticians but need to make use of statistics anyway. The medical residents who need to do logistic regression for a research project, the social science grad student (save econometricians, I’d say we could learn something from them but we probably know all the same things by different names) who is more interested in their research than their statistical models, the business intelligence analyst who has to to analyze a (poorly planned) A/B test and would love nothing than to improve the experiment the next time around.

This sequence of blog posts is going to be a sort of first go at that book. Not even an alpha or a rough draft, but rather somewhere to put some thoughts that might eventually make it into the book. This first post is about the motivation for the book, which I am tentatively calling “Journeyman Statistics”.

What is Journeyman Statistics?

A journeyperson/woman/man is

“a worker, skilled in a given building trade or craft, who has successfully completed an official apprenticeship qualification. Journeymen are considered competent and authorized to work in that field as a fully qualified employee. They earn their license by education, supervised experience and examination. Although journeymen have completed a trade certificate and are allowed to work as employees, they may not yet work as self-employed master craftsmen.

What do I mean then by “journeyman statistics” and who are “journeyman statisticians”? Borrowing heavily from the definition above, journeyman statisticians are people who are trained in some field and are currently doing statistics in service of someone or something else. Journeyman statistics are then the statistical analyses performed by these people. I don’t think its tough to pick our journeyman statisticians; they are to a first approximation those who perform statistical analysis but are not statisticians first and foremost. They are biologists, medical residents, sociologists, analysts of several varietys, etc. To them, statistics is the means whereas statistics are the end to research (perhaps “pure”) statisticians.

Its important to further distinguish journeyman statisticians from applied statisticians. Applied statisticians can, as Tukey once said, “play in everyone’s backyard”. They possess the necessary mathematical maturity and statistical expertise to move from field to field. A journeyman statistician, though they may be well versed in statistics, would likely stay in their own backyard (to continue the metaphor) in order to tackle problems there.

Of course, I don’t mean to place people in boxes. You don’t need to subscribe to my taxonomy of statisticians (in fact, outside of this book I don’t think its a particularly useful taxonomy), and I think there are edge cases which threaten the taxonomy as a whole. The taxonomy is simply a model, and this model is useful for one thing: understanding motivations for learning statistics, and designing a path through statistical literature so as to serve those motivations.

Why Do We Need a Book on Journeyman Statistics?

The taxonomy allows us to understand who journeyman statisticians are, what their intentions are, what they may lack in terms of statistical understanding, what is enough to satiate their desire to learn statistics, and what details contribute “noise” rather than “signal”. As an example, I don’t think biology grad students need to know what \(\operatorname{plim}\) means, or any of the other topics adjacent to mathematical analysis in Casella and Berger’s Statistical Inference. They do need to go slightly beyond their sophmore classes which tell them they can use the t-test when \(N>30\). Likewise, medical residents need to go beyond Martin Bland’s An Introduction to Medical Statistics and need to be able to confidently say “we shouldn’t do that” when their supervisors or superiors insist on a clearly flawed mode of analysis. However, they may get bogged down by the integrals in Frank Harrell’s Regression Modelling Strategies (as well as the sea of references to methodological papers). There are few books for people like that. I find that books are typically for sophmore students learning statistics for the first time, or applied statisticians. Journeyman statisticians need something in the middle. I hope this book is that something.

What Will This Book Contain?

Statisticians like to joke “its all regression”. There is truth in that phrase, and so this book will take the perspective that regression is the primary means of estimation. We’ll cover all the typical analyses as regression methods. This includes estimation of the mean, its just a one parameter regression. I want to get to GLM’s as quickly as possible while not getting bogged down by mathematical details, like whatever “mild regularity conditions” means. GLMs are the workhorse of applied statistics, and I see no point in leaving GLMs to later chapters. One thing that will be absent from the book is p values. The book will take an estimation approach and report only confidence intervals.

The book will also contain code in both python and R, though I encourage readers to use R rather than python. Python’s statistical tools have a distinct econometrics flavor.

What Benefit is There to Reading This Book?

Before discussing why you should read this book, I want to discuss what I call “The Precision-Usefulness Tradeoff”, as I anticipate I will refer to this many times. In short, the tradeoff states

Perfectly precise statements are completely useless. In order to become useful, the statement must be made less precise. The less precise, the more useful.

Again, this is a model rather than a law. As an example, consider the definition for a 95% confidence interval.

A 95% confidence interval is an interval estimate which, upon repeated construction under identical and ideal conditions, contains the estimand 95% of the time.

I can make this statement more precise by writing it down mathematically

\[ P \left( \bar{x} - z_{0.975} \sigma/\sqrt{n} \leq \mu \leq \bar{x} + z_{0.975} \sigma/\sqrt{n} \right) = 0.95 \]

but it loses usefulness. This doesn’t really tell me what a confidence interval is, when to use one, or how to interpret one. However, the definition I initially presented perhaps permits some pathological cases. We can make the definition more useful by removing precision further

A 95% confidence interval contains parameters consistent with the data.

Now, we have a better idea how to interpret the interval, but the 95% is opaque to us.

The reason I bring up this trade off is because the book is intended to give journeyman statisticians the tools required to move in Precision-Usefulness space. My hope is that when the time comes, you will be able to artfully trade precision for usefulness (e.g. to be pedantic when it is necessary, or to break the rules precisely because you know how and when they can be broken safely). This is the primary benefit of the book. Obviously, I can’t tell people when or where to move within that space. I will have to leave that to their best judgement.

The second benefit is to solve the problem we began with; to answer “how to I go about learning statistics” in a satisfactory way.