Zachary Lipton recently tweeted that sklearn’s
LogisticRegression uses a penalty by default. This resulted in some heated twitter debates about the differences in attitudes between statistics and machine learning researchers and the responsibility of users to read the documentation, amongst other things.
My initial reaction was to jump to the defense of the sklearn developers. In essence, I thought (and to some extent still do think) that practitioners should take responsibility for the tools they use to draw the conclusions they draw. I’ve had some time to sit on that for a day or so and let other facts come to light.
While I maintain data analysts/scientists/researchers need to be mindful of their tools, I have reluctantly also come to agree that the designers of these tools should be mindful of their use. Below, I outline some of my thinking, linking to interesting discussions and facts as they have been shown to me.
Argument’s For A More Transparent
The arguments for why
LogisticRegression should have an unbiased implementation as the default are numerous. I won’t go through them all here (you can read @ryxcommar’s very thoughtful and well argued article here should you need a refresher). In short, if the function’s name is
LogisticRegression then we should expect that it performs logistic regression and not some penalized variant thereof. Few people read the docs (for some strange reason to me) and so it can be very easy to go through your data scientific life blissfully unaware that you are biasing results.
I thought it was apparent that sklearn was a ML library first and foremost, and thus preferred to trade off variance in exchange for bias. The devs have made that abundantly clear in several issues in github. I was wrong on this.
Creating Tools For Science Is A Great Power, And With Great Power…
A lot of my initial reaction was to put blame primarily on the user. After all, no one forced anybody to use sklearn. If only the user had read the docs and understood what they were doing, then this all could have been avoided. When I make a mistake, I don’t blame someone else, I say “damn, I should have read the docs!”. Silly user, it’s your fault!
Another argument I had was that a lot of software has really crappy or unintuitive defaults that we have just come to live with. Stata drops missing data, SAS spits out superfluous statistics making it tough for people to find what they need, R uses a Welch’s t-test in
t.test and Clopper-Pearson intervals in
binom.test in contrast to what even graduate students are taught in class. If users are, for instance, familiar with Wald intervals then the use of Clopper-Pearson intervals may result in some confusion when users realize the interval is not symmetric about the estimated mean. If these are sane defaults given the context of statistics (and they are, at least the ones in R are. No one is saying they aren’t), then why shouldn’t automatic regularization be the standard in the context of ML, which as I’ve said before, routinely trades variance in exchange for bias. Context is important, silly user! Read the docs!
I was quick to blame the user because they really are the last line of defense. If you design a study, analyze the data, and then send it for review, it is up to you to make sure that your paper accurately reflects what you said you did. No one else can do that for you. The reviewers act in this capacity in a small way, but if you said you did logistic regression, how are they to know any different? This brings up the issue of code review, but let’s leave that alone for now.
But, as I do, I forgot that we’re all human. We get tired, we skim, we look for shortcuts, and we fail to find the details we really need. Even the best of us do that. Even I do this, and I think I have been the most critical of people who do this. Now, I don’t find this a completely abdicating argument. I mean, how hard is it to google the docs or run
?LogisticRegression in the terminal? But, I digress.
And that is where the design of scientific tools can come to the rescue. Making tools easier and clearer to use can combat a lot of this human error. I’ve reluctantly come to the conclusion that sklearn should have implemented an unbiased logistic regression by default (or at least named
LogisticRegression more appropriately).
Moreover, I think the devs may need to consider that their library is more popular than perhaps they initially thought it might become. That sort of popularity brings a lot of power, and with great power…well, you know the rest.
I was especially disappointed to read this comment by reddit user /u/shaggorama, which states that sklearn quietly deprecated their bootstrap validator after it came to light that the implemented method was more or less ad hoc and did not actually implement the bootstrap. I find this particularly egregious because it is arguably more deceptive than
LogsiticRegression. Not only that, but I was shocked to see Andreas Muller plainly ask why anyone would want an unbiased implementation of logistic regression.
If we should expect that practitioners act in good faith, then tool makers should create tools which allow practitioners to act in good faith easily. To do otherwise, I have come to conclude, is to obstruct good science.
While I maintain users have a responsibility to know what they are doing, I jumped too quickly to blame the user. After reading some of the links I have posted here, I’ve changed my mind. Users can only shoulder so much responsibility, and I think if toolmakers want to make good tools then they should consider the perspective of the user.