Let me ask you a question: Considering logistic regression can be performed without the use of a penalty parameter, why does sklearn include a penalty in their implementation of logistic regression? I think most people would reply with something about overfitting, which I suppose is a reasonable answer, but isn’t very satisfactory, especially since the documentation for
sklearn.linear_model.LogisticRegression() is awash with optimization terminology and never mentions overfitting.
I can’t say exactly why the authors of
sklearn.linear_model.LogisticRegression() included the
C parameter, but I am willing to bet it is to ensure the fitting algorithm converges in the case of complete or quasi complete separation. Let me explain what I mean through the use of a motivating example. Let’s create a univariate classification problem. Generate a sequence of gaussian random numbers and if the number is greater than 0, then let the outcome for that observation be 1, else 0. The set up for this problem is straight forward in python.
import numpy as np from sklearn.linear_model import LogisticRegression from scipy.optimize import minimize #Set a random seed for reproducibility np.random.seed(19920908) data = [[j,1] if j>0 else [j,0] for j in np.random.normal(size = 20) ] X,Y= np.array(data).T #Create a design matrix for the X variable lX = np.c_[np.ones(X.size),X]
A priori, we know the decision boundary is at $x=0$ and that negative observations will have $y=0$. This is called complete separation because there exists a line which completely separates the two classes. Let’s fit a logistic regression with sklearn and see what we get.
reg = LogisticRegression() reg.fit(lX,Y) reg.coef_
Sklearn returns coefficients $\beta_0 = 0.14$ and $\beta_1 = 1.91$. This means that for every one unit increase in $x$, the odds that $y=1$ increases by $\exp(1.91) \approx 6.75$. Let’s try fitting a logistic regression on the same data through MLE. We can use
scipy to optimize the log Likelihood
def LL(beta,X,Y): #Negative likelihood because we are going to minimize it return -Y@(X@beta) + np.sum( np.log( 1+np.exp(X@beta) ) ) beta_guess = np.array([0,0]) #Initial guess minimize(lambda B: LL(B,lX,Y), beta_guess) #perform the optimization
The optimization fails to converge. Try it again but this time require more precision. Still get the same thing? Hmm, weird, maybe I did something wrong. Try using
statsmodels.Logit instead. Did the function throw you an error stating
Maximum Likelihood optimization failed to converge? I thought it might.
What Is Going On?
Let’s fix the intercept of the model and just allow the coefficient for $x$ to be optimized. You can see that as $\beta_1$ increases, the log likelihood levels off. However, and here is the important part, the likelihood does not have a global maximum and so the algorithm will fail to converge.
OK, a natural question now is “why is there no maximum?”. The coefficients of a logistic regression represent the change in the log odds when the predictor is increased by one unit. In our problem, the odds increase infinitely after we cross 0. That means we are trying to estimate an infinitely large number (good luck with that). Since no maximum exists, the optimization fails to converge.
But sklearn’s implementation converged and gave us a finite $\beta_1$. What gives? Remember that penalized logistic regression is just a convex optimization of the log likelihood in disguise. The penalization parameter essentially forces the optimization to converge by maximizing over a subset of the parameter space. Go ahead and refit sklearn’s logistic regression, but now specify
C=10000.0. What pattern do you see in $\beta_1$ when you increase the penalty paramater? You should see $\beta_1$ get larger because the space over which the likelihood is being maximized is getting larger.
This is a totally fabricated example, but separation does happen. The inclusion of a penalty parameter totally avoids convergence problems at the cost of biased estimates of the log odds ratio (which really isn’t a problem because sklearn’s goal is prediction, not statistical inference).
A common question over at /r/datascience is “how much math do I need for data science?”. I’ve waffled a back and forth on the answer to this question, but decided recently on my answer. I think if you can read this blog post and understand the conclusion, then you understand enough math. I am sure people will disagree with me and say that you need far more/less math (or that if I can not explain this to a 5 year old then I am not a good data scientist), so let me acknowledge those right now and say that this is my own personal litmus test, not a classification algorithm. You don’t need to be an expert on statistics, you don’t need to understand esoteric theory, you just need to understand the big picture.