Don’t Select Features, Engineer Them

12 minute read

Published:

Students in the class I TA love to do feature selection prior to modelling. Examining pairwise correlation and dropping seemingly uncorrelated features is one way they do this, but they also love to fit a LASSO model to their data and refit a model with the selected variables, or they might do stepwise selection if they are feeling in the mood to code it up in python.

Feature selection can be an important thing to do if you want to optimize the predictive performance per unit cost of data aquisition ratio. Why spend money getting customer shoe size if it doesn’t help you predict the propensity to buy? However, students don’t rationalize their feature selection approach this way. I suspect they do it because they feel it will help them achieve a lower loss/better predictive accuracy.

Let me start by saying that I think examining pairwise correlations is a bad way to select features. First, this approach is almost always never validated correctly, second it ignores confounding relationships, third correlations only the strength of a linear relationship, and fourth it is very easy to construct a dataset in which the correlations completely get the variable importance wrong. Let $x$ and $z$ be binary independent random variables and let $y$ be the XOR relationship. Knowing $x$ and $z$ determines $y$, but the correlation as measured by Kendall’s $\tau$ is 0. A strict selection based on correlation would completely miss this relationship.

In this blog post, I argue feature selection is at best not neccesary when the goal is optimal out of sample predictive performance and at worst increases test loss. If the goal is good predictions, then what students – and perhaps, practitioners too – should be doing is engineering new features. The question then becomes “what features” and to that I answer “splines!”.

In what follows, I describe 3 experiments in which I examine 5 models. Those models are:

  • Linear Regression. This serves as our baseline. We just throw everything in the model and get what we get.
  • Linear Regression + Feature Selection via Correlations. Any features with an absolute correlation of 0.2 or larger are included in the model.
  • Linear Regression + Feature Selection via Lasso. I use 10 fold cross validation to first fit a LASSO model, then take all the non-zero coefficients from that model and refit a linear regression.
  • LASSO fit via 10 fold cross validation to examine how shrinkage might compare to these models, and
  • A linear model where each feature is expanded into a restricted cubic spline with 3 degrees of freedom. That is one additional feature for every feature in the training data, increasing the number of training variables by a factor of 2 as opposed to reducing them.

The goal is to examine predictive performance of these models. I examine the performance of these models on three data sets:

  • The diabetes dataset from sklearn
  • The Boston housing dataset from sklearn. I’ve removed binary indicators from this data because they screw with my spline implementation. No model sees the binary indicator, so its still fair.
  • A dataset of my own creation. I simulate 2,500 observations of 50 features from a standard gaussian. The first 10 features have a non-linear effect that was created by expanding those features into a restricted cubic spline with 7 degrees of freedom. I do 7 and not 3 because I want to investigate model mispecification (which is always the case) so as not to inflate the perfomance of the last model too much. The 20 features have an effect of 0 (they are nussance variables) and the remaining features just have linear effects. I sample the coefficients for the data from a standard normal and add student t noise with 10 degrees of freedom (something a little fatter in the tails than a normal, for some potential outliers).

Methods

Each model is examined through repeated 10 fold cross validation performed 100 times (that’s a total of 1000 resamples and refits). I ensure each model sees the same 1000 random splits so results are comparable. I don’t have an explicity test set because the diabetes and boston data are small and any single test set could yield noisy estimates of test loss. The repeats of 10 fold CV are meant to cirvument this. I choose mean squared error as my loss function because it penalizes models heavily which cannot make extreme predictions when neccesary. I also monitor the variability in the MSE fold to fold (essentially computing the standard deviation of the 1000 MSE evaluations). Ostensibly, this should be a measure of how variable the model could be were we to get a different training/testing set from the same data generating processes.

I make all code available on github for you to extend. I would love to see more models applied to this data (perhaps selection from random forests, or feature selection through a technique described + random forest. Go nuts).

Quick Note: What Is The Right Way To Select Variables If You’re Going To Do It?

The wrong way to select variables is to inspect correlations/perform lasso/do stepwise on the entire training set and then cross validate after choosing features. The selection procedure is part of the model. Had you gotten different training data, you might of selected different features! The right way to make selection part of the model is to make a transformer which does it for you and put it in a pipeline. Here the LASSO selection transformer I wrote for the experiment:

class LassoSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        
        return None
        
    def fit(self, X, y):
        self.mod_ = LassoCV(cv=10).fit(X,y)
        return self
    
    def transform(self, X, y=None):
        self.coef_ = self.mod_.coef_
        return X[:, np.abs(self.coef_)>0]

When we do cross validation now, the training data is sent to the LassoCV where a LASSO model is fit. Once the model is fit, we can determine which coefficients were selected by the model in the transform step. Now, when we pass the held out data in cross validation, the held out data’s variables are selected based only on the data the model was able to see during training. This properly accounts for the variability induced by the selection and keeps your models honest! A similar class can be written for selection with correlations

class CorrelationSelector(BaseEstimator, TransformerMixin):

    def __init__(self, cutoff = 0.2):
        self.cutoff = cutoff
        return None
        
    def fit(self, X, y):

        self.correlations_ = np.array([pearsonr(j, y)[0] for j in X.T])
        return self
    
    def transform(self, X, y=None):
        selection = np.argwhere(np.abs(self.correlations_)>self.cutoff).ravel()
        return X[:, selection]

Results

Shown below are the results of the experiments. For each dataset, I plot the relative difference in expected cross validated MSE (relative to linear regression) against the realtive difference in the fold to fold standard deviation (again, relative to lienar regression). Linear regression is at the intersection of the gray lines for reference. Models in the bottom left hand corner have better MSE as compared to linear regression and are less variable fold to fold.

In all three datasets, the splines model has superior MSE as compared to the other models. In fact, linear regression has superior MSE as compared to the selection models across all datasets (with the small exception of LASSO selection in the Boston dataset, though the difference is smaller than 1% and in my own opinion negligible).

Varaibility fold to fold seems to depend on the dataset. The spline model is less variable in the Boston and non linear datasets, but more variable in the diabetes dataset.

Discussion

From these experiments, we can conclude that adding additional features rather than selecting existing ones leads to smaller MSE (although the No Free Lunch Theorem prevents us from generalizing to other datasets). The results are consistent across two datasets derived from real world data generating processes and a third synethtic dataset of my own design.

What might explain these results? First, splines add additional features to the model, meaning we can estimate a much richer space of functions from the data. Considering we can estimate a broader space of functions it is no surprise the splines come out on top, especially considering the competing models are all high bias models only capable of modelling linear effects. I imagine were we to apply other non-linear modelling techiniques (e.g. random forests or neural networks) that their MSE would also be lower than linear regression. However, splines offer a nice half way point between simplicity of linear regression and the black box of neural networks or random forests: we can add additional flexibility to the model thereby decreasing loss while remaining interpretable, and easily maintainable.

Frank Harrell makes another good argument for splines, which I will summarize here. When considering using splines there are four distinct possibilities:

1) We don’t use splines and the effect is linear. In this case, a simple model will suffice and we benefit from lower variance fits.

2) We use splines and the effect is linear. In this case, we spend extra degrees of freedom unneccisarily, increasing the variance of our fits, but the effect of the variable is appropriately estimated.

3) We don’t use splines and the effect is non-linear. This has the potential to be catastrophic! Imagine that a variable has an effect on $y$ that looks like $y = x^2$. If x is mean centered, then the linear fit could estimate the effect to be 0 or too small.

4) We use splines and the effect is non-linear. This would be the best case scenario where we appropriately spend our degrees of freedom.

Examinig all four cases shows that using splines is nearly always a good idea. With enough data, the varibility in the fit should shrink nullifying the downsides of point 2. The risk of point 3 is likely what explains the abysmal performance of correlation selection in the non-linear dataset. My intuition here is the effect of some of the non-linear variables would have small correlations but have a strong non-linear effect (kind of like the $x^2$ example). Negelcting these variables is a huge mistake as they can reduce the MSE, explaining why correlation selection has an MSE nearly 3x that of OLS. I’ve not conducted a post mortem on these experiments, but I anticipate this to be the cause.

Conclusion

I provide experimental motivation for the use of feature engineering (specifically though splines) over featuer selection for optimizing for predictive accuracy. Five models were applied across 3 datasets in order to examine the relative improvement of selection procedures over linear regression. To measure performance, mean squared error was computed over 100 repeats of 10 fold cross validation. As a comparitor, I incldued a spline model which adds features rather than removing them. The spline model outperformed all other models across all datasets, and linear regression had either superior or comparable performance to selection procedures. From these experiments, coupled with knowledge of how additional variability can impact expected loss, I conclude that rather than select features for inclusion in the model students should consider carefully balancing additional variability of splines with the reduction in bias – and hence loss – they can impart.

The experiment has several limitations. In particular, the spline model I estimated was still quite biased. It would be important to simulate another dataset in which the effects are non-linear and the model we use has too many degrees of freedom. The variability in this approach may attenuate the improvement we see due to splines. Additionally, complexity is not explicity accounted for and is only allowed to penalize models via poor predictive performance. A natural extension would be to measure AIC of all models since they are linear (except perhaps the lasso where effective degrees of freedom may or may not be appropriate to use in AIC).