Machine Learning Interview Questions – Q8 – Explain the difference between L1 and L2 regularization

Machine learning interview questions is a series I will periodically post on.  The idea was inspired by the post 41 Essential Machine Learning Interview Questions at Springboard.  I will take each question posted there and provide an answer in my own words.  Whether that expands upon their solution or is simply another perspective on how to phrase the solution, I hope you will come away with a better understanding of the topic at hand.

To see other posts in this series visit the Machine Learning Interview Questions category.

Q8- Explain the difference between L1 and L2 regularization

What is regularization?

Regularization helps to solve the problem of over-fitting by favoring the model with fewer rules.  Recall the bias-variance trade off, we want a model that is simple enough to predict on unseen data but not too complex that it won’t generalize well.

Regularization is a mathematical way to discourage complex or extreme models.  It does this by adding a penalty term to the machine learning algorithm’s loss function.  Regularization introduces a cost term for including more features and will try to reduce the coefficients (weights) of many features in your model to zero or very small values.

Regularized Loss Function = Original Loss Function + Lambda*R(f)

That last term, lambda*R(f), is the regularization term.  Lambda is a parameter controlling the importance of the regularization term and R(f) can be a variety of potential functions but this is where the logic of how to impose a penalty on the complexity of our model is imposed.

 

L1 regularization vs L2 regularization

When using L1 regularization, the weights for each parameter are assigned as a 0 or 1 (binary value).  This helps perform feature selection in sparse features spaces and is good for high-dimensional data since the 0 coefficient will cause some features to not be included in the final model.  L1 can also save on computational costs since the features weighted 0 can be ignored, however, model accuracy is often lost for this benefit.  L1 is best used in high dimensional or sparse data sets when doing classification.

L2 regularization spreads the error among all the features.  This results in weights for every feature with the possibility that some weights are really small values close to 0.  L2 tends to be more accurate in almost every situation however at a higher computational cost.  It is best used in non-sparse outputs, when no feature selection needs to be done, or if you need to predict a continuous output.

 

Summary

Regularization is the mathematical approach to prevent over-fitting.  It accomplishes this by punishing more complex models by adding the regularization term to the model’s loss function.

L1 regularization results in binary weights of 0 or 1 for the features in our model and is good for reducing the number of features in a high dimensional data set.  L2 regularization spreads the error among all the weights and results in almost universally more accurate final models.

To see other posts in this series visit the Machine Learning Interview Questions category.