Finding and Handling Outliers
An outlier is simply a data point that is far away (smaller or bigger) than the majority of other points. In this post, I will discuss ways to find the outliers in your data, what effect outliers can have on your results, and how to handle the outliers that you do find.
The easiest and probably most common way to find an outlier is to visualize your data. Doing a scatter plot or a box plot can make it really easy to see which, if any, data points are outliers.
The above box plot is an example of an extreme outlier. You can’t even see that the image is supposed to be a box plot. A point this far from the data deserves further investigation. DO NOT simply remove your outliers. In the case of the image above, you can set the y-limit on the graph to be something like 75 so you can better visualize the data but do not remove that outlier point from your data set just yet.
Interquartile Range Rule for Outliers
A more mathematical approach to defining outliers requires using the interquartile range. The interquartile range is the middle 50% of the data, that is, the data between the 25th and 75th percentiles.
To use the interquartile range (IQR) to find outliers you use the following formula:
Lower bound = Q1 – (1.5 * IQR)
Upper bound = Q3 + (1.5 * IQR)
An outlier is any point that is below the lower bound or above the upper bound. As an example, let’s assume you have a data set of adult heights in inches. Let’s say the 25th percentile is 65 inches and the 75th percentile is 75 inches. That means that 50% of adult heights are between 65 and 75 inches, with an IQR of 10 inches.
Using these values we can calculate the following:
Lower bound = 65 – (1.5 * 10)
Lower bound = 50
Upper bound = 75 + (1.5 * 10)
Upper bound = 90
Using the IQR outlier rule we have determined that heights above 90 inches and below 50 inches can be considered outliers. Again this does not mean you should remove the data points, this is simply a mathematical approach to define what an outlier is.
The Effect of Outliers
The average can be misleading. Mean and median are both measures of average, but if only given one of them you may not be seeing the whole picture. This can often be done by design to illustrate a specific viewpoint, however, good analysis will include both.
As an example, let’s assume there are 10 data scientists in a bar, where each data scientist has an annual income of $100k per year. The median and the mean income of everyone in the bar is $100k. Now if Bill Gates walks in, who let’s assume has an annual income of $1 billion, the average income of everyone in the bar is now $91 million. Of course, the median is still $100k annual income per year but when saying the average of $91 million, I was using the mean and did not include the median.
The point of this example is to show how the mean can very easily be skewed due to an outlier or outlier like values. In cases of extreme outliers, the median is often a better picture of the ‘average’ or ‘typical’ case. Also, I encourage you to question any average you are given. Is it the median or the mean? What is more applicable to this particular case?
While removing outliers might help your analysis to look better or allow for a nicer fit of your regression, you can not remove data just because it will make life easier. The only time you should remove a data point is if that data point is clearly erroneous. If you are confident the data point is a measurement error or an incorrect entry than it is safe to remove such an outlier.
If you can not confidently determine that the outlier is due to some type of error then you need to leave it in your dataset. These outliers are often worth further examination and may help you in designing your model.
When using a dataset with outliers, keep in mind that your data has a skewness to it and you should take caution when using tools that assume a normal distribution. Statistical tools that are better suited to this kind of distribution should be used. Use the median instead of the mean as a measure of average and use the IQR instead of standard deviation as a measure of variation.
Finding an outlier is best done through visualizing your data. You could also mathematically define an outlier by using the IQR outlier rule.
Outliers can heavily skew your data and therefore you should be careful in selecting which statistical measures to use when you have outliers in your data. Also be careful in the future whenever you see just the “average” of something without knowing whether it is the mean or the median.
Only remove an outlier from your dataset if it is a clearly erroneous data point. Otherwise, you need to leave the outliers in your data and just select the tools that are better able to handle a non-normal distribution.