Outliers
A Quick Explanation on Outliers and Why We Should Care About Them
Hola everyone, I decided to write my first post about outliers. This is a very common concept when handling data and it can, often times, be problematic when doing statistical analyses. On this post I am going to explain what an outlier is and how they can affect our results. In addition, I am going to explain briefly the concept behind finding outliers. Remember that I am trying to approach this concept with a little bit of comedy. So, I hope you enjoy:
What is an outlier?
Imagine having a group of 10 friends and all of you are the almost the same height (five of you are 5’5” and five of you are 5’7”). For your group of friends, the average height would be 5’6” and the distance between each individual person and the average is not higher than 1 inch. For this distribution, it seems like there is no height value that behaves differently from the other values.
Now, imagine that you meet someone called Bob on your vacations to Colombia and it turns out that your new friend is from the same city as you are. When you come back to your city, you introduce Bob to your friends and they all like him - he is a cool guy. Now that you have a new friend, you want to know your group of friend’s height distribution again. As you already noticed, Bob is really tall, so you ask him and he tells you that he is 120’ (Bob doesn’t like it when you question his height). You then proceed to check your new distribution and you realize that your group’s average height changed to 14’11” (Also, the highest distance to the average becomes 105’).
Amazing, isn’t it?
Well, in this example Bob’s height might be considered to be an outlier. According to wikipedia, an outlier is a “data point that differs significantly from other observations”. So, given that your and your friend’s height is around 5’6”, Rob’s height seems to be differing significantly from other observations.
Why should we care about them?
Say, for example, that you want to guess one of your friend’s height based on your other friends. One safe bet is to guess her height as the average height of the group. Based on your group of friends before you met Bob, you would guess that your other friend is 5’6”. This is a pretty close guess because her possible height can be either 5’5” or 5’7”. Using the average height to predict your friend’s height worked for this scenario.
On the opposite case of the example, you can guess her height after you met Bob. For this scenario, your guess would be 14’11”. When comparing the possible outcomes to your new results, it seems like you think she is way taller than she really is.
In statistics, what I tried to replicate with the example is called prediction. In technical definition, a prediction is the process of estimating an outcome in a future point in time. This example, and some other analyses, rely on metrics such as the average value or the distance of each observation. And, if you are handling real life situations, this results can affect your decision making process.
Therefore, having outlier in your data can create misleading conclusions such as telling your friend’s mom that she might be 14’11” tall or not letting her ride in a roller coaster because you think she is just too tall (but she is not).
What could be a problem handling outliers?
Up to this point, we decided that Bob is 120’ tall. This is, undoubtedly, a very tall person. But what happens if Bob now is 6’5”?
Is he still an outlier?
Defining the correct limit for what an outlier is (or isn’t) can be a very complicated task. It depends mostly on the type of data you are handling and its user. For some people, 6’5” can be a close number to the distribution. For others, it can be an outlier. Thus, identifying outliers is a very subjective exercise and there are many approximations to do that. Learning some methods to identify outliers is not going to be discussed on this post. So, please, wait for a future post on that.
In conclusion
Outliers is a concept to consider when doing statistical analyses. Having an extremely high (or low value) can affect some measures such as the mean or the variance of a distribution. This can end up in having misleading conclusion in your results. So, next time you read me, I hope you have taken care of those outliers.