Outliers: The Excess Salt in your meal!
Box plot ,Outlier Detection & Removal Techniques

Outliers: The Excess Salt in your meal! Box plot ,Outlier Detection & Removal Techniques

Outliers are extreme values that differ from most other data points in a dataset. They can have a big impact on your statistical analyses and skew the results of any hypothesis tests. It’s important to carefully identify potential outliers in your dataset and appropriately deal with them for accurate results.

What are outliers?

Outliers are values at the extreme ends of a dataset. Some outliers represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other measurement errors.

An outlier isn’t always a form of dirty or incorrect data, so you have to be careful with them in data cleaning. What you should do with an outlier depends on its most likely cause.

  • True outliers should always be retained in your dataset because these just represent natural variations in your sample.

  • Other outliers that don’t represent true values should be removed or corrected if possible, or treated as missing values if not.

Impact of Outliers on Statistical Models

Outliers can have a significant impact on statistical models, as they can skew the results and lead to inaccurate conclusions. Here are a few ways outliers can impact statistical models:

  1. Distorted measures of central tendency: Outliers can significantly affect measures of central tendency such as mean, median, and mode. For example, a single extreme outlier in a dataset can cause the mean to shift, making it a less representative measure of the overall data.

  2. Increased variance: Outliers can increase the variance of a dataset, making it difficult to identify patterns or trends in the data. This can make it challenging to build accurate predictive models.

  3. Bias in regression models: Outliers in a regression model can lead to biased estimates of coefficients and reduce the predictive power of the model. They can also increase the variability of residuals, making it difficult to interpret the goodness-of-fit of the model.

  4. Decreased model accuracy: Outliers can lead to overfitting of models, where the model fits too closely to the data and fails to generalize well to new data. This can result in decreased accuracy and poor performance of the model.

To mitigate the impact of outliers on statistical models, it is important to identify and handle them appropriately.

How to find outliers?

Here are some general methods for detecting outliers for normal distributions, skewed distributions, and other distributions:

  1. Normal distributions: For normally distributed data, a common method for detecting outliers is to use the standard deviation. Typically, data that falls outside of 3rd standard deviations from the mean is considered an outlier. Another method is to use the z-score, which measures how many standard deviations a data point is from the mean. Any data point with a z-score greater than 3 or less than -3 is typically considered an outlier.

  2. Skewed distributions: For skewed distributions, one approach is to use the interquartile range (IQR). The IQR is the difference between the third quartile (75th percentile) and the first quartile (25th percentile) of the data. Any data point that falls outside of 1.5 times the IQR from the first or third quartile is considered an outlier.

  3. Other distributions: For other distributions, it may be necessary to use domain-specific knowledge to determine what constitutes an outlier. For example, in a distribution of test scores, a score that is much higher or lower than the others may be considered an outlier. Similarly, in a distribution of medical measurements, a reading that is outside of a normal range may be considered an outlier.

There are many ways to identify outliers, but one of the most common and simple methods is using a box plot. A box plot is a graphical display of the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

A box plot consists of a rectangular box that spans from Q1 to Q3, with a line inside indicating the median. The box represents the middle 50% of the data, also known as the interquartile range (IQR). The IQR measures the variability or spread of the data.

The box is extended by two lines called whiskers that reach out to the minimum and maximum values that are not outliers. Outliers are defined as data points that are more than 1.5 times the IQR above Q3 or below Q1. Outliers are usually shown as dots that are separate from the rest of the plot.

Here is an example of a box plot with outliers:

We can see that there are low outliers (below the lower fence) and two high outliers (above the higher fence) in this figure of the box plot.

How to remove outliers?

Once we have identified the outliers in our dataset, we can decide how to deal with them. One option is to remove them from the dataset entirely. This can reduce the noise and improve the accuracy of our analyses, but it can also introduce bias and reduce the sample size.

Another option is to replace them with some other values, such as the mean, median, or mode of the data, or the nearest non-outlier value. This can preserve the sample size and reduce the distortion of the data, but it can also introduce error and uncertainty.

A third option is to treat them as missing values and use some imputation techniques to fill them in. This can account for the uncertainty and variability of the data, but it can also be complex and computationally intensive.

The best option depends on the nature and source of the outliers, the size and shape of the dataset, and the purpose and goal of the analysis.

How to code to remove outliers?

We can use Python to perform some basic operations to remove outliers from our dataset. Assuming we have a pandas dataframe called df that contains our data, we can use the following code snippets for different options:

Option 1: Remove outliers from skewed Distribution

We can use the quantile method of pandas to calculate Q1 and Q3, and then use them to define our upper and lower fences. We can then use boolean indexing to filter out any rows that fall outside these fences.

import pandas as pd
# calculate Q1 and Q3
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
# calculate IQR
IQR = Q3 - Q1
# calculate upper and lower fences
upper = Q3 + 1.5 * IQR
lower = Q1 - 1.5 * IQR
# filter out any rows that are outside the fences
df_no_outliers = df[(df >= lower) & (df <= upper)].dropna()

Option 2: Remove outliers from Gaussian Distribution

Here's an example of how to remove outliers from a normal/gaussian distribution in Python using the Z-score method:

import numpy as np

# Generate a normally distributed dataset with some outliers
data = np.random.normal(loc=50, scale=10, size=1000)
data[0] = 10
data[1] = 90

# Calculate the Z-score for each data point
z_scores = (data - np.mean(data)) / np.std(data)

# Define a threshold for outlier removal (typically 2 or 3)
threshold = 3

# Identify the indices of the outliers
outliers = np.where(np.abs(z_scores) > threshold)[0]

# Remove the outliers from the dataset
clean_data = np.delete(data, outliers)

print("Original data shape:", data.shape)
print("Clean data shape:", clean_data.shape)

Output

Original data shape: (1000,)

Clean data shape: (995,)

In this code, we first generate a normally distributed dataset with some outliers. We then calculate the Z-score for each data point, which measures how many standard deviations a data point is away from the mean. We define a threshold for outlier removal (typically 2 or 3) and identify the indices of the outliers using the absolute value of the Z-score. Finally, we remove the outliers from the dataset using NumPy's delete function.

That's all for today, see you in the next blog. Till then, have a nice day ;)