Introduction to Outlier Detection and Highlighting in Pandas
As data analysts, we often encounter datasets that contain outliers - values that are significantly different from the rest of the data. In this article, we will explore how to detect and highlight these outliers using z-scores in pandas.
Background on Z-Score
The z-score is a measure of how many standard deviations an element is from the mean. It’s used to determine whether a value is unusual or not. The formula for calculating the z-score is:
z = (X - μ) / σ
Where:
- X is the value we’re examining
- μ is the mean of the dataset
- σ is the standard deviation of the dataset
A z-score greater than 1.5 indicates that a value is more than one and a half standard deviations away from the mean, which is generally considered to be an outlier.
pandas DataFrames and Outlier Detection
Pandas provides several ways to detect outliers in dataframes. One common approach is to use the zscore function from the NumPy library, which calculates the z-scores for each value in the dataframe and returns a new series with these values.
In this article, we’ll explore how to use pandas’ built-in functions and libraries to highlight outliers in a dataset using z-scores.
The Problem
The question provided is about highlighting values that are outliers based on their z-scores. It includes an example data frame and a function written in Python to calculate the z-scores for each value, but with a minor mistake in handling NaN (Not a Number) values.
The Fix: Handling NaN Values
When using fillna(-np.Inf) or .replace(0, -np.Inf), it’s crucial to remember that replacing zeros is not equivalent to removing them. In the corrected function, we should use -np.Inf for NaN values instead of just fillna(-np.Inf). This ensures that NaN values are handled correctly and do not appear as outliers.
Using pandas’ Style Functionality
To highlight outlier values in a dataframe, we can utilize pandas’ style functionality. The style.apply function allows us to apply various styles to parts of the dataframe based on conditions. In this case, we use it to highlight rows where the z-score is greater than 1.5.
Code Walkthrough: Corrected Function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
def highlight_outliers(x):
color ='orange'
#extract numeric columns
c=x.select_dtypes([np.number]).columns
#create df of numeric cols
df2=pd.DataFrame(x, columns=c)
#calculate zscores
df2=df2.apply(stats.zscore, nan_policy='propagate').abs()
#boolean mask of values greater than threshold value
mask=(df2[c].values < 1.5)
#create blank df of numeric cols
df1=pd.DataFrame('', index=x.index, columns=c)
#style locations which exceed threshold (fill orange) based on mask
df1 = df1.where(mask, 'background-color:{}'.format(color))
return df1
df_styled = df.style.apply(highlight_outliers, axis=None)
Conclusion
Detecting and highlighting outliers in a dataset is an important part of data analysis. By using pandas’ built-in functions and libraries, we can easily identify and visualize these unusual values.
In this article, we explored how to highlight outlier values based on their z-scores using pandas’ style functionality. We also discussed the importance of handling NaN values correctly when detecting outliers.
By applying this technique to your own datasets, you’ll be able to better understand and analyze your data, making it easier to spot trends and patterns that might otherwise go unnoticed.
Remember to adjust the threshold value according to your dataset’s characteristics for optimal results.
Last modified on 2024-09-25