Find and Correct Typos in a DataFrame with Python Pandas

Finding and Correcting Typos in a DataFrame with Python Pandas

=============================================

In this article, we will explore how to find and correct typos in a DataFrame using Python pandas. We’ll take an example DataFrame where names, surnames, birthdays, and some random variables are stored, and learn how to identify and replace typos in the names and surnames columns.

Problem Statement


The problem is as follows: given a DataFrame with names, surnames, birthdays, and some other columns, we want to find out if there are any typos in the names and surnames columns based on the birthdays. We’ll assume that two people born on the same date have the same birthday. If we find two different spellings of a name or surname with equal numbers of occurrences for the same birthday, we’ll replace the typo with the correct spelling.

Solution Overview


Our solution will involve the following steps:

  1. Grouping the birthdays with names and surnames
  2. Finding the most common name and surname for each group
  3. Calculating the Levenshtein distance between the most common name and surname, and other names in the same group
  4. Replacing typos based on the calculated distances

Solution Implementation


Step 1: Importing Libraries and Loading Data

import pandas as pd
from collections import Counter 
import distance

# Load the DataFrame
df = pd.read_csv('your_data.csv')

Step 2: Defining a Function to Calculate Levenshtein Distance

We’ll define a function dist that calculates the Levenshtein distance between two strings using Python’s built-in distance.levenshtein function.

def dist(str1, str2):
    return distance.levenshtein(str1, str2)

Step 3: Grouping Birthdays with Names and Surnames

We’ll group the DataFrame by ‘BIRTH’ and ‘NAME’, and store the resulting lists of surnames in a new column called ‘SURNAME’. We repeat this process for the ‘SURNAME’ and ‘NAME’ columns.

dfsurname = df1.groupby(['BIRTH', 'NAME']).SURNAME.apply(list).reset_index()
find_name(dfsurname.SURNAME.tolist(), dict_SURNAME)

dfname = df1.groupby(['BIRTH', 'SURNAME']).NAME.apply(list).reset_index()
find_name(dfname.NAME.tolist(), dict_NAME)

Step 4: Replacing Typos

We’ll create two dictionaries, dict_surName and dict_NAME, to store the correct spellings of surnames and names, respectively. We’ll then use these dictionaries to replace typos in the DataFrame.

df2 = df1.replace({'NAME': dict_NAME, 'SURNAME': dict_SURNAME})

Step 5: Printing the Results

Finally, we’ll print the corrected DataFrame df2 and the dictionaries dict_surName and dict_NAME.

print(dict_SURNAME)
print(dict_NAME)

print(df2)

Full Code


Here’s the full code that you can use to find and correct typos in a DataFrame with Python pandas.

import pandas as pd
from collections import Counter 
import distance

def dist(str1, str2):
    return distance.levenshtein(str1, str2)

dict_SURNAME = dict()
dict_NAME = dict()

# Load the DataFrame
df = pd.read_csv('your_data.csv')

dfsurname = df.groupby(['BIRTH', 'NAME']).SURNAME.apply(list).reset_index()
find_name(dfsurname.SURNAME.tolist(), dict_SURNAME)

dfname = df.groupby(['BIRTH', 'SURNAME']).NAME.apply(list).reset_index()
find_name(dfname.NAME.tolist(), dict_NAME)

# Replace typos
df2 = df1.replace({'NAME': dict_NAME, 'SURNAME': dict_SURNAME})

# Print the results
print(dict_SURNAME)
print(dict_NAME)

print(df2)

Last modified on 2023-09-22