Extracting Specific Substrings from Strings in Python Using Pandas

Pandas: Efficient String Extraction with Filtering

Pandas is a powerful library in Python for data manipulation and analysis. One of its strengths is the ability to efficiently process and manipulate structured data, including strings. In this article, we will explore how to extract specific substrings from another string using Pandas.

Problem Statement

You have a column containing 8000 rows of random strings, and you need to create two new columns where the values are extracted from the existing column. The extraction should be done based on specific patterns or substrings within the original string.

For example, let’s consider the following dataset:

test_no
TP-ETU06-01-525-W-133
temp data
NaN
12345

We need to create two new columns: sys_no and package_no, where sys_no is extracted from the substring “ETU06” in the original string, and package_no is extracted from the last substring.

Existing Solution

The provided solution uses the apply() function along with a lambda function to achieve this:

df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)

df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)

This solution works fine, but it may not be the most efficient or scalable approach, especially for large datasets.

Improved Solution

A more efficient way to achieve this is by using Pandas’ vectorized operations, specifically Series.str.contains, Series.str.split, and indexing.

mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]

print(df)
                 test_no sys_no package_no
0              temp data    NaN        NaN
1                    NaN    NaN        NaN
2  TP-ETU06-01-525-W-133  ETU06        133

Explanation

Let’s break down the improved solution:

  • We first create a mask using Series.str.contains to identify rows where the string contains a hyphen (-). This mask will be used to filter out rows that do not contain the desired pattern.
  • Next, we split the strings in the test_no column into substrings using Series.str.split. This creates an array of arrays, where each inner array represents a substring from the original string.
  • We then use indexing to extract the second and last elements from the array of substrings. The second element is assigned to the sys_no column, while the last element is assigned to the package_no column.

Key Concepts

The key concepts involved in this solution are:

  • Vectorized operations: Pandas’ vectorized operations allow us to perform operations on entire arrays at once, rather than iterating over individual elements. This can significantly improve performance for large datasets.
  • String manipulation: The str.contains and str.split methods provide efficient ways to manipulate strings in Pandas dataframes.
  • Masking: Creating a mask allows us to filter out rows that do not meet certain conditions, reducing the number of iterations required.

Conclusion

In this article, we explored how to efficiently extract specific substrings from another string using Pandas. By leveraging vectorized operations and string manipulation methods, we can create more efficient and scalable solutions for data manipulation tasks.

Additional Examples

Here are some additional examples of string extraction using Pandas:

# Extract the first 5 characters from each string

df["substring"] = df["test_no"].str[:5]

print(df)
                 test_no substring
0              temp data      tempd
1                    NaN        NaN
2  TP-ETU06-01-525-W-133  ETU06...

# Extract the last 3 characters from each string

df["substring"] = df["test_no"].str[-3:]

print(df)
                 test_no substring
0              temp data      dat
1                    NaN        NaN
2  TP-ETU06-01-525-W-133  W...

# Replace all occurrences of a substring with another substring

df["new_column"] = df["test_no"].str.replace("old_substring", "new_substring")

print(df)
                 test_no new_column
0              temp data   temp data...
1                    NaN        NaN
2  TP-ETU06-01-525-W-133  TP-ETU06...

# Split a string into multiple columns based on a delimiter

df["column1"] = df["test_no"].str.split(",")[0]
df["column2"] = df["test_no"].str.split(",")[1]

print(df)
                 test_no column1 column2
0              temp data      temp  data
1                    NaN        NaN     NaN
2  TP-ETU06-01-525-W-133    ETU06   ...  ...

These examples demonstrate additional string manipulation techniques using Pandas, such as extracting substrings, replacing occurrences, and splitting strings into multiple columns.


Last modified on 2023-10-06