Conditional Filtering on Paragraph and List Columns in Pandas DataFrame
===========================================================
Introduction
In this article, we will explore how to perform conditional filtering on columns that contain both paragraphs of text and lists. We will use the popular Python library Pandas to achieve this task.
Problem Statement
We have a Pandas DataFrame dftest containing information about various jobs. The “Job Description” column is a paragraph of text, while the “Job Skills” column contains lists of skills separated by “\n\n”. We want to find all rows that match a given list of skills and print only those rows.
Solution
We will use the apply() function along with a lambda function to achieve this task. The lambda function checks if any skill in the “Job Skills” list is present in each row’s list of skills.
out = dftest[dftest["Job Skills"].apply(lambda x: any(s in skills for s in x))]
print(out)
Explanation
- We first import the necessary library
pandas. - We define a list of skills
skillsthat we want to match against. - We use the
apply()function on the “Job Skills” column, passing a lambda function as an argument. - Inside the lambda function, we check if any skill in the
skillslist is present in each row’s list of skills using theany()function and theinoperator. - The resulting boolean Series is then used to index the original DataFrame
dftest, selecting only the rows that match the given list of skills. - Finally, we print the resulting DataFrame.
Example Use Case
Suppose we have a Pandas DataFrame dftest containing information about various jobs:
| Job Posting | Time Type | Job Location | Job Description | Job Skills |
|---|---|---|---|---|
| Data Scientist | Full Time | Colorado | asdfas fasdfsad sadfsdaf sdfsdaf | [Algorithms, Data Analysis, Python, Data Mining] |
| Cloud Engineer | Part Time | Maryland | asdfasd fasdfasd fwertqqw rtwergd fverty | [Application Development, Application Integrations, Architectural Modeling, Cloud Computing] |
| Systems Engineer | Full Time | Virginia | qwerq e5r45yb rtfgs dfaesgf reasdfs dafads | [Configuration Management, Information Management, Integration Testing] |
If we want to find all rows that match the list of skills [Algorithms, Data Analysis, Python, Data Mining], we can use the following code:
skills = ["Algorithms", "Data Analysis", "Python", "Data Mining"]
out = dftest[dftest["Job Skills"].apply(lambda x: any(s in skills for s in x))]
print(out)
This will print only the row that matches the given list of skills:
| Job Posting | Time Type | Job Location | Job Description | Job Skills |
|---|---|---|---|---|
| Data Scientist | Full Time | Colorado | asdfas fasdfsad sadfsdaf sdfsdaf | [Algorithms, Data Analysis, Data Mining, Python, Unstructured Data] |
Note that the row has been updated to include all skills in the list, even if some of them are not explicitly mentioned in the “Job Description”.
Last modified on 2024-07-18