Creating a New Column Based on Stages and Dates in R

Introduction

In this post, we will discuss how to create a new column in an existing dataframe based on certain conditions. Specifically, we want to create a “Project Status” column that reflects the stage of each project based on its dates.

Background

The problem arises when dealing with multiple stages and dates for a project. The goal is to create a column that shows the latest date for each project, which can be used to determine its current status. This requires conditional logic to handle different stages and dates.

Problem Description

The original question asked to create a new column on the existing dataframe as “Project Status”. This column should have records in it such as SIR_Received stage, SIR_Approved stage, TSS_FC_Stage, TSS_Actual_stage, EP_Received stage, EP_Approved stage, Lease_Approved Stage, RFC_FC_Stage, RFC_Actual_stage & WIP.

The stages are in order so latest date should be captured in the Project_Status column. For example, row number 5 has dates till the column “EP_APP” so in project_status column it shd be EP_Approved_Stage.

When there is date against “SIR_REC” column in the attached data, it means its SIR_Received_stage similarly for all the columns and if all the columns are blank then the project status is WIP.

The new column should appear like this in the dataframe highlighted in RED.

Solution

To create the new column based on stages and dates, we can use the dplyr package in R. The mutate function from dplyr allows us to create a new column with conditional logic.

Here’s an example code snippet that demonstrates how to create the “Project Status” column:

library(dplyr)
df %>% 
  mutate(
    Project_status = case_when(
      !is.na(RFC_ACT) ~ "RFC_Actual_stage",
      !is.na(RFC_FC) ~ "RFC_???",
      !is.na(LEASE_APP) ~ "???",
      !is.na(EP_APP) ~ "???",
      !is.na(EP_REC) ~ "EP_Received_stage",
      !is.na(TSS_ACT) ~ "TSS_Actual_stage",
      !is.na(TSS_FC) ~ "TSS_???_stage",
      !is.na(SIR_App) ~ "SIR_Approved stage",
      !is.na(SIR_REC) ~ "SIR_Received_Stage",
      TRUE ~ "WIP"
    )
  )

In this code, the mutate function creates a new column named Project_status. The case_when function is used to specify the conditions for each stage. If the condition is met, the corresponding status is assigned.

Understanding the Code

The case_when function takes a series of conditions as arguments and returns a value based on which condition is true. In this case, we use the !is.na() function to check if a column is not missing (i.e., it has values).

The first argument in the case_when function specifies the condition for each stage. For example, !is.na(RFC_ACT) checks if the “RFC_ACT” column is not missing. If this condition is true, the value "RFC_Actual_stage" is assigned to the new column.

Importance of Condition Order

The order of conditions in the case_when function is important. The conditions are evaluated from left to right, and the first condition that is met determines the return value.

In the example code snippet above, if all columns are missing (i.e., they have no values), the last condition (TRUE ~ "WIP") will be met, assigning the value "WIP" to the new column.

Conclusion

Creating a new column based on stages and dates requires conditional logic. The dplyr package in R provides an easy-to-use function called mutate, which allows us to create a new column with specified conditions.

By following this example code snippet, you can create a “Project Status” column that reflects the stage of each project based on its dates. Remember to adjust the conditions according to your specific dataset and requirements.

Last modified on 2024-12-20