Identifying Three Distinct Rows Based on the Same Date: A Step-by-Step Solution in R

Identifying Three Distinct Rows Based on the Same Date

In this article, we’ll explore a problem where we need to identify three distinct rows based on the same date. The problem involves cleaning and manipulating data using R’s lubridate and dplyr packages.

Sample Data

We are given a dataset with three columns: IDrow, date, and result. The IDrow column represents an identifier for each row, while the date column stores dates in string format, and the result column contains categorical data (NP1, NP2, or NP3).

df = read.table(header = T, text = '   IDrow     date result
1    ID1 01-01-09    NP1
2    ID1 01-01-10    NP1
3    ID1 01-01-10    NP2
4    ID1 01-01-10    NP3
5    ID1 03-03-15    NP1
6    ID1 03-03-15    NP2
7    ID1 03-03-15    NP3
8    ID2 01-05-10    NP1
9    ID2 01-05-10    NP2
10   ID2 01-05-10    NP3
11   ID3 02-08-11    NP1
12   ID3 02-08-11    NP2
13   ID3 02-08-11    NP3')

Solution

The solution involves several steps:

Step 1: Convert the `date` Column to Date Class

We use the lubridate package’s mdy() function to convert the date column from string format to date class.

library(lubridate)
df %>% mutate(date = mdy(date))

Step 2: Assign Scores to Each ID/Date Combination

We assign scores based on the presence of three distinct results for each ID/date combination:

Score 2 if all three results are present (n_distinct(result) == 3)
Score 1 if only NP1 and NP2 are present ("NP1" %in% result & "NP2" %in% result)
Score 0 otherwise

df %>% 
  group_by(IDrow, date) %>% 
  mutate(
    score = case_when(
      n_distinct(result) == 3 ~ 2,
      "NP1" %in% result & "NP2" %in% result ~ 1,
      TRUE ~ 0
    )
  ) %>% 
  filter(score > 0)

Step 3: Sort the Data Based on ID, Score, and Date

We sort the data by IDrow, score (descending), and date (ascending) within each group. We then keep only the first date for each IDrow.

df %>% 
  filter(score > 0) %>% 
  group_by(IDrow) %>% 
  arrange(desc(score), date, .by_group = TRUE) %>% 
  filter(date == first(date)) %>% 
  ungroup()

Result

After applying these steps to our sample data, we obtain the following output:

# A tibble: 9 × 4
#   IDrow date       result score
#   &lt;chr&gt; &lt;date&gt;     &lt;chr&gt;  &lt;dbl&gt;
#1 ID1   2010-01-01 NP1        2
#2 ID1   2010-01-01 NP2        2
#3 ID1   2010-01-01 NP3        2
#4 ID2   2010-01-05 NP1        2
#5 ID2   2010-01-05 NP2        2
#6 ID2   2010-01-05 NP3        2
#7 ID3   2011-02-08 NP1        2
#8 ID3   2011-02-08 NP2        2
#9 ID3   2011-02-08 NP3        2

This solution provides us with the required data, where each row represents a distinct combination of IDrow and date, along with the corresponding score.

Last modified on 2025-04-04