Optimizing Similarity Matching: A Step-by-Step Guide to Grouping Observations

To solve this problem, we need to use a combination of data manipulation and graph theory. Here’s the step-by-step solution:

Step 1: Add row number to original data

dt <- dt %>% mutate(row = row_number())

This adds a new column row to the original data, which will help us keep track of each observation.

Step 2: Create “next day” version of table

dt_next_day <- dt %>% 
  mutate(Date = Date - 1)

This creates a new data frame dt_next_day, where each row is shifted one day back compared to the original data.

Step 3: Perform left join with “next day” data

working_matches <- dt %>% 
  left_join(dt_next_day, by = "Date") %>% 
  filter(abs(x.y - x.x) <= 10 & abs(y.y - y.x) <= 10)

This performs a left join between the original data and the “next day” data, keeping only the rows where the difference in x or y values is less than or equal to 10.

Step 4: Create graph from working matches

library(tidygraph)
working_matches %>% 
  select(row.x, row.y) %>% 
  as_tbl_graph(directed = FALSE) %>% 
  mutate(group = group_components())

This creates a graph from the working matches data, where each observation is connected to its “next day” counterpart if they have similar values.

Step 5: Extract group information

working_matches <- working_matches %>% 
  select(row.x, row.y) %>% 
  as_tbl_graph(directed = FALSE) %>% 
  mutate(group = group_components()) %>% 
  activate(nodes) %>% 
  data.frame() %>% 
  mutate(row = as.integer(name)) %>% 
  right_join(dt, by = "row")

This extracts the group information from the graph and joins it back with the original data.

Step 6: Plot results

ggplot(aes(x, y, label = group)) +
  geom_text() +
  facet_wrap(~Date)

This plots the results using a heatmap, where each observation is colored according to its group.

The complete code is:

library(dplyr)
library(tidyr)
library(tidygraph)

dt <- dt %>% mutate(row = row_number())
dt_next_day <- dt %>% mutate(Date = Date - 1)
working_matches <- dt %>% 
  left_join(dt_next_day, by = "Date") %>% 
  filter(abs(x.y - x.x) <= 10 & abs(y.y - y.x) <= 10)

working_matches %>% 
  select(row.x, row.y) %>% 
  as_tbl_graph(directed = FALSE) %>% 
  mutate(group = group_components()) %>% 
  activate(nodes) %>% 
  data.frame() %>% 
  mutate(row = as.integer(name)) %>% 
  right_join(dt, by = "row")

ggplot(aes(x, y, label = group)) +
  geom_text() +
  facet_wrap(~Date)

This code solves the problem of grouping observations based on their similarity with “next day” counterparts.


Last modified on 2025-03-20