Decomposing the Problem of Importing Dissimilar Schema and Fanning Out an Array of Categories into a Categories Table in Postgres

Postgres: Decomposing the Problem of Importing Dissimilar Schema and “Fanning Out” an Array of Categories into a Categories Table

As data migration and integration become increasingly complex, it’s not uncommon to encounter scenarios where two or more dissimilar schemas need to be integrated. One such challenge involves importing a dataset with a comma-delimited list of categories from one schema, while another schema already has a table of category names. In this blog post, we’ll delve into the world of Postgres and explore how to decompose this problem, using SQL as our tool of choice.

Understanding the Problem

Let’s take a closer look at the problem statement:

  • We have two schemas with dissimilar structures.
  • One schema has a table companies with a column categories, which contains a comma-delimited list of categories (e.g., “foo, bar, baz, boo”).
  • The other schema already has a separate table categories with an id and name column, as well as another table item_to_category that links the id of an item to the id of a category.
  • Our goal is to “fan out” the array of categories from the companies table into the categories table in our target schema.

Breaking Down the Problem

To tackle this problem, we’ll need to break it down into smaller, more manageable steps. Here’s a high-level overview of the process:

  1. Create a new function in Postgres that will handle the transformation from the companies table to our target tables.
  2. Iterate over all records in the companies table and extract the comma-delimited list of categories.
  3. Insert each category into our target categories table, handling any potential conflicts on the category column using a UNIQUE constraint.
  4. Retrieve the newly inserted category’s id from the categories table and insert it into our target item_to_category table.

Creating the Transformation Function

Let’s dive into the implementation of this function:

CREATE OR REPLACE FUNCTION scan_categories()
RETURNS SETOF categories AS $$
DECLARE
   category_list VARCHAR[];
   category_text VARCHAR;
   category_id INTEGER;
   company_id INTEGER;
BEGIN

  FOR category_list, company_id IN SELECT categories, id FROM public.companies LOOP
    IF category_list IS NOT NULL THEN
      FOREACH category_text IN ARRAY category_list LOOP
        INSERT INTO public.categories(category) VALUES (category_text) ON CONFLICT DO NOTHING;
        SELECT id FROM public.categories INTO category_id WHERE category LIKE category_text;
        INSERT INTO public.item_to_category(item_id, category_id) VALUES ((SELECT id FROM public.companies WHERE categories = category_text), category_id);
      END LOOP;
    END IF;
  END LOOP;

  RETURN QUERY SELECT * FROM public.categories;
END
$$ LANGUAGE plpgsql;

Here’s a breakdown of what’s happening in this code:

  • We create a new function scan_categories that returns a SETOF categories type.
  • Inside the function, we declare several variables to hold our data: category_list, category_text, category_id, and company_id.
  • We use a FOR loop to iterate over all records in the companies table. For each record, we check if the categories column is not null.
  • Within this loop, we use another FOREACH loop to iterate over each category text in the comma-delimited list. For each category, we:
    • Insert it into our target categories table using a UNIQUE constraint to handle any potential conflicts.
    • Retrieve the newly inserted category’s id from the categories table and insert it into our target item_to_category table.
  • Finally, we return all rows from the categories table using a RETURN QUERY.

Example Use Case

Let’s test this function with some sample data:

-- Insert sample data into companies table
INSERT INTO public.companies (id, categories) VALUES (1, 'foo, bar');
INSERT INTO public.companies (id, categories) VALUES (2, 'baz, boo');

-- Call the transformation function and retrieve results
SELECT * FROM scan_categories();

This should output all rows from our target categories table:

 id |     name     | company_id
----+--------------+-------------
 1  | foo         |       1
 2  | bar         |       1
 3  | baz         |       2
 4  | boo         |       2

-- Check the item_to_category table for results
SELECT * FROM public.item_to_category;

This should output the expected rows from our target item_to_category table.

Conclusion

In this blog post, we’ve explored how to decompose the problem of importing dissimilar schema and “fanning out” an array of categories into a categories table using Postgres. By breaking down the problem into smaller steps and creating a targeted function, we were able to efficiently transform our data from one schema to another. While this example uses a specific Postgres syntax, the principles applied here can be adapted to other SQL databases with similar functionality.


Last modified on 2023-11-04