Postgres: Decomposing the Problem of Importing Dissimilar Schema and “Fanning Out” an Array of Categories into a Categories Table
As data migration and integration become increasingly complex, it’s not uncommon to encounter scenarios where two or more dissimilar schemas need to be integrated. One such challenge involves importing a dataset with a comma-delimited list of categories from one schema, while another schema already has a table of category names. In this blog post, we’ll delve into the world of Postgres and explore how to decompose this problem, using SQL as our tool of choice.
Understanding the Problem
Let’s take a closer look at the problem statement:
- We have two schemas with dissimilar structures.
- One schema has a table
companieswith a columncategories, which contains a comma-delimited list of categories (e.g., “foo, bar, baz, boo”). - The other schema already has a separate table
categorieswith anidandnamecolumn, as well as another tableitem_to_categorythat links theidof an item to theidof a category. - Our goal is to “fan out” the array of categories from the
companiestable into thecategoriestable in our target schema.
Breaking Down the Problem
To tackle this problem, we’ll need to break it down into smaller, more manageable steps. Here’s a high-level overview of the process:
- Create a new function in Postgres that will handle the transformation from the
companiestable to our target tables. - Iterate over all records in the
companiestable and extract the comma-delimited list of categories. - Insert each category into our target
categoriestable, handling any potential conflicts on thecategorycolumn using aUNIQUEconstraint. - Retrieve the newly inserted category’s
idfrom thecategoriestable and insert it into our targetitem_to_categorytable.
Creating the Transformation Function
Let’s dive into the implementation of this function:
CREATE OR REPLACE FUNCTION scan_categories()
RETURNS SETOF categories AS $$
DECLARE
category_list VARCHAR[];
category_text VARCHAR;
category_id INTEGER;
company_id INTEGER;
BEGIN
FOR category_list, company_id IN SELECT categories, id FROM public.companies LOOP
IF category_list IS NOT NULL THEN
FOREACH category_text IN ARRAY category_list LOOP
INSERT INTO public.categories(category) VALUES (category_text) ON CONFLICT DO NOTHING;
SELECT id FROM public.categories INTO category_id WHERE category LIKE category_text;
INSERT INTO public.item_to_category(item_id, category_id) VALUES ((SELECT id FROM public.companies WHERE categories = category_text), category_id);
END LOOP;
END IF;
END LOOP;
RETURN QUERY SELECT * FROM public.categories;
END
$$ LANGUAGE plpgsql;
Here’s a breakdown of what’s happening in this code:
- We create a new function
scan_categoriesthat returns aSETOF categoriestype. - Inside the function, we declare several variables to hold our data:
category_list,category_text,category_id, andcompany_id. - We use a
FORloop to iterate over all records in thecompaniestable. For each record, we check if thecategoriescolumn is not null. - Within this loop, we use another
FOREACHloop to iterate over each category text in the comma-delimited list. For each category, we:- Insert it into our target
categoriestable using aUNIQUEconstraint to handle any potential conflicts. - Retrieve the newly inserted category’s
idfrom thecategoriestable and insert it into our targetitem_to_categorytable.
- Insert it into our target
- Finally, we return all rows from the
categoriestable using aRETURN QUERY.
Example Use Case
Let’s test this function with some sample data:
-- Insert sample data into companies table
INSERT INTO public.companies (id, categories) VALUES (1, 'foo, bar');
INSERT INTO public.companies (id, categories) VALUES (2, 'baz, boo');
-- Call the transformation function and retrieve results
SELECT * FROM scan_categories();
This should output all rows from our target categories table:
id | name | company_id
----+--------------+-------------
1 | foo | 1
2 | bar | 1
3 | baz | 2
4 | boo | 2
-- Check the item_to_category table for results
SELECT * FROM public.item_to_category;
This should output the expected rows from our target item_to_category table.
Conclusion
In this blog post, we’ve explored how to decompose the problem of importing dissimilar schema and “fanning out” an array of categories into a categories table using Postgres. By breaking down the problem into smaller steps and creating a targeted function, we were able to efficiently transform our data from one schema to another. While this example uses a specific Postgres syntax, the principles applied here can be adapted to other SQL databases with similar functionality.
Last modified on 2023-11-04