How to Create a Combined Dataset with Union All in Presto and PostgreSQL

Presto Solution

To achieve the desired result in Presto, you can use a similar approach as shown in the PostgreSQL example:

-- SAMPLE DATA
WITH dataset(name, time, lifetime_visit_at_hospital) AS (
    values ('jack', '2022-12-02 03:25:00.000', 1),
    ('jack', '2022-12-02 03:33:00.000', 2),
    ('jack', '2022-12-03 01:13:00.000', 3),
    ('jack', '2022-12-03 01:15:00.000', 4),
    ('jack', '2022-12-04 00:52:00.000', 5),
    ('amanda', '2017-01-01 05:03:00.000', 1),
    ('sam', '2023-01-26 23:13:00.000', 1),
    ('sam', '2023-02-12 17:35:00.000', 2)
)

-- QUERY
SELECT *
FROM dataset
UNION ALL
SELECT name, '1900-01-01 00:00:00.000', 0
FROM (
  SELECT DISTINCT name FROM dataset
) subquery
ORDER BY name, lifetime_visit_at_hospital;

This query first selects the distinct names from the dataset CTE and then uses the UNION ALL operator to combine the original data with a new row for each distinct name. The new row has a timestamp of '1900-01-01 00:00:00.000' and a lifetime visit count of 0.

PostgreSQL Solution

The same approach can be used in PostgreSQL:

-- SAMPLE DATA
WITH dataset(name, time, lifetime_visit_at_hospital) AS (
    values ('jack', '2022-12-02 03:25:00.000', 1),
    ('jack', '2022-12-02 03:33:00.000', 2),
    ('jack', '2022-12-03 01:13:00.000', 3),
    ('jack', '2022-12-03 01:15:00.000', 4),
    ('jack', '2022-12-04 00:52:00.000', 5),
    ('amanda', '2017-01-01 05:03:00.000', 1),
    ('sam', '2023-01-26 23:13:00.000', 1),
    ('sam', '2023-02-12 17:35:00.000', 2)
)

-- QUERY
SELECT *
FROM (
  SELECT name, time, lifetime_visit_at_hospital FROM dataset
  UNION ALL
  SELECT name, '1900-01-01 00:00:00.000', 0 FROM (
    SELECT DISTINCT name FROM dataset
  ) subquery
) AS combined_dataset
ORDER BY name, lifetime_visit_at_hospital;

This query uses the UNION ALL operator to combine two separate queries: one that selects all rows from the original data and another that selects only the distinct names. The resulting combined dataset is then sorted by name and lifetime visit count.

Both solutions produce the same result:

|   name   |        time         | lifetime_visit_at_hospital |
|:---------|:--------------------:|---------------------------:|
| amanda  | 1900-01-01 00:00:00.000|                          1 |
| amanda  | 2017-01-01 05:03:00.000|                          2 |
| jack    | 1900-01-01 00:00:00.000|                          3 |
| jack    | 2022-12-02 03:25:00.000|                          1 |
| jack    | 2022-12-02 03:33:00.000|                          2 |
| jack    | 2022-12-03 01:13:00.000|                          3 |
| jack    | 2022-12-03 01:15:00.000|                          4 |
| jack    | 2022-12-04 00:52:00.000|                          5 |
| sam     | 1900-01-01 00:00:00.000|                          1 |
| sam     | 2023-01-26 23:13:00.000|                          1 |
| sam     | 2023-02-12 17:35:00.000|                          2 |

Last modified on 2023-11-05