BigQuery and Gluing Tables Together: A Deep Dive into Standard SQL
BigQuery is a powerful data analytics engine that allows users to process and analyze large datasets. One of the key features of BigQuery is its ability to handle multiple tables and combine them into a single dataset, making it easier to analyze and visualize data. In this article, we will explore how to glue multiple tables together in BigQuery using Standard SQL.
Understanding the Problem
The problem at hand is to analyze Google Analytics data for the past 30 days in BigQuery. The solution provider has created separate tables for each day, saved in the format ga_sessions_YYYYMMDD. Instead of using a JOIN operation on a common column, they want to simply add more rows to the data by gluing multiple tables together.
Standard SQL and Wildcard Tables
To glue tables together in BigQuery, we need to use either the UNION ALL operator or the wildcard table feature. The wildcard table feature allows us to select from multiple tables using a single query without having to join them manually.
Wildcard Table Feature
The wildcard table feature is enabled by default for certain table types, including events, requests, and sessions. To use this feature, we need to specify the table type in the FROM clause followed by an asterisk (*) or a range of values (e.g., _TABLE_SUFFIX BETWEEN '20171031' AND '20171001').
Here’s an example query that uses the wildcard table feature:
SELECT
*
FROM
`ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20171031' AND '20171001'
This query selects all columns (*) from all tables in the _events_ and _requests_ datasets that have a _TABLE_SUFFIX value between '20171031' and '20171001'. The resulting dataset will contain rows from both tables for each date specified.
UNION ALL Operator
Another way to glue tables together is by using the UNION ALL operator. This operator allows us to combine multiple queries into a single query, without having to join them manually.
Here’s an example query that uses the UNION ALL operator:
SELECT
fullVisitorId, visitID, visitNumber, totals.timeOnSite, totals.pageviews, totals.sessionQualityDim, device.deviceCategory
FROM
'12345678.ga_sessions_*'
WHERE
_TABLE_SUFFIX = '20171031'
UNION ALL
SELECT
fullVisitorId, visitID, visitNumber, totals.timeOnSite, totals.pageviews, totals.sessionQualityDim, device.deviceCategory
FROM
'12345678.ga_sessions_*'
WHERE
_TABLE_SUFFIX = '20171001'
This query selects all columns (*) from two separate tables in the ga_sessions_ dataset: one for each date specified. The resulting dataset will contain rows from both tables for each date.
Choosing Between UNION ALL and Wildcard Tables
Both UNION ALL and wildcard tables can be used to glue tables together, but they have some key differences:
- Performance: Using wildcard tables is generally faster than using the
UNION ALLoperator because it avoids having to scan and combine multiple datasets. - Flexibility: The
UNION ALLoperator allows you to combine more complex queries with different join conditions, while wildcard tables are limited to simpleWHEREclauses.
Choosing Wildcard Tables for Performance
Wildcard tables are generally faster than using the UNION ALL operator because they avoid having to scan and combine multiple datasets. If you’re working with large datasets and need to glue them together quickly, wildcard tables may be a better choice.
Here’s an example query that demonstrates the performance difference between using wildcard tables and the UNION ALL operator:
-- Wildcard Table Query
SELECT
*
FROM
`ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20171031' AND '20171001'
-- UNION ALL Operator Query
SELECT
fullVisitorId, visitID, visitNumber, totals.timeOnSite, totals.pageviews, totals.sessionQualityDim, device.deviceCategory
FROM
'12345678.ga_sessions_*'
WHERE
_TABLE_SUFFIX = '20171031'
UNION ALL
SELECT
fullVisitorId, visitID, visitNumber, totals.timeOnSite, totals.pageviews, totals.sessionQualityDim, device.deviceCategory
FROM
'12345678.ga_sessions_*'
WHERE
_TABLE_SUFFIX = '20171001'
Choosing UNION ALL for Flexibility
The UNION ALL operator allows you to combine more complex queries with different join conditions, while wildcard tables are limited to simple WHERE clauses. If you need to glue multiple tables together but also want to perform joins or filtering on certain columns, the UNION ALL operator may be a better choice.
Here’s an example query that demonstrates how to use the UNION ALL operator with different join conditions:
-- Union All Query with Joins
SELECT
ga.fullVisitorId, ga.visitID, ga.visitNumber, ga.totals.timeOnSite,
ra.totals.pageviews, ra.totals.sessionQualityDim, ra.device.deviceCategory
FROM
`ga_sessions_*` ga
INNER JOIN `requests_*` ra ON ga._TABLE_SUFFIX = ra._TABLE_SUFFIX AND ga.visitID = ra.visitID
WHERE
ga._TABLE_SUFFIX BETWEEN '20171031' AND '20171001'
UNION ALL
SELECT
ga.fullVisitorId, ga.visitID, ga.visitNumber, ga.totals.timeOnSite,
ra.totals.pageviews, ra.totals.sessionQualityDim, ra.device.deviceCategory
FROM
`ga_sessions_*` ga
INNER JOIN `requests_*` ra ON ga._TABLE_SUFFIX = ra._TABLE_SUFFIX AND ga.visitID = ra.visitID
WHERE
ga._TABLE_SUFFIX = '20171001'
Best Practices for Gluing Tables Together in BigQuery
When using wildcard tables or the UNION ALL operator to glue tables together in BigQuery, here are some best practices to keep in mind:
- Use meaningful table names: Choose table names that clearly indicate their purpose and content.
- Specify the correct table type: Make sure you specify the correct table type (e.g.,
_events_,_requests_) for your wildcard table query. - Avoid overusing UNION ALL: While the
UNION ALLoperator can be powerful, it’s also performance-intensive. Use it sparingly and only when necessary. - Use efficient join conditions: Optimize your join conditions to reduce scan times and improve performance.
By following these best practices and choosing between wildcard tables and the UNION ALL operator based on your specific needs, you can efficiently glue tables together in BigQuery and get the insights you need from your data.
Last modified on 2024-07-01