Improving SQL Query Performance: Understanding Materialization of Derived Tables vs Join-Based Optimization

Understanding SQL Performance Tuning: A Deep Dive into Two Queries

Introduction

As a beginner in SQL learning, one of the most common questions asked on Stack Overflow is about optimizing SQL queries for better performance. In this article, we will delve into two seemingly similar SQL queries and explore why they have different performance characteristics. We will examine the query optimization process, materialization of derived tables, and how to improve the performance of SQL queries.

The Two Queries

Query 1:

SELECT First_Name,
SUM(total_sales_amount) AS sub_total_sales_amount  
FROM 
(
select A.First_Name, C.product_quantity * D.Retail_Price AS t  
ototal_sales_amount From join_demo.customer as A
inner join join_demo.customer_order as B  on A.customer_id = B.customer_id
inner join join_demo.order_details C  on B.order_id = C.order_id
inner join join_demo.product as D  on C.product_id= D.product_id
) E
GROUP BY 1
ORDER BY sub_total_sales_amount DESC  
LIMIT 1;

Query 2:

SELECT A.First_Name, 
SUM(C.product_quantity * D.Retail_Price) AS sub_total_sales_amount
From join_demo.customer as A
inner join join_demo.customer_order as B  on A.customer_id = B.customer_id
inner join join_demo.order_details C  on B.order_id = C.order_id
inner join join_demo.product as D  on C.product_id= D.product_id  
GROUP BY 1
ORDER BY sub_total_sales_amount DESC  
LIMIT 1;

The Execution Plans

The execution plans for both queries are provided in the Stack Overflow question. We will analyze these plans to understand why Query 2 has better performance.

+----+----------+-------+--------+------+---------+-------+--------+------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref   |
+----+----------+-------+--------+------+---------+-------+--------+------+
| 1  | SIMPLE    | A     | NULL     | ALL | NULL       | NULL  | NULL    | NULL  |
| 1  | SIMPLE    | B     | NULL     | EQ_REF | customer_id| primary|       4 | A.customer_id|
| 1  | SIMPLE    | C     | NULL     | ALL  | NULL       | NULL  | NULL    | B.order_id|
| 1  | SIMPLE    | D     | NULL     | ALL  | NULL       | NULL  | NULL    | C.product_id|
+----+----------+-------+--------+------+---------+-------+--------+------+

Analysis of Execution Plans

From the execution plans, we can see that:

Query 1 uses the SIMPLE select type, which means it is a single table scan.
Query 2 also uses the SIMPLE select type, but with different tables.
The EQ_REF index is used in both queries to match the customer_id column.

However, there is a key difference between the two queries. In Query 1, the subquery is materialized before being aggregated. This means that MySQL has to scan the entire subquery and store the results in a temporary location.

In Query 2, the subquery is not materialized. Instead, MySQL uses a join-based approach to compute the total_sales_amount column. This approach is more efficient because it avoids the need for materialization.

Optimizer Improvement

As mentioned in the Stack Overflow answer, newer versions of MySQL (starting from version 5.7) are smarter about materialization and will generally merge a subquery with the outer query. This means that more recent versions of MySQL should produce the same execution plan as older versions.

However, optimizers can be confused, and the optimizer may decide to materialize the subquery, which would slow down the query under most circumstances.

Conclusion

In conclusion, Query 2 has better performance than Query 1 because it avoids materialization of derived tables. This approach is more efficient and takes advantage of the join-based optimization that MySQL provides.

To improve the performance of SQL queries, we should:

Use meaningful table aliases to make our queries easier to read and maintain.
Qualify column references to ensure clarity on which columns come from which tables.
Avoid materialization of derived tables whenever possible.
Take advantage of index-based optimization by using indexes on columns used in WHERE, JOIN, and ORDER BY clauses.

By following these best practices, we can optimize our SQL queries for better performance and scalability.

Last modified on 2025-01-12