Understanding SQL Performance Tuning: A Deep Dive into Two Queries
Introduction
As a beginner in SQL learning, one of the most common questions asked on Stack Overflow is about optimizing SQL queries for better performance. In this article, we will delve into two seemingly similar SQL queries and explore why they have different performance characteristics. We will examine the query optimization process, materialization of derived tables, and how to improve the performance of SQL queries.
The Two Queries
Query 1:
SELECT First_Name,
SUM(total_sales_amount) AS sub_total_sales_amount
FROM
(
select A.First_Name, C.product_quantity * D.Retail_Price AS t
ototal_sales_amount From join_demo.customer as A
inner join join_demo.customer_order as B on A.customer_id = B.customer_id
inner join join_demo.order_details C on B.order_id = C.order_id
inner join join_demo.product as D on C.product_id= D.product_id
) E
GROUP BY 1
ORDER BY sub_total_sales_amount DESC
LIMIT 1;
Query 2:
SELECT A.First_Name,
SUM(C.product_quantity * D.Retail_Price) AS sub_total_sales_amount
From join_demo.customer as A
inner join join_demo.customer_order as B on A.customer_id = B.customer_id
inner join join_demo.order_details C on B.order_id = C.order_id
inner join join_demo.product as D on C.product_id= D.product_id
GROUP BY 1
ORDER BY sub_total_sales_amount DESC
LIMIT 1;
The Execution Plans
The execution plans for both queries are provided in the Stack Overflow question. We will analyze these plans to understand why Query 2 has better performance.
+----+----------+-------+--------+------+---------+-------+--------+------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref |
+----+----------+-------+--------+------+---------+-------+--------+------+
| 1 | SIMPLE | A | NULL | ALL | NULL | NULL | NULL | NULL |
| 1 | SIMPLE | B | NULL | EQ_REF | customer_id| primary| 4 | A.customer_id|
| 1 | SIMPLE | C | NULL | ALL | NULL | NULL | NULL | B.order_id|
| 1 | SIMPLE | D | NULL | ALL | NULL | NULL | NULL | C.product_id|
+----+----------+-------+--------+------+---------+-------+--------+------+
Analysis of Execution Plans
From the execution plans, we can see that:
- Query 1 uses the
SIMPLEselect type, which means it is a single table scan. - Query 2 also uses the
SIMPLEselect type, but with different tables. - The
EQ_REFindex is used in both queries to match thecustomer_idcolumn.
However, there is a key difference between the two queries. In Query 1, the subquery is materialized before being aggregated. This means that MySQL has to scan the entire subquery and store the results in a temporary location.
In Query 2, the subquery is not materialized. Instead, MySQL uses a join-based approach to compute the total_sales_amount column. This approach is more efficient because it avoids the need for materialization.
Optimizer Improvement
As mentioned in the Stack Overflow answer, newer versions of MySQL (starting from version 5.7) are smarter about materialization and will generally merge a subquery with the outer query. This means that more recent versions of MySQL should produce the same execution plan as older versions.
However, optimizers can be confused, and the optimizer may decide to materialize the subquery, which would slow down the query under most circumstances.
Conclusion
In conclusion, Query 2 has better performance than Query 1 because it avoids materialization of derived tables. This approach is more efficient and takes advantage of the join-based optimization that MySQL provides.
To improve the performance of SQL queries, we should:
- Use meaningful table aliases to make our queries easier to read and maintain.
- Qualify column references to ensure clarity on which columns come from which tables.
- Avoid materialization of derived tables whenever possible.
- Take advantage of index-based optimization by using indexes on columns used in
WHERE,JOIN, andORDER BYclauses.
By following these best practices, we can optimize our SQL queries for better performance and scalability.
Last modified on 2025-01-12