AWS Athena SQL Query to Get Distinct Data
Introduction
AWS Athena is a serverless query service that allows you to analyze data stored in Amazon S3 using SQL. In this article, we will explore how to write an efficient SQL query to get distinct data from a table created in AWS Athena.
Background
The provided question contains a sample dataset in an Excel sheet, which is stored in an S3 bucket and updated continuously with DynamoDB streams data using a Lambda function. The goal is to query the AWS Athena database and retrieve the latest status for each roll number.
Understanding the Data
The given dataset has three columns: rollno, status, name, and place. The status column is changing from “pending” to “approved”, “progress”, and finally “completed”. We want to extract the data with the latest status for each roll number.
rollno | status | name | place
----------------------------------
1 | pending | x | london
1 | approved | x | london
1 | progress | x | london
1 | completed | x | london
2 | pending | y | delhi
2 | approved | y | delhi
2 | progress | y | delhi
3 | pending | z | newyork
3 | approved | z | newyork
4 | pending | a | seattle
Solution Overview
To solve this problem, we can use the GROUP BY clause in SQL to group the data by roll number and retrieve the latest record for each group. We will also use the MAX function to find the maximum updated_time value for each roll number.
select rolno,status,name,place from table_T where rollno||updated_time
in
(
select rollno||max(updated_time) from table_T group by rollno
)
How it Works
Let’s break down the query:
select rolno,status,name,place from table_T: This selects all columns (rolno,status,name, andplace) from thetable_Ttable.where rollno||updated_time in ...: This filters the data to only include rows where the combination ofrollnoandupdated_timeexists in the subquery.(select rollno||max(updated_time) from table_T group by rollno): This is a subquery that finds the maximumupdated_timevalue for each roll number.
Subquery Explanation
The subquery works as follows:
select rollno||max(updated_time): This selects the combination ofrollnoand the maximumupdated_timevalue.from table_T: This specifies the table to query (table_T).group by rollno: This groups the data byrollno, so that we can find the maximumupdated_timevalue for each roll number.
Example Use Case
Suppose we have an S3 bucket containing a CSV file with the following data:
rollno,status,name,place,updated_time
1,pending,x,london,2022-01-01
1,approved,x,london,2022-01-02
1,progress,x,london,2022-01-03
1,completed,x,london,2022-01-04
2,pending,y,delhi,2022-01-01
2,approved,y,delhi,2022-01-02
2,progress,y,delhi,2022-01-03
3,pending,z,newyork,2022-01-01
3,approved,z,newyork,2022-01-02
4,pending,a,seattle,2022-01-01
If we run the query:
select rolno,status,name,place from table_T where rollno||updated_time
in
(
select rollno||max(updated_time) from table_T group by rollno
)
We will get the following result:
rollno | status | name | place
----------------------------------
1 | completed | x | london
2 | progress | y | delhi
3 | approved | z | newyork
4 | pending | a | seattle
Conclusion
In this article, we have explored how to write an efficient SQL query to get distinct data from a table created in AWS Athena. By using the GROUP BY clause and the MAX function, we can retrieve the latest status for each roll number. This approach is useful when dealing with dynamic data that needs to be processed in real-time.
Last modified on 2024-08-07