AWS Athena SQL Query to Get Distinct Data Using GROUP BY and MAX Function

AWS Athena SQL Query to Get Distinct Data

Introduction

AWS Athena is a serverless query service that allows you to analyze data stored in Amazon S3 using SQL. In this article, we will explore how to write an efficient SQL query to get distinct data from a table created in AWS Athena.

Background

The provided question contains a sample dataset in an Excel sheet, which is stored in an S3 bucket and updated continuously with DynamoDB streams data using a Lambda function. The goal is to query the AWS Athena database and retrieve the latest status for each roll number.

Understanding the Data

The given dataset has three columns: rollno, status, name, and place. The status column is changing from “pending” to “approved”, “progress”, and finally “completed”. We want to extract the data with the latest status for each roll number.

rollno | status    |  name | place
----------------------------------
   1   | pending   |   x   | london 
   1   | approved  |   x   | london
   1   | progress  |   x   | london
   1   | completed |   x   | london
   2   | pending   |   y   | delhi
   2   | approved  |   y   | delhi
   2   | progress  |   y   | delhi
   3   | pending   |   z   | newyork
   3   | approved  |   z   | newyork 
   4   | pending   |   a   | seattle

Solution Overview

To solve this problem, we can use the GROUP BY clause in SQL to group the data by roll number and retrieve the latest record for each group. We will also use the MAX function to find the maximum updated_time value for each roll number.

select rolno,status,name,place from table_T where rollno||updated_time
in
(
select rollno||max(updated_time) from table_T group by rollno
)

How it Works

Let’s break down the query:

select rolno,status,name,place from table_T: This selects all columns (rolno, status, name, and place) from the table_T table.
where rollno||updated_time in ...: This filters the data to only include rows where the combination of rollno and updated_time exists in the subquery.
(select rollno||max(updated_time) from table_T group by rollno): This is a subquery that finds the maximum updated_time value for each roll number.

Subquery Explanation

The subquery works as follows:

select rollno||max(updated_time) : This selects the combination of rollno and the maximum updated_time value.
from table_T: This specifies the table to query (table_T).
group by rollno: This groups the data by rollno, so that we can find the maximum updated_time value for each roll number.

Example Use Case

Suppose we have an S3 bucket containing a CSV file with the following data:

rollno,status,name,place,updated_time
1,pending,x,london,2022-01-01
1,approved,x,london,2022-01-02
1,progress,x,london,2022-01-03
1,completed,x,london,2022-01-04
2,pending,y,delhi,2022-01-01
2,approved,y,delhi,2022-01-02
2,progress,y,delhi,2022-01-03
3,pending,z,newyork,2022-01-01
3,approved,z,newyork,2022-01-02
4,pending,a,seattle,2022-01-01

If we run the query:

select rolno,status,name,place from table_T where rollno||updated_time
in
(
select rollno||max(updated_time) from table_T group by rollno
)

We will get the following result:

rollno | status    |  name | place
----------------------------------
   1   | completed |   x   | london
   2   | progress  |   y   | delhi
   3   | approved  |   z   | newyork
   4   | pending   |   a   | seattle

Conclusion

In this article, we have explored how to write an efficient SQL query to get distinct data from a table created in AWS Athena. By using the GROUP BY clause and the MAX function, we can retrieve the latest status for each roll number. This approach is useful when dealing with dynamic data that needs to be processed in real-time.

Last modified on 2024-08-07