Introduction
A common industrial flow of data analysis consists of :
- data preparation
- data aggregation
- data join
- data output
In this guide I will show how to implement efficiently data preparation, aggregation and output.
Feel free to check on industrial Oracle SQL using WITH and AWS Athena SQL UNPIVOT : CROSS JOIN UNNEST, SQL WITH clause.
Problem definition
Let's define the following problem : as the input we have sale revenue per day, and we wish to have a json as the output with the sale revenue percentage per day of week and the sale revenue percentage per day of week within the group : weekdays and weekends. The problem definition is the same as for the Oracle SQL guide, therefore I will add an additional constraint : the output and the way of aggregation must be the same as for the Oracle SQL guide to make an easy cross check after each stage.
Parameters
In order to write professional SQL we should better define all the parameters within a separate block:
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
)
select * from parameters;
This is a simple example how to use with
. Let's extend the success and go further.
Test data
Now let's generate some test data : date and sale revenue for the date :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add('day', (t.v - 1), from_iso8601_date('2024-01-01')) as sale_date,
sqrt(t.v * 31) as sale_revenue
from unnest(sequence(1, 7)) t(v)
)
select * from data_in;
Great! Now let's prepare our data for aggregation.
Data preparation
At this stage we need to be sure that we have all the necessary information for the data aggregation and that all the data types are the ones that we need. Feel free to check the guide on decimal data types which is applicable to our use case as well.
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add('day', (t.v - 1), from_iso8601_date('2024-01-01')) as sale_date,
sqrt(t.v * 31) as sale_revenue
from unnest(sequence(1, 7)) t(v)
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
cast(sale_revenue as decimal(31,2)) as sale_revenue,
trim( upper(date_format(sale_date, '%W'))) as sale_dow
from data_in
) data_in_prep cross join parameters
)
select * from data_prep;
Good! Now we are set to start data aggregation.
Data aggregation
With properly prepared data the data aggregation is straight forward :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add('day', (t.v - 1), from_iso8601_date('2024-01-01')) as sale_date,
sqrt(t.v * 31) as sale_revenue
from unnest(sequence(1, 7)) t(v)
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
cast(sale_revenue as decimal(31,2)) as sale_revenue,
trim( upper(date_format(sale_date, '%W'))) as sale_dow
from data_in
) data_in_prep cross join parameters
),
data_agg as (
select
sale_dow,
sale_revenue,
100 * sale_revenue / (sum(sale_revenue) over ()) as sale_revenue_perc,
sale_type,
100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type)) as sale_type_revenue_perc
from data_prep
order by sale_revenue asc
)
select * from data_agg;
Now we are done with the aggregation and as we have a single data source we will skip join
and we may start with data output. Depending on the consumer we may have multiple data outputs by caching the results of data aggregation with a view creation.
Data output
AWS Athena has a native support for json output :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add('day', (t.v - 1), from_iso8601_date('2024-01-01')) as sale_date,
sqrt(t.v * 31) as sale_revenue
from unnest(sequence(1, 7)) t(v)
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
cast(sale_revenue as decimal(31,2)) as sale_revenue,
trim( upper(date_format(sale_date, '%W'))) as sale_dow
from data_in
) data_in_prep cross join parameters
),
data_agg as (
select
sale_dow,
sale_revenue,
100 * sale_revenue / (sum(sale_revenue) over ()) as sale_revenue_perc,
sale_type,
100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type)) as sale_type_revenue_perc
from data_prep
order by sale_revenue asc
),
data_output as (
select
upper(json_object('sale_data' value '['||array_join(array_agg(sale_data_dow), ',') ||']' format json)) as json_output
from (
select
json_object(
'sale_dow' value sale_dow,
'sale_revenue' value sale_revenue,
'sale_revenue_perc' value sale_revenue_perc,
'sale_type' value sale_type,
'sale_type_revenue_perc' value sale_type_revenue_perc
) as sale_data_dow
from data_agg
)
)
select * from data_output;
and here is the result:
{"SALE_DATA":[{"SALE_TYPE_REVENUE_PERC":11.93,"SALE_REVENUE_PERC":7.42,"SALE_TYPE":"WORKWEEK","SALE_REVENUE":5.57,"SALE_DOW":"MONDAY"},{"SALE_TYPE_REVENUE_PERC":16.86,"SALE_REVENUE_PERC":10.49,"SALE_TYPE":"WORKWEEK","SALE_REVENUE":7.87,"SALE_DOW":"TUESDAY"},{"SALE_TYPE_REVENUE_PERC":20.66,"SALE_REVENUE_PERC":12.85,"SALE_TYPE":"WORKWEEK","SALE_REVENUE":9.64,"SALE_DOW":"WEDNESDAY"},{"SALE_TYPE_REVENUE_PERC":23.87,"SALE_REVENUE_PERC":14.85,"SALE_TYPE":"WORKWEEK","SALE_REVENUE":11.14,"SALE_DOW":"THURSDAY"},{"SALE_TYPE_REVENUE_PERC":26.68,"SALE_REVENUE_PERC":16.59,"SALE_TYPE":"WORKWEEK","SALE_REVENUE":12.45,"SALE_DOW":"FRIDAY"},{"SALE_TYPE_REVENUE_PERC":48.08,"SALE_REVENUE_PERC":18.18,"SALE_TYPE":"WEEKEND","SALE_REVENUE":13.64,"SALE_DOW":"SATURDAY"},{"SALE_TYPE_REVENUE_PERC":51.92,"SALE_REVENUE_PERC":19.63,"SALE_TYPE":"WEEKEND","SALE_REVENUE":14.73,"SALE_DOW":"SUNDAY"}]}
With Athena there are more efficient ways to use parameters and to aggregate the data. By the way, if you know how to use json_array
in this context, feel free to write in the comments. There are more efficient ways to output the data using unload. Stay tuned!
Top comments (0)