Introduction
A common industrial flow of data analysis consists of :
- data preparation
- data aggregation
- data join
- data output
In this guide I will show how to implement efficiently data preparation, aggregation and output.
Feel free to check on industrial Oracle SQL using WITH and GCP BigQuery SQL CROSS JOIN with UNPIVOT UNNEST, SQL WITH clause.
Problem definition
Let's define the following problem : as the input we have sale revenue per day, and we wish to have a json as the output with the sale revenue percentage per day of week and the sale revenue percentage per day of week within the group : weekdays and weekends. The problem definition is the same as for the Oracle SQL guide, therefore I will add an additional constraint : the output and the way of aggregation must be the same as for the Oracle SQL guide to make an easy cross check after each stage.
Parameters
In order to write professional SQL we should better define all the parameters within a separate block:
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
)
select * from parameters;
This is a simple example how to use with
. Let's extend the success and go further.
Test data
Now let's generate some test data : date and sale revenue for the date :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
sqrt(v * 31) as sale_revenue
from unnest(generate_array(1, 7)) v
)
select * from data_in;
Great! Now let's prepare our data for aggregation.
Data preparation
At this stage we need to be sure that we have all the necessary information for the data aggregation and that all the data types are the ones that we need. Feel free to check the guide on decimal data types which is applicable to our use case as well.
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
sqrt(v * 31) as sale_revenue
from unnest(generate_array(1, 7)) v
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
round(cast(sale_revenue as numeric),2) as sale_revenue,
trim(upper(format_date('%A',sale_date))) as sale_dow
from data_in
) data_in_prep cross join parameters
)
select * from data_prep;
Good! Now we are set to start data aggregation.
Data aggregation
With properly prepared data the data aggregation is straight forward :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
sqrt(v * 31) as sale_revenue
from unnest(generate_array(1, 7)) v
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
round(cast(sale_revenue as numeric),2) as sale_revenue,
trim(upper(format_date('%A',sale_date))) as sale_dow
from data_in
) data_in_prep cross join parameters
),
data_agg as (
select
sale_dow,
sale_revenue,
round(cast(100 * sale_revenue / (sum(sale_revenue) over ())
as numeric), 2) as sale_revenue_perc,
sale_type,
round(cast(100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type))
as numeric), 2) as sale_type_revenue_perc
from data_prep
order by sale_revenue asc
)
select * from data_agg;
We may also check the chart:
and the json :
Now we are done with the aggregation and as we have a single data source we will skip join
and we may start with data output. Depending on the consumer we may have multiple data outputs by caching the results of data aggregation with a view creation.
Data output
BigQuery has a native support for json output :
with
parameters as (
select
'SATURDAY' as weekend_v0,
'SUNDAY' as weekend_v1,
'WEEKEND' as weekend_label,
'WORKWEEK' as workweek_label
),
data_in as (
select
date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
sqrt(v * 31) as sale_revenue
from unnest(generate_array(1, 7)) v
),
data_prep as (
select
data_in_prep.sale_date,
data_in_prep.sale_revenue,
data_in_prep.sale_dow,
(case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1)
then weekend_label else workweek_label end) as sale_type
from (
select
sale_date,
round(cast(sale_revenue as numeric),2) as sale_revenue,
trim(upper(format_date('%A',sale_date))) as sale_dow
from data_in
) data_in_prep cross join parameters
),
data_agg as (
select
sale_dow,
sale_revenue,
round(cast(100 * sale_revenue / (sum(sale_revenue) over ())
as numeric), 2) as sale_revenue_perc,
sale_type,
round(cast(100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type))
as numeric), 2) as sale_type_revenue_perc
from data_prep
order by sale_revenue asc
),
data_output as (
select
upper(to_json_string(json_object('sale_data', sale_data_dow_array))) as json_output
from (
select array_agg(sale_data_dow) as sale_data_dow_array
from(
select
to_json(data_agg_output) as sale_data_dow
from (
select
data_agg.sale_dow,
trim(cast(data_agg.sale_revenue as string format '999999.99')) as sale_revenue,
trim(cast(data_agg.sale_revenue_perc as string format '999999.99')) as sale_revenue_perc,
data_agg.sale_type,
trim(cast(data_agg.sale_type_revenue_perc as string format '999999.99')) as sale_type_revenue_perc
from data_agg
) data_agg_output
)
)
)
select * from data_output;
and here is the result:
{"SALE_DATA":[{"SALE_DOW":"MONDAY","SALE_REVENUE":"5.57","SALE_REVENUE_PERC":"7.42","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"11.93"},{"SALE_DOW":"TUESDAY","SALE_REVENUE":"7.87","SALE_REVENUE_PERC":"10.49","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"16.86"},{"SALE_DOW":"WEDNESDAY","SALE_REVENUE":"9.64","SALE_REVENUE_PERC":"12.85","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"20.66"},{"SALE_DOW":"THURSDAY","SALE_REVENUE":"11.14","SALE_REVENUE_PERC":"14.85","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"23.87"},{"SALE_DOW":"FRIDAY","SALE_REVENUE":"12.45","SALE_REVENUE_PERC":"16.59","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"26.68"},{"SALE_DOW":"SATURDAY","SALE_REVENUE":"13.64","SALE_REVENUE_PERC":"18.18","SALE_TYPE":"WEEKEND","SALE_TYPE_REVENUE_PERC":"48.08"},{"SALE_DOW":"SUNDAY","SALE_REVENUE":"14.73","SALE_REVENUE_PERC":"19.63","SALE_TYPE":"WEEKEND","SALE_TYPE_REVENUE_PERC":"51.92"}]}
With BigQuery there are more efficient ways to use parameters and to aggregate the data, if you know them feel free to post in the comments. Stay tuned!
Top comments (0)