DEV Community

hexfloor
hexfloor

Posted on

SQL Pro Tips : industrial GCP BigQuery SQL using WITH

Introduction

A common industrial flow of data analysis consists of :

  • data preparation
  • data aggregation
  • data join
  • data output

In this guide I will show how to implement efficiently data preparation, aggregation and output.
Feel free to check on industrial Oracle SQL using WITH and GCP BigQuery SQL CROSS JOIN with UNPIVOT UNNEST, SQL WITH clause.

Problem definition

Let's define the following problem : as the input we have sale revenue per day, and we wish to have a json as the output with the sale revenue percentage per day of week and the sale revenue percentage per day of week within the group : weekdays and weekends. The problem definition is the same as for the Oracle SQL guide, therefore I will add an additional constraint : the output and the way of aggregation must be the same as for the Oracle SQL guide to make an easy cross check after each stage.

Parameters

In order to write professional SQL we should better define all the parameters within a separate block:

with 
parameters as (
    select 
        'SATURDAY' as weekend_v0,
        'SUNDAY' as weekend_v1,
        'WEEKEND' as weekend_label,
        'WORKWEEK' as workweek_label
)
select * from parameters;
Enter fullscreen mode Exit fullscreen mode

Image description

This is a simple example how to use with. Let's extend the success and go further.

Test data

Now let's generate some test data : date and sale revenue for the date :

with 
parameters as (
    select 
        'SATURDAY' as weekend_v0,
        'SUNDAY' as weekend_v1,
        'WEEKEND' as weekend_label,
        'WORKWEEK' as workweek_label
),
data_in as (
  select 
    date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
    sqrt(v * 31) as sale_revenue
  from unnest(generate_array(1, 7)) v
)
select * from data_in;
Enter fullscreen mode Exit fullscreen mode

Image description

Great! Now let's prepare our data for aggregation.

Data preparation

At this stage we need to be sure that we have all the necessary information for the data aggregation and that all the data types are the ones that we need. Feel free to check the guide on decimal data types which is applicable to our use case as well.

with 
parameters as (
    select 
        'SATURDAY' as weekend_v0,
        'SUNDAY' as weekend_v1,
        'WEEKEND' as weekend_label,
        'WORKWEEK' as workweek_label
),
data_in as (
  select 
    date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
    sqrt(v * 31) as sale_revenue
  from unnest(generate_array(1, 7)) v
),
data_prep as (
  select 
    data_in_prep.sale_date,
    data_in_prep.sale_revenue,
    data_in_prep.sale_dow,
    (case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1) 
            then weekend_label else workweek_label end) as sale_type
  from (
        select
            sale_date,
            round(cast(sale_revenue as numeric),2) as sale_revenue,
            trim(upper(format_date('%A',sale_date))) as sale_dow
        from data_in
  ) data_in_prep cross join parameters
)
select * from data_prep;
Enter fullscreen mode Exit fullscreen mode

Image description

Good! Now we are set to start data aggregation.

Data aggregation

With properly prepared data the data aggregation is straight forward :

with 
parameters as (
    select 
        'SATURDAY' as weekend_v0,
        'SUNDAY' as weekend_v1,
        'WEEKEND' as weekend_label,
        'WORKWEEK' as workweek_label
),
data_in as (
  select 
    date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
    sqrt(v * 31) as sale_revenue
  from unnest(generate_array(1, 7)) v
),
data_prep as (
  select 
    data_in_prep.sale_date,
    data_in_prep.sale_revenue,
    data_in_prep.sale_dow,
    (case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1) 
            then weekend_label else workweek_label end) as sale_type
  from (
        select
            sale_date,
            round(cast(sale_revenue as numeric),2) as sale_revenue,
            trim(upper(format_date('%A',sale_date))) as sale_dow
        from data_in
  ) data_in_prep cross join parameters
),
data_agg as (
    select
        sale_dow,
        sale_revenue,
        round(cast(100 * sale_revenue / (sum(sale_revenue) over ()) 
            as numeric), 2) as sale_revenue_perc,
        sale_type,
        round(cast(100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type)) 
            as numeric), 2) as sale_type_revenue_perc
    from data_prep
    order by sale_revenue asc

)
select * from data_agg;
Enter fullscreen mode Exit fullscreen mode

Image description

We may also check the chart:

Image description

and the json :

Image description

Now we are done with the aggregation and as we have a single data source we will skip join and we may start with data output. Depending on the consumer we may have multiple data outputs by caching the results of data aggregation with a view creation.

Data output

BigQuery has a native support for json output :

with 
parameters as (
    select 
        'SATURDAY' as weekend_v0,
        'SUNDAY' as weekend_v1,
        'WEEKEND' as weekend_label,
        'WORKWEEK' as workweek_label
),
data_in as (
  select 
    date_add(date('2024-01-01'), interval (v - 1) DAY) as sale_date,
    sqrt(v * 31) as sale_revenue
  from unnest(generate_array(1, 7)) v
),
data_prep as (
  select 
    data_in_prep.sale_date,
    data_in_prep.sale_revenue,
    data_in_prep.sale_dow,
    (case when (data_in_prep.sale_dow = parameters.weekend_v0 or data_in_prep.sale_dow = parameters.weekend_v1) 
            then weekend_label else workweek_label end) as sale_type
  from (
        select
            sale_date,
            round(cast(sale_revenue as numeric),2) as sale_revenue,
            trim(upper(format_date('%A',sale_date))) as sale_dow
        from data_in
  ) data_in_prep cross join parameters
),
data_agg as (
    select
        sale_dow,
        sale_revenue,
        round(cast(100 * sale_revenue / (sum(sale_revenue) over ()) 
            as numeric), 2) as sale_revenue_perc,
        sale_type,
        round(cast(100 * sale_revenue / (sum(sale_revenue) over (partition by sale_type)) 
            as numeric), 2) as sale_type_revenue_perc
    from data_prep
    order by sale_revenue asc

),
data_output as (
  select 
    upper(to_json_string(json_object('sale_data', sale_data_dow_array))) as json_output
  from (
    select array_agg(sale_data_dow) as sale_data_dow_array
      from(
        select 
          to_json(data_agg_output) as sale_data_dow
        from (
          select 
            data_agg.sale_dow,
            trim(cast(data_agg.sale_revenue as string format '999999.99')) as sale_revenue,
            trim(cast(data_agg.sale_revenue_perc as string format '999999.99')) as sale_revenue_perc,
            data_agg.sale_type,
            trim(cast(data_agg.sale_type_revenue_perc as string format '999999.99')) as sale_type_revenue_perc
          from data_agg
        ) data_agg_output
    )
  )
)
select * from data_output;
Enter fullscreen mode Exit fullscreen mode

and here is the result:

{"SALE_DATA":[{"SALE_DOW":"MONDAY","SALE_REVENUE":"5.57","SALE_REVENUE_PERC":"7.42","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"11.93"},{"SALE_DOW":"TUESDAY","SALE_REVENUE":"7.87","SALE_REVENUE_PERC":"10.49","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"16.86"},{"SALE_DOW":"WEDNESDAY","SALE_REVENUE":"9.64","SALE_REVENUE_PERC":"12.85","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"20.66"},{"SALE_DOW":"THURSDAY","SALE_REVENUE":"11.14","SALE_REVENUE_PERC":"14.85","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"23.87"},{"SALE_DOW":"FRIDAY","SALE_REVENUE":"12.45","SALE_REVENUE_PERC":"16.59","SALE_TYPE":"WORKWEEK","SALE_TYPE_REVENUE_PERC":"26.68"},{"SALE_DOW":"SATURDAY","SALE_REVENUE":"13.64","SALE_REVENUE_PERC":"18.18","SALE_TYPE":"WEEKEND","SALE_TYPE_REVENUE_PERC":"48.08"},{"SALE_DOW":"SUNDAY","SALE_REVENUE":"14.73","SALE_REVENUE_PERC":"19.63","SALE_TYPE":"WEEKEND","SALE_TYPE_REVENUE_PERC":"51.92"}]}
Enter fullscreen mode Exit fullscreen mode

With BigQuery there are more efficient ways to use parameters and to aggregate the data, if you know them feel free to post in the comments. Stay tuned!

Top comments (0)