DEV Community

loading...
Bornfight

Implementing months_between() function in BigQuery

Niko Draganić
Business Intelligence Developer, Data Science enthusiast
・6 min read

After 6 years of BI experience in Oracle DB, I came to Bornfight to develop their BI system using Google Cloud Platform. Having used very little of it, I was glad to see all the major features, such as aggregate and analytic functions, waiting for me. I am still adjusting to not having full procedural language (PL) capabilities, but that may be for the better, since building Agile BI can be a very long lasting process using only code.

My first assignment was to create a report for Employee Loyalty Bonuses, which are calculated based solely on the duration of one's employment; a 10 minute task using Oracle's months_between() function. So I immediately looked up if BigQuery had any similar functions and I found the built-in date_diff(), which calculates the difference between two dates in days, months, or years. So far so good. I tested it out on some examples and realized it give the literal difference in the year/month part of the date, ie. Jan 1st to Feb 28th returns 1, as does Jan 31st to Feb 1st. What I wanted was, of course, something very close to 2 in the first case and something very close to 0 in the second one. So I got to thinking.

(skip to the end for the final code)

The Logic

The function needs to take two arguments: DAT_TO and DAT_FROM (in that particular order because of legacy and intuition of subtracting the earlier/older date from the later/newer one).

So what are the rules?

1. If day_in_month is the same, the function returns an integer

The first one is obvious: Jan 1st to Feb 1st should be 1 month, as should Feb 2nd to Mar 2nd, because month is a relative term of variable duration. Same goes for every pair of dates with the same value in their day fields. This is also where date_diff() does the job correctly. However, not all the dates in my dataset will be on the same day of the month.

2. If DAT_TO.day_in_month is EOM and DAT_FROM.day_in_month is EOM, the function returns an integer.

Consider the following: Jan 31st to Mar 31st returns 2, as per rule #1. So what should the function return if DAT_FROM were Feb 28th (in a non-leap year)? It should be 1, because even though day_in_month values are not the same, they are both ends of their respective months.

3. Monotony

So, in respect to Feb 28 as the end-of-month, we have Mar 28th returning 1 (rule #1) and Mar 31st returning 1 (rule #2). What about Mar 29th and 30th? Well, it would be rather weird for this to equal 1 on the 28th, then be greater than 1 on the 29th, then be even greater on the 30th and then return to 1 on 31st of March. We need this function to be monotonically increasing, which means that going further in the future either increases or holds the value of the function, never decreases it. Combined with #2, it leaves us with this:
If DAT_TO.day_in_month < DAT_FROM.day_in_month and DAT_TO is EOM then return an integer.

4. Inverse of rules 2 and 3

The logic applied in #2 and #3 goes both ways. Without a detailed explanation, If DAT_FROM.day_in_month < DAT_TO.day_in_month and DAT_FROM is EOM then return an integer.

5. Non-integer differences

So, rules #1-#4 give us the cases when the result is an integer. What about everything else? The dates inside the same month are easy: Distance between Jan 1st and Jan 15th is fourteen days, over 31 days in total for January, which gives us 0.45 of a month. But what if the dates are not in the same month? Consider Feb 14th to Mar 15th. The value should be slightly over 1, because rule #1 gives us exactly 1, and this is one day longer. But how do we proceed? We have half of February ((28-14)/28) to go and slightly less than half of March (15/31) to get from Feb 14th to Mar 15th. But adding those two fractions does not give us a value over 1, in fact, it's less than 1. That's why we should always denominate by the total number of days in the month of DAT_FROM. This gives us (14+15)/28, which is > 1. And it works for all pairs of dates, because as we get to the end of the month, we are covered by rules #3 and #4.

6. Negative distance for reversed dates

For legacy reasons (to be aligned with Oracle function, that is), when DAT_FROM > DAT_TO, the returned value should be negative.

The Code

As with every SQL query, there are multiple ways to get this result. Here's how I did it:

1. Extract year, month and day digits

Simply have the day, month and year values as integers.

SELECT
  calculation_date AS calculation_date,
  CAST(FORMAT_DATETIME("%Y", DATETIME(calculation_date)) AS INT64) AS year_num,
  CAST(FORMAT_DATETIME("%m", DATETIME(calculation_date)) AS INT64) AS month_num,
  CAST(FORMAT_DATETIME("%d", DATETIME(calculation_date)) AS INT64) AS day_num,
  FORMAT_DATETIME("%Y%m", DATETIME(calculation_date)) AS ym
FROM (
  SELECT DAT_FROM AS calculation_date UNION ALL
  SELECT DAT_TO AS calculation_date
)
Enter fullscreen mode Exit fullscreen mode

2. Calculate total days in month and distance to BOM and EOM

Calculate percentages, or rather, percentiles of month passed and month left.

SELECT
  calculation_date,
  year_num,
  month_num,
  day_num,
  ym,
  days_in_month,
  round(distance_to_bom/days_in_month, 4) AS pct_month_passed,
  round(distance_to_eom/days_in_month, 4) AS pct_month_left
FROM (
  SELECT
    *,
    date_diff(last_day(calculation_date), date_trunc(calculation_date, MONTH), DAY) + 1 AS days_in_month,
    date_diff(calculation_date, date_trunc(calculation_date, MONTH), DAY) AS distance_to_bom,
    date_diff(last_day(calculation_date), calculation_date, DAY) AS distance_to_eom
  FROM (
    --#1
  )
)
Enter fullscreen mode Exit fullscreen mode

3. Pivot

Since we started by having two rows, we want all values in a single row, so we pivot by maximizing for both DAT_FROM and DAT_TO input values.

SELECT
      max(CASE WHEN calculation_date = DAT_TO THEN calculation_date END) AS calc_dat_to,
      max(CASE WHEN calculation_date = DAT_TO THEN year_num END) AS year_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN month_num END) AS month_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN day_num END) AS day_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN days_in_month END) AS days_in_month_to,
      max(CASE WHEN calculation_date = DAT_TO THEN pct_month_passed END) AS pct_month_passed_to,
      max(CASE WHEN calculation_date = DAT_TO THEN pct_month_left END) AS pct_month_left_to,
      max(CASE WHEN calculation_date = DAT_FROM THEN calculation_date END) AS calc_dat_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN year_num END) AS year_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN month_num END) AS month_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN day_num END) AS day_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN days_in_month END) AS days_in_month_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_passed END) AS pct_month_passed_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_left END) AS pct_month_left_from
FROM (
      --#2
)
Enter fullscreen mode Exit fullscreen mode


sql

4. Implement the logic

This just implements the logic stated above.

SELECT
    CASE
      WHEN calc_dat_from = calc_dat_to
      THEN 0
      WHEN calc_dat_from < calc_dat_to
      THEN CASE
        WHEN day_num_to = day_num_from
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        WHEN day_num_to < day_num_from AND pct_month_left_to = 0
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        WHEN day_num_to > day_num_from AND pct_month_left_from = 0
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        ELSE pct_month_left_from + (date_diff(calc_dat_to, calc_dat_from, MONTH) - 1) + (day_num_to/days_in_month_from)
      END
      WHEN calc_dat_from > calc_dat_to
      THEN CASE
        WHEN day_num_from = day_num_to
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        WHEN day_num_from < day_num_to AND pct_month_left_from = 0
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        WHEN day_num_from > day_num_to AND pct_month_left_to = 0
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        ELSE -1*( pct_month_left_to + (date_diff(calc_dat_from, calc_dat_to, MONTH) - 1) + (day_num_from/days_in_month_to) )
      END
    END AS result
FROM (
    --#3
)
Enter fullscreen mode Exit fullscreen mode

Final code

Here's the entire code for the function for those who want results without the background story.

CREATE OR REPLACE FUNCTION sys.months_between(
  DAT_TO DATE,
  DAT_FROM DATE
) RETURN FLOAT64
AS (
  SELECT
    CASE
      WHEN calc_dat_from = calc_dat_to
      THEN 0
      WHEN calc_dat_from < calc_dat_to
      THEN CASE
        WHEN day_num_to = day_num_from
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        WHEN day_num_to < day_num_from AND pct_month_left_to = 0
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        WHEN day_num_to > day_num_from AND pct_month_left_from = 0
        THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
        ELSE pct_month_left_from + (date_diff(calc_dat_to, calc_dat_from, MONTH) - 1) + (day_num_to/days_in_month_from)
      END
      WHEN calc_dat_from > calc_dat_to
      THEN CASE
        WHEN day_num_from = day_num_to
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        WHEN day_num_from < day_num_to AND pct_month_left_from = 0
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        WHEN day_num_from > day_num_to AND pct_month_left_to = 0
        THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
        ELSE -1*( pct_month_left_to + (date_diff(calc_dat_from, calc_dat_to, MONTH) - 1) + (day_num_from/days_in_month_to) )
      END
    END AS result
  FROM (
    SELECT
      max(CASE WHEN calculation_date = DAT_TO THEN calculation_date END) AS calc_dat_to,
      max(CASE WHEN calculation_date = DAT_TO THEN year_num END) AS year_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN month_num END) AS month_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN day_num END) AS day_num_to,
      max(CASE WHEN calculation_date = DAT_TO THEN days_in_month END) AS days_in_month_to,
      max(CASE WHEN calculation_date = DAT_TO THEN pct_month_passed END) AS pct_month_passed_to,
      max(CASE WHEN calculation_date = DAT_TO THEN pct_month_left END) AS pct_month_left_to,
      max(CASE WHEN calculation_date = DAT_FROM THEN calculation_date END) AS calc_dat_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN year_num END) AS year_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN month_num END) AS month_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN day_num END) AS day_num_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN days_in_month END) AS days_in_month_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_passed END) AS pct_month_passed_from,
      max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_left END) AS pct_month_left_from
    FROM (
      SELECT
        calculation_date,
        year_num,
        month_num,
        day_num,
        ym,
        days_in_month,
        round(distance_to_bom/days_in_month, 4) AS pct_month_passed,
        round(distance_to_eom/days_in_month, 4) AS pct_month_left
      FROM (
        SELECT
          *,
          date_diff(last_day(calculation_date), date_trunc(calculation_date, MONTH), DAY) + 1 AS days_in_month,
          date_diff(calculation_date, date_trunc(calculation_date, MONTH), DAY) AS distance_to_bom,
          date_diff(last_day(calculation_date), calculation_date, DAY) AS distance_to_eom
        FROM (
          SELECT
            calculation_date AS calculation_date,
            CAST(FORMAT_DATETIME("%Y", DATETIME(calculation_date)) AS INT64) AS year_num,
            CAST(FORMAT_DATETIME("%m", DATETIME(calculation_date)) AS INT64) AS month_num,
            CAST(FORMAT_DATETIME("%d", DATETIME(calculation_date)) AS INT64) AS day_num,
            FORMAT_DATETIME("%Y%m", DATETIME(calculation_date)) AS ym
          FROM (
            SELECT DAT_FROM AS calculation_date UNION ALL
            SELECT DAT_TO AS calculation_date
          )
        )
      )
    )
  )
)
Enter fullscreen mode Exit fullscreen mode

Enjoy!

Discussion (0)