DEV Community

geraldew
geraldew

Posted on

Some Example SQL

Prologue

I thought it might be interesting to share an example of the sort of code I write in my day job.

Much of the time, what I write is so inherently arcane by being specific to my work environment, that it would be way too much trouble to anonymise it for sharing in public.

However I recently wrote something that felt quite shareable

I have translated the idiom of the code into something that feels like it would have a similar-enough mathematical pattern.

By chance, outside of my work I hade been reading some description of the dynamics of authors and publishers, and that struck me as almost suitable. Frankly, it doesn't quite work, but it will do until I find another paradigm.

Ideally I would like to find some publicly available example data that fits here. Suggestions are welcome.

Base Data

So before we get to the code I wrote, here is our example base data structure. I've imagined a dataset listing per-year connections between authors and publishers, where:

  • each author has only one publisher for a given year
  • publishers can have any number of authors each year
  • publishers might just not be active some years - if that sounds strange, just think of it as meaning that from one year to the next, a publisher can begin to operate, or cease to operate.

So our base table is just a triple of ID numbers for the author and publisher each, and a year. Each triple is unique. More about this in the commentary after the SQL script.

The Base Table

CREATE TABLE
  Dbs.Data_AuthorYearPublisher
(
  Pblshr_Num INTEGER ,
  Acnt_Yr INTEGER ,
  Athr_Id INTEGER
  )
UNIQUE PRIMARY INDEX( 
  Pblshr_Num ,
  Acnt_Yr ,
  Athr_Id
  )
;
Enter fullscreen mode Exit fullscreen mode

The Analytical SQL

Because there's a fair bit of SQL here, I have a choice of two ways to present it.

  • show each step as an explained piece and then put it all together
  • show the whole lot and then explain each part

What I will do is the second method, so I can post the whole lot, and then revisit this page to add explanations as I find time to do do.

Of course, in reality I wrote it by starting with the base table, writing a treatment of it, then writing a treatment of the treatment and so on, until I had something that gave me the kind of understanding of the data that I'd vaguely had in mind.

So while I will at this point just dump all the SQL in a code block, I can say that the general structure is:

  • a common table expression (CTE) named "Rollup_Publisher_Year" pulling from the base table
  • a CTE "Chunkify" from "Rollup_Publisher_Year"
  • a CTE "Rollup_Publisher" from Chunkify
  • a CTE "Patterns" from Rollup_Publisher
  • a CTE "RollingSums" from Patterns
  • a CTE "DoProportionCalcs" from RollingSums
  • a final SELECT from DoProportionCalcs that just gives a different sequence for the columns.

And generally, apart from being translated into the terminology of authors and publishers the identifier names are unchanged from original code.

Finally, to give an idea of timeliness to the writing of this SQL, it came from reaching a "next thing to do" late on a Thursday afternoon. I started tackling it at about 5pm and by 7pm had the SELECT of the "Patterns" step running. Overnight I thought of the idea for the "X0", "X1", "X2" encoding and after a morning diversion, spent the next afternoon pushing through to the code you see here.

-- Analyse Over-the-Year Patterns of Author Counts of Publishers
WITH
  Rollup_Publisher_Year AS
(
SELECT
  Pblshr_Num ,
  Acnt_Yr ,
  EXTRACT( YEAR FROM CURRENT_DATE ) AS Current_Yr ,
  ( Current_Yr - Acnt_Yr ) AS BackByYearsN ,
  COUNT( DISTINCT Athr_Id ) AS Athr_Ctd
FROM
  Dbs.Data_AuthorYearPublisher
GROUP BY
  Pblshr_Num ,
  Acnt_Yr ,
  Current_Yr ,
  BackByYearsN
) ,
  Chunkify AS
(
SELECT
  Pblshr_Num ,
  Acnt_Yr ,
  BackByYearsN ,
  Athr_Ctd ,
  CASE
    WHEN Athr_Ctd = 0 THEN '00'
    WHEN Athr_Ctd < 10 THEN 'X0'
    WHEN Athr_Ctd < 100 THEN 'X1'
    WHEN Athr_Ctd < 1000 THEN 'X2'
    WHEN Athr_Ctd < 10000 THEN 'X3'
    WHEN Athr_Ctd < 100000 THEN 'X4'
    WHEN Athr_Ctd >= 100000 THEN 'X5+'
    END AS AthrCnt_Cd
FROM
  Rollup_Publisher_Year
WHERE
  BackByYearsN > 1
) ,
  Rollup_Publisher AS
(
SELECT
  Pblshr_Num ,
  ( -- derive a backward sequence string, for now, just to illustrate their cliemt pattern
    --MAX( CASE WHEN BackByYearsN = 0 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    --MAX( CASE WHEN BackByYearsN = 1 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 2 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 3 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 4 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 5 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 6 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 7 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 8 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 9 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '...'
    ) AS BackYears_Athr_Ctd_Str ,
  ( -- derive a backward sequence string, for now, just to illustrate their cliemt pattern
    --MAX( CASE WHEN BackByYearsN = 0 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    --MAX( CASE WHEN BackByYearsN = 1 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 2 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 3 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 4 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 5 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 6 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 7 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 8 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
    MAX( CASE WHEN BackByYearsN = 9 THEN AthrCnt_Cd ELSE '' END ) || '...'
    ) AS BackYears_AthrCntCd_Str ,
  --
  MIN( Acnt_Yr ) AS Min_Acnt_Yr ,
  MAX( Acnt_Yr ) AS Max_Acnt_Yr ,
  --
  MIN( BackByYearsN ) AS Min_BackByYearsN ,
  AVG( BackByYearsN ) AS Avg_BackByYearsN ,
  MAX( BackByYearsN ) AS Max_BackByYearsN ,
  --
  MIN( Athr_Ctd ) AS Min_Athr_Ctd ,
  AVG( Athr_Ctd ) AS Avg_Athr_Ctd ,
  MAX( Athr_Ctd ) AS Max_Athr_Ctd ,
  --
  MAX( CASE WHEN Acnt_Yr = 2024 THEN Athr_Ctd END ) AS Athrs_2024_Ctd ,
  MAX( CASE WHEN Acnt_Yr = 2023 THEN Athr_Ctd END ) AS Athrs_2023_Ctd
FROM
  -- Rollup_Publisher_Year
  Chunkify
GROUP BY
  Pblshr_Num
) ,
  Patterns AS
(
SELECT
  BackYears_AthrCntCd_Str ,
  COUNT( Pblshr_Num ) AS Publisher_Cnt ,
  --
  -- derive a value that can later be checked for being more than zero - the actual amount doesn't itself have a meaning
  -- the idea is that this value can only be zero if both 2014 and 2023 were zero
  MAX( ZEROIFNULL( Athrs_2024_Ctd ) + ZEROIFNULL( Athrs_2023_Ctd ) ) AS MaxAthrsCtd_24plus23 ,
  --
  MAX( Max_Acnt_Yr ) AS Max_Max_Acnt_Yr ,
  MAX( Min_Acnt_Yr ) AS Max_Min_Acnt_Yr ,
  MIN( Min_Acnt_Yr ) AS Min_Min_Acnt_Yr ,
  --
  SUM( Athrs_2024_Ctd ) AS Sum_Athrs_2024_Ctd ,
  SUM( Athrs_2023_Ctd ) AS Sum_Athrs_2023_Ctd
FROM
  Rollup_Publisher
GROUP BY
  BackYears_AthrCntCd_Str
-- ORDER BY
--  Publisher_Cnt DESC
) ,
  RollingSums AS
(
SELECT
  P.* ,
  CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN '24 or 23' END AS RecentlyActive ,
  CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN 'Acted_2024' END AS Active2024 ,
  FIRST_VALUE( Publisher_Cnt) OVER(    
    ORDER BY
      Publisher_Cnt DESC ,
      Sum_Athrs_2024_Ctd DESC ,
      Sum_Athrs_2023_Ctd DESC
    ) AS IgnoreFirstPattern_Cnt ,
  --
  SUM( Publisher_Cnt ) OVER (
    ORDER BY
      Publisher_Cnt DESC ,
      Sum_Athrs_2024_Ctd DESC ,
      Sum_Athrs_2023_Ctd DESC
    ROWS UNBOUNDED PRECEDING
    ) AS SumPblshrsSoFar ,
  SUM( Publisher_Cnt ) OVER () AS SumPblshrsOfAll ,
  --
  SUM( CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN Publisher_Cnt END ) OVER (
    ORDER BY
      Publisher_Cnt DESC ,
      Sum_Athrs_2024_Ctd DESC ,
      Sum_Athrs_2023_Ctd DESC
    ROWS UNBOUNDED PRECEDING
    ) AS SumActvPblshrsSoFar ,
  SUM( CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN Publisher_Cnt END ) OVER () AS SumActvPblshrsOfAll ,
  --
  SUM( CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN Publisher_Cnt END ) OVER (
    ORDER BY
      Publisher_Cnt DESC ,
      Sum_Athrs_2024_Ctd DESC ,
      Sum_Athrs_2023_Ctd DESC
    ROWS UNBOUNDED PRECEDING
    ) AS SumActv24PblshrsSoFar ,
  SUM( CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN Publisher_Cnt END ) OVER () AS SumActv24PblshrsOfAll ,
  --
  SUM( Sum_Athrs_2024_Ctd ) OVER (
    ORDER BY
      Publisher_Cnt DESC ,
      Sum_Athrs_2024_Ctd DESC ,
      Sum_Athrs_2023_Ctd DESC
    ROWS UNBOUNDED PRECEDING
    ) AS SumSoFar_Athrs_2024_Ctd ,
  SUM( Sum_Athrs_2024_Ctd ) OVER () AS SumofAll_Athrs_2024_Ctd
FROM
  Patterns AS P
) ,
  DoProportionCalcs AS
(
SELECT
  R_S.* ,
  ( CAST( SumActvPblshrsSoFar AS FLOAT) / CAST( SumActvPblshrsOfAll AS FLOAT) ) AS SoFar_ActvPblshrs_Ppn ,
  ( CAST( SumActv24PblshrsSoFar AS FLOAT) / CAST( SumActv24PblshrsOfAll AS FLOAT) ) AS SoFar_Actv24Pblshrs_Ppn ,
  ( CAST( SumPblshrsSoFar AS FLOAT) / CAST( SumPblshrsOfAll AS FLOAT) ) AS SoFar_Pblshrs_Ppn ,
  R_S.SumPblshrsSoFar - IgnoreFirstPattern_Cnt AS SoFar_WrkblPblshrs ,
  ( CAST( SoFar_WrkblPblshrs AS FLOAT) / CAST( SumPblshrsOfAll - IgnoreFirstPattern_Cnt AS FLOAT) ) AS SoFar_WrkblPblshrs_Ppn ,
  --
  ( CAST( SumSoFar_Athrs_2024_Ctd AS FLOAT) / CAST( SumofAll_Athrs_2024_Ctd AS FLOAT) ) AS SoFar_Athrs2024_Ppn
FROM
  RollingSums AS R_S
)
SELECT
  -- Grouped By
  BackYears_AthrCntCd_Str ,
  -- Agggregates from here on
  Publisher_Cnt ,
  -- using the idea of Active only about 2024
  Active2024 ,
  SumActv24PblshrsSoFar ,
  SoFar_Actv24Pblshrs_Ppn ,
  SumActv24PblshrsOfAll ,
  -- and their Authrs
  Sum_Athrs_2024_Ctd ,
  SumSoFar_Athrs_2024_Ctd ,
  SoFar_Athrs2024_Ppn ,
  SumofAll_Athrs_2024_Ctd ,
  -- some year spread indicators
  Max_Max_Acnt_Yr ,
  Max_Min_Acnt_Yr ,
  Min_Min_Acnt_Yr ,
  --
  -- All Publishers, no filtering
  SumPblshrsSoFar ,
  SoFar_Pblshrs_Ppn ,
  SumPblshrsOfAll ,
  -- Ignoring the long-time inactive Publishers
  IgnoreFirstPattern_Cnt ,
  SoFar_WrkblPblshrs ,
  SoFar_WrkblPblshrs_Ppn ,
  --
  -- Idea of Active in either 2024 or 2023
  MaxAthrsCtd_24plus23 ,
  Sum_Athrs_2023_Ctd ,
  RecentlyActive ,
  SumActvPblshrsSoFar ,
  SoFar_ActvPblshrs_Ppn ,
  SumActvPblshrsOfAll
FROM
  DoProportionCalcs
ORDER BY 
  Publisher_Cnt DESC ,
  Sum_Athrs_2024_Ctd DESC ,
  Sum_Athrs_2023_Ctd DESC
;
Enter fullscreen mode Exit fullscreen mode

Commentary

Note that I said this "doesn't quite work" - that's because in real life, authors probably use multiple publishers simultaneously. But here we'll imagine that doesn't happen - though we will allow authors to change publisher from one year to the next.

That's important for one important aspect of this analysis:

  • we cannot ever add the counts of authors across the years

To be continued...

Top comments (0)