Prologue
I thought it might be interesting to share an example of the sort of code I write in my day job.
Much of the time, what I write is so inherently arcane by being specific to my work environment, that it would be way too much trouble to anonymise it for sharing in public.
However I recently wrote something that felt quite shareable
I have translated the idiom of the code into something that feels like it would have a similar-enough mathematical pattern.
By chance, outside of my work I hade been reading some description of the dynamics of authors and publishers, and that struck me as almost suitable. Frankly, it doesn't quite work, but it will do until I find another paradigm.
Ideally I would like to find some publicly available example data that fits here. Suggestions are welcome.
Base Data
So before we get to the code I wrote, here is our example base data structure. I've imagined a dataset listing per-year connections between authors and publishers, where:
- each author has only one publisher for a given year
- publishers can have any number of authors each year
- publishers might just not be active some years - if that sounds strange, just think of it as meaning that from one year to the next, a publisher can begin to operate, or cease to operate.
So our base table is just a triple of ID numbers for the author and publisher each, and a year. Each triple is unique. More about this in the commentary after the SQL script.
The Base Table
CREATE TABLE
Dbs.Data_AuthorYearPublisher
(
Pblshr_Num INTEGER ,
Acnt_Yr INTEGER ,
Athr_Id INTEGER
)
UNIQUE PRIMARY INDEX(
Pblshr_Num ,
Acnt_Yr ,
Athr_Id
)
;
The Analytical SQL
Because there's a fair bit of SQL here, I have a choice of two ways to present it.
- show each step as an explained piece and then put it all together
- show the whole lot and then explain each part
What I will do is the second method, so I can post the whole lot, and then revisit this page to add explanations as I find time to do do.
Of course, in reality I wrote it by starting with the base table, writing a treatment of it, then writing a treatment of the treatment and so on, until I had something that gave me the kind of understanding of the data that I'd vaguely had in mind.
So while I will at this point just dump all the SQL in a code block, I can say that the general structure is:
- a common table expression (CTE) named "Rollup_Publisher_Year" pulling from the base table
- a CTE "Chunkify" from "Rollup_Publisher_Year"
- a CTE "Rollup_Publisher" from Chunkify
- a CTE "Patterns" from Rollup_Publisher
- a CTE "RollingSums" from Patterns
- a CTE "DoProportionCalcs" from RollingSums
- a final SELECT from DoProportionCalcs that just gives a different sequence for the columns.
And generally, apart from being translated into the terminology of authors and publishers the identifier names are unchanged from original code.
Finally, to give an idea of timeliness to the writing of this SQL, it came from reaching a "next thing to do" late on a Thursday afternoon. I started tackling it at about 5pm and by 7pm had the SELECT of the "Patterns" step running. Overnight I thought of the idea for the "X0", "X1", "X2" encoding and after a morning diversion, spent the next afternoon pushing through to the code you see here.
-- Analyse Over-the-Year Patterns of Author Counts of Publishers
WITH
Rollup_Publisher_Year AS
(
SELECT
Pblshr_Num ,
Acnt_Yr ,
EXTRACT( YEAR FROM CURRENT_DATE ) AS Current_Yr ,
( Current_Yr - Acnt_Yr ) AS BackByYearsN ,
COUNT( DISTINCT Athr_Id ) AS Athr_Ctd
FROM
Dbs.Data_AuthorYearPublisher
GROUP BY
Pblshr_Num ,
Acnt_Yr ,
Current_Yr ,
BackByYearsN
) ,
Chunkify AS
(
SELECT
Pblshr_Num ,
Acnt_Yr ,
BackByYearsN ,
Athr_Ctd ,
CASE
WHEN Athr_Ctd = 0 THEN '00'
WHEN Athr_Ctd < 10 THEN 'X0'
WHEN Athr_Ctd < 100 THEN 'X1'
WHEN Athr_Ctd < 1000 THEN 'X2'
WHEN Athr_Ctd < 10000 THEN 'X3'
WHEN Athr_Ctd < 100000 THEN 'X4'
WHEN Athr_Ctd >= 100000 THEN 'X5+'
END AS AthrCnt_Cd
FROM
Rollup_Publisher_Year
WHERE
BackByYearsN > 1
) ,
Rollup_Publisher AS
(
SELECT
Pblshr_Num ,
( -- derive a backward sequence string, for now, just to illustrate their cliemt pattern
--MAX( CASE WHEN BackByYearsN = 0 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
--MAX( CASE WHEN BackByYearsN = 1 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 2 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 3 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 4 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 5 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 6 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 7 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 8 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 9 THEN CAST( Athr_Ctd AS VARCHAR( 10) ) ELSE '' END ) || '...'
) AS BackYears_Athr_Ctd_Str ,
( -- derive a backward sequence string, for now, just to illustrate their cliemt pattern
--MAX( CASE WHEN BackByYearsN = 0 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
--MAX( CASE WHEN BackByYearsN = 1 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 2 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 3 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 4 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 5 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 6 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 7 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 8 THEN AthrCnt_Cd ELSE '' END ) || '-<-' ||
MAX( CASE WHEN BackByYearsN = 9 THEN AthrCnt_Cd ELSE '' END ) || '...'
) AS BackYears_AthrCntCd_Str ,
--
MIN( Acnt_Yr ) AS Min_Acnt_Yr ,
MAX( Acnt_Yr ) AS Max_Acnt_Yr ,
--
MIN( BackByYearsN ) AS Min_BackByYearsN ,
AVG( BackByYearsN ) AS Avg_BackByYearsN ,
MAX( BackByYearsN ) AS Max_BackByYearsN ,
--
MIN( Athr_Ctd ) AS Min_Athr_Ctd ,
AVG( Athr_Ctd ) AS Avg_Athr_Ctd ,
MAX( Athr_Ctd ) AS Max_Athr_Ctd ,
--
MAX( CASE WHEN Acnt_Yr = 2024 THEN Athr_Ctd END ) AS Athrs_2024_Ctd ,
MAX( CASE WHEN Acnt_Yr = 2023 THEN Athr_Ctd END ) AS Athrs_2023_Ctd
FROM
-- Rollup_Publisher_Year
Chunkify
GROUP BY
Pblshr_Num
) ,
Patterns AS
(
SELECT
BackYears_AthrCntCd_Str ,
COUNT( Pblshr_Num ) AS Publisher_Cnt ,
--
-- derive a value that can later be checked for being more than zero - the actual amount doesn't itself have a meaning
-- the idea is that this value can only be zero if both 2014 and 2023 were zero
MAX( ZEROIFNULL( Athrs_2024_Ctd ) + ZEROIFNULL( Athrs_2023_Ctd ) ) AS MaxAthrsCtd_24plus23 ,
--
MAX( Max_Acnt_Yr ) AS Max_Max_Acnt_Yr ,
MAX( Min_Acnt_Yr ) AS Max_Min_Acnt_Yr ,
MIN( Min_Acnt_Yr ) AS Min_Min_Acnt_Yr ,
--
SUM( Athrs_2024_Ctd ) AS Sum_Athrs_2024_Ctd ,
SUM( Athrs_2023_Ctd ) AS Sum_Athrs_2023_Ctd
FROM
Rollup_Publisher
GROUP BY
BackYears_AthrCntCd_Str
-- ORDER BY
-- Publisher_Cnt DESC
) ,
RollingSums AS
(
SELECT
P.* ,
CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN '24 or 23' END AS RecentlyActive ,
CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN 'Acted_2024' END AS Active2024 ,
FIRST_VALUE( Publisher_Cnt) OVER(
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
) AS IgnoreFirstPattern_Cnt ,
--
SUM( Publisher_Cnt ) OVER (
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
ROWS UNBOUNDED PRECEDING
) AS SumPblshrsSoFar ,
SUM( Publisher_Cnt ) OVER () AS SumPblshrsOfAll ,
--
SUM( CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN Publisher_Cnt END ) OVER (
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
ROWS UNBOUNDED PRECEDING
) AS SumActvPblshrsSoFar ,
SUM( CASE WHEN MaxAthrsCtd_24plus23 > 0 THEN Publisher_Cnt END ) OVER () AS SumActvPblshrsOfAll ,
--
SUM( CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN Publisher_Cnt END ) OVER (
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
ROWS UNBOUNDED PRECEDING
) AS SumActv24PblshrsSoFar ,
SUM( CASE WHEN Sum_Athrs_2024_Ctd > 0 THEN Publisher_Cnt END ) OVER () AS SumActv24PblshrsOfAll ,
--
SUM( Sum_Athrs_2024_Ctd ) OVER (
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
ROWS UNBOUNDED PRECEDING
) AS SumSoFar_Athrs_2024_Ctd ,
SUM( Sum_Athrs_2024_Ctd ) OVER () AS SumofAll_Athrs_2024_Ctd
FROM
Patterns AS P
) ,
DoProportionCalcs AS
(
SELECT
R_S.* ,
( CAST( SumActvPblshrsSoFar AS FLOAT) / CAST( SumActvPblshrsOfAll AS FLOAT) ) AS SoFar_ActvPblshrs_Ppn ,
( CAST( SumActv24PblshrsSoFar AS FLOAT) / CAST( SumActv24PblshrsOfAll AS FLOAT) ) AS SoFar_Actv24Pblshrs_Ppn ,
( CAST( SumPblshrsSoFar AS FLOAT) / CAST( SumPblshrsOfAll AS FLOAT) ) AS SoFar_Pblshrs_Ppn ,
R_S.SumPblshrsSoFar - IgnoreFirstPattern_Cnt AS SoFar_WrkblPblshrs ,
( CAST( SoFar_WrkblPblshrs AS FLOAT) / CAST( SumPblshrsOfAll - IgnoreFirstPattern_Cnt AS FLOAT) ) AS SoFar_WrkblPblshrs_Ppn ,
--
( CAST( SumSoFar_Athrs_2024_Ctd AS FLOAT) / CAST( SumofAll_Athrs_2024_Ctd AS FLOAT) ) AS SoFar_Athrs2024_Ppn
FROM
RollingSums AS R_S
)
SELECT
-- Grouped By
BackYears_AthrCntCd_Str ,
-- Agggregates from here on
Publisher_Cnt ,
-- using the idea of Active only about 2024
Active2024 ,
SumActv24PblshrsSoFar ,
SoFar_Actv24Pblshrs_Ppn ,
SumActv24PblshrsOfAll ,
-- and their Authrs
Sum_Athrs_2024_Ctd ,
SumSoFar_Athrs_2024_Ctd ,
SoFar_Athrs2024_Ppn ,
SumofAll_Athrs_2024_Ctd ,
-- some year spread indicators
Max_Max_Acnt_Yr ,
Max_Min_Acnt_Yr ,
Min_Min_Acnt_Yr ,
--
-- All Publishers, no filtering
SumPblshrsSoFar ,
SoFar_Pblshrs_Ppn ,
SumPblshrsOfAll ,
-- Ignoring the long-time inactive Publishers
IgnoreFirstPattern_Cnt ,
SoFar_WrkblPblshrs ,
SoFar_WrkblPblshrs_Ppn ,
--
-- Idea of Active in either 2024 or 2023
MaxAthrsCtd_24plus23 ,
Sum_Athrs_2023_Ctd ,
RecentlyActive ,
SumActvPblshrsSoFar ,
SoFar_ActvPblshrs_Ppn ,
SumActvPblshrsOfAll
FROM
DoProportionCalcs
ORDER BY
Publisher_Cnt DESC ,
Sum_Athrs_2024_Ctd DESC ,
Sum_Athrs_2023_Ctd DESC
;
Commentary
Note that I said this "doesn't quite work" - that's because in real life, authors probably use multiple publishers simultaneously. But here we'll imagine that doesn't happen - though we will allow authors to change publisher from one year to the next.
That's important for one important aspect of this analysis:
- we cannot ever add the counts of authors across the years
Top comments (0)