DEV Community

Cover image for Exploring Advanced SQL Techniques: Aggregation Methods for Structured Data Retrieval
Emtiaj Hasan
Emtiaj Hasan

Posted on • Updated on

Exploring Advanced SQL Techniques: Aggregation Methods for Structured Data Retrieval

Assume we have two tables, Vocabulary and Definition to store any unknown vocabulary. It is common for a vocabulary to have multiple definitions. In SQL terms, this means that there is a one-to-many relationship between the Vocabulary and Definition tables.

Let’s assume we need to retrieve vocabularies along with all of the definitions. The payload structure should be like the one below.

{
   id: string;
   word: string;
   definitions: {
       id: string;
       meaning: string;
       examples:  string[];
   }[];
}[]
Enter fullscreen mode Exit fullscreen mode

We can quickly create tables after spinning up a Postgres container.

As both tables have been created we can insert rows into these two tables. A simple query can quickly insert 5 vocabularies.

If we see the above-mentioned snippet, except one (which is Cakewalk) all have at least one definition and the vocabulary Summon has two definitions. Let’s keep in mind.

Approach 1: External Processing

We can execute one of the simplest queries ever to retrieve all vocabularies.

select "Vocabulary".id as "vocabularyId",
      "Vocabulary".word,
      "Definition".id as "definitionId",
      "Definition".meaning,
      "Definition".examples
from "Vocabulary"
        left join "Definition" on "Vocabulary".id = "Definition"."vocabularyId"
order by "Vocabulary".word;
Enter fullscreen mode Exit fullscreen mode

The query responses can be found here.

If we notice, for each definition, there is a single entry. Therefore, for Summon, we have two rows. But our response payload indicates, we need to aggregate all definitions into an array. That means we need to somehow process the response.

Also, as the Cakewalk does not have a definition, it returns null for definitionId, meaning, and examples. In this scenario, we need to return an empty array for simplicity.

Process retrieved response

We can write a neat JavaScript function to tidy up the SQL response to group all the meanings under each word, just like how we want it.

function process(queryResponses) {
   const vocabulariesMap = new Map();


   queryResponses.forEach(currentQueryResponse => {
       const existingVocabulary = vocabulariesMap.get(currentQueryResponse.vocabularyId);
       if (!existingVocabulary) {
           vocabulariesMap.set(currentQueryResponse.vocabularyId, {
               id: currentQueryResponse.vocabularyId,
               word: currentQueryResponse.word,
               definitions: currentQueryResponse.definitionId ? [{
                   id: currentQueryResponse.definitionId,
                   meaning: currentQueryResponse.meaning,
                   examples: currentQueryResponse.examples
               }] : []
           });
       } else {
           existingVocabulary.definitions.push({
               id: currentQueryResponse.definitionId,
               meaning: currentQueryResponse.meaning,
               examples: currentQueryResponse.examples
           });
       }
   });


   return [...vocabulariesMap.values()];
}


const vocabularies = process(queryResponse);
Enter fullscreen mode Exit fullscreen mode

The method will return us the expected response and here it is.

[
 {
   "id": "d2daf563-ee43-439e-a8c0-6cfa325206dc",
   "word": "Cakewalk",
   "definitions": []
 },
 {
   "id": "7b33f8c7-1648-40e3-80df-3faec587f773",
   "word": "devil's advocate",
   "definitions": [
     {
       "id": "1a284263-c329-4876-bb66-14bb3b9e9a76",
       "meaning": "a person who expresses a contentious opinion in order to provoke debate or test the strength of the opposing arguments",
       "examples": [
         "The schoolmaster often played devil's advocate with his students so that they could have an interesting discussion and look at other points of views."
       ]
     }
   ]
 },
 {
   "id": "ac7a758d-bebe-4f6c-8642-11b30260e9df",
   "word": "Guerrilla marketing",
   "definitions": [
     {
       "id": "2bb85746-8915-4495-bff7-b63037a4ea09",
       "meaning": "creative, low cost, unconventional strategy to promote the products or services",
       "examples": [
         "A startup should use guerrilla marketing to promote their services."
       ]
     }
   ]
 },
 {
   "id": "ed6df6e5-2580-4c61-a05f-f390851974b3",
   "word": "Run round like a headless chicken",
   "definitions": [
     {
       "id": "c89966db-ca3a-4ea2-b638-fde91834f00a",
       "meaning": "to be very busy doing a lot of things, but in a way that is not very effective",
       "examples": [
         "Instead of running round like a headless chicken use your efforts in a more productive way."
       ]
     }
   ]
 },
 {
   "id": "bd5b5c91-5776-423c-8641-522110e98b29",
   "word": "Summon",
   "definitions": [
     {
       "id": "ce25ff39-c37f-426e-ba50-2311492184a1",
       "meaning": "order to be present",
       "examples": [
         "We were summoned to the headmaster's office."
       ]
     },
     {
       "id": "873cf807-3fe7-4322-8e61-465c87d66a40",
       "meaning": "to make an effort to produce a particular quality in yourself, especially when you find it difficult",
       "examples": [
         "It took me six months to summon (up) the courage to ask her out for a drink."
       ]
     }
   ]
 }
]
Enter fullscreen mode Exit fullscreen mode

Approach 2: SQL with Postgres Functions

But, isn’t there any way to retrieve data as our desired structure using just the query?

I was looking into it and found two awesome Postgres functions json_build_object and json_agg.

Let’s apply those two!

select "Vocabulary".id,
      "Vocabulary".word,
      json_agg(json_build_object(
              'id', "Definition".id,
              'meaning', "Definition".meaning,
              'examples', "Definition".examples)) as definitions
from "Vocabulary"
        left join "Definition" on "Vocabulary".id = "Definition"."vocabularyId"
group by "Vocabulary".id, "Vocabulary".word
order by "Vocabulary".word;
Enter fullscreen mode Exit fullscreen mode

Except for one minor issue, it returns a response as exactly as our structure.

If we notice the definitions for the Cakewalk, we will see the issue. It is like the below having a bunch of null.

{
 "id": "d2daf563-ee43-439e-a8c0-6cfa325206dc",
 "word": "Cakewalk",
 "definitions": [
   {
     "id": null,
     "meaning": null,
     "examples": null
   }
 ]
}
Enter fullscreen mode Exit fullscreen mode

Filter null

Postgres has filter clause to filter out the null, which can rescue us.

If we change the query a bit it will not return null for each definition column.

json_agg(json_build_object(
      'id', "Definition".id,
      'meaning', "Definition".meaning,
      'examples', "Definition".examples)) filter (where "Definition".id is not null) as definitions
Enter fullscreen mode Exit fullscreen mode

However, it looks like we still need to rely on the JS method to set an empty array as the query response for the Cakewalk is like the one below.

{
 "id": "d2daf563-ee43-439e-a8c0-6cfa325206dc",
 "word": "Cakewalk",
 "definitions": null
}
Enter fullscreen mode Exit fullscreen mode

Being a bit disappointed, I started looking into Postgres-specific ways and found another beautiful function! coalesce accepts any arguments as well as a default one and if it does not find any non-null values, it returns the default value. For our scenario, we can pass the default value as an empty array.

So, here is the modified select clause.

coalesce(json_agg(json_build_object(
      'id', "Definition".id,
      'meaning', "Definition".meaning,
      'examples', "Definition".examples)) filter (where "Definition".id is not null), '[]') as definitions
Enter fullscreen mode Exit fullscreen mode

And voila! We got our structured data straight from the database query, all spick and span!

So, which approach do you prefer? Classic SQL with a sprinkle of external methods or Super SQL straight from Postgres? Let me know, and we can dive deeper!


Cover image credit: postgresweekly

Top comments (1)

Collapse
 
nigel447 profile image
nigel447

For this simple case I would stick with pure SQL and in the case that I needed the aggregation code to not be in the source would just aggregate with a sql function, reason being the need to stay SQL compliant and to keep the possibility of a seamless migration to a new relational database if needed, I am a big fan of the PostgreSQL JSON extension(s) so this is a great write up.