Bhavya Thakkar

Posted on Sep 5

The Dashboard Dilemma

#backend #database #postgres

Introduction
Sweaty palms, jittery hands, typing away at midnight to ship a crucial feature to improve team productivity, we've all been there. This could be avoided by building a reliable and easy to use dashboard that gives insights on KPIs, makes management smoother, makes any strategic pivots data-driven, and iterations faster. The problem lies in what we want to track, how frequently, how stringently, and how do we present it in a quick yet reliable fashion. I was in the same conundrum.
During August last year, after basic infrastructure and backend setup was done for our in-house ERP, I was tasked with any upcoming improvements and feature extensions.
Imagine this, you are in the spotlight of the financial year-end party all thanks to one feature that quadrupled the sales team's efficiency, I was stoked-that's an understatement; I wanted this every bit of it, even the possibility of failing miserably.

But little did I know one of those features was going to be an organization-level dashboard involving complex KPIs, like call efficiency, call times, and all of it preferably real-time. So this blog is going to be about how I went about it, what were the trade offs I chose, what I could have done better, what I actually got right, and things I didn't.

Freshness SLA & Its Impact
So what exactly is freshness SLA, and why does it matter so much? Let's take an example: if you were working at an HFT (high frequency trading) firm, if the prices that you execute your purchase/sell orders on are even a second old, it's going to cost the company probably millions of dollars, and on the stark end of the spectrum if you were working on a government project of verifying a user's document for vehicle registration you would be fine even if the system took, let's say an hour.
What changed in these 2 examples is the requirement of frequent data updates/syncs, that's freshness SLA, and freshness is simply how old or new the data that you possess is
The one thing about any business and any data-driven decision is, if you ask them what they want, they will most definitely answer with something along the lines of
"The most recent accurate data, as quick as possible"
The problem with these statements is I don't have anything to latch on to, nothing to map my user expectations, and nothing to derive my KPI. Luckily I was smart enough to ask the right question,

Is it okay if the data is 15 mins stale?

The answer to that question most definitely won't be a simple yes or no, and that's the beauty of it-it will give you insights on what can you get away with, it will be more like

15 mins is too old, we are willing to accept 5 mins stale data>

15 mins is fine but no more delays than that

or a
rigid real-time requirement, but then the expectation for performance would be lowered. I was working with a max staleness of 15 minutes and a sub-3 second response time, any slower and it would be a bad UX, and preferably sub-1 second responses.

figure 1

This gives me quite a bit of wiggle room, even though the data as shown in figure 1 is quite complex to begin with. So let's break it down: each user is assigned a designation, and a team which can change with time due to promotions, and internal movements. We need to extract information about KPI on customer state, derived from call_history, and whether the customer is closed or not, based on the invested flag in the customer table. All simple and straightforward so far, but we need to do it for each user in all the teams, over the time period specified, and aggregate under the correct hierarchy, this also includes contributions by past team members.
(user here refers to the stakeholders of our ERP system—members of the sales team)
Final data would look something like this.

{
  "key": 1,
  "child": [
    {
      "key": 2,
      "child": [
        {
          "key": 3,
          "child": [
            {
              "key": 4,
              "warm_calls": 2,
              "dialled_calls": 100,
              "call_efficiency": 5,
              "closures": 1,
              "sip_setup": 1,
              "designation": "RM"
            },
            {
              "key": 5,
              "warm_calls": 2,
              "dialled_calls": 100,
              "call_efficiency": 5,
              "closures": 1,
              "sip_setup": 1,
              "designation": "RM"
            }
          ],
          "warm_calls": 4,
          "dialled_calls": 200,
          "call_efficiency": 5,
          "closures": 2,
          "sip_setup": 2,
          "designation": "team-lead"
        }
      ],
      "warm_calls": 4,
      "dialled_calls": 200,
      "call_efficiency": 5,
      "closures": 2,
      "sip_setup": 2,
      "designation": "manager"
    },
  ],
  "warm_calls": 4,
  "dialled_calls": 200,
  "call_efficiency": 5,
  "closures": 2,
  "sip_setup": 2,
  "designation": "avp"
}

sample data for 1 team consisting of 2 sales reps

We were mainly considering 3 approaches, as any lean startup running on a time crunch, our first idea was caching, but the problem was caching the whole data (as shown in sample data) would be expensive, largely due to the amount of recursive hierarchy that we need to maintain for team-based KPIs. Not to mention we would need to invalidate cache, and if a cache miss occurs, the DB query is still inefficient. We have neither optimized the query, nor have we reduced the data that query processes, so no real improvement on the first hit. This didn't sit right with me and if I had done the easy thing, and kept the TTL as 15 minutes and called it a day, it would be a band-aid fix waiting to explode in my face. Caching as a solution would work for frequently accessed time frames like daily or weekly data, but the moment we switch to say more infrequently accessed data like yearly, quarterly, or half-yearly we run into the problem of frequent cache misses

The second approach was to use batch processing to precompute this data, although much better than the quasi improvement of caching, it had its own challenges setting up complex ETL pipelines, separate infrastructure setup, a lot of initial configuration, and still it isn't a perfect fit for the real-time requirement. Not to mention that if setup is anything short of perfect it would introduce data inconsistencies, and any minor schema change would need us to go through this whole process again, not ideal for development velocity

The third approach we had was to use streaming of events from our mobile ERP app to directly populate web dashboards, fast, quick, reliable, initial infra setup in terms of Apache Kafka is required. This can be circumvented by using cloud solutions like Confluent and Kinesis, these would be plug and play. We only had to worry about triggering events and consuming them, but the problem is the data we required was too complex, mandating complex aggregations and joins. While data points sent by mobile client could be used to calculate KPIs, processing it in event-driven architecture would mean computing it as we receive the event making refreshes much slower than the required 3 seconds
So what I ended up doing was to use a combination of these 3 to find the sweet spot between these approaches. I wanted a streamlined approach similar to Redis, with cost and accuracy of batch processing and ease of setup similar to streaming

The Incremental Refresh Approach
So I lied, I didn't use all three, I held back caching. I just couldn't come to terms with a pseudo real-time dashboard, I used streaming, specifically a queue, Kafka to be even more specific, to ingest real-time events into the DB. These events included metrics like call times, leads closed, leads transitioning to different states in the funnel, so my database has fresh data, what about the aggregation problem, you ask. That's where realizing that all the data required for a real-time organizational dashboard mostly resides in the backend server, I would say mostly because some of the transactional data would be async, including but not limited to payment_status, fulfillment_timestamp. This doesn't affect the problem statement much because in such dashboards you work with an unsaid assumption, whatever data is present within the system, you crunch the numbers and give it to business teams. This context setting was required to appreciate a rather simple solution, materialized views, all your aggregations could be pre-computed and stored as a materialized view.

The problem with materialized view is that it doesn't refresh with updates, an intuitive solution would be to use hooks and refresh the view. That's only true if you serve all your data through views to the dashboard, now let's think about this problem in a broader perspective. Does historical data change, the answer to that in my case was no. So can I save all the data until yesterday in a view and refresh it through a cron? Yes, and compute only today's slice irrespective of time frame. The aggregation considers significantly less data, and queries can be optimized to give sub-second results.
Now coming to another downfall of views, they lock the whole table while they are refreshed, not good. So we need to run it in a cron, during phases with lower loads, I hear you, didn't I frown upon additional infra, and that's where a nifty little extension on PostgreSQL comes to save the day, pg_cron to be specific. So how it works is you configure your usual cron but on your db in a different schema, you just provide the refresh view command as the command to be executed to the cron config and whether it should be active or not.

This is a clean solution, no additional infra overhead, no compromise on real-time SLA, minimal cost increase, if any due to views, and the best part it's elegant and easy to manage. The next section goes over how to setup pg_cron for your use case, I refrained from going on in-depth on how to make materialized views, this blog is already running too long and setting them up is really straightforward. While pg_cron does require some work if you don't self-host your servers and use a cloud provider, we use AWS RDS so all of this info will be in context to that but I will attach references on how to achieve the same on the other 2 major cloud providers, once pg_cron has been added as an extension in postgres, process after that is really straightforward.

PG Cron Setup RDS
pg_cron is supported on RDS for PostgreSQL versions 12.5 and above.

Modify Parameter Group
Go to the AWS RDS Console.
Navigate to Parameter Groups on the sidebar.
If you don’t have a custom parameter group, create one by copying the default group associated with your DB instance.
Edit the custom parameter group. Find the parameter shared_preload_libraries.
Add pg_cron to this parameter’s value (append it if other libraries exist).
Save the changes.
Apply the Parameter Group to Your DB Instance
Modify your RDS instance to use the new/modified parameter group.
A database restart is required for the changes to take effect. Restart your instance.
Create the pg_cron Extension in the Database

Connect to your PostgreSQL instance as a user with rds_superuser privileges (default user postgres usually has these).

CREATE EXTENSION pg_cron;

This will enable the pg_cron background worker and create the necessary objects, typically in the postgres database by default.
Schedule Jobs

SELECT cron.schedule('job_name', 'cron_schedule', 'command_to_run');

For example:

SELECT cron.schedule('refresh_materialized_view', '*/15 * * * *', 'REFRESH MATERIALIZED VIEW my_mat_view');

Managing Jobs
To unschedule jobs:

SELECT cron.unschedule(job_id);  
SELECT cron.unschedule('job_name');

Job details and histories can be viewed in the cron.job_run_details table in the database, like so

SELECT * FROM cron.job_run_details
WHERE jobid = your_job_id;

Top comments (1)

Bhavya Thakkar • Sep 6

Thank you all for reading! This is my very first blog post, and I plan to make blogging a monthly habit. Your feedback is incredibly valuable to me—please don’t hesitate to share your honest thoughts, suggestions, or constructive criticism. The more input I get, the better I can make my future posts. Let’s grow together!