Riken Shah

Posted on Jun 14 • Edited on Jun 21

Avoiding Beginner Mistakes Hampering You to Scale Backend⚡️

#go #devops #backend #postgres

This blog covers how I unlocked performance that allowed me to scale my backend from 50K requests → 1M requests (~16K reqs/min) on minimal resources (2 GB RAM 1v CPU and minimal network bandwidth 50-100 Mbps).

It will take you on a journey with my past self. It might be a long ride, so tighten your seatbelt and enjoy the ride! 🎢

It assumes that you are familiar with the backend and writing APIs. It's also a plus if you know a bit about Go. If you don't, that's fine too. You'd still be able to follow along as I've provided resources to help you understand each topic. (If you don't know GO, here's a quick intro)

tl;dr,

First, we build an observability pipeline that helps us monitor all the aspects of our backend. Then, we start stress testing our backend until breakpoint testing (when everything eventually breaks).

→ Connection Polling to avoid hitting max connection threshold

→ Enforcing resource constraints to avoid resource hogging from non-critical services

→ Adding Indexes

→ Disabling Implicit Transaction

→ Increasing the max file descriptor limit for Linux

→ Throttling Goroutines

→ Future Plans

Intro w/ backend 🤝

Let me give a brief intro to the backend,

It's a monolithic, RESTful API written in Golang.
Written in GIN framework and uses GORM as ORM.
Uses Aurora Postgres as our sole primary database hosted on AWS RDS.
Backend is dockerized and we run it in t2.small instance on AWS. It has 2GB of RAM, 50-100mb/s of network bandwidth, 1 vCPU.
The backend provides authentication, CRUD operation, push notifications, and live updates.
For live updates, we open a very lightweight web socket connection that notifies the device that entities have been updated.

Our app is mostly read-heavy, with descent write activity, if I had to give it a ratio it'd be 65% read / 35% write.

I can write a separate blog on why we choose - monolithic architecture, golang, or postgress but to give you the tl;dr at MsquareLabs we believe in "Keeping it simple, and architecting code that allows us to move at a ridiculously fast pace."

Data Data Data 🙊

Before doing any mock load generation, first I built observability into our backend. Which includes traces, metrics, profiling, and logs. This makes it so easy to find the issues and pinpoint exactly what is causing the pain. When you have such a strong monitoring hold of your backend, it's also easier to track production issues faster.

Before we move ahead let me just give tl;dr for metrics, profiling, logs, and traces:

Logs: We all know what logs are, it's just loads of textual messages we create when an event occurs.

Traces: This is structured logs high on visibility, that help us to encapsulate an event with correct order and timing.

Metrics: All the numeric churned data like CPU usage, active requests, and active goroutines.

Profiling: Gives us real-time metrics for our code and their impact on the machine, that help us understand what's going on. (WIP, will talk in detail in next blog)

To learn more about how I built observability into the backend you can study the next blog (WIP), I moved this section to another blog because I wanted to avoid the reader becoming overwhelmed and focus on only one thing - OPTIMIZATION)

This is how visualization of traces, logs, and metrics looks like,

So now we have a strong monitoring pipeline + a decent dashboard to start with 🚀

Mocking Power User x 100,000 🤺

Now the real fun begins, we start mocking the user who is in love with the app.

"Only when you put your love (backend) to extreme pressure, you find it's true essence ✨" - someone great lol, idk

Grafana also provides a load testing tool so without overthinking much I decided to go with it, as it requires minimal setup of just a few lines of code, there you have got a mocking service ready.

Instead of touching all the API routes I focused on the most crucial routes that were responsible for 90% of our traffic.

Quick tl;dr about k6, it's a testing tool based on javascript and golang, where you can quickly define the behavior you want to mock, it takes care of load testing it. Whatever you define in the main function is called an iteration, k6 spins up multiple virtual user units (VUs) which processes this iteration until the given period or iteration count is reached.

Each iteration constitutes 4 requests, Creating Task → Updating Task → Fetching the Task → Delete Task

Starting slow, let's see how it goes for ~10K request → 100 VUs with 30 iters → 3000 iters x 4reqs → 12K request

That was a breeze, no sign of memory leaks, CPU overloading, or any kind of bottleneck, Hurray!

Here's the k6 summary, 13MB of data sent, received 89MB, averaging over 52 req/s, with an average latency of 278ms not bad considering all this running on a single machine.



checks.........................: 100.00% ✓ 12001     ✗ 0    
     data_received..................: 89 MB   193 kB/s
     data_sent......................: 13 MB   27 kB/s
     http_req_blocked...............: avg=6.38ms  min=0s       med=6µs      max=1.54s    p(90)=11µs   p(95)=14µs  
     http_req_connecting............: avg=2.99ms  min=0s       med=0s       max=536.44ms p(90)=0s     p(95)=0s    
   ✗ http_req_duration..............: avg=1.74s   min=201.48ms med=278.15ms max=16.76s   p(90)=9.05s  p(95)=13.76s
       { expected_response:true }...: avg=1.74s   min=201.48ms med=278.15ms max=16.76s   p(90)=9.05s  p(95)=13.76s
   ✓ http_req_failed................: 0.00%   ✓ 0         ✗ 24001
     http_req_receiving.............: avg=11.21ms min=10µs     med=94µs     max=2.18s    p(90)=294µs  p(95)=2.04ms
     http_req_sending...............: avg=43.3µs  min=3µs      med=32µs     max=13.16ms  p(90)=67µs   p(95)=78µs  
     http_req_tls_handshaking.......: avg=3.32ms  min=0s       med=0s       max=678.69ms p(90)=0s     p(95)=0s    
     http_req_waiting...............: avg=1.73s   min=201.36ms med=278.04ms max=15.74s   p(90)=8.99s  p(95)=13.7s 
     http_reqs......................: 24001   52.095672/s
     iteration_duration.............: avg=14.48s  min=1.77s    med=16.04s   max=21.39s   p(90)=17.31s p(95)=18.88s
     iterations.....................: 3000    6.511688/s
     vus............................: 1       min=0       max=100
     vus_max........................: 100     min=100     max=100

running (07m40.7s), 000/100 VUs, 3000 complete and 0 interrupted iterations
_10k_v_hits ✓ [======================================] 100 VUs  07m38.9s/20m0s  3000/3000 iters, 30 per VU

Let's ramp up 12K → 100K requests, sent 66MB sent, 462MB received, saw a peak CPU usage to 60% and memory usage to 50% took 40mins to run (averaging 2500 req/min)

Everything looked fine until I saw something weird in our logs, "::gorm: Too many connections::", quickly checking the RDS metrics it was confirmed that the open connection had reached, 410, the limit for max open connection. It is set by Aurora Postgres itself based on the available memory of the instance.

Here's how you can check,

select * from pg_settings where name='max_connections'; ⇒ 410

Postgres spawns a process for each connection, which is extremely costly considering it opens a new connection as a new request comes and the previous query is still being executed. So postgress enforces a limit on how many concurrent connections can be open. Once the limit is reached, it blocks any further attempt to connect to DB to avoid crashing the instance (which can cause data loss)

Optimization 1: Connection Pooling ⚡️

Connection Pooling is a technique to manage the database connection, it reuses the open connection and ensures it doesn't cross the threshold value, if the client is asking for a connection and the max connection limit is crossed, it waits until a connection gets freed or rejects the request.

There are two options here either do client-side pooling or use a separate service like pgBouncer (acts as a proxy). pgBouncer is indeed a better option when we are at scale and we have distributed architecture connecting to the same DB. So for the sake of simplicity and our core values, we choose to move ahead with the client-side pooling.

Luckily, the ORM we are using GORM supports connection pooling, but under the hood uses database/SQL (golang standard package) to handle it.

There are pretty straightforward methods to handle this,



configSQLDriver, err := db.DB()
        if err != nil {
            log.Fatal(err)
        }
        configSQLDriver.SetMaxIdleConns(300)
        configSQLDriver.SetMaxOpenConns(380) // kept a few connections as buffer for developers
        configSQLDriver.SetConnMaxIdleTime(30 * time.Minute)
        configSQLDriver.SetConnMaxLifetime(time.Hour)

SetMaxIdleConns → maximum idle connection to keep in memory so we can reuse it (helps reduce latency and cost to open a connection)
SetConnMaxIdleTime → maximum amount of time we should keep the idle connection in memory.
SetMaxOpenConns → maximum open connection to the DB, as we are running two environments on the same RDS instance
SetConnMaxLifetime → maximum amount of time any connection stays open

Now going one step further 500K requests, (4000 req/min) and 20mins in server crashed 💥, finally let's investigate 🔎

Quickly looking through metrics and bam! CPU and Memory usage spiked. Alloy (Open Telemetry Collector) was hogging all the CPU and memory rather than our API container.

Optimization 2: Unblocking Resources from Alloy (Open Telemetry Collector)

We are running three containers inside our small t2 instance,

API Dev
API Staging
Alloy

As we dump huge loads to our DEV server it starts to generate logs + traces at the same rate hence exponentially increasing CPU usage and network egress.

So it's important to ensure alloy container never crosses the resource limits, and hampers the critical services.

As the alloy is running inside a docker container it was easier to enforce this constraint,



resources:
        limits:
            cpus: '0.5'
            memory: 400M

Also, this time logs weren't empty there were multiple context canceled errors - the reason being request timed out, and the connection was abruptly closed.

And then I checked latency it was crazy 😲 after a certain period average latency was 30 - 40 seconds. Thanks to traces now I can exactly pinpoint what was causing such huge delays.

Our query in GET operation was extremely slow, let's run EXPLAIN ANALYZE on the query,

LEFT JOIN took 14.6 seconds while LIMIT took another 14.6 seconds, how can we optimize this - INDEXING

Optimization 3: Adding Indexes 🏎️

Adding indexes to fields that are often used in where or ordering clauses can improve the query performance five-fold. After adding the indexes for the LEFT JOIN table and ORDER fields the same query took 50ms. It's crazy, from 14.6seconds ⇒ 50ms 🤯

(But beware of adding indexes blindly, it can cause slow death to CREATE/UPDATE/DELETE)

It also frees up connection faster and helps improve the overall capacity of handling huge concurrent loads.

Optimization 4: Ensure while testing there is no blocking TRANSACTION 🤡

Technically not an optimization but a fix, you should keep this in mind. Your code doesn't try to update/delete the same entity concurrently when you are stress testing.

While going over the code I found a bug that caused an UPDATE to the user entity on every request and as each UPDATE call executed inside a transaction, which creates a LOCK, almost all the UPDATE calls were blocked by previous update calls.

This alone fix, increased the throughput to 2x.

Optimization 5: Skipping implicit TRANSACTION of GORM 🎭

By default GORM, executes each query inside a transaction which can slow down the performance, as we have an extremely strong transaction mechanism, the chances of missing a transaction in a critical area is almost not possible (unless they are an intern 🤣).

We have a middleware to create a transaction before hitting a model layer, and a centralized function to ensure the commit/rollback of that transaction in our controller layer.

By disabling this we can get performance gains of at least ~30%.

"The reason we were stuck at 4-5K requests a minute was this and I thought it was my laptop network bandwidth." - dumb me

All this optimization led to a 5x throughput gain 💪, now my laptop alone can generate traffic of 12K-18K requests a minute.

Million Hits 🐉

Finally, a million hits with 10k-13K requests a minute, it took ~2 hours, it should have been done earlier but as alloy restarts (due to resource restriction) all the metrics get lost with it.

To my surprise the maximum CPU utilization during that span was 60% and memory usage was just 150MB.

It's crazy how Golang is so performant and handles the load so beautifully. It has minimal memory footprint. Just in love with golang 💖

Each query took 200-400ms to complete, The next step is to uncover why it takes that much, my guess is connection pooling and IO blocking slowing down the query.

The average latency came down to ~2 seconds, but there is still a lot of room for improvements.

Implicit Optimization 🕊️

Optimization 6: Increasing the max file descriptor limit 🔋

As we are running our backend inside a Linux OS, each network connection we open creates a file descriptor, by default Linux limits this to 1024 per process which hinders it from reaching peak performance.

As we are opening multiple web-socket connections if there is a lot of concurrent traffic we'd hit this limit easily.

Docker compose provides a nice abstraction over it,



ulimits:

        core:

          soft: -1

          hard: -1

Optimization 7: Avoid overloading goroutines 🤹

As a go developer, we often take a goroutine for granted, and just mindlessly run many non-critical tasks inside a goroutine we add go before a function and then forget about its execution, but at the extreme condition it can become a bottleneck.

To ensure it never becomes a bottleneck for me, for the services that often run in goroutine, I use an in-memory queue with n-worker to execute a task.

Next Steps 🏃‍♀️

Improvements: Moving from t2 to t3 or t3a

t2 is the older generation of AWS general-purpose machines, while t3 and t3a, t4g are newer generation. They are burstable instances they provide far better network bandwidth and better performance for prolonged CPU usage than t2

Understanding burstable instances,

AWS introduced burstable instance types mainly for workloads that don't require 100% CPU for most of the time. So these instances operate on baseline performance (20% - 30%). They maintain a credit system whenever your instances don't require CPU it accumulates credits. When the CPU spike occurs it uses that credits. This reduces your cost and wastage of computing for AWS.

t3a would be a nice family to stick with because their cost/efficiency ratio is much better among burstable instance families.

Here's a nice blog comparing t2 and t3.

Improvements: Query

There are many improvements that we can make to query/schema to improve speed, some of them are:

Batching inserts in insert heavy table.
Avoiding LEFT JOIN by denormalization
Caching layer
Shading & Partitioning but this comes a lot latter.

Improvements: Profiling

The next step to unlock performance is to enable profiling and figuring out what is exactly going on in runtime.

Improvements: Breakpoint Testing

To discover the limitations and capacity of my server, breakpoint testing is the next step forward.

End Note 👋

If you have read till the end, you are cracked, congrats 🍻

This is my first blog, please let me know if something is unclear, or if you want a more in-depth dive into the topic. In my next blog, I'll be taking a deep dive into profiling so stay tuned.

You can follow me on X, to stay updated :)

Top comments (49)

Achintya • Jun 17 • Edited

Amazing blog, can you also share the code of the backend you made.
Also, another optimization that I would have added was to use Sqlc with pgx rather than gorm as sqlc gives the performance of raw query execution with proper idomatic go models.

Riken Shah • Jun 21

Thanks, Achintya!
My next set of optimization is pushing the app beyond ~350 RPS, for which I might need to dump gorm and opt for something faster-ligher alternative like pgx.

Sorry, I cannot share the code for the backend as it is proprietary to my work.

Steven Brown • Jun 17

Thanks for this blog post! A lot of good information here and I'm mostly curious around the database and data set.

How large was the initial data set you're working with or was it an empty database? At first I was thinking most operations were on a single table (incredibly simple CRUD) but when you mentioned the joins, my curiosity peaked on the DB schema.

How many tables are joined on the queries you were able to bring down to ~50ms initially?
Are those included in the ones that went back to 200-400ms?

I'm also curious on the database settings and query structure.

Do you have the fields being returned added to the index (in the correct order) to better utilize the machines memory, or would that make the index is too large?

Thanks again!

Riken Shah • Jun 21

How large was the initial data set you're working with or was it an empty database?

The CRUD operation I mentioned touches multiple tables for authentication, verification of integrity, actual operation, and post-processing tasks. Most of the tables had records < 10K but the entity on which operations were performed had ~500K records to start with and in the end it had > 1M records.

How many tables are joined on the queries you were able to bring down to ~50ms initially?

Just two but on fairly heavy tables > 500K and 100K

Are those included in the ones that went back to 200-400ms?

Yep, My suspicion for query taking 200-400ms is on the availability of open connections. As we have capped open connections to 300, the SQL driver might wait for the connection to free up. Going to sit on this to investigate the real reason, might just be slow query execution

I'm also curious on the database settings and query structure.

We are using AWS managed Postgres service AWS Aurora, we are running this on base instance db.t4g.medium - 2 CPU, 4GB Ram, 64bit Graviton

Sorry, won't be able to share the query structure as this is proprietary work.

Do you have the fields being returned added to the index (in the correct order) to better utilize the machines memory, or would that make the index is too large?

Yep, I do have, we suffer from slightly slow writes, but it is worth it as we are READ heavy app.

Great questions and Thanks for reading this, Steve!

Mary Ahmed • Jun 27

Excellent article Riken, as a backend developer that just deployed a monolithic Node/express backend on AWS, I read every word. Our backend is so inefficient right now, the container wouldn't even load on a t2 micro we had to use a t3 medium (and we have 0 users right now so RDS databases were empty). I was trying to figure out where to even begin with optimization and you have provided so many things to try.

One question though on the nature of your app, you said it's read heavy. Is it a web app or mobile? We're building a social network and I'm wondering if GO would be the better option instead of Node. Appreciate any thoughts.

Riken Shah • Jul 3

Our backend powers our mobile app and web app.

NodeJS is definitely not a bad idea, I’ve worked on it before but golang provides so much stuff out of the box. The memory management is just next level, nodeJS wouldn’t be able to reach there without a lot of manual configuration.

I’d suggest write your core routes in golang hookup it with the RDS and then test.

Glad you liked the article :)