Entry #01: My perfect API was actually drowning

#discuss #javascript #productivity #postgres

For a long time, I believed that if my API sent back a 200 OK status code, my work as a backend developer was over. But I've come to realize that I should not only be using tools and libraries but also understand how they work. I've been using Postgres because I've heard it's the best. But I never really knew why—until now.

Welcome to Entry #01 of my journey into advanced backend architectures.

The Life of a Query

I used to think that a query goes in and comes out with the data. But the life of a query is actually very disciplined. It's like a relay race.

When I send a query to my Node.js application, a connection is made (now I’m using a connection pool to keep things stable). When this connection is made, a backend process is created. Then, Postgres determines how to retrieve this information. It could do an index scan if it can find it quickly, or it could do a sequential scan if it has to look at everything. It all depends on the query

One of the biggest mind blowers for me is that Postgres does not access the disk every time. It has something called a Shared Buffer. This is essentially a memory. If it is in that memory, it is quick. If it is not in that memory, it goes to the disk and puts it in that memory.

What happens when things go wrong?

I wanted to examine a few scenarios that would normally keep developers up at night.

Scenario 1: The Morning Spike
The Problem: At PayWave, our /wallet/balance endpoint suddenly jumped from 50ms to 500ms during the morning rush.

My Coworker Alex: "Hey, the balance check is dragging. Is the server overloaded?"

The Solution: If I'm working at a fintech firm like PayWave, and a balance check takes 500ms, it's a disaster. Instead of thinking of restarting the server, I would think about the dead tuples. If the database has a lot of old tuples that have not been vacuumed in a while, it could slow down the server significantly. I would also consider if the query actually using the index I created?

Scenario 2: The Growing Table
The Problem: At ShopSphere, our orders table grew from 2GB to 6GB in one week, but our sales didn't triple.

My Coworker Priya: "Why is the disk filling up so fast? The autovacuum is on, so it shouldn't be a bloat, right?"

The Solution: If a table has tripled in size in a week at ShopSphere, it could be a sign of that it's a bloat. Even if autovacuum is enabled, it may not be fast enough to keep up with the volume of changes to the table. I would examine the value of autovacuum_vacuum_cost_limit to determine if we are slowing down the autovacuum process too much.

Scenario 3: The Mid-Transaction Crash
The Problem: During a quarterly DR test at CryptoBank, we simulate a primary node crash mid-transaction.

My Coworker Sam: "We killed the primary node. Can we actually restore this without losing the last 30 seconds of transactions?"

The Solution: This used to worry me. But after learning about Write Ahead Logging (WAL), I don’t worry about this anymore. Even in the case of a primary node crash, I know that I the data can be recovered. I can simply replay all the logs and bring my system to a consistent state.

My Key Takeaways

Indexing is your friend when it comes to speed.
Bloating is a silent killer; do monitor dead tuples.
Using connection pools avoids the "Too Many Connections" crash because it manages backend processes.

I'm shifting my focus from merely coding to learning about the environment that runs my code. If you have any questions/scenarios related to Postgres performance tuning, do share in the comments below.

Catch you in Entry #02.