DEV Community: Garvit Gupta

The Big Power of Small Pull Requests

Garvit Gupta — Mon, 04 Nov 2024 17:18:31 +0000

I have been writing code professionally for more than five years now. For the first four years, I never cared about the size of my pull requests (PRs). However, in the last year, I transitioned from submitting massive PRs with thousands of lines of changes to breaking them down into smaller, more manageable ones. The benefits of this shift have been immense, and in this blog, I’ll share those advantages.

According to Github, a pull request is:

A pull request is a proposal to merge a set of changes from one branch into another. In a pull request, collaborators can review and discuss the proposed set of changes before they integrate the changes into the main codebase.

Essentially, a pull request is a way of collaborating, we should do everything possible to enhance this collaboration. One effective method to improve this collaboration is to keep PRs small.

There is no universal definition to differentiate between a small and a large PR. Relying solely on the number of lines changed is insufficient, as auto-generated code and tests can inflate line counts. When I refer to small PRs in this article, I mean dividing a larger PR into multiple smaller, logically coherent PRs. Each smaller PR should be standalone, mergeable and deployable. I do not advocate artificial splitting like splitting a PR into two, one containing all the code and another with just the tests, as this approach fails to yield any benefits I share below.

Efficient Reviews:

Ask a programmer to review 10 lines of code, he’ll find 10 issues. Ask him to do 500 lines and he’ll say it looks good.

While humorous, this quote carries truth. Everyone is busy with their own work, and when you ask someone to review a PR, you are essentially requesting their time. Reviewing a PR requires the reviewer to switch context from their own work, and if the review takes considerable amount of time, it can be challenging for them to return back to their work, potentially affecting their motivation and commitment to the review. Smaller PRs, which only take about 20–30 minutes to review, are much easier to tackle compared to those that can take 2–3 hours. Plus, larger PRs often lead to oversights because our attention spans can only handle so much, and jumping between lots of changes in a single PR can be confusing. From my experience, smaller PRs tend to get better feedback and lead to more meaningful design conversations.

Faster Reviews:

At this point, I’m thinking of adding my PR to my will in case it’s still in review after I’m gone.

Longer PRs require a significant time commitment from reviewers, making them less likely to get attention — especially if they’re not tied to high-impact features. Small PRs on the other hand are reviewed quickly, as they demand less of the reviewer’s time and are less intrusive. This speed of reviews can be crucial for meeting project deadlines; I’ve seen projects getting delayed because senior reviewers couldn’t allocate time for massive PRs (though this can happen with small PRs, the risk is inherently higher with large ones).

Lesser Rework:

Reworking a big PR after design changes is like rearranging deck chairs on a ship you just finished building… and then sinking it.

We have all experienced situations where someone realizes during PR review that a different design would have been more maintainable and future-proof, and we need to spend additional time reworking the PR based on the new design (not to say that they were completely on board with the design that was initially implemented :p). This is very natural because sometimes things become clearer once you see them written in code, and you start noticing aspects you might have missed during the design phase. With large PRs, this can be a significant issue because you have to rework a lot of elements, but with smaller PRs, it is easier to make changes. More importantly, there is a lower chance of rework because reviewers are more likely to identify issues early on and address them in the initial PRs, allowing subsequent PRs to be based on the new designs.

Better Testing:

Smaller PRs also benefit you as the author of the PR. They help in testing smaller changes incrementally rather than testing the whole project in one go. Testing smaller changes results in more exhaustive testing of each component of the system, thus leading to fewer production bugs. This applies to both automated tests and manual testing performed by you or dedicated QA engineers.

Moreover, smaller PRs reduces the likelihood of missed test cases, as you can focus on a limited scope rather than the entire system.

Writing tests? That sounds like Future Me’s problem.

I have seen developers (including myself in the past) hesitate to write automation tests due to the perceived time investment without immediate, “visible” value to the feature / product. Smaller PRs reduce this friction by limiting the number of tests required and the time spent writing them.

Easier Debugging:

No matter how thorough your testing is, production bugs will happen! Being able to debug a bug in production is crucial since production bugs directly impact users, the business, or both. With large PRs, the surface area of change is also large, making it time-consuming and difficult to find the root cause of issues. On the other hand, smaller PRs contain less code and thus make debugging much faster.

Debugging small changes is like finding a typo; debugging big changes is like proofreading an encyclopedia.

Phased Feature Deployment:

Last but not least, smaller PRs are also helpful to your product manager and users. By using small PRs, you can continuously push parts of the system to production, which helps in obtaining early feedback from users and allows for early course corrections if needed.

Skipping early feedback is like cooking a five-course meal without tasting anything — you’re just hoping it’s not a disaster.

The benefits of small PRs are numerous, and the points outlined above are among the most impactful I’ve experienced personally. If you have encountered other advantages of small PRs or challenges with larger ones that I haven’t covered, do comment your insights.

I hope this article motivates you to embrace smaller PRs. If you’re already on board, I hope it reinforces the value of this practice.

Thanks for reading, until next time, keep coding and stay curious!

Learnings from a 5-hour production downtime!

Garvit Gupta — Sat, 02 Mar 2024 13:48:28 +0000

As with all the incidents, it happened on a Friday evening!

In this article, I’ll delve into the causes and prolonged recovery time of a recent 5-hour downtime in one of our critical production services.

The affected service, a Node.js application, manages data transactions with PostgreSQL, sustaining peak loads of 250K requests per minute. Our server infrastructure is orchestrated via Kubernetes, with AWS RDS serving as the backend database.

The Beginning (5 PM)

The problem started around 5 PM when we started receiving unusually high traffic on the servers, 3–4 times the normal traffic. Due to this increase in traffic, the database server started degrading and in 15 minutes database degraded so much that it was barely able to process any queries.

First Response (5:20 PM)

We investigated possible causes for the traffic surge, such as a marketing campaign, but found nothing conclusive and the traffic was still increasing. To manage the traffic and allow the database to recover, we implemented temporary rate-limiting rules on our firewall. This resulted in a decrease in traffic and signs of database recovery.

Second Attack (5:45 PM)

Just as we believed the incident had concluded, the RDS console flashed ‘Storage Full.’ The database had exhausted its storage capacity, rendering it unable to process any new requests. Knowing that AWS allows easy storage expansion, we promptly increased the storage capacity. To our surprise, we saw an error that storage cannot be increased. After multiple unsuccessful attempts to increase the storage, we found that in AWS, the storage of an RDS server cannot be increased more than once in 6 hours (AWS reference).

Storage optimization can take several hours. You can’t make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer
But we recalled that we haven’t increased the storage in the last 6 hours, then who did?

Hidden Attack (5:30 PM)

In AWS RDS, you can configure auto-scaling for storage, allowing an automatic increase in storage when it reaches near capacity. Our database had auto-scaling configured. By 5:30 PM, the surge in traffic had already pushed the database storage to its scale-up threshold, triggering an automatic scale-up. This meant that we would not be able to increase the storage for the next 6 hours!

Darkness

We couldn’t afford to wait six hours to increase storage because the period between 5–10 PM sees the highest traffic. Given the critical nature of this service, any delay would severely impact user experience and business operations. We considered restoring a backup on a new RDS server and decommissioning the current one. However, since the last backup was taken 3 hours ago, implementing this solution would result in a loss of 3 hours of data.

Ray of Hope (6:30 PM)

After consulting with service owners, we concluded that losing 3 hours of data was acceptable. The nature of the service was such, that once the service is back online, any lost data, will be recreated. So we started preparing for the the point-in-time recovery of the database. We provisioned a new RDS server mirroring the current configuration, but with expanded storage, and initiated the backup restoration process. Anticipating from previous experiences, we estimated the restoration process to take approximately 20–30 minutes.

Darkness once again (7:15 PM)

Even after 45 minutes, the restoration process was not complete. We started checking why it was taking so much time (there is no progress bar while restoring so we didn’t know if it would be done in 10 minutes or if it would take 10 more hours). We discovered that the server’s CPU usage was almost 100%, likely causing the restoration slowdown. However, increasing CPU capacity wasn’t feasible as it required changing the RDS instance type, something that couldn’t be done while the restoration was in progress.

Back to square one (8:00 PM)

After waiting for 45 more mins we decided to increase the CPU. The only solution was to create a new server and initiate the backup restoration process again. We kept the current server on which the CPU was becoming a bottleneck for restoration and simultaneously started restoration in a new server with 3 times the CPU, hoping that we will use the one that gets finished early. In the new server, the CPU was no longer a bottleneck, stabilising at 50–60%.

Still Not Done (9:00 PM)

Even after an hour, the backup process continued on both servers. Concerned about other potential bottlenecks, we began checking metrics for the new server. Turned out that this time IOPS was the bottleneck (IOPS is the measure of Disk IO that can be done per second, IO requests beyond the threshold are throttled). We paled at the thought of having to restart the recovery process from scratch, once again!

The Last Stand

Fortunately, AWS allows increasing the IOPS during the backup restoration process. Doubling the IOPS resolved the bottleneck. Finally, by 10 PM, we successfully completed the backup restoration and updated the service configuration to connect to the new server, by 10:15 service stabilised and it started handling traffic as usual. The next day we reduced all the resources that we over-provisioned during restoration.

Learnings

Maintain adequate free storage on database servers, before the incident our database server was already running at 90% storage, we could have proactively increased the storage to avoid storage bottlenecks during the incident.
While restoring a backup, database server resources should be over-provisioned to avoid bottlenecks.
The rate limiting implemented reactively during the incident should have been implemented proactively to manage sudden traffic spikes before they impact the servers.

Thank you for reading. Until next time, happy reading!

Index-Only Scan in Postgresql is not always Index “Only”!

Garvit Gupta — Tue, 22 Aug 2023 15:10:08 +0000

An Index-only scan is supposed to return query results just by accessing the index but in Postgresql, an index-only scan can end up accessing table rows (heap memory) as well, which might result in the query taking more time (or other resources) than anticipated. In this blog, I will discuss how we discovered this behavior of Postgresql and how we solved this for our use case.

The Problem:

We optimized a high IO-consuming read query some time back (detailed blog). The optimization we had done was to create appropriate indexes so that query can be resolved using an index-only scan so that there is no need to read table rows, thereby reducing IOPs (Input-Output per second) consumed by the query.

But a few weeks down the line we again started observing a gradual increase in IOPs consumed by the query. On checking the query plan, it was still using index-only scan but we found that the query was also doing a lot of disk access and it was accessing heap memory as well. It was not intuitive for us why would an index-only scan access heap memory. On further debugging and going through Postgresql docs we discovered that there are cases when index-only scan might end up accessing heap memory in Postgresql. Below image show the gradual increase in read IOPs that we faced:

The Reason:

This happened because of how concurrency control works in Postgresql, to quote from the official docs:

There is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be “visible” to the query’s MVCC snapshot. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently.

In short, the query engine needs to access the table row to determine whether the row is visible to the current transaction or not, but does that mean an index-only scan always has to access heap memory? No definitely not, further from the official docs:

For seldom-changing data there is a way around this problem. PostgreSQL tracks, for each page in a table’s heap, whether all rows stored in that page are old enough to be visible to all current and future transactions. This information is stored in a bit in the table’s visibility map. An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it’s set, the row is known visible and so the data can be returned with no further work. If it’s not set, the heap entry must be visited to find out whether it’s visible, so no performance advantage is gained over a standard index scan. Even in the successful case, this approach trades visibility map accesses for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it describes, far less physical I/O is needed to access it. In most situations the visibility map remains cached in memory all the time.

To describe a visibility map briefly from the docs:

Each heap relation has a Visibility Map (VM) to keep track of which pages contain only tuples that are known to be visible to all active transactions.

So if the visibility map is up-to-date then the query engine need not access the table rows and the query can be resolved just by using the index and visibility map.

I will slightly digress here to talk a little bit about Vacuuming in Postgresql (it's needed to understand the remainder of this blog). In PostgreSQL, an UPDATE or DELETE of a row does not immediately remove the old version of the row, just a new version is created and stored. The old version keeps using the disk space, these old versions of rows are called dead tuples. The Vacuum process in Postgresql removes dead tuples and marks the space available for future reuse. Vacuum does other things as well but removing the dead tuples is one its most important tasks. Vacuum can be triggered manually and it can also be done automatically by setting some thresholds for dead tuples, whenever the number of dead tuples are more than the threshold, the vacuum is triggered automatically.

This much understanding of Vacuum in Postgresql is enough to understand the remainder of this blog but if you are curious you can check out more details about Vacuuming in the official doc.

Coming back to our problem, we now know that index-only scan uses visibility map to avoid accessing the heap memory, but how to keep the visibility map up-to-date? The Visibility Map is automatically updated when vacuum runs. If visibility map is automatically updated then why did we see an increase in IOPs in our case? The reason is that the table in consideration was huge (~200 million records) and auto-vacuum was not running frequently enough on the table, resulting in an outdated visibility map. In fact, it was taking close to 2 months between subsequent auto-vacuum triggers, because we were using the default auto-vacuum thresholds, which trigger auto-vacuum if at least 10% of the rows are dead tuples OR 20% of the rows are newly inserted since the last vacuum.

The Solution:

We changed our thresholds to run auto-vacuum frequently so that the visibility map remains updated. To achieve this, we updated the thresholds to a constant value of 100k dead rows OR 100k newly inserted rows. Now auto-vacuum triggers daily and IOPs consumed by the query have come down by 6 times!

Conclusion:

When trying to optimize a query for an index-only scan in Postgresql we also need to check that the auto-vacuum on the table is running frequently enough to keep the visibility map updated because the visibility map is used to determine whether to read table rows or not during an index scan and if visibility map is up-to-date then in the majority of the cases query can be resolved just by reading from the index.

References:

Postgresql Concurrency Control: https://www.postgresql.org/docs/current/mvcc.html
Visibility Map: https://www.postgresql.org/docs/current/storage-vm.html
How index-only scan uses visibility map: https://www.postgresql.org/docs/current/indexes-index-only-scans.html#:~:text=But%20there%20is,all%20the%20time.

Learnings from a slow query analysis in PostgreSQL

Garvit Gupta — Sat, 11 Mar 2023 14:04:09 +0000

Recently, I encountered a slow PostgreSQL query in my work. The query was taking a lot of time and using a lot of I/O, even though it was relatively straightforward. The reason for its slowness was not immediately apparent. However, after some analysis, we were able to optimise the query and reduce its worst-case runtime by ~1000 times. In this article, I will share my learnings.

For this article all queries were run on PostgreSQL 14 on my laptop. Query times may differ if underlying hardware is different but relative latency improvements after optimisations mentioned below will be similar.

Query to be optimised:

It is a simple query that has where clause on two columns and selects a single third column as output:

SELECT id FROM test WHERE col_1 = 13 AND col_2 = 41

Let’s seed some data first:

The actual production table contained around 150 million records. For our analysis 50 million rows would be sufficient:

-- Create table
CREATE TABLE test (
 id SERIAL PRIMARY KEY NOT NULL,
 col_1 BIGINT not null,
 col_2 BIGINT not null
);

-- Randomly insert 50 million rows
INSERT INTO test (col_1, col_2) 
SELECT random() * 50, random() * 50 FROM generate_series(1, 50000000);

-- Check data distribution 
-- We will use col_1 = 13 and col_2 = 41 in our queries
SELECT col_1, col_2, COUNT(*) as cnt from test
GROUP BY (col_1, col_2)
ORDER BY cnt DESC;

Output:
col_1  col_2   cnt
13     41      20524
22     1       20508
25     14      20412
41     2       20384
34     45      20378
6      8       20377

All the query plans excludes network time taken by data to travel to and from database server.

Let’s Start Analysing and Optimising the query:

Let’s first run the query without doing anything and see its query plan:

This query plan contains the “actual steps” taken by the database engine while executing the query.

There is lot of information in above query plan but we will focus only on few things:

Total time taken in query execution is ~6 secs (which is huge).
Total rows returned by the query are 20,107.
For matching the condition ((col_1 = 13) AND (col_2 = 41)), the strategy used by database engine is “Sequential Scan”.
- In sequential scan, engine goes through every row of the table one by one and discard the rows which does not match the query condition. That means, every row must be visited at-least once.
All 50 million rows were scanned but more than 49 million rows were discarded. This is the major inefficiency that can be seen in the plan.

We need to somehow reduce the number of rows scanned during query execution. General solution to this is to create an index on columns used in where clause. So let’s create an index on col_1 and col_2 and re-analyse the query:

CREATE INDEX concurrently test_col_1_col_2_idx ON test(col_1, col_2)

Re-analyse query:

Now the engine decided to do “Index Scan” using the index that we created in the previous step.
The query time reduced significantly but it is still huge(~1 sec) for the amount of data we are working with.
After doing the index scan lots of reads are happening — 19,899 — which is resulting in high query time. Reason for high number of reads:
- We created the index on col_1 and col_2 but we need id in the query response.
- Using the index, engine was able to know the location of rows to be accessed without scanning the whole table. But to actually return the id, each row needs to accessed using the address stored in the index (because index only contains information about col_1 and col_2 and not about id ).
- That’s why number of reads are almost same as number of rows returned by the query.

In order to avoid large number of reads we need to somehow avoid reading every filtered row.

Postgres allows to keep additional information with every entry of an index using INCLUDE clause. These columns are not indexed — having a where clause on these columns will not use the index.
Reference Docs: link

Let’s try including the id column in the index and see if it improves the performance:

-- Drop previously created Index
DROP INDEX test_col_1_col_2_idx

-- Create new index that includes ID
CREATE INDEX CONCURRENTLY test_col_1_col_2_idx_2 ON test(col_1, col_2) INCLUDE (id)

Rerun the query:

After including id in the index, the query time reduced drastically to just ~7 ms.
Number of reads also reduced from ~19K to just ~110.
Engine switched from using an “Index Scan” to “Index Only Scan”, meaning that to execute complete query just scanning the index was sufficient, there was no need to access table rows.

Few points to note:

Avoid including too much extra data with the index as it will bloat the index and increase index size significantly.
The columns that are included using INCLUDE clause are not indexed, so having a WHERE condition on these columns will not use the index.
This approach is helpful only if large number of rows are returned by the query. If the query returns only few rows, say 1 or 2 then there won’t be any significant performance gains using above approach.

Conclusion:

When writing a SELECT query with WHERE clause, index creation depends not just on the columns that are included in WHERE clause but also on the columns that the query selects as final output.

Additional reads that helped me in the analysis:

Good explanation of how to read query plan: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE.
How to read buffers output: https://www.postgresql.org/docs/current/sql-explain.html#:~:text=Include information on buffer,defaults to FALSE.
Actual time vs cost in output of explain analyze - why they should not be compared: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE:~:text=Note that the,unusual in practice.
Data transmission costs are not involved in the output of explain analyse: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE:~:text=First%2C since no output rows are delivered to the client%2C network transmission costs and I/O conversion costs are not included.
Index-only scans: https://www.postgresql.org/docs/current/indexes-index-only-scans.html.
Docs for pg_buffercache extension (helps in seeing the contents of shared buffer cache): https://www.postgresql.org/docs/current/pgbuffercache.html.
Vaccum: https://www.postgresql.org/docs/current/sql-vacuum.html.
Auto-vacuum: https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM.
Visibility Map: https://www.postgresql.org/docs/current/storage-vm.html.
How index-only scan uses visibility map: https://www.postgresql.org/docs/current/indexes-index-only-scans.html#:~:text=But there is,all the time.