DEV Community: CoinGecko Engineering

Scaling PostgreSQL Performance with Table Partitioning

Amree Zaid — Fri, 13 Jun 2025 01:25:06 +0000

Table of contents:

The background
The investigation
The execution
The result
What we would do differently
Summary

The background

In CoinGecko we have multiple tables that we use to store crypto prices for various purposes. However, after over 8 years of data, one of the tables we store hourly data in grew over 1TB to a point it took over 30 seconds on average to query.

We started to see higher IOPS usage whenever there are more requests hitting price endpoints. Requests queues started to increase and our Apdex score started to go down. For a short term fix, we increased the IOPS up to 24K. However, the IOPS keeps on getting breached causing alerts everyday.

In order to ensure this situation doesn’t affect our SLO and eventually our SLA, we started to look into what we can do to improve the situation.

FYI, we are using PostgreSQL RDS as our main database.

The investigation

Adding indexes was initially considered the quickest solution, but this approach was unsuccessful. The query utilizes a JSONB column with keys based on supported currencies, presenting an additional challenge. Indexing different keys for various applications was deemed excessive, as an added index might only benefit a single application.

Ultimately, table partitioning was chosen as the solution most likely to yield the greatest returns, despite its complexity.

What is table partitioning?

Table partitioning involves dividing a large table into smaller, more manageable pieces called partitions. These partitions share the same logical structure as the original table but are physically stored as separate tables.

This allows queries to operate on only the relevant partitions, improving performance by reducing the amount of data scanned.

There are three methods of partitions which are:

Range: Partitions data based on a range of values (e.g., dates, numerical ranges).
List: Partitions data based on specific list values (e.g., countries, categories).
Hash: Partitions data by applying a hash function to a column's value, distributing data evenly across partitions.

The question is, which partitioning method should we use? As usual, the answer is: it depends. We need to analyze our query patterns to determine which method will provide the greatest benefit. The key is to select the method that minimizes the amount of data our queries need to read every time it runs.

In our case, range partitioning is the optimal choice. This is due to the fact that almost all of our queries on this table incorporate a timestamp range in the WHERE clause. Moreover, we know that we generally only require data for a few months at a time, with a maximum of four. As a result, partitioning the table by month will guarantee that our queries only access up to four partitions (most of the time).

IF for some reason, we are not limiting the read based on the timestamp, we may need to use the Hash method as that will limit the read based on a foreign key. Again, it depends on the use case.

What would the code look like?

CREATE TABLE orders (
   id SERIAL,
   customer_id INTEGER,
   order_date DATE,
   amount DECIMAL(10,2)
) PARTITION BY RANGE (amount);

-- Create partitions for different amount ranges
CREATE TABLE orders_small
   PARTITION OF orders
   FOR VALUES FROM (0) TO (100);

CREATE TABLE orders_medium
   PARTITION OF orders
   FOR VALUES FROM (100) TO (500);

CREATE TABLE orders_large
   PARTITION OF orders
   FOR VALUES FROM (500) TO (1000);

CREATE TABLE orders_extra_large
   PARTITION OF orders
   FOR VALUES FROM (1000) TO (MAXVALUE);

-- Insert sample data
INSERT INTO orders (customer_id, order_date, amount) VALUES
   (1, '2024-01-15', 50.00),    -- Goes to orders_small
   (2, '2024-01-15', 150.00),   -- Goes to orders_medium
   (3, '2024-01-15', 600.00),   -- Goes to orders_large
   (4, '2024-01-15', 1200.00),  -- Goes to orders_extra_large
   (5, '2024-01-15', 75.00),    -- Goes to orders_small
   (6, '2024-01-15', 450.00);   -- Goes to orders_medium

In my case, I just name the partition based on this format: table_YYYYMM.

What did I learn during the investigation?

If someone had to do this again, these are the info that I will pass on:

We can’t change the table from an unpartitioned table to a partitioned table. We need to create a new table and copy the data into it before making the switch.
The partitioned table needs to have the partition key as part of the primary key. If we have ID as the original primary key, then, we need to use composite keys on the new table.
In order to use partitioned tables in Ruby on Rails, we need to change the schema format from schema.rb to schema.sql.
We need to figure out how to have both tables running at the same time if we can’t afford downtime.
Since we will be creating a new table, we have to be careful about the cache. Technically, the new table doesn’t have cache at all and the performance will be very bad. We have to figure out how to “warm” up the new table. I am referring to the cache in PostgreSQL itself.
To warm up the table, please learn about pg_prewarm.
Copying data as big as 1.2TB would require bigger resources such as IOPS. We need to take that into account.

With all the info that we had, we created a Release Plan document outlining when and what is going to happen. We used that document as our main reference point for everyone to see. The document contains these info:

Date: When it is going to happen.
Prerequisites: Before executing the todos, we may need to do other tasks first. They will be listed in this section ensuring we do not start the todos without completing them.
Risks: For every risk, we will list down what could happen and what are the mitigation plans.
Todos: This section will list down what needs to be done and once it is done, we will tick them off from the list.

The execution

Dry Run

Our customers are very particular about the uptime of our services, hence, we proactively conduct dry runs to safeguard our SLO/SLA commitments. Before we start our dry run, we will list down what we want to do and what kind of statistics that we want to collect.

For this project, we spinned up another database identical to our production. Then, we ran all the commands or scripts that we will run on the production instance later. In our case, we were looking for these data:

Before and after query performance.
How long will it actually take to copy the data?
How long does it take to warm up the table partitions?
What does the CPU and IOPS look like for every action that we did?

Avoid operating in the dark.. Be prepared so that we won’t miss our objectives. Remember, using a database similar to our production is going to cost a lot of money.

What we found out during the dry run:

The original table was too slow at first. But, this is expected as without any cache, we won’t be able to work on it.
It took us 10 hours to just warm up the original table before we could start trying out our commands.
It took us 3 days+ to finish copying the data.
Total IOPS can spike up to 6,000 during this operation, even when running in isolation without any other database workload. To put this in perspective, 6,000 IOPS for the read is virtually identical to what our production database handles under normal operating conditions.
We can get 6-8x performance based on the same query that we had when we switched to partitioned tables.
Prewarming the partitioned table only took 3 hours compared to the original table which was 10 hours.

Once we have the right statistics, we made the necessary arrangements such as:

Announcing the day and time where we will do this on the production database.
Increase the storage capacity to ensure our database can fit the new table and still have extra spaces left until we drop the original table. We also need to consider the amount of storage needed for everyday tasks.
Increase the IOPS so that the Primary and the Replicas can handle the load due to the data copy process.

Go-Live Day

The work will start at the beginning of the week to minimize weekend work and ensure our engineers have a peaceful weekend. We also have a backup engineer and SRE support.

It’s quite normal that things didn’t go as planned, but the first challenge was something that I didn’t expect at all.

Challenge #1: Our original table was so bad that I couldn’t even complete copying one day's worth of data.

Based on what we did during the dry run, we should be able to copy one day of data within 2-3 minutes. However, when it comes to production, I didn’t expect it was going to be much worse. Technically, the production data should have sufficient cache as it is actively being used. I didn’t spend too much time looking into it, but I know I can’t warm up the table as we won’t have enough IOPS for it.

We have 8 years of data that we need to copy. So, imagine waiting 15 minutes for 1 day of data. Actually, I don’t know how long it would take as I just kill the query after 15 minutes.

What was the solution? Well, we know for a fact from our dry run, if we warm up the table, it will only take 2-3 minutes for one day of data. But, we cannot warm up our production’s table. So, what can we do?

I remember a Postgres feature called Foreign data wrappers. Basically, we will read from another host and write to the partitioned table in the production’s host. This way, we don’t have to warm up the table in the production and we also won’t use too much IOPS as well. This seems like a win to us.

Based on that idea, we improvised our plan for a little bit:

Provision another production grade database.
Prewarm the table.
Setup Foreign data wrapper.
Update our copy script to read from the new host and write to the current production’s database.

This whole process set us back by two to three days. But, it’s something that we cannot avoid.

Challenge #2: Warming up all databases, including replicas, was necessary

We didn't realize we needed to warm up our replicas until the go-live day began. We had only been focusing on the primary database. This oversight added extra work to the process but at least we are not so worried about the partitions not warmed up enough in the replicas.

The final query

Once we have gone through all of the tasks, we just flip the switch by renaming the table:

BEGIN;
-- Remove existing trigger
DROP TRIGGER IF EXISTS ...;

-- THE IMPORTANT BITS
ALTER TABLE prices RENAME TO prices_old;
ALTER TABLE prices_partitioned RENAME TO prices;

-- Create the trigger function
CREATE OR REPLACE FUNCTION sync_prices_changes_v2()
RETURNS TRIGGER AS $$
BEGIN
  -- ...
END;
$$ LANGUAGE plpgsql;

-- Attach the trigger on the new table so that prices_old will get the changes
CREATE TRIGGER sync_to_partitioned_table
AFTER INSERT OR UPDATE OR DELETE ON prices
FOR EACH ROW
EXECUTE FUNCTION sync_prices_changes_v2();

COMMIT;

As you can see, we are using triggers to copy the data from and to the new and the old table. We still need the old table. Remember, things could go wrong and we need our Plan B, C and so on.

What We Did Right

Table warm-ups: Based on past experience from the database upgrade, we made the right call to warm up the partitions. This ensured that query time didn't increase when we switched from the unpartitioned table to the partitioned table.
Scripted the tasks: We prepared scripts for every task and developed a small Go app to manage data copying. The app included essential features like timestamps, the ability to specify the year for data copying, and the time taken to copy data. We also created an app for warming up the table, allowing us to carefully manage CPU and IOPS usage.
Used CloudWatch Dashboard: We decided to fully utilize CloudWatch Dashboard for this project, and it proved invaluable for monitoring IOPS, CPU, Replica Lag, and other metrics across multiple replicas. Learning to set up the vertical line feature was particularly helpful for visualizing before-and-after comparisons.
Team Backup: Having backup from another person or team was beneficial. They helped identify things we might have missed and provided a sounding board for ideas during planning and execution.

The result

Now let’s go to the fun part. However, there was also a regression after we switched to the partitioned table. Details are in the “What we would do differently” section.

When we are talking about the result, we should return back to why we are doing this in the first place. On the micro level, we want to reduce the IOPS for certain queries. On the macro level, we want our endpoints to be faster and more resilient towards requests spikes. Severe high IOPS can cause replica lags as well.

IOPS

The IOPS was reduced by 20% right after this exercise. Since this table was being used extensively across all of our applications, we can reduce the maximum IOPS thus allowing us to save our costs further. To be clear, we are running multiple replicas so the cost savings are multiplied by the number of replicas that we have.

Response Time

This was quite significant as we managed to reduce the p99 from 4.13s to 578ms. That is a about 86% reduction in terms of the response time. You can see how flat the chart is right after we made the switch.

Replica Lag

Right before we made the table switch, we had increased usage of the affected endpoints causing higher IOPS which in result caused us replica lags. But, it went away the moment we flipped to the partitioned table.

What we would do differently

Block the deployments

The biggest mistake in our planning was not blocking deployments on the day of the switch. While the switch itself wouldn't disrupt others' work, we overlooked the fact that two deployments were related to the table we were optimizing. This caused confusion about the impact of the table partitioning.

The higher CPU and IOPS utilization observed after the switch put the exercise at risk of rollback. We eventually identified that an earlier deployment caused the problem. However, pinpointing the cause required rolling back the changes and extensive discussion. This situation could have been avoided by blocking deployments for a day to clearly assess the impact of our changes.

Too focused on the replicas

Our focus on resolving the replica issue led us to overlook the impact on the primary database. We failed to identify which queries would be affected, and one query, in particular, performed worse after the switch. This query triggered scans across all of the table partitions, increasing IOPS and CPU usage. By modifying the query, we managed to resolve it.

This experience highlighted that without the correct query, table partitioning can be detrimental. In this instance, the query lacked a lower limit for the date range, resulting in all partitions being scanned unnecessarily. Interestingly, the same query performed well on the unpartitioned table.

Extra partitions

This was another instance where we noticed a regression on one endpoint that helped us realize the mistake.

To reduce the future workload of creating partitions, we initially created all the partitions for 2025. We discovered that one query was structured like 'created_at > ?' without an upper limit. It caused the query to scan future partitions that were empty. By removing these partitions, we fixed the issue.

Going forward, we need to determine a better strategy for when to create future partitions.

Incremental release

We shouldn’t have waited for the entire table to be migrated before starting to use it. Since most users only access data from the last four partitions (or months), we could have implemented a feature toggle to direct queries for data after a specific date to the already migrated partitioned tables.

This approach would have allowed us to start using the partitioned tables sooner and reduced the risks and potential negative impact of any issues that might arise. As it stands, rolling back our changes would be costly in terms of IOPS, as we would need to prewarm the old table again to avoid production downtime due to cold cache.

Summary

This is just the beginning. With this experience under our belt, we can start exploring the possibility of implementing table partitions on other tables as well. Of course, table partitioning isn't a one-size-fits-all solution. We need to diagnose the issue before proceeding. Sometimes, something as simple as adding an index can resolve the problem.

In conclusion, while partitioning the 1TB+ "prices" table presented some challenges, especially during the go-live phase, the overall outcome was substantial performance improvements. This initiative aligns with the API Team's ongoing goal: providing the best possible experience to better serve our customers. Our API is now more stable and resilient against sudden spikes in requests during peak hours.

Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

Hakim Zulkhibri — Mon, 20 Jan 2025 11:29:29 +0000

Why Error Budgets?

CoinGecko offers API services to our customers. There are 2 types of APIs that we provide, Public API and Pro API. For Pro API, we are bound with tight service-level agreements (SLA) to our customers. These SLAs are important for us to ensure customer satisfaction and trust in the platform.

We visualized the risk metrics to categorize risk categories into severities that may impose danger to our SLAs. Instead of settling for availability goals like 99.9% or 99.95%, we strive for tangible information to determine to ensure that our goals remain realistic.

In this article, we will discuss the process behind measuring and managing a reliable uptime SLA. How do we track, analyse and understand risks before reaching a conclusion for our SLA?

For the ease of understanding, let’s first talk about SLOs and SLAs:-

SLA
Service Level Agreements – agreements with our customers about reliability of our services.

SLO
Service Level Objectives – thresholds that catch an issue before it breaches our SLAs.

SLA and SLO, where it stands - Courtesy of Google

In other words, we have a higher threshold for SLO compared to SLA. We need to capture any issues before it reaches out to the customer. In terms of uptime, internally, we only allow a lower duration of downtime, x, compared to the external threshold that we set, y duration of downtime. To put it into formula → x < y.

The SLA that we have for our Pro API is 99.9%. That means, for SLO, we have a higher threshold; e.g., 99.95% or 99.99%.

How do we know how much head room that we have before we breach our SLO?

An uptime SLA of 99.9% is equivalent to 43.2 minutes of downtime in a month. A corresponding SLO of 99.95% is equivalent to 21.6 minutes of downtime.

This difference in minutes is also known as Error Budget. This error budget allows us to do maintenance, deployment and improvement towards our application. Engineers only have 21.6 minutes a month to maneuver around when they face problems that cause downtime.

Error Budget is an inverse of SLO. If our SLO is 99.9% availability, our Error Budget is the remaining amount of time (0.1% unavailability).

Availability Table - Courtesy of Google

Calculating unavailability formula in terms of minutes per month:

// Desired SLO
slo = 99.9

// Days in a month used in the formula
month = 30

// Total minutes in a month; 30 days * 24 hours * 60 minutes
mins_per_month = 43200 

// Calculate allowed downtime in minutes per month
unavailability_mins = (100 - slo) / 100 * mins_per_month
unavailability_mins = 0.1 / 100 * 43200
unavailability_mins = 43.2

From the table above, we now understand the unavailability, i.e., the Error Budget, in terms of the time we can afford.

Let’s take a look at these 2 diagrams below to understand the burn rate of our Error Budget from Day 1 to Day 28. The diagram below shows our Error budget of 21.6 mins (expressed as 100%) at the beginning of the month.

The first diagram shows a positive Error Budget by the 28th day of the month.

Monthly error budget nearing the budget - Courtesy of Google

Meanwhile, the following diagram shows a breached Error Budget with negative percentage remaining.

Monthly error budget breaching the budget - Courtesy of Google

The diagram above provides a visual representation of the burn rate in percentage regardless of how many minutes of Error Budget that we have chosen.

Error budget burn rate can be monitored throughout the month to revise the frequency, priority and type of deployments scheduled.

Analyzing past incidents to categorize our failure points

No application or system is perfect, especially in its early stages. The key is to learn from these experiences by recording, documenting, and categorizing each incident for future reference. By investigating these issues, we gain a deeper understanding of how to prioritize and address them, helping us craft realistic SLOs. Analyzing past incidents and anticipating future ones allows us to take proactive measures to prevent SLA breaches and ensure system reliability.

First thing first is we have to understand our failure points. Categorize each incident that occurred or may occur in the future. This is what we call a risk. This helps us to create a high-level view to view which categories cause us the most headache.

From where do we obtain the information of risks? Historical data, industry best practices, brainstorming etc.

For example purposes, these are some of the categories that we identified that can cause downtime to our application.

Disaster recovery drill
Updating major code version
Code deployment misconfiguration
Unoptimized database queries
Software defects in the code
Breakdown of caching service
Outage in an Availability Zone
Unintended data loss or corruption
Malicious security breach/attack
High volume of traffic
Breakdown in the message queue system
Disk failure
Third-party dependency failure

Next, from each of the incidents, we calculate:
ETTD - Estimated Time To Detection – how long it would take to detect and notify a human (or robot) that the incident has occurred; aka MTTD (Mean Time to Detect).
ETTR - Estimated Time To Resolution – how long it would take to fix the incident once the human (or robot) has been notified; aka MTTR (Mean Time to Repair).
ETTF - Estimated Time To Failure – estimated frequency between instances of this incident; aka MTBF (Mean Time Between Failure).
% of Users Affected – Percentage of users was affected by the failure

Above terms visualized - Courtesy of Google

This helps us understand the frequency, time, and our swiftness in responding towards an incident.

We wanted to understand how much downtime (bad minutes) per year is caused by a single category. From this valuable information, we open up a spreadsheet, fill in all of our data, and calculate our risk level for each category. This is what we call the Risk Catalog.

We enter our list of risks in the blue cells together with the ETTD, ETTR, Percentage of impact towards users and ETTF. Based on our inputs, we are able to see the number of incidents per year and bad minutes per year generated by the spreadsheet formula in the grey cells.

Computed Stack Rank of Risks

We took the information above and rearranged the risks based on a severity level to a new spreadsheet called the Risk Stack Rank. This is how we can calculate and provide a data-driven context on how we stand today vs our current SLO defined.

Let’s have a look at the computed stack rank of risks below:

In the sheet above, we can see that our risks are populated and arranged by bad minutes per year. The most bad minutes per year will be considered as the highest risk

Risk Stack Rank has multiple components to look for:

Target Availability
The desired availability in percentage.

Budget (m/yr)
The total error budget available, measured in minutes per year (m/yr), which represents the maximum allowable downtime while still meeting the target availability.

Accepted (m/yr)
The amount of downtime already allocated for various known risks in minutes per year.

Unallocated Budget (m/yr)
The portion of the error budget that remains uncommitted after accounting for known and accepted risks.

Threshold of unacceptability for an individual risk (% of error budget)
A limit that defines how much of the total error budget a single risk can consume.

Too Big Threshold (m/yr) – for a single risk
The absolute upper limit for the amount of downtime a single risk can be responsible for. If the expected impact of a risk exceeds this threshold, the risk is deemed "too big" and must be mitigated, as it could jeopardize the ability to meet the SLO.

In terms of the colored cell, below are the explanation of it:

Red – this risk is unacceptable, as it falls above the acceptable error budget for a single risk.
Amber – this risk should not be acceptable, as it’s a major consumer of our error budget and therefore, needs to be addressed.
Green – this is an acceptable risk. It's not a major consumer of our error budget, and in aggregate, does not cause our application to exceed the error budget.
Blue – this risk has been accepted to fit within our error budget. Accepting a risk means planning not to fix it and taking the outage and corresponding hit on the error budget.

Understanding Risk Stack Rank in Practice

Remember the risks that we have entered in the Risk Catalog together with its metrics? This Risk Stack Rank calculates the risks and rank it according to bad mins/year.

In this subsection, assume that we want to have a 3-nines availability target (99.9%), we have 2 red-shaded (unacceptable) risks and the others are green-shaded (acceptable) risks.

Let's see some scenarios below to see it in action.

Accepting a Risk that is in Red or Amber-shaded

Say that our threshold of unacceptability for an individual risk is 25% of the error budget.

We can see from above that accepting “Third-party dependencies failure” causes green-shaded risks to turn into amber-shaded risks (should not be accepted). This happens due to the accepted risks already consuming a number of error budgets, causing other risks to impose danger to our error budget.

Say we accept more of the risks that will consume our error budget.

The diagram above shows that more risks are amber-shaded. This means that we have to act upon the risks to bring down the bad mins/year. We’ll discuss this in Improving our Risk Stack Rank section.

Ideal Situation

We can start by accepting the green risks and see how it consumes our error budget in this sheet.

From the figure above we can see that we have accepted 519.62 out of 525.96 minutes from our error budget.

Unaccepted Risks

As we accepted risks (marked y), we agreed to accept it without any mitigation actions that are required. These risks are now known as risks that will burn our error budget.

But how about unaccepted risks that are in the red or amber-shaded? What do we do with them?

If we do accept them, the sheet will show that we have breached our error budget.

We can see that our Unallocated Budget section has reached a negative value.

These risks (red and amber-shaded) are the risks that require mitigation actions, these risks have to be acted upon so that it does not impose danger to our Error Budget.

How to implement Error Budgets in practice

Now that we have Service Level Objective (SLO) and Error Budget, let's enforce it! It is important that everyone in the organization is aware of this policy, especially the engineering and product team.

This sets as a baseline in determining whether we can release new features or deploy a hotfix in case the Error Budget is nearing its limit or has been breached.

To simplify things in this article, we are going to present 3 tiers of severity levels – Tier 1, Tier 2 and Tier 3 – and with Call to Action (CTA) in each tier.

How do we know which is which? Again, quantifying this is crucial in understanding the criticality of an issue.

Tier 1
Description: There is a depletion of Error Budget within X days (e.g. 14 days) and the Error Budget percentage is still within acceptable status.
CTA: Acknowledgement is required and the SRE team will notify the application team.

Tier 2
Description: The Error Budget has depleted until Y% (e.g. 50%) within 28 days and the Error Budget percentage is in warning status.
CTA: Halt releases and P0 issues or security fixes until SLO recovers; setup dedicated team to investigate AND SRE team to highlight to application team.

Tier 3
Description: There is a major depletion of Error Budget within X days (e.g. 2 days) OR the Error Budget has reached Z% (e.g. 80%) or lower.
CTA: All-hands-on-deck to focus on resolving the service outage. Inform top management. Use a "silver bullet" (see next section on Silver Bullets) carefully upon multiple approvals at this stage. Prepare a PR statement if needed.

By categorizing incidents into these tiers, the team can respond proportionally to the severity of an issue. This ensures resources are allocated effectively while maintaining adherence to SLOs.

SLO Recovery

When SLOs reach warning levels, two immediate actions are crucial – first, a dedicated task force comprising application developers and SRE must be assembled to address the situation. Second, all ongoing releases must be temporarily suspended during the investigation phase.

An improved release planning strategy is fundamental to SLO improvement, particularly in environments practicing Continuous Deployment. The foundation of this strategy involves categorizing deployments based on their risk levels.

The classification of deployments into risk categories requires careful consideration of various factors. While the specific criteria for risk assessment may vary by organization, they typically consider factors such as:

The scope of changes
Potential impact on critical user paths
Architectural modifications
Integration points with external systems

For deployments classified as high-risk, several key practices should be implemented. The implementation of a daily stagger system ensures that high-risk deployments are spread across different days, allowing for precise identification and swift rollback of problematic changes if necessary. All high-risk deployments must be:

Documented in a shared engineering calendar
Clearly communicated to all team members
Monitored by relevant stakeholders during and after deployment
Scheduled with consideration for key personnel availability

An often overlooked but crucial aspect of release management is ensuring the availability of critical stakeholders during deployment windows. This includes considering team members' leave schedules when planning significant releases.

By implementing these controls, teams can effectively shield their remaining error budget from new incidents, allowing their SLOs to gradually recover as the measurement window advances and maintaining system stability during the recovery period.

Conclusion

We assess risks by understanding how much downtime or error the system can tolerate while still meeting SLOs. We use error budgets to track acceptable failure, balance reliability with deployments, and prioritize risks.

Rather than stating without evidence that we aim for 99.9%, 99.95%, or any other availability goal, we now have concrete data that shows whether or not our goal is feasible.

By analyzing historical data, potential failures, and user impact, we can determine if an Error Budget is realistic and adjust accordingly to ensure system stability without holding back progress. This approach ensures that every decision—whether to deploy a new feature or address a critical issue—is backed by measurable insights and aligned with the organization’s goals.

References

Sachto, A. (2022, May 5). How SREs analyze risks to evaluate SLOs. Google Cloud Blog. https://cloud.google.com/blog/products/devops-sre/how-sres-analyze-risks-to-evaluate-slos
Brown, M. (2017, May 23). How to prioritize and communicate risks—CRE life lessons. Google Cloud Blog. https://cloud.google.com/blog/products/gcp/know-thy-enemy-how-to-prioritize-and-communicate-risks-cre-life-lessons
Training, G. C. (n.d.). SLOs vs SLAs [Video]. Coursera. https://www.coursera.org/learn/site-reliability-engineering-slos/lecture/KpI1q/slos-vs-slas
Training, G. C. (n.d.-a). Error budgets [Video]. Coursera. https://www.coursera.org/learn/site-reliability-engineering-slos/lecture/N12XI/error-budgets
Warner, A., Bramley, A., & Hilton, A. (2018, June 28). SRE at Google: Good housekeeping for error budgets. Google Cloud Blog. https://cloud.google.com/blog/products/devops-sre/good-housekeeping-error-budgetscre-life-lessons

How to create test cases using Equivalence Partitioning and Boundary Value Analysis test technique (Step by step)

Alan Liew — Mon, 15 Jul 2024 06:13:30 +0000

In the world of software testing, ensuring comprehensive coverage of test cases while maintaining efficiency is a critical challenge. Among the range of techniques available, Boundary Value Analysis (BVA) and Equivalence Partitioning (EP) stand out as fundamental methods that significantly enhance the effectiveness of test design. These techniques help identify critical test cases and ensure that the testing process is both systematic and thorough.

In this blog post, we will delve into the principles of BVA and EP, exploring how they can be applied to optimize testing efforts and improve software quality.

Both are black box testing techniques. Let’s start with an explanation of the Equivalence Partitioning technique.

✅ Equivalence Partitioning

The steps should be as follows:

Divide the test conditions into groups or sets of partitions. All the elements under the same set of partitions are considered the same, and the system should handle them equivalently.

A partition containing valid values is called a valid partition
A partition containing invalid values is called an invalid partition

For non-numeric example:
If the fruits = apple, then print green
If the fruits = orange, then print the orange

In the above example, Apple and Orange are valid partitions, while no invalid partition is specified; we assumed other fruits are invalid partitions.

For numeric example :
Users should be able to register when they are between ages 1 and 21.

In the above example, an age between 1 and 21 is a valid partition(able to register), while an age between less than 1 and more than 21 is an invalid partition(unable to register).

2. When there are multiple sets of partitions or more than one input,

The valid partition shall combined.
The invalid partition should be tested individually.

Let’s take this as an example; the requirement is to register a user email when it hits the requirements below:

It should be 6–30 characters long
Should be alphanumeric

Following the No. 2 rules, we don't need to create 5 test cases for 3 partitions belonging to (6–30 characters long) and 2 partitions belonging to (should be alphanumeric).

We shall combine the valid partition and come out with the same test value (input) of abc123.

Then, to test the invalid partition individually.

In this example, 4 test cases are sufficient to test out all the combinations of partitions.

The question may arise: why don't we combine the invalid partition?

For example, if we test only with less than 6 characters, the error message is “Sorry, your username must be between 6 and 30 characters long”

However, when we tested the invalid partition of <6 characters and the invalid partition of not alphanumeric, the error message was “Sorry, only letters (a-z), numbers (0–9), and periods (.) are allowed.”

It’s clear that when we combine two invalid partitions to test, it’s very easy to miss out on validating the invalid partition of <6 characters. In this case, the test cases are not well designed.

✅ Boundary Value Analysis

Boundary value analysis is closely related to equivalence partitioning. We can determine the boundary value by knowing the partition.

We will use back the same numeric example:

Users should be able to register when they are between ages 1 and 21

And we know that

age between 1 and 21 is a valid partition (able to register),
age less than 1 and more than 21 is an invalid partition (unable to register).

To determine what is the boundary values:

Rule 1: The BVA is that the minimum and maximum values of the partition are called the partition's boundary values.

Hence, the boundary values of (1 to 21) are 1 and 21.

Next, let's find out what the 2-value BVA or 3-value BVA are.

For 2-Value BVA: This boundary value and its closest neighbor belonging to the adjacent partition will be boundary values.

Boundary values of (1 to 21) are the values of 1 and 21.

(Adjacent partition) Boundary values of (1 to 21) are the values of 0 and 22.

Hence, the 2-value BVA are the values of 0, 1, 21, 22.

For 3-Value BVA: This boundary values and both their neighbors will be boundary values.

Boundary values of (1 to 21) are the values of 1 and 21.

(adjacent partition) Boundary values of (1 to 21) are the values of 0 and 22

Boundary values and both its neighbors of (1 to 21) are the values of 0, 1, 2, 20, 21, 22.

Boundary values and both its neighbors of (0 and 22) are the values of 1, 0, 1, 21, 22, 23.

Hence, the 3-value BVA are the values of -1, 0, 1, 2, 20, 21, 22, 23

Now, you may wonder when we should use 2-value BVA or 3-value BVA.

Let’s try another requirement:

Users should be able to register when their age less or equal to 21

And somehow, developers make this mistake in the logic to be able to register when age = 21.

We may have overlooked and missed this bug because the 2-value BVA only tests 21; however, a 3-value BVA may be able to surface the issue because the value of 20 is tested.

Boundary Value Analysis and Equivalence Partitioning are useful techniques for any software tester. By focusing on critical edge cases and grouping inputs into meaningful categories, these techniques avoid exhaustive testing, streamline the testing process, and uncover potential defects that might otherwise go unnoticed. Embracing these methods will enhance your testing strategy and ensure that your applications meet the highest standards of quality and performance.