DEV Community: Denis Stebunov

The best testing strategy

Denis Stebunov — Sun, 15 Jun 2025 16:07:10 +0000

If you're interested in automated testing, you might have seen some contradictory advice. For example, there's a well-known Test Pyramid, suggesting that we should focus mostly on unit tests. And there's another approach called Testing Trophy, which suggests that we should mostly write integration tests.

Some posts argue that unit tests are overrated, and others - that they're fine. Which advice should we follow? Choosing the right testing strategy isn't that simple, and the choice will have long-lasting consequences. Let's dive in.

Terminology

People most commonly refer to unit, integration, and end-to-end tests—and those will be the focus of this post. Of course, this isn't the only way to categorize tests. You might also have come across terms like smoke, functional, system, or contract tests, each of which serves its own purpose. To keep things simple, I won't cover every type of test here, but the core ideas discussed below apply to them as well.

What is being tested?

Each test interacts with some interface, and the interface being tested defines the test type:

Unit tests verify the behavior of a small piece of code, such as a function or a class;
Integration tests work with larger pieces of code. Often, this is a code behind an API;
End-to-end tests work with the program from the end user's perspective by interacting with the UI.

Comparison: Unit - Integration - E2E

Now, let's see how these types of tests compare to each other. Let's start with the most important which is...

✅ Confidence

🟥 Unit	🟨🟨 Integration	🟩🟩🟩 E2E

High-level tests, like integration or end-to-end, provide the best confidence. Maybe you've seen memes on the internet like this. If all the low-level tests pass, it doesn't guarantee that our app works as a whole.

Confidence is the most important reason why we even bother with testing. If we spend time and effort on testing, but the confidence level remains low - it doesn't seem like a good investment. This is why we list "Confidence" first. However, it's not the only criterion.

🏎️ Speed

🟩🟩🟩 Unit	🟨🟨 Integration	🟥 E2E

The more code under the test, the slower the test is—no surprises here. This is especially noticeable with end-to-end tests because we run not just one program but two—our app and another program through which we test. In the case of web apps, it's a browser managed by a framework like Selenium or Playwright. As a result, end-to-end tests are the slowest and require more resources like CPU and memory, and their setup is more complicated.

✨Ease of use

🟩🟩🟩 Unit	🟨🟨 Integration	🟥 E2E

Writing and debugging end-to-end tests could be more challenging. The biggest complaint you may hear is that they're flaky. You run the test once, and it works. You run it once again, and it fails! Run it one more time, and now it works again! This problem is so widespread that some people resort to retrying failing tests automatically, hoping they'll eventually pass.

Why are end-to-end tests flaky? It comes down to how browsers work—handling network requests, user input, JavaScript, and rendering all at once. If any of these take a bit longer, it can cause unexpected timing issues or transient states that throw off the test. A well-written test accounts for this by checking intermediate states and waiting for elements to load, making it stable and reliable. But simpler tests often skip these steps. They usually work—but sometimes fail without clear reasons. That's why flaky tests are common: writing them properly is just more complex.

📊 Coverage (one test)

🟥 Unit	🟨🟨 Integration	🟩🟩🟩 E2E

Unit tests operate on small pieces of code, such as one function or a class, so one test can cover only so much. Higher-level tests cover larger pieces of code at a time, which may be helpful when starting with automated testing. If the codebase has no tests yet, just a few end-to-end or integration tests may quickly provide decent coverage.

I don't think chasing a very high percentage of coverage makes sense, though. As coverage grows, it becomes less and less useful as a metric, and 100% coverage won't guarantee the absence of bugs. However, going from 0% to 50-70% would certainly make a difference.

🔀 Testing input combinations

🟩🟩🟩 Unit	🟨🟨 Integration	🟥 E2E

In theory, we could use any test type to validate all desired input combinations. In practice though, it could be painful to do with slow running tests.

Let's say we have a sign-up form that accepts a user's name, email, and password, and we would like to test five input combinations for each field. It may include a very long name, names in different languages, an already existing email, a password that is too short or appears in a leaked password database, etc. Testing all 5 x 3 = 15 combinations with end-to-end tests will take significant time. And this time quickly adds up! If we test all input combinations in the app this way, end-to-end tests will run forever.

Instead, we could use a hybrid approach. It would be enough to test just two combinations with an end-to-end test: a happy path and when an error occurs. Then, we could cover the rest of the combinations with unit or integration tests, which would work much faster, and with a couple of E2E tests covering our back, there will be no degradation in confidence.

🤝 TDD-friendly

✅ Unit	✅ Integration	❌ E2E

While it is technically possible to use TDD with any test type, I'd argue that it doesn't play well with end-to-end tests due to their slowness. When using TDD, people tend to run tests often, and if they are slow, it would be too painful to build the program this way.

I'm not saying that we should always use test-driven development, though. As with any technique, it has limitations and works great in some situations and not so great in others. I'm just saying that it's much easier and more enjoyable to use TDD with fast tests.

♻️ Refactoring-friendly

❌ Unit	✅ Integration	✅ E2E

By definition, refactoring means restructuring the code without adding new features. Restructuring the code, in turn, means that the internal interfaces in the code might change—we might add or remove functions or classes or change their call signatures.

This process slows down significantly if these functions or classes are heavily covered with unit tests, which is often the case in projects that meticulously follow the Test Pyramid.

Conversely, high-level tests can significantly boost refactoring speed since they provide a quick way to verify that the program works as before from the end user's perspective.

Side-by-side

So what should we use?

Looking at the table above, we see no clear winner. Integration tests may look more balanced (and I like them), but they're not a silver bullet and aren't the best for every situation. So the answer is good old "It depends." Yes, boring, but true.

I think methodologies like Test Pyramid or Testing Trophy do more harm than good. Their original explanations are rational and nuanced, but their most memorable artifact is the simple picture that clearly shows the "default" test type—unit, in the case of Test Pyramid, or integration, in the case of Testing Trophy. Most people remember only this.

When developers adopt one of these methodologies, they often just go with the "default" test type if they don't immediately see reasons not to use it. This is a mental shortcut. It requires less thinking but results in suboptimal testing suites - less robust, more complex, or slower than they could be.

A better target

Instead of focusing on specific test types, we'd better focus on values: What do we like to achieve with testing?

We certainly want Confidence—it's the primary reason why we even bother with testing. We also want Speed; the faster our tests are, the better. And it all comes with a price, and the price is Effort. Our users don't care about our test suite; they care about features. Kent Beck, the author of TDD, summarized it perfectly:

My philosophy is to test as little as possible to reach a given level of confidence.

I think this diagram of balancing confidence, speed, and effort is a much more useful concept to remember than any methodology based on balancing different test types.

How to find the right balance?

We probably can't measure this balance precisely, but we can develop some intuition about good and bad for each axis.

✅ Confidence as a measure of bugs filtering efficiency

Our test suite works like a filter between our code and production, catching bugs. The efficiency of this filter is a measure of confidence. If our tests never fail, the filter is useless—it doesn't catch anything. By looking at bugs caught by tests and bugs sneaking into production, we can get an idea of how efficient the filter is and what level of confidence we get from tests.

🏎️ Speed: What is fast and slow?

How long does it take to run the full test suite when pushing a change to production? If it's measured in minutes, excellent! If it takes hours, that's too bad, and there's a grayscale in between.

A common trick to speed it up is to run tests in parallel. A multi-threaded test runner and a powerful server might be enough for a small project. For large projects, it may require a significant investment. For example, Stripe, in its annual letter, mentions its
test infrastructure as its largest distributed system:

The biggest distributed system at Stripe is our testing system. Stripe now comprises more than 50 million lines of code. Each change is verified within 15 minutes by running a battery of tests that would take 50 days to run on a single CPU.

🧭 Guiding our effort

No matter how good our test suite is, it must never be the only way of achieving production stability. If we do automated testing, that's great, but it doesn't mean we can abandon manual exploratory testing. It doesn't mean we can get rid of code reviews or that there's no need for production monitoring. All these measures are essential and complement each other to give us the level of confidence we need.

When a production incident occurs, we should track its root cause and ask
ourselves: How could it be prevented? Sometimes, the answer is "write a test,"
but not always. It could also be improving processes, monitoring, or something
else. This will guide our efforts.

Bottom line

Don't chase that magical test proportion - it doesn't make sense. Better look at how you could test this particular feature with good confidence, minimal effort, and reasonable speed, and don't hesitate to use whatever kinds of tests if they fit the job. Happy testing!

Why you shouldn't start with a mobile app

Denis Stebunov — Fri, 21 Mar 2025 15:06:17 +0000

Many new project founders believe launching a mobile app is the best way to launch a product. It might seem a good choice since mobile usage dominates the digital market. However, a mobile-first approach often carries unnecessary risks for the early-stage assessment of an idea, such as spending too much money on a concept that ends up not working.

MVPs are all about testing hypotheses with a minimal investment. That’s why a web-based app is often a smarter, faster, and more cost-effective approach. Read on and discover why.

A website enables faster launch and iterations

Web development offers the fastest way to develop a solution that works everywhere. It's the most cross-platform technology, running on practically every device—smartphones, computers, and even IoT devices like TVs. That makes the development process more manageable—one codebase for all platforms; a single development team; and a strong focus on feature quality instead of having to check feature compatibility on different platforms.

Once you’ve designed and developed a web app with a responsive interface, you’re good to go on any device. With nearly 68% of global web traffic coming from mobile devices SimilarWeb, targeting a broad mobile audience is a realistic and effective goal that can be achieved through web development without the complexities of native app development.

What are those complexities? Native mobile development (Swift/Kotlin) requires separate codebases for iOS and Android, leading to higher costs, increased coordination efforts, and more development time spent.

Even with cross-platform frameworks like Flutter or React Native, developing a single project for mobile platforms doesn't automatically provide a web or desktop version.

A website allows instant updates. A single codebase hosted on centralized servers can be updated quickly, ensuring that all users run the latest and, most importantly, the same version of the application. This simplifies user support, simplifies server-side maintenance, and eliminates the need to keep outdated APIs for users who haven't updated their apps.

For native apps, updating is more complex. Separate builds must be created for each platform, multiplying development effort and costs. While cross-platform frameworks help mitigate this, minor platform-specific tweaks might still be required. Moreover, the update process is less controlled: there are delays caused by app store moderation. Also, users often skip or delay updates. As a result, development teams are frequently forced to maintain a cohort of legacy versions, adding complexity and technical debt over time.

Debugging is easier on the web. The server side remains entirely controlled, allowing for thorough monitoring and logging. Browser incompatibilities were a pain in the past, but the situation has improved significantly. Modern browsers do a decent job, it's much easier to work with them now.

In contrast, native applications must be tested across many devices, OS versions, and screen resolutions. Debugging becomes significantly more complex, as certain issues may only appear on specific devices. Often, the best data available are crash reports, which require deep investigation (which costs time and money) to identify the problem.

Websites can do more than you think

Modern web apps have evolved over time. Platforms like Figma and Miro are prime examples of how powerful browser-based solutions have become. While they can be downloaded as desktop apps, they still are web applications wrapped in a native shell, delivering a seamless, native-like experience.

Here are a few examples that illustrate how powerful the web is:

Geolocation and GPS. Browsers can determine a user's location using GPS on mobile devices and laptops or via IP-based geolocation on other devices. Starbucks' website, for instance, detects a user's country to suggest the closest coffee place to visit.

Notifications. Like native mobile apps, modern browsers support push notifications, allowing web apps to send updates, alerts, and promotional messages even when the website is not actively open. Mobile platforms also support notifications via APIs or with the help of notification providers like OneSignal.

Hardware access. Browser APIs enable direct connection with external devices. This allows a simple web page to interact with IoT gadgets, printers, controllers, and specialized equipment without requiring a native app. For example, the Pixel Buds Pro Web Companion utilizes the Web Serial API to interact with the earbuds. This application enables users to adjust multiple settings on their Pixel Buds Pro, including noise cancellation, the equalizer, in-ear detection, and firmware updates. (The image is )

Local storage. Tools like IndexedDB, File Handling API, and Cache API allow developers to store data locally. For example, the web version of VSCode saves changes in the browser's local storage while working on a remote repository until the next commit is pushed. Another example is the Photopea image editor, which lets users open files from their file explorer.

WebGL and advanced rendering. WebGL libraries like Three.js and Babylon.js deliver high-performance 3D graphics within the browser, enabling interactive visualization, gaming, and AR experiences. Retail giants like Target and Macy's have built web-based room planners that utilize WebGL to help customers visualize furniture layouts before purchasing.

Camera and microphone. Video conferencing platforms like Google Meet and Jitsi Meet demonstrate that real-time video and audio communication can be handled entirely through the browser.

There is a misconception that native mobile apps undeniably have full, low-level access to all device capabilities and therefore, are better to build an MVP than a web app. However, the list above is a good example of how a growing number of browser APIs is rapidly closing the gap between web-based and mobile apps.

Business perspective

A web-based solution improves unit economics from day one. When integrating payments, full control over payment services provides the best conditions, supporting both one-time purchases and subscriptions. In contrast, listing an app on the App Store or Google Play can result in significant revenue loss due to platform fees—15% or 30% as the user base increases.

Web services are better at encouraging people to use them. With modern browsers preinstalled on nearly every device, the web app is one click away—whether from a smartphone, laptop or even a smart TV. On the other hand, mobile apps introduce additional steps: navigating to the app store, downloading, and installing the app. This extra friction can discourage users, especially those hesitant to install yet another app for occasional use. Additionally, public app store reviews are highly visible on the app listing, meaning even a few negative experiences can scare potential customers before they try the product.

Marketing for web-based services follows a well-established process. SEO tools drive organic traffic, and instant updates ensure that new features, product experiments, and marketing campaigns can be rolled out without delay. In contrast, marketing a native mobile app is less predictable. App Store Optimization (ASO) is highly dependent on store algorithms, which are basically undisclosed. Users who disable auto-updates may never receive new features, complicating retention efforts. App store review processes further slow down feature releases, and sometimes updates are rejected, resulting in decreased product metrics.

When developing a mobile app might make sense

At first glance, a web app seems to outperform a native mobile solution in most cases. However, there are two scenarios where developing a native mobile app as an MVP might be justified:

Mobile-first market launch. In industries where competitors are mobile-first—such as Uber services, navigation apps, or messaging platforms—a native mobile app must be developed to compete effectively. Still, launching a mobile-first product remains a high-risk investment.
Expanding an established product. If a proven product or service is already in place, a native mobile app can enhance user experience. However, even in this scenario, the web often serves as a better starting point, allowing for faster iterations and validation before committing to full-scale native mobile development.

Conclusion

Numerous successful products have demonstrated that a strong web-based service can thrive—from simple QR-code restaurant menus to powerful industry-standard tools like Figma. These examples highlight the web's ability to support everything from lightweight services to full-fledged professional applications.

When developing an MVP, the main goal is to validate hypotheses as fast as possible and then iterate quickly based on users’ feedback. Web applications excel in these areas, enabling rapid development, instant updates, and a seamless cross-platform experience. If the concept proves successful, transitioning to a native mobile app remains an option—but without the upfront risks and constraints of mobile-first development.

Better than estimates

Denis Stebunov — Thu, 13 Mar 2025 17:07:24 +0000

Estimates are ubiquitous in software development. People routinely use them for planning, prioritization, and managing expectations. All these activities are crucial for project management, but are estimates really the best tool for the job? Surprisingly, often, the answer would be "no." In this post, we'll explore how estimates slow us down and what alternative tools we could use instead:

Due dates, a.k.a. Deadlines if we need to fit into a certain timeline and/or budget;
T-shirt sizing for tasks prioritization;
Regular check-ins to identify stuck progress, minimize waste, and manage expectations better;
Postponed estimates when you just need a classic estimate.

Estimates slow us down

One of the most unpleasant effects of relying too much on estimates is that they slow the team down. Here's how it happens.

1. Initial estimates are very uncertain

Managers are most interested in estimates before starting work on a task or a project, which is quite understandable. Unfortunately, this is also the time when estimates are the most unreliable. This effect is known as the Cone of uncertainty - the more we work on the project, the better we understand how long it might take:

There were studies on how big that initial estimation error might be. It turns out that in software projects, it can be up to 4x off in either direction. For example, when people say "a couple of weeks," it could be anywhere between 2-3 days and a couple of months! Such a huge range may sound surprising to some, but ask any seasoned developer, and they will recall numerous cases from their practice when a "quick one-day task" ends up taking many weeks.

2. Developers quickly learn to overestimate

Of course, no one wants to deal with estimates that range from a few days to a few months. Managers are asking for estimates because they need something to rely on, and such a wide range isn't helpful in this regard. They expect estimates to be more precise and perceive them as some kind of promise. If a developer is lucky enough to finish the task faster than expected, this is great, and no one would complain. But if the work is delayed, people may become frustrated. So there's a natural incentive for developers to overestimate—it's just safer for them.

At first, developers tend to be optimistic about timelines, but they quickly learn that their estimates often fail, which could be an unpleasant experience. They realize that it makes sense to add a safety margin, and estimates go up.

3. Work expands so as to fill the time available for its completion

If developers overestimate, they must complete most of their work earlier than expected, right? Nope! This is where Parkinson's Law comes into play:

Work expands so as to fill the time available for its completion.

So this is how estimates slow us down - developers overestimate, and then that pessimistic estimate becomes the earliest date when the work could be finished. Unfortunately, there's no easy workaround for it. Pressuring developers to give lower estimates demotivates the team (why are you asking us if you already know how long it must take?) and will backfire because there are reasons why they overestimate.

Bonus: additional slowdown due to introduced misconceptions

Estimates also introduce dangerous misconceptions that further slow down a process that heavily relies on estimates. As Frederik Brooks famously noted in The Mythical Man-Month book,

Men and months aren't interchangeable

If one developer can complete a project within a month, it doesn't mean another developer can do the same. Maybe they can or maybe not; it depends on their skill, experience, productivity, and how well they know this project area. However, this important aspect is often neglected. For example, in SCRUM, people plan their work relying on universal task estimates and "team capacity," while in reality, performance may vary significantly depending on who is assigned to which task.

It contributes to the list of reasons why people reserve extra time in project plans to handle various "unexpected situations." But for whom this situation is "unexpected"? Most managers would agree that generally speaking, men and months aren't interchangeable. However, it's really hard to account for that in frameworks like SCRUM, so people consciously choose to ignore it and just add some extra time to hide planning inefficiencies.

Predictability comes at a cost

It is understandable that people want their software development process to be predictable. Many organizations go to great lengths to achieve this, for example, by estimating all of their work, calculating team "velocity," introducing burndown charts, and encouraging developers to give "more accurate" estimates. The truth is, it has its price, and the price is slowness.

Software development has some unpredictability by nature because it's not a fully repeatable process. Of course, there are some routine tasks, but developers don't spend too much time on them due to code reuse and automation. Most of their time is spent on something new. It could be working with a new library or an API that they didn't use before or debugging a tricky bug. Developers simply don't know how long it might take, so their only way to achieve predictability is to overestimate.

For some companies, sacrificing speed is a reasonable tradeoff for predictability. However, if you are interested in moving faster, it makes a lot of sense to accept that software development can't be fully predictable. We'd better allow some degree of variation and uncertainty. That doesn't mean that the project would become chaotic and unmanageable, though. In the chapters below, we'll explore techniques for managing projects without relying on estimates too much. We aren't going to throw away estimates completely. It can be a useful tool, but it should not be the only tool in the manager's toolbox.

Meet deadlines

Let's talk about situations when we need to fit into a certain budget and/or timeline. For example, we've got a small "idea-stage" investment for our startup, and we have a few months to demonstrate its potential to get more funding. Another example - we're running an e-commerce store and would like to upgrade our discount and bonus program before Black Friday. These are examples of natural deadlines, when we need to get something out of the door before the deadline, or the world will move on without us. In such situations, we'd better operate in the "deadline mode" rather than "estimate mode."

Deadlines aren't bad

Deadlines may sound scary. They're often associated with stress, burnout, cutting corners, or poor planning overall. Even the word itself sounds bad—we certainly don't want anyone to die! Such a bad reputation pushes people away from deadlines towards "safer" estimates, which I think is very unfortunate.

Of course, there are many ways to abuse deadlines, like setting arbitrary deadlines or asking for estimates and then treating them as deadlines. However, as shown above, there are plenty of ways to abuse estimates, too. Estimates aren't "safer". Neither deadlines nor estimates are a bad concept. They are just tools, and how we use these tools makes all the difference.

Estimates vs. deadlines

On the surface, estimates and deadlines may look like close concepts. They both are about the project completion time, but they work very differently. When we ask for an estimate, we assume that the project scope is more or less fixed and the completion time is flexible. When we set a deadline, the time is fixed, but we must accept that the scope may vary:

It might be tempting to fix both scope and time, but experienced managers know this is a bad idea. Either we won't be able to complete everything, or there will be problems with quality, or the timeline must be so relaxed that the project won't look viable from the business perspective. So, when setting a deadline, allow flexibility in the scope.

Deadlines facilitate prioritization

When we might not have enough time to complete everything, we naturally ask, "Okay, what's the most important?" Great question! There are usually many ways of achieving the same goal when building software, and not all features and requirements are equally valuable. Time constraints force us to set priorities, and clearly defined priorities significantly reduce the project risks.

Meeting deadlines reliably

The approach that we use at ivelum when we need to meet a deadline is a combination of the Fix Time and Budget, Flex Scope technique, and Continuous Delivery:

Write down all known requirements as separate features, which we could ship independently;
Prioritize the features from the most important to the least important;
Establish conditions for Continuous Delivery - ensure that the production environment and CI pipelines are ready;
Ship the updates frequently, starting with the most important features, and keep the system fully functional at all times.
Continuous Delivery helps further de-risk the project by ensuring that the MVP lands in production as early as possible. After that, we continue adding features to it, trying to build as much as possible before the deadline and continuously testing the new system as a whole.

When deadlines work best

Deadlines work best when there's a natural deadline that we can easily explain to the team, and explaining it is important. First, we need to get a commitment from the team, and people are reluctant to commit to arbitrary deadlines (which is quite understandable). Second, it guides our prioritization. Knowing why the deadline is set helps us understand what would be critical to ship by that date.

However, deadlines shouldn't be ubiquitous, even with the best explanations and reasoning. Definitely don't put a deadline on every task 🙂

T-shirt sizing

Another common reason why people ask for estimates is prioritization. The common logic is this - if it's a quick task, let's do it now, or if it takes significant time - maybe postpone. While the initial implementation time shouldn't be the only criteria taken into consideration, having this information could be helpful, especially for non-technical managers.

T-shirt sizing provides a very rough estimate of how large the task could be - S (small), M (medium), or L (large). It allows managers to compare tasks to each other and helps make decisions on their priorities.

If the task looks large, try to explain why it looks large. Can it be simplified so that we can achieve the same goal with less effort? Quite often, this is possible.

Please note that it's very important to avoid linking these sizes to the estimated completion time. As soon as people start to associate these sizes with concrete date ranges, we are back to classic estimates with all their shortcomings.

Story points in SCRUM tried to mix the same idea with better predictability and utterly failed. Everyone quickly learned to convert these magic numbers to days, which is unsurprising, as they're used for planning on fixed periods (sprints). Even the story point's inventor regretted inventing them. Don't repeat this mistake; don't tie t-shirt sizes to date ranges.

Regular check-ins

Sometimes, managers treat an estimate as a checkpoint when to communicate with the developer next time:

- Hey, how long it might take?
- Maybe a couple of weeks?
- Okay, I'll check back in a couple of weeks.

It may sound reasonable at first—we don't want to distract developers while they're working, so why not check back when they're supposed to finish? The problem is that it could be too late. There are many unknowns that developers unpack one by one as they progress through the task.

Frequent discussions of progress help identify bottlenecks and allow us to adjust requirements on the fly if it turns out that the original task is more complicated than we initially thought. The optimal check-in schedule may vary from one team to another. Some use daily standups, while others prefer weekly or another schedule. Standups are usually associated with SCRUM, but you don't have to go full-SCRUM to do standups—just do standups! When done right, they minimize waste of time and keep the manager well-informed about the real situation on the project.

What if we actually need an estimate?

Even if we effectively use all of the tools above, there could be situations where we still need to get a classic estimate to communicate it to clients or stakeholders. In such cases, ask for it later, if possible.

Remember about the Cone of uncertainty - the more we work on the project, the better we understand when it can be completed. When developers haven't started yet, they may have no idea. However, as they make progress, it should become more evident what it'll take to complete the project, and we'll likely get a more reliable estimate that would be safer to communicate to someone else (with a safety margin on top of it, of course 🙂)

Summary

Estimates are overused in software engineering, which often leads to poor planning and underwhelming team efficiency. While estimates have their place, they shouldn't be the only tool in the manager's toolbox. There are other tools that work better in many situations:

If you have a timeline to meet - use deadlines! Being flexible with scope, rigorous prioritization, and continuous delivery is a recipe for success;
If you need to decide what to work on next - consider T-shirt sizing estimates and if the task looks large, try to figure out why;
There's no need to ask for an estimate to decide when to check the task progress next time - just check it regularly, no matter what, and the progress will be more reliable;
Finally, if you just need a classic estimate measured in days - see if you can give developers a few days to work on the task before asking for an estimate. It significantly improves the chances of getting a more accurate figure.

Update with no fear — achieving zero-downtime deployment

Denis Stebunov — Tue, 18 Feb 2025 17:03:24 +0000

During a website or web application update, there’s a risk of downtime — a potential trigger for a downward spiral of problems. Business stakeholders often worry that users will notice disruptions and leave, leading them to delay updates “for a more convenient time.” This slows development, replacing small, frequent releases with larger, riskier batches.

The bigger the update, the more things can go wrong. It also demands more time for debugging and testing. Over time, this cycle fuels tension in communication between developers and business teams.

The good news is that it’s pretty much avoidable. Zero-downtime deployment addresses these challenges head-on. It allows teams to return to a rhythm of small, frequent updates, keeping development agile and the business side happy.

Today, we’ll dive into the core principles of zero-downtime deployment for web applications. We use these methods at ivelum extensively, and while there could be some edge cases not covered here, we believe this post covers the basics for most projects.

Setting up automated monitoring to know where problems occur

The first step to zero-downtime deployment is tracking downtime before users notice it and start complaining. Many things could be monitored. If you're unsure where to start - take a look at The Starter Pack, which covers the essential needs of a small-to-medium project.

Once monitoring is in place, what exactly can go wrong during a deployment? A problem could occur in any main component: the backend, the frontend, or the database. Let’s take a closer look.

Handling DB migrations without downtime

Some application changes require an upgraded database to work correctly. Database migration must be the first step in our CD pipeline, prior to updating the backend or doing anything else:

Update the database;
Update the backend;
...
PROFIT!

This sequence is important because new backend app instances won’t work correctly if the DB isn’t updated, so if we care about zero downtime, we must update the database first. How we launch the database migration script is also important. It might be tempting to run it as a part of the backend application startup like this:

<run-db-migrations> && <launch-the-app>

Please DON'T DO THIS.

It won't cause any problems if we run just a single instance of the app, but it's a terrible idea in production environments with multiple app instances. We should launch the DB migration script exactly once, and if it fails, the whole deployment process must be stopped and rolled back. For example, if the DB migration fails due to a timeout on a heavy SQL operation, we shouldn't retry it again with the next app instance launch, or we can cause severe downtime by those multiple attempts. We should roll back immediately after the first failure, investigate, fix the migration script, and only then try again.

Next, we must carefully plan what to include in the DB migration script. Downtime during the DB migration is usually caused by one of the two main reasons:

Breaking backward compatibility with the previous version of the app;
Heavy DB operations that overload the server.

The good news is that both problems can be successfully mitigated. However, there are many nuances that are worth a separate post, and we actually covered them in one of our articles.

Deploying the backend

When deploying backend code updates, an application server restart is typically required to load the new code into memory. During the restart, some requests may be lost, causing downtime. In most modern projects, it’s handled with container orchestration. The main principle is simple: launch new application instances in parallel with the old ones and then switch traffic on the load balancer. Strategies may vary. It can be a rolling update when we launch new instances one by one, or we could replace them all at once (so-called blue-green deployment).

Updating the frontend

Depending on how we built our frontend, we may have more or fewer items to care about. Let's start with the most obvious, which is relevant to almost everyone.

Updating caches—browser and CDN

So-called "static resources," including CSS, JavaScript, and images, are usually cached in browsers and CDNs, which is a good thing. Static resources rarely change, so we can significantly speed up our web app if we don't load them again with every page request. However, sometimes they might change—when we update the app, and in this case, old cached versions may become a problem. If they're not updated, they can cause broken app layouts or functionality.

Fortunately, the fix is relatively straightforward. For every static resource that might ever change, let's update its URL with every new app version. For example, we can do this by adding the app version (or CI build number) to the query string:

<html>
  ...
  <!-- Note "7503" in the line below - it's a CI build number: -->
  <script src="https://app.example.com/bundle.js?7503"></script>
  ...
</html>

After the app update, the backend returns HTML with new links that neither the user's browsers nor our CDN have seen before, effectively forcing them to load the latest content.

Updating SPA

But what if we built our frontend as a Single-Page Application (SPA)? By default, it doesn't do a full page reload when we navigate the app. If the app update happens in the middle of a user's session, they will still have the old version of the frontend app loaded in their browser, but the backend is already new, which may cause problems. Of course, if we did our homework and updated the caches as described above, these problems would disappear as soon as they refresh the page, but how do they know they need to refresh? Some users keep their browser tabs open for days or even weeks.

Ideally, the frontend app should somehow receive a signal that a new version is available. After receiving such a signal, it can reload itself automatically so that users don't have to do anything and can continue to work normally. This idea can be implemented in various ways. Let's see a concrete example of how we did it in one of our projects, Teamplify:

Both the frontend and backend use the build number from the CI server to be aware of their version. The backend includes its version in the HTTP response headers for every API request. The frontend compares the version from the backend with its own. If they don’t match, it triggers an update.

But the update is not just an immediate forced reload; we don’t want to frustrate our users or risk losing unsaved data they might have entered on the page. Instead, we use a deferred reload. It means we schedule the reload to trigger at the next convenient moment, which can be one of the following:

When the user clicks a link. Usually, this action performs without a full page reload, but if the frontend is outdated, we adjust the router so that the next link click triggers a full reload.
When an application error has already occurred on the frontend. Since there is an error, and we know a new version is available, it makes perfect sense to reload the page—this might even resolve the issue.

Surviving rolling updates

When we gradually update our backend instances, there might be a situation when an API request from the new frontend app hits an old backend instance. Here's the problematic sequence:

We started updating the app using a rolling update, and it's in progress. Some of the app instances are already new, and some are still old;
The user's first request hits a new backend instance, and it responds with a link to the new version of the frontend app. The new frontend app is loaded into the user's browser;
The new frontend app sends an API request, and since some old backend instances are still running, it might hit one of them. If the new frontend is incompatible with the old backend, it might result in an error.

For the Teamplify project, we initially relied on so-called "sticky sessions" to mitigate this problem. The idea is that the load balancer "remembers" (via special cookies) from which backend instance a user's request was originally served and routes all subsequent requests from this user session to the same instance. While it worked well for preventing problems during the app update, we found that it resulted in sub-optimal workload distribution across backend instances. If one client generated too many requests, they all were directed to the same instance, creating a "hot spot" and causing app slowness.

After researching possible solutions, we decided to avoid rolling updates altogether—we switched to the "blue-green" deployment strategy where backend instances are replaced all at once, and the problem disappeared. This solution might not be for everyone since it requires extra capacity on the app servers. Before switching to the new version, all new backend instances must be up and running in parallel with the old ones. Depending on your project infrastructure, you may or may not have such extra resource capacity. For Teamplify, we had such a capacity, so it was an easy solution.

Alternative solutions may include some logic built into the app, for example:

Require a specific API version in all requests coming from the frontend app and route them accordingly of the load balancer;
Implement retry logic in the frontend app. If it receives a response from an older backend instance, it should try again after a while, hoping that the request will be served from a new instance next time.

Summary

In this post, we covered the basic principles of zero-downtime deployment for all major web app components: the backend, the frontend, and the database. The above strategies apply to most web projects, enabling engineers to roll out updates frequently without downtime.

However, this is not an exhaustive guide, and there's a chance you may run into a tricky situation that is not covered here. If so, please feel free to reach out to us via this website or on X, and we'll be happy to brainstorm with you.

You might not need staging

Denis Stebunov — Wed, 12 Feb 2025 12:23:07 +0000

Many engineering teams test new features on staging before pushing them to production. There’s no doubt that testing is crucial, but is staging really the best place to test? Let’s dive in.

What we call a "staging environment"

Staging is a controversial topic among developers. Some argue that it's essential, others that it's unnecessary, and even the term "staging" is ambiguous—different people would mean different things. In this post, we'll call "staging" an application environment that satisfies two criteria:

It's a separate environment, which is not a developer environment and not production;
Developers use this environment to demonstrate their work to others to get approval to push it to production.

There could be situations when only one of these criteria is met but not the other. For example, if developers showcase their work on their machine or in production, we assume there's no staging since there's no separate environment.

Another example is if a developer has deployed a separate environment to test something themselves and isn't waiting for anyone's approval. We also assume this is not a staging environment because "waiting for approval" is crucial to our definition. Bear with me; you'll see why we use these two criteria soon.

How staging affects code workflows

Since staging is a separate environment, it makes sense to have a separate code branch for it. So we'll have at least two long-living branches, probably called "main" for production and "stage" for staging. Ideally, they both should have a CI pipeline configured for automated deployments to production and staging.
Developers would start working on new features in separate branches and then merge them to staging for review:

So what happens next? The review might take some time—maybe a few days or even weeks. Of course, developers won't waste time simply waiting; they'll work on something else in the meantime. As a result, having multiple features deployed to staging is pretty common. Some are pending review, others are actively developed based on previous feedback, and some features are ready - we have a green light for deploying them to production.

Deploying to production... not exactly what we tested!

When a feature is approved for production deployment, developers merge its feature branch back into the main branch, and it gets deployed to production:

And here's the problem. We tested that feature on the staging environment together with other features that were still on staging but not in production yet. It shouldn't be an issue if all these features were completely independent. However, in the real world, dependencies do happen.

As a result, we're deploying a slightly different version of our app to production—not exactly what we've tested on staging. It does not sound very reliable, and with active development, we will run into a problem sooner or later. Could we do a better job?

What about dedicated feature environments?

Some teams use so-called "feature environments" that usually come with automation. A special script creates a new clean staging environment for every pull request so that we can test it separately. The script automatically deploys all pull request updates to its dedicated environment and cleans up the environment after the pull request is closed.

This setup is more complicated and requires more resources since we now run multiple staging environments simultaneously. However, it brings certain benefits for developers. First, there's no need to merge to the staging branch anymore, which is faster, and there are fewer code conflicts to resolve. Second, we can test each feature independently, mitigating possible dependencies between features.

I'd happily report that it solves all problems, but unfortunately, it doesn't. The biggest issue persists—we might be testing a slightly different version of the app from the one that would go to production. Reviews aren't instant. Features may hang on review for a while, maybe for a few days or even weeks. Chances are, something else may land in production in the meantime, and the longer the feature is reviewed, the bigger the gap becomes between its code
and the main branch.

So, while this workflow offers some benefits over a single staging environment, it still has a major flaw—we aren't testing exactly the same code that goes into production.

Challenges with test data

Besides the code, one more thing directly affects the quality of our testing—the data. If the data in our testing environment significantly deviates from production data, it might result in:

Bugs: In production, our code might run into an unexpected value that was missing in the test data;
Performance issues: The more data our code has to handle, the slower it will be. If we didn't test the code on real production datasets, there's an elevated risk of performance problems in production;
UX problems: Even if our designers don't use "Lorem ipsum" and try to build realistic-looking designs, it might still be hard to catch all the corner cases. For best results, we should test user interfaces with real data.

Naturally, the closer our test data are to actual production data, the better. But how do we ensure that?

While restoring a production DB backup on staging may sound straightforward, it represents some challenges in practice:

Taking care of DB migrations. Some of the features currently deployed on staging may require changes in the database. They'll become broken if we simply overwrite the previous DB state with a new backup, so after restoring the backup, we need to apply the migrations again.
It may be harder to reproduce bugs. When submitting bug reports, people often include "steps to reproduce," which may depend on a particular DB state. If we regularly overwrite the DB with a fresh backup, this state may be lost, making it harder for developers to investigate.
How long would it take to restore the backup? We're lucky if the production DB is small and its backup takes just a few minutes to restore. But what if the production DB size is measured in terabytes and takes many hours to restore?

The last item could be especially painful for
feature environments. Does it mean we'll need to launch a powerful server for each pull request and wait many hours before we can even start testing a new feature?

Staging breaks Continuous Integration

The main idea of Continuous Integration is that developers merge their work into the main code branch early and often, and it's all being tested together. Instead of testing multiple slightly different versions of the app, we focus all our efforts on testing a single main version, and as a result, we can test it better. This is why so many companies are chasing it.

Staging forces developers to wait and not merge their work until it's "fully tested." The thing is, it's impossible to "fully test" it in isolation when multiple developers are working on the same codebase.

There must be a better way

While we can work around some of the staging problems using better workflows, clever automation, and (maybe) a lot of computing resources, the main issue remains fundamentally unsolved. On staging, we're testing a slightly different version of the app in slightly different conditions compared to production.

What if, instead of spending time and effort on improving our staging, we invest in safe testing in production? There will be other challenges, of course, but there will be some serious benefits as well:

Benefits:

We are testing exactly the same code that goes into production because it is in production;
Our test data always matches the production data because it is the production data.

Challenges:

We don't want users to see the unfinished work, which we don't consider as "done" yet;
We don't want to disrupt the end-user experience with our testing.

Despite the challenges, for many projects, it's actually easier to overcome these challenges rather than fighting with challenges on staging.

Is anyone else doing it?

Oh yes. Literally, everyone tests in production, even those who don't openly admit it. Modern software is very complex, and the complexity only increases with time. It's practically impossible to ship a piece of software that would be completely bug-free, even after thorough testing. That's why we see numerous bug fixes in subsequent releases, and that's why production monitoring and crash-reporting is a
multi-billion dollar business.

So, let's face it - testing in production is unavoidable, but the big
difference is how it's performed and controlled. Maybe you participated in an Early Access Program offered by many companies, e.g., OpenAI or JetBrains, to get first feedback from early adopters. Or, maybe you've heard of canary deployments,
used by Google and many other companies, or about Netflix's Chaos Monkey, which deliberately breaks things in production to ensure that the system is resilient and can tolerate that.

The best engineering teams embraced testing in production and developed numerous techniques for doing it safely and efficiently over the years. We'll cover some of these techniques below.

Meet Feature Flags

A feature flag (or a feature toggle) is a very simple concept. It's just a condition in the code, an "if" statement that runs the new program behavior under a specific condition and the old behavior otherwise:

javascript
if (condition) {
  newBehavior();
} else {
  oldBehavior();
}

Despite its simplicity, it's a powerful and flexible tool that allows us to test things in production in numerous different ways. We can keep the original program behavior for most users and at the same time, test how new behavior works. Let’s see some practical examples.

Example 1: Emulating staging

If our team is used to working with staging — what if we organize staging right in production? It can be as follows: we add a second domain that points to the production app, e.g., staging.app.example.com:

And in the code, we use a feature flag called STAGING that is enabled depending on the domain from which we open the app:

javascript
const STAGING = domain.startswith('staging.');

...

if (STAGING) {
  newBehavior();
} else {
  oldBehavior();
}

Now, we can still use familiar phrases like "deploy to staging" or "look at staging," but staging is actually hosted in production! We can easily switch between new and old user experiences and test both.

As a precaution, end-users should probably not have access to the staging domain, or they might get confused.

Example 2: Early Access Program (EAP)

Early Access Program is a common way to introduce new features to a subset of users who agree to test them and give feedback. At the very minimum, it could be just our teammates - developers, QA, and product management folks, or we may be lucky enough to recruit some end users as well.

From the technical perspective, EAP is just a checkbox in the user's profile and a feature flag in the code that relies on this checkbox. We can make the checkbox public so that users can enable it for themselves or keep it private so that only our customer support team can control it.

Example 3: Multiple experimental features

Sometimes, we may want to have more granular control over which new features to enable for users - not just "all new stuff at once" but being able to select individual features. You may have seen it in your browser; for example, here's Chrome:

Every major browser has a long list of feature flags, allowing users to select the features they want to enable. When the developers think a new feature is already stable enough, they can turn it on by default. And even after that, there still could remain a checkbox for turning it off.

This approach might be overkill for most projects, but it has its place and works well for huge software projects like browsers.

Example 4: Release flags

Sometimes, we may use feature flags to enable a new feature for all users simultaneously. In this case, they're called "release flags." Release flags are a safety mechanism, providing instant rollback if something goes seriously wrong with a new feature after the launch. Instead of pushing a code update that turns the feature on and off, we do it with a release flag somewhere in the internal administrative interface.

Of course, not all features need a release flag, but they come to the rescue when the ability to roll back instantly is critical.

Example 5: Canary deployments

Even more cautious approach to releasing new features is called "canary deployment." In this case, we're not releasing the feature to all users at once; we do it gradually, starting with a small percentage of users:

For example, we may roll out the new feature to 1% percent of users, then to 10%, and finally to everyone. After each step, we closely watch the production metrics. If we see some worrying anomalies, we roll back and investigate.

Example 6: Dark launch

If we’d like to do a load testing of the new version of a service before the launch, we could use a pattern called “dark launch.” Here’s how it works — we configure a load balancer to send a copy of all traffic requests to the new service version:

Responses from the new service version are ignored; the old version still serves all user traffic. However, it allows us to test the new version's performance, see its metrics, and understand how it would perform under the full production load.

When staging is necessary

As you can see, people invented numerous ways of testing things in production. There's a solution for almost every use case. However, one notable exception is when the development team doesn't have access to production. No access means there's no way to test anything there.

First, not everything is in the cloud. People still build and use software that runs on-premises, in their private network, to which the developers have no access. Second, not all teams have embraced DevOps, even with cloud products. Some still have a separation for "dev" and "ops," and developers have no access to production.

Without access to production, developers need another way to showcase and test their work, which naturally leads to staging.

Conclusion

Counterintuitively, testing in production isn't as reckless as it sounds. Quite the opposite—it gives us more confidence in what we build, speeds up the development process, and improves quality. The best teams in the world are doing it, and most likely, you can do it, too. If your team still uses staging, we collected some arguments above on why you shouldn't and proposed concrete practices for safe testing in production. Happy testing!

Lean app monitoring—The Starter Pack

Denis Stebunov — Tue, 21 Jan 2025 13:15:12 +0000

Imagine an e-commerce solution that went down right before Black Friday! Downtime or failures could result in disaster, so they should be noticed and fixed ahead of time. But deciding what to monitor can be challenging when there are so many options. In this article, we’ll explain different metrics and share an easy, low-cost way to start your monitoring routine that covers the most basic needs.

If you’re looking for a quick recipe for a new project, skip to The Starter Pack. Or, bear with us and learn more about potential choices.

Let’s say a site is unavailable. The developers might release a new update and shut it down during the rollout, or the cloud infrastructure provider could experience a major failure, leaving its servers offline. Anyway, downtime must be detected and eliminated.

But just because the site looks like it is working doesn’t mean it really is. Errors could go unnoticed, but still be there. Imagine how frustrating it would be to spend a long time finding something you want in an online store only to realize in the end that you can’t pay for it—because of a bug!

One way to be notified about problems is to get feedback from users. Unfortunately, frustrated users might just leave without providing any feedback— and never come back.

The good news is that there is a better way to make sure things will work as expected—automated app monitoring.

So many things to be monitored!

In large, mature projects, everything has to be monitored, and for a good reason—there are many things that could go wrong. So, how should we tackle this task? Here’s a list of monitoring metrics:

Infrastructure monitoring

Infrastructure monitoring tracks servers’ uptime and resource allocation. These metrics show that the server is up and running and that the app uses resources efficiently without memory leaks, disk space shortages, or CPU overload.

Modern cloud providers offer extensive infrastructure metrics, simplifying the monitoring process.

Application Performance Monitoring (APM)

Application Performance Monitoring services track application-specific metrics, such as the number and types of requests it handles, response time, error rate, etc. These metrics provide an overview of the app's performance and possible optimization points.

Error monitoring

This type of monitoring service specializes in collecting crash reports. Analyzing exceptions thrown by the application helps the team detect and fix bugs that have reached production.

Besides bugs in the code, crash reports can indicate other problems, such as infrastructure issues, third party service outages, or unexpected changes in external services' behavior.

Uptime monitoring

Some problems may go unnoticed by infrastructure and application monitoring. For example, if our hosting provider experiences a major outage, we probably won’t get any alert from our internal monitoring, yet the app will be down. The same situation might happen if the app is deployed incorrectly.

That’s why we need to check if the website is available from the user’s perspective in different locations.

Business metrics

Aside from purely technical things, we also need to keep an eye on what is valuable for each business. For example, for media websites, page views matter; for an e-commerce store, it’s about order volume; for a SaaS app, it's about active users, revenue, and customer churn.

Business metrics are an additional monitoring layer to check for anomalies and hidden problems that are not identified by other tools. For example, a sudden drop in orders in an online store could force the development team to investigate any issue thoroughly, even if other error checks are overlooked.

The Starter Pack

Once again, in mature applications everything has to be monitored. However, the teams behind these apps didn’t get there in one big step, all at once. Improving monitoring capabilities is an ongoing process. However, starting a new project still requires a first step.

Based on our experience, reaching a reliable and stable production for most new projects is possible using the Starter Pack: error monitoring plus uptime monitoring.

Why these two? Because they are relatively easy to implement, and they provide decent coverage.

And this is how it works: an error monitoring service catches exceptions and regressions as they occur. It also sends messages to the team chat, helping devs get detailed information about the problem. However, if the application fails to start, it won’t report anything because there’s no application running to throw errors. This is where an external monitoring tool comes into play. It checks whether specific web pages or API endpoints for the service are available.

To summarize, all types of monitoring exist for good reasons, and as the project matures, you shouldn’t hesitate to invest in better production monitoring. However, uptime and error monitoring give just enough reassurance to launch a new project for the first time, knowing it will be reliable and stable enough.

Three important steps before jumping to the code

Denis Stebunov — Sat, 13 Jul 2024 11:50:10 +0000

Once you decide which feature you want to build, it’s time to decide how to actually build it. Over the years, I have participated in dozens of software development projects as a developer and engineering manager. I have built many features myself and have been lucky to collaborate with many talented people and watch how they work.

Below is my go-to checklist for starting work on a new feature. It’s based on my experience and the teams I supervised – what worked well, and if problems arise, where did they stem from? Of course, the approach I propose here might not fit every situation. However, so many problems in my practice fall into these three buckets that I thought it was worth treating them as a checklist:

#1: Understand “why”

Who will be using this feature?
What problem are we trying to solve for them?
Why are we going to solve it this way?

Sometimes, developers skip these fundamental questions and jump right into the code. It’s understandable. They’re eager to do what they love, and also, isn’t it the product manager’s job to think about such questions? Well, yes, it certainly is; but that doesn’t mean this information is useless for us developers.

While working on a feature, we face many decisions on all levels, from “How do I name this variable?” to “We ran into a technical issue and need to find a workaround.” A deep understanding of the task context is crucial for making informed decisions. It’s also worth thinking a bit ahead:

How may this functionality evolve in the future?
How much data may we need to store?
Which system failures will cause a bad user experience, and how will we handle that?
… and so on.

The answers to these and similar questions make the difference between great and poorly designed software. And, once again, we need to understand the task context to get it right.

In an ideal world, the answers to the most important questions will be found or inferred from the task description itself: who the users of this feature are, what problem we will be solving, and why we are going to solve it this way. In practice, this is not always the case, and sometimes developers are shy about asking or think it’s none of their business. Please don’t skip this step. Not only will it help you solve this particular problem at hand better, but it’s also essential for your professional growth.

#2: UX design

Regardless of the interface you’re building — a UI, API, or command line — consider the user interface carefully before jumping to writing code.

If someone else has already prepared the designs, that’s awesome; so, study them thoroughly first. You might spot problems or inconsistencies and will be able to report them to the designer early on. Even if everything is clear and reasonable, it’s still time well spent because now you have a much better understanding of what you’re building.

If there are no designs yet, and you’re supposed to come up with something, make sure to work on the designs before the code! Some developers, when tasked with building a UI, tend to postpone it because they’re not so confident about their design skills and prefer to start with something else – something they’re more familiar with. Don’t do that. Understanding how users will interact with the system should be your top priority. It’ll likely save you a lot of development work time, and the result will be much better.

You can still produce something useful even if you’re not a professional designer. For example, you can use a rapid wireframing tool like Balsamiq (my favorite) or Excalidraw. With such tools, you can sketch an idea quickly without spending time on minor visual details. Or, use a whiteboard or good old pencil and paper. Any sketch is better than nothing.

Low-fidelity Balsamiq wireframes

And if you’re building an API or a command-line interface, your design would be the documentation and usage examples. It doesn’t have to be polished at this stage but should include at least the most important use cases.

#3 - Data structures

“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” - Linus Torvalds.

“Smart data structures and dumb code works a lot better than the other way around.” - Eric S. Raymond.

Yes, data structures are important. Depending on the task, we may optimize them for data consistency, speed, storage requirements, and developer experience; and how we organize data may have vast implications in all of these dimensions. However, the reason why they’re on my list is something else:

Data structures are harder to change. If we modify a data structure, we’ll have to update all the code that works with it and also migrate the existing data to the new structure, which can be quite a challenge depending on the project stage and size. As a result, data structures often live longer than the original code that was shipped with them. Also, building more features on top of existing data structures is quite common, so it makes sense to try to make them future-proof to some degree.

This is why data structures are #3 on my list. After we study the task context and understand how users will interact with the system, this is the next important thing to tackle. We can certainly revise data structures later while working on a feature, maybe even multiple times, but given their importance and potentially problematic updates, we should start working on them as early as possible in the process.

If the project is based on a relational database and the feature you’re working on uses multiple tables, it might be a good idea to visualize it using an ER diagram. ER diagrams are arguably the most useful part of the UML specification. You may skip everything else in UML, but if you’re working with relational databases, don’t skip ER diagrams :)

Credit: dbdiagram.io

Summary

When you start working on a new feature:

Understand “why”;
Review UX designs or create your own;
Design data structures;
Code.

In that order.

Why go full-stack in 2023?

Denis Stebunov — Sat, 04 Feb 2023 18:49:30 +0000

This post is not just another take on the never-ending “generalists vs. specialists” debate. We’ll be looking at one specific area – web development. We won’t be talking about mobile apps, machine learning, game development, and whatever else is on the horizon; this post is about web development only. So let’s start with a brief history of how it evolved in the last two decades.

In the early 2000-s, when the web was young, the distinction between frontend and backend developers barely existed. Browsers were not as powerful as today, and websites looked much simpler. Some people specialized in working with HTML/CSS, but that was a bit different – HTML is a markup language, not a programming language. The industry was already pretty active, though. PHP, Java, Ruby, JavaScript, CSS, MySQL, and Postgres were all introduced in 1995-1996, and we still use these technologies today. By 2000, developers had already built enough websites to crash the stock market, even without jQuery! But that was going to change.

2010-s: The rise of frontend frameworks

The major shift happened somewhere in the 2010-s when frontend frameworks emerged. A remarkable demonstration of what they made possible is Trello, a popular issue tracker released in 2011. It was built on the Backbone.js framework, which was cutting-edge technology at the time. Since Trello used the Single Page Application (SPA) architecture, it didn’t require a full page reload to interact with the server. And it felt fast! Trello was very popular, thanks to its good design and a generous free plan, so many people tried it and are still using it.

Trello wasn’t the very first SPA, of course. Another famous example is GMail which was introduced in 2004. However, Trello was built on a frontend framework, and frontend frameworks promised to do a significant amount of heavy lifting for you. It was around early 2010-s when frontend frameworks really started to take off: Backbone and Angular.js were released in 2010, Ember.js in 2011, React.js in 2013, and so on.

I’m not sure if it was pure coincidence or a result of Trello’s popularity, but at that time TODO list had become a canonical example, showcasing the capabilities of JS frameworks.

APIs

Simultaneously in the 2010-s there was an explosive growth of APIs:

Source: Programmable Web

Many organizations realized that building APIs could be a way to save development efforts and move faster. The same API can be behind a web application and a mobile app, and when needed, it can be exposed outside and used by external integrations. Widespread adoption of the REST API concept has further boosted the API development.

Frontend developers

Frontend frameworks were hot in the 2010-s. People were talking about them at conferences. At that time, the joke about “X days without a new JavaScript framework” wasn’t so much of a joke.

Some web developers were excited about these innovations, but others… not so much. Previously, CSS and JavaScript were already perceived by many as a mess. Browsers were not very compatible with each other, so numerous CSS and JavaScript hacks were required to get things working consistently in all browsers. Now, another layer of complexity had been added – frontend frameworks, JS bundlers, and package managers, and they were young and still in flux.

And so, some developers started to specialize in this area and call themselves frontend developers. It seemed reasonable because the industry was moving rapidly. All that additional complexity required additional time to study, and developers were always in demand. Many “traditional” web developers converted to backend developers and were glad they wouldn’t have to deal with that ever-changing frontend mess.

APIs became a natural separation of responsibility between the backend and frontend. Backend developers would build an API, and frontend developers would build a web application on top of it.

The separation of backend and frontend roles provides two major benefits:

Hiring becomes easier. Since the scope of required knowledge for each role has been reduced, it would take less time to become a qualified developer. Besides, such a separation allows people to choose an area they like more. Some folks enjoy working with interfaces, and others prefer the backend world with databases and algorithms.
It allows deeper specialization. Again, because of the more reduced scope, people would become experts in their areas faster.

Some people would argue that there are additional benefits of faster and/or better quality development due to higher specialization, but it’s not always the case. It could be true for some projects and false for others due to the reasons outlined in the next section.

What problems do full-stack developers solve

Not everyone has hopped on that backend-frontend separation train, though. Simultaneously with frontend developers, a competing role emerged – full-stack developers. They were specifically required to work with both the backend and frontend, like in the good old days, but with modern technologies. Facebook is a notorious example – there was a time when they proudly stated that they hired full-stack developers only. You can also find full-stack positions in Google, Netflix, and many other reputable organizations.

Why hasn't everyone embraced that frontend-backend role separation? Well, because it isn’t free. Besides the benefits mentioned above, there are some issues, most notably:

Extra dependencies to manage. Backend and frontend separation has introduced a dependency. Frontend developers rely on an API but cannot build it themselves. As a result, they need to talk to backend developers first, agree on the API design, and then wait until backend developers implement it. If they don’t have other tasks to work on while waiting, they can build an API mock, work with it in the meantime, and then replace it with a real API. If later they discover that something is missing in the API or it doesn’t work as expected, they need to talk to backend developers again and then wait for an update or fix.

Mediocre APIs often emerge as a result of such separation. Sometimes, frontend developers can see that the API they’re using is not perfect, but at the same time, they can work around it with some hacks. Of course, ideally, they should report it to backend developers and suggest improvements, but that entails some friction, so issues often end up being neglected. For backend developers, it could be harder to see usability problems with their APIs because of a lack of “dogfooding” – they don’t use what they build themselves.

Workload distribution problems, especially in smaller teams. It's much easier to prioritize work in a team of full-stack developers because all of them can work on any task. The percentage of the backend-vs-frontend work during the project's life may vary. With dedicated backend and frontend developers, their workload is often suboptimal. Frontend developers could be overloaded with high-priority tasks, and backend developers cannot help them, and vice versa.

Some technologies cross boundaries. Let's take Server Side Rendering (SSR), for example. With SSR, the same code is executed on the server side and then in the browser. Technically, it is closer to the frontend since it mentions the frameworks that frontend developers are typically using. However, frontend developers would no longer be in their traditional territory – they would be running their code on a server. Another example is Hotwire for Ruby on Rails which was introduced in 2021. This takes an alternative approach, suggesting adding interactivity to web pages without traditional SPAs and APIs. WebAssembly is another interesting technology for high-performance code that can run on both server and client.

Lack of product thinking. Since full-stack developers work on all aspects of a task, they tend to understand better the big picture of what they're building. Frontend developers see how people would interact with software, but they may have a vague understanding of its internals. Therefore, it may be hard for them to decide what would be easy to build and what is challenging, and which optimizations are possible. On the other hand, backend developers understand the internals very well, but they may not look at the product from the end user's perspective.

State of modern frontend

What has changed in frontend development since the 2010-s? Well, many things, but most importantly – it has matured:

New frontend frameworks no longer appear every other day. The innovation in this area continues, but it’s no longer a crazy ride like in the 2010-s;
Browsers have improved! We no longer need that ridiculous amount of CSS hacks in our codebase. Flexbox and Grid layouts are now safe to use for most apps, so we can finally center a div;
JavaScript has improved, and it’s not just leftpad being accepted as a standard. There have been numerous improvements in the language, tooling, frameworks, libraries, package managers, bundlers, linters, and testing tools. And it’s no longer considered a slow language – it’s faster than some popular languages used on the backend, such as Python or Ruby;
TypeScript is now the standard de-facto for those who prefer stronger typing.

This is not to say that all problems on the frontend are solved, but today it's much easier to work with it than ten years ago. Yes, a large portion of complexity is still there, but it's no longer such a mess, so there are fewer reasons for backend developers to hate it.

Should you hire full-stack developers?

It depends. There could be no benefit from full-stack developers in areas of your project with little or no back-and-forth between the backend and frontend. Hiring dedicated backend or frontend developers is easier, and it matters.

However, if interactions between the backend and frontend happen regularly, and most tasks require working on both parts – then maybe consider hiring at least some full-stack developers. Hiring would be more difficult, but it’s certainly doable. I know it first-hand since we've been hiring full-stack developers at ivelum for years. If you don’t take it from me, take it from Google, Meta, Netflix, and others. High-qualified full-stack developers do exist, and it is possible to find them.

You can also have a hybrid team – some frontend and backend developers and some full-stack folks to take the best from both worlds.

Should you become a full-stack developer?

When you read these lines, I bet you already know the answer. It's perfectly fine if you don't want it – specialization is nothing wrong. Most job offers nowadays are for frontend or backend developers, and it's very unlikely that it'll change anytime soon.

If you'd like to give it a try, then go ahead! Even if you are not going to apply for a full-stack position, a better understanding of what happens on the other side will help you to see the big picture and be a more productive collaborator. And if you really invest in it, you'll be able to apply for a broader range of jobs and build complete features from start to finish.

Here are a couple of random thoughts on this topic:

For backend developers, it might be easier to switch to full-stack development. The scope of additional knowledge required is significant but smaller than for the other side;
For frontend developers, trying Node.js might be enticing. There are many things to learn on the backend, and with Node, you won't have to start by learning a new programming language.

And that's all I have for today. Happy coding!

Effortless Time Tracking

Denis Stebunov — Sun, 01 Jan 2023 21:21:27 +0000

Why use Time Tracking?

Time tracking provides knowledge about work time spent on particular tasks and projects. It can be useful for various reasons, for example:

Better project planning. When we see tasks take too long, it might be a good idea to check what's going on and maybe adjust our plans. Also, if we know how long projects took in the past, we can use that information to plan similar future projects;
Time-based billing, which is common for contractor work;
Cost attribution. For example, time spent on software maintenance and bug fixes can be treated as OpEx (Operating Expense), and time spent on building new projects and features – as CapEx (Capital Expense). Financial management can be interested in these numbers for investment planning and tax accounting purposes.

Traditional solutions

Source: Jira – Log time on an issue

Time-tracking software typically relies on two main techniques or a combination of them:

Manual entry – people enter work hours manually for each task (or manually start and stop a timer);
Automated – users install tracking software that creates record entries based on their actions.

As usual, there are pros and cons. Manual entry significantly increases overhead. People may forget to update their records, or do it late or inaccurately. Automated solutions can be perceived as an intrusion into privacy (and rightfully so).

Besides, tracking based on user actions has some limitations and it doesn't account for all types of actions. For example, thinking about a task or brainstorming ideas on paper doesn't require a computer and can't be tracked automatically.

For many software development teams, the overhead caused by time tracking is not outweighed by its benefits. Developers aren't happy to enter work hours manually and would hate to install some "time-tracking spyware" on their machines. Therefore, many teams just give up or don't even try. But what if there was a fully automated, reasonably accurate, and privacy-friendly solution? If it takes minimal effort – wouldn't it be nice to know how long our tasks take?

💡Can't we get it from an issue tracker?

Nearly all modern software development teams use some issue-tracking system. Such systems use the concept of "in progress" task status. In theory, if we analyze the change history of an issue and add up all the periods when it was "in progress," we'll get the total time spent on the task.

Well, not so fast. First, the information in issue trackers doesn't always match reality. Sometimes people forget to put a task in progress or just don't want to create a task if it's something minor. Or, they may have multiple tasks in progress; in this case, it's not clear what they're actually working on.

Second, such a method may be highly inaccurate without information about the work schedule. Suppose someone puts a task "in progress" and, on the next day, leaves on vacation or becomes sick. The task status may stay "in progress" for many days before the work resumes. Even on regular work days, an issue tracker receives no input on the actual moment when people start and end working. Today, many teams are remote-first, can be spread across multiple countries and time zones, and enjoy flexible work schedules.

Our approach

Despite the challenges outlined above, the idea of using an issue tracker to get the information we need is appealing. Having accurate task statuses is valuable on its own since it helps teams to stay in sync and reduces the need for status meetings. Teams that keep their issue trackers in good shape can get time tracking easily and effortlessly.

When we started working on the Effortless Time Tracking feature, we already had the main ingredients in Teamplify:

Integrations with the most popular issue trackers – Jira, Trello, YouTrack, Linear, and GitLab issues;
The Smart Daily Standup bot, which besides other features, helps teams keep their issue trackers up-to-date. It analyzes the team issue tracker and politely reminds people to update issue statuses if needed;
Built-in Time Off management system.

So we combined all of the above, added some magic, and voila – fully automated time tracking is here:

What is great about this method is that it requires no additional setup and is privacy-friendly. Issue statuses and work schedule information are shared with other team members, and we're not asking for anything beyond that.

Of course, one size doesn't fit all, and this approach has limitations. We're making some assumptions about daily work schedules, and the accuracy is usually limited to a fraction of a day (and that's why we measure it in days and not hours). Therefore, we wouldn't recommend using it for time-based billing, which usually requires more accuracy.

However, it can work pretty well for teams that would like to have a general understanding of how long tasks take to improve their project coordination and planning. And the silver-lining of its slightly limited accuracy is that it minimizes the potential side-effect of this tool being perceived as micro-control. Better project planning is important of course, but being comfortable at work is important too.

Give it a try

Effortless Time Tracking is available on all Teamplify plans, including the Free plan. You can see how long tasks take in Team Analytics and also in Smart Daily Standup (for current tasks in progress). Give it a try – get started today!

Why public chats are better than direct messages

Denis Stebunov — Fri, 09 Sep 2022 09:22:42 +0000

Most problems in project management are, in fact, communication problems. How we communicate makes an enormous impact on our work. In this post, we talk about one of the best strategies for improving communication in a team: making it open.

What is open communication?

Over the years, we at ivelum have developed a work culture that relies on open communication, an approach borrowed from open-source communities. We prefer to publicly discuss most work questions so any team member can follow the discussion and participate if they'd like. It can happen in chat channels, a project management system, or open video meetings, which anyone could join.

We discourage work discussions in private talks. This is not to say that we avoid direct communication entirely; of course, there's no replacement for it for sensitive topics. However, there's nothing sensitive in most work talks, and we prefer to have them in the open.

We're not the only company that practices open communication. However, it doesn't seem to be very common. Most companies follow a more traditional approach. In most companies, if Alice has a question for Bob, she is likely to ask Bob directly rather than posting a question in a group team chat. At ivelum, we do the opposite. Many folks who join our teams notice that we communicate differently from how they used to do it at their previous job. We explain why and try our best to ensure a smooth transition.

Why?

Simple: it boosts team performance - a lot. Of course, it takes some adaptation effort for new team members, but the effect is so significant that it soon pays off.

Minimize distractions

At first, it may seem that when all work communication is public, it might cause a lot of distraction. In reality, the opposite is true because of how chats work. Slack, which we use, and other chat platforms (Discord, MS Teams, Mattermost, etc.) are configured by default like so:

Notifications ARE shown for direct messages;
Notifications are NOT shown for messages in group channels.

As a result, when you send a direct message, you distract your colleague, but when you post a message in a group chat, you don't distract them (unless you mention someone explicitly via @).

From our experience, the vast majority of work questions are not so urgent that they require an immediate response. Most topics can be discussed asynchronously – you post a question when you have time, and your teammates will respond when they have time. It greatly helps to minimize distractions, which is especially important for developers. To be productive with code, they need prolonged periods of focused work.

This is not "noise"

Okay, reducing distractions is all well and good, but why would people in a team chat browse through lots of messages which are not addressed to them directly? Well, in practice this information is actually very useful:

Easy way to keep the team updated. When people communicate openly, others can see what they're working on. It significantly reduces the need for status updates – meetings or written reports. When almost all communication is public, everyone knows what their teammates are working on and their progress.
Faster problem-solving and learning. Some information can't be learned by asking questions and getting answers. We often don't know what we don't know, so we might have no idea what questions to ask in the first place. Watching how other people work can be an insightful learning experience. It could be an opportunity to learn something but also to share our knowledge with others as well.

Finally, proper organization of chat channels reduces the risk of information overload to a minimum. Each chat channel can be dedicated to a particular project or a team, and people follow only channels in which they're interested.

Why isn't everyone doing it?

Even when teams agree that they may benefit from open communication, they sometimes don't rush to implement it in practice.

One obstacle is that people can be shy. Many are reluctant to talk in public because they're not used to it and don't feel comfortable doing it. They may think they're asking a trivial question and don't want to attract too much attention to it. The best thing we could do to help with that is to be friendly and welcoming. No sarcasm, and never treating any questions as "silly." A positive atmosphere is crucial for learning, and continuous learning is vital for team success.

Another common problem is too much focus on secrecy. Open communication can be perceived as a risk of information leaks, and in some cases, it can be a risk indeed. However, by imposing communication barriers, companies not only mitigate these risks but also greatly harm their productivity. Even one of the most secretive companies – Apple – is now reconsidering its practices because the downsides are so significant.

Bottom line

There are certain things that the most productive teams do differently. Of course, open communication is not a silver bullet that magically solves all potential team problems, but this is certainly a big one.

Numerous studies show that in one or other form, communication problems are behind most of the reasons why IT projects fail. Involving themselves in frictionless team communication is arguably the best thing a team leader can do, as the rest of their team members will most likely follow suit and make it a habit.

On Estimates

Denis Stebunov — Sun, 21 Aug 2022 10:03:01 +0000

Since we began developing software, we’ve looked for ways to reliably estimate our development time. Now, some 60+ years later, we've gotten no better at it. Maybe the problem is not in how we estimate, but that we're so concerned with estimates in the first place.

Take the popular Scrum framework. At its core, Scrum is based on estimating the future work and taking only as much as you can supposedly do into the next sprint. At first glance, it sounds reasonable. In reality, more often than not, it means trading team performance for the illusion of planning. I’ll explain why.

Estimates slow us down

People naturally want to be good at what they're doing. At first, developers tend to be optimistic, and accordingly tend to underestimate their task times. What inevitably happens next is they commit to the specific timeline that they themselves stated and fail. This makes them feel uncomfortable, even when no one blames them (and sometimes people do blame them). As the process repeats itself, they slowly learn to overestimate, because they don't want to fail.

Counterintuitively, overestimating doesn't always help people to finish their work on time. This effect is known as Parkinson's Law, and here's where psychology plays against us. Let's say you think a task will take a couple of days, but, given the uncertainty, you might need up to a week. You estimate a week, and your manager says "fine." Now you think - alright, I have plenty of time! As a result, in the first half of the week, you either work in a relaxed mode or shift your priorities to other tasks, because you know that you have plenty of time. As the timeline approaches, you start to realize that the task is not as simple as you’d initially thought and remember why you overestimated in the first place. So you work hard, maybe stay late, and finally, barely fit the work into the timeline you set for yourself. Next week, the same story happens again.

Ultimately, problems happen either way you estimate. If team members are under time pressure (often because they’ve underestimated), they may have to cut corners to meet the timeline. Surprisingly, if they overestimate, the same thing happens, just later. As a rule of thumb, you may think of deadlines as "the earliest dates when something can be delivered." So no matter whether they over- or underestimated, nothing ships until the deadline.

Source: Peopleware by Tom DeMarko & Timothy Lister

Estimates slow the team down, and the more frequently you ask developers for estimates, the worse the effect becomes. If a team estimates all the work it’s doing, then the amount of time and energy it must spend on the estimating process will be overwhelming. First, people spend time making the estimates. After that, they spend time discussing why those estimates failed and what they should do next (usually, the solution is to request more estimates.) Next, requirements change - and voila, now they have to figure out how the change affects their estimates. Finally, they deal with the technical debt accumulated in their numerous attempts to meet the estimated timelines, slowing the project down in the long term.

Quick estimates are just guessing

The nature of software development entails a lot of complexity and uncertainty. Almost all of the work that developers do is something new. Even if they’ve done something similar in the past, they're now doing it in new conditions. There might be a new project, or new requirements, or new knowledge learned from the last experience. If it were absolutely the same and routinely repeated, it’d likely already be automated, or be implemented as an out-of-the-box product, or exist as a library or framework. Therefore, in most cases, developers have to provide estimates for tasks that have a high degree of uncertainty.

Source: Show Progress | Shape Up by BaseCamp

t could be better if developers are given some time to work on the problem in advance; in such cases, they would understand the problem better and be more confident providing an estimate. Unfortunately, it rarely happens in practice. Developers usually have very little time to study a problem before someone asks them for an estimate. As a result, they pluck numbers out of thin air.

Of course, all experienced managers know that estimates are just estimates and don't try to treat them as hard deadlines. However, they are perceived as some form of commitment anyway, by both managers and developers. If estimates mean absolutely nothing, why ask for them in the first place? So providing an estimate means being accountable for it. It's no surprise that developers often don't like them - basically, we ask them to be accountable for something that they don't fully understand.

Estimates create dangerous misconceptions

Uncertainty is not the only problem with estimates. They can also create misconceptions about the workflow for the project team and stakeholders. As Fred Brooks famously said, "The bearing of a child takes nine months, no matter how many women are assigned." However, when people hear that a project will take nine man-months, they sometimes assume that it probably can be three months for three random developers.

Source: The Mythical Man-Month by Frederick P. Brooks

Men and months are not interchangeable, and not just because some work can't be done in parallel. Who is doing the work also matters. A senior developer who’s also familiar with the matter could probably do it much faster than estimated, and a junior developer with no knowledge of the subject may not just miss the deadline, but fail the task altogether.

Initial development is a minor cost

Arguably the biggest problem with estimates is that people routinely ask for initial development estimates, but rarely ask for estimates about the subsequent maintenance. Maintenance is huge. It takes up to 60-90% of the total project cost. Besides that, every feature that you add to the product increases complexity and makes further development slower. With this fact in mind, people should be very cautious about adding new features. Instead, prioritize like a maniac, and work on only the features that make the most sense. This rarely happens when estimates drive project planning. In such projects, people tend to push features that take less time and postpone the ones that will take longer. As a consequence, the most important work can be postponed and the product becomes bloated with less important features and quick hacks.

How is it possible to plan a project without estimates?

Ok, so estimates take their toll. They can be misleading, they can slow a team down and make developers uncomfortable, but we need them for project planning anyway, right? Well, not necessarily. Sometimes estimates are unavoidable, but in most cases, they are not required for project decisions.

First, the initial development estimates shouldn't be the driving factor that determines what to work on next. It would be best if you only worked on what is crucial, and not on what is fast to implement. If you're not absolutely sure that a feature is essential, it shouldn't be in the development queue. Use quick MVPs and product discovery to find out what your customers need.

Second, when you have a timeline to meet, communicate it to developers instead of asking for their estimates. There's the excellent Fix Time and Budget, Flex Scope approach. Every task can be completed in thousands of ways. Prioritize the requirements carefully, work on the most critical items first, and be ready to ship at any time. In such conditions, timelines become manageable. Even if you're not lucky enough to ship it fully, the most important features will be there and the least important will be missing, which quite often aren’t critical to launching.

Finally, if you badly need an estimate, give developers a few days to work on the task before asking. If you're lucky, the task will already have been completed, and no estimate will be needed. Otherwise, they will understand the task better, and you'll be able to have much more productive talk about the task challenges. Make sure that you take into account not only initial development estimates but also long-term maintenance, which is usually more important. And once again, if you're not sure that the task is essential, it shouldn't be in the development queue, even for estimates.

Managing expectations

This one is probably the most challenging. As a manager or a team lead, you will likely deal with stakeholders who will wonder how long a project will take. If you ask them why they need this information, they'll tell you it's necessary for planning or that they simply want to know what to expect. We discussed planning above, now let's turn to expectations. It’s human nature to be concerned if something you care about is uncertain. It may be your boss who’s asking, and it may be hard to say, "I don't know." Here's what you can try:

If the work’s already in the queue or in progress, ask what plans related to the completion of this work. Try to help them execute their plans. Sometimes, you may use this information as an input for work prioritization and communicate their expectations to the team. Other times, you may find workarounds that would help those people to achieve their goals while the task’s still in progress;
If the work is not in the development queue yet, ask about its priority. Talk about the plans that the development team already has, try to figure out where this new work can fit. Sometimes you'll find there's no chance that you'll work on it anytime soon, and if so - there's not much sense in discussing estimates;
Focus on team productivity and work transparency. If your team ships to production every day and everyone can see the work and the queue, people are less inclined to ask for estimates;
As a manager or a team lead, track the progress of your team regularly. Know what was done and what’s remaining, what the blockers are, etc. If you know what's going on, you can make your own estimates and communicate them to stakeholders when you think it's necessary. Remember, good managers work as a "shit umbrella" for the team; bad managers let the shit fall through. Passing all requests for estimates down to developers is not good management practice.

Conclusion

To summarize, while estimates look appealing and straightforward on the surface, really they are misleading and obstructive to the workflow. It’s natural for humans to simplify complicated things, but estimates simplify the perception of the development process far beyond what’s reasonable. Scrum can take this over-simplification to the extreme with its planning poker, velocity, and burn-down charts.

Most people would agree that, though estimates are imperfect, they are at least measurable. They can't imagine how to manage a project without having something measurable. Unfortunately, not everything that counts can be measured, and not everything that can be measured counts. In this post, I provided some practical advice from my experience of managing projects without estimates. Of course, I'm not saying that estimates are entirely useless; I'm just saying that most of the time, other more important factors should be at play. I’ve used this approach for years, and it’s worked well for our teams.

I realize that this advice is not for everyone. I worked with different organizations, big and small, and most of them used estimates in one form or another. At the same time, most organizations are far from efficient. The way the best teams operate is often different from the way the majority does. Perhaps the emphasis placed on estimates is one place where these differences lie.

Migrating a production database without any downtime

Denis Stebunov — Sat, 13 Aug 2022 11:44:14 +0000

In this episode, we'll cover the basic principles of zero-downtime database migrations and provide quick recipes for the most common scenarios.

How does a deployment process work?

Let's take a look at a simplified deployment process for a typical web application. Most applications these days rely on load balancing and container orchestration:

What happens when we need to deploy a new version of an application? The deployment process replaces app instances one by one. It excludes them from the cluster first, replaces app instances with newer versions, and then puts them back in the cluster:

As shown in the animation above, the application version 2 gradually replaces previous version 1 without any service interruption for end users.

This is all well and good, but what if application v2 requires some changes in the database? Unlike application instances, the database is a shared stateful resource, so we cannot simply clone it and use the same technique that we used for application instances. The only viable solution for upgrading the database is to modify it in place. At what point in time should we modify the database?

Since the application version 2 requires an upgraded DB version 2 to work correctly, the database must be upgraded before putting any instance of application v2 into production. Therefore, a deployment process that includes a database upgrade should look like this:

How to run the DB migration script

Please note that how you run a DB migration script matters. For example, it might be tempting to make it part of the application startup, like this:

<run-db-migrations> && <launch-the-app>

The idea behind such an arrangement is that the first application instance being launched will run the DB migrations, and for the other instances the script will be a no-op since the DB is already upgraded.

Please DON'T DO THIS.

First, since we're launching multiple application instances in parallel on our cluster, the DB migration script must handle parallel executions correctly. Depending on the DB migration framework you're using, and how the script is written, it may or may not do so. A migration has three main states — not started, in progress, finished. The script must detect that another migration is already in progress and wait. If it doesn't wait, it may cause crashes or even corrupted data.

Second — even if DB migration scripts handle parallel executions correctly, there's an issue with retries. It is important that the DB migration script gets executed once and only once, and if it fails, the whole deployment process must be stopped and rolled back. For example, if the migration fails due to a timeout on a long heavy SQL operation, you don't want to retry it again and again automatically. The reasonable solution would be to roll back immediately after the first failure, investigate, fix the migration script, and only then try again.

This is why the DB migration script is launched from a CI/CD server in the animation above. But it doesn't necessarily have to be done in that way. For example, it could be a one-time Kubernetes task that is launched as a part of a deployment process or something like that. Just remember the main principle — launch it only once, and if it fails — roll back immediately.

What can cause downtime during database migration?

There are two main reasons for this:

Backward incompatibility. As shown above, a deployment process isn't instant, and at some point, the database is already upgraded, but older application instances are still running in production. If the newer database version is incompatible with the previous application version, it may cause errors and crashes in production until older application instances are fully replaced with newer ones.
Heavy DB operations. A database migration may include some heavy operations, which lead to increased DB server load or prolonged database locks. As a result, it may slow down the application or make it unresponsive during the DB upgrade.

Now, let's look at the most common cases related to the backward incompatibility problem. We'll also briefly talk about heavy DB operations at the end of this post.

Example 1: downtime caused by a new column

Let's say we're building a new feature — user avatars. After registration, every user will get a randomly generated avatar with an option to upload their own. To implement this, we need to add the avatar column to the Users table:

How can we upgrade the database from v1 to v2 in this case? Our database migration script should include the following operations:

Add the avatar field to the Users table, nullable;
Update all existing records in the Users table, generate random avatars;
Make the avatar field non-nullable.

And in the application code, we need to implement the following functionality:

Generate random avatars for new users;
Show user avatars where appropriate;
Provide an option for users to upload their custom avatars.

What will happen if we just naively push all of the above into production? As we mentioned before, the database migration script will run first. Then the application instances will be gradually replaced with their newer versions. At some point in time, the previous application version, which doesn't know anything about the avatar field yet, will be running against the upgraded DB version, which already has that new field.

It will effectively lead to broken user registrations since application v1 will try to insert new records into the Users table without any value provided for the avatar field, which is non-nullable. Of course, the problem will fix itself after some time when all v1 application instances are replaced with v2. However, since we're talking about "zero-downtime deployments," this is not good enough for us. Let's see how we can deploy this new feature without any downtime.

Solution

The trick is to split the feature deployment into multiple phases and deploy them one by one, waiting for each phase to deploy completely before moving to the next one. In this particular case, we need two phases:

Phase 1

DB migration script:
- Add the avatar field to the Users table, nullable
New app features:
- Generate random avatars for all new users under the hood

The database migration script only adds a new nullable field and therefore doesn't cause any issues with new records inserted by the previous app version. Then the updated application version is deployed, and the new field becomes populated for all new records.

Phase 2

DB migration script:
- Generate random avatars for all existing users with empty avatars;
- Make the avatar field non-nullable
New app features:
- All remaining "avatar" features — show user avatars where appropriate, provide an option to upload their own avatar, etc.

The database migration script updates existing user records by populating all empty avatar values. After that, it makes the avatar field non-nullable. Even if new registrations happen during the deployment, there will be no blank values in this field since the app version that we deployed on Phase 1 already generates avatars for all new users. Therefore, we can safely enforce the "non-null" constraint and deploy all the remaining avatar features.

Example 2: downtime caused by a column removal

Let's say that the "User avatars" feature we described in the previous example didn't meet our expectations. It wasn't popular enough, so we decided to roll it back. There will be no more user avatars, so we want to remove all application functionality related to it and also remove the avatar column from the DB:

As with the previous example, if we just naively push everything to production, including the DB migration script, it will cause downtime.

The application will be severely broken during the deployment. As we discussed above, the DB migration script will run first. Immediately after its execution, there will be a time when the previous application version is still running in production, but the avatar column no longer exists, so all functionality related to it will be broken. How could we avoid this?

Solution

Use the same technique as with the previous example — split the deployment into two phases:

Phase 1

DB migration script:
- Make the avatar column nullable
New app features:
- Remove all of the functionality related to avatars. Remove all mentions of the avatar column from the app code

Making the column nullable doesn't break the older application version that is still running in production and relies on the avatar column. At the same time, it allows us to deploy a new application version that won't use this column in any way.

Phase 2

DB migration script:
- Drop the avatar column
New app features:
- None

After Phase 1 is deployed, there are no more mentions of the avatar column anywhere in the application code, so we can safely drop it.

Example 3: renaming a column or changing its data type

Let's say we'd like to upgrade the avatars feature that we described in example 1. Instead of storing file names, we'd like to store full avatar URLs so that we can support avatars hosted on different domains. Full URLs are noticeably longer, so we need to extend the maximum length of the avatar column. Also, we need to convert existing data to the new format by transforming file names into full URLs. Lastly, it is a good idea to rename the column from avatar to avatar_url to better reflect its new purpose:

In this case, the data migration script includes:

Changing the column data type, varchar(100) -> varchar(2000);
Converting existing data from the old format (file names) to the new one — full URLs;
Renaming the column, avatar -> avatar_url.

And in the application code, we need to:

Save the data in the new format;
Read the data in the new format and process it accordingly.

If we simply push such a change to production, it will cause downtime.

The application will be severely broken during the deployment. As we discussed above, the DB migration script will run first. During its execution and for some time after its execution the previous application version will still run in production. The first part of the script only extends the maximum length of the stored data. Such an operation alone wouldn't cause any downtime. However, the data transformation applied in the next step will cause incorrect avatar handling since the previous application version still relies on the old format. And the last step, which renames the column, will cause application crashes, including broken user registrations. The previous application version won't be able to read or write the data because the column name changed. The application will remain broken until its new version is fully deployed. How can we avoid this?

Solution

Use the same technique as with the previous examples — split the migration into phases. This case is more complicated — we need four phases:

Phase 1

DB migration script:
- Add the new avatar_url column, nullable;
- Don't change the existing avatar column yet.
New app features:
- Start writing data to both the old avatar column and the new avatar_url column in their corresponding formats;
- Don't change the read logic yet — still read from the old avatar column.

First, the DB migration script only adds a new nullable field, so no downtime. The application starts writing data to both fields, preparing for the next phase.

Phase 2

DB migration script:
- Populate all empty values in the avatar_url column with values from the avatar column, while converting the data into the new format (file names to full URLs);
- Make the avatar_url column non-nullable;
New app features:
- Switch the reading logic to the new avatar_url column;
- Continue writing data to both avatar and avatar_url columns for now.

The DB migration script populates the data in the new column by converting file names stored in the old column to full URLs stored in the new column. Then, it enforces the non-null constraint on the new field. It shouldn't cause any problems since the app was already populating avatar_url for all new records. The application version deployed in this phase switches to reading the data from the new avatar_url field. It still writes the data to both old and new fields, to ensure backward compatibility with the app version deployed on Phase 1.

Phase 3

DB migration script:
- Make the avatar column nullable
New app features:
- Stop writing data for the avatar column and remove all mentions of it from the code

The DB migration script makes the avatar field nullable, so that the application can stop writing data for it.

Phase 4

DB migration script:
- Drop the old avatar column from the Users table
New app features:
- None

After Phase 3 finishes deploying, we can remove the old column from the DB. No application changes are required in this phase since the application was already upgraded in the previous phases.

A generic approach to maintaining backward DB compatibility

As you can see, all three solutions above use the same technique — split the deployment of a new feature into two or more phases to avoid any downtime during deployment. Of course, these three examples don't cover all possible DB migration cases, but they should help you understand the main idea. If you get the idea and know how your deployment script works, you can develop solutions for other cases yourself.

Quick recap:

During the deployment, the data migration script runs first;
After that, there will be a brief period when older application instances are running against a newer, upgraded version of the DB. This is a potentially risky moment when downtime caused by broken backward compatibility might happen;
You can avoid this downtime by splitting your deployment into multiple phases if necessary. Split it in such a way that the previous application version, which is still running in production, is always compatible with the newer database version that you're going to deploy;
The examples above illustrate how exactly you can plan your deployment phases to avoid downtime caused by broken backward compatibility.

Heavy DB operations

Another common reason for downtime during the DB upgrade is that some modifications performed by the database migration script can cause a heavy load on the database or lead to prolonged locks on some tables, causing application slowness or downtime.

As a rule of thumb, problems could emerge when modifying tables that store a lot of data. The creation of new tables is fast, deletion is also fast, and modifications of small tables usually don't cause any issues. But if you're going to modify a large table, e.g., add or remove columns or change their data type, create or modify indexes or constraints, it could be slow, sometimes painfully slow. What could we do?

Tip #1: Use a modern DB engine. Databases are constantly evolving. For example, one of the most requested features in the MySQL community was the ability to do fast DDL operations that won't require a full table rewrite. MySQL v8.0 introduced noticeable improvements, including instant adding of new columns if certain conditions are met. Another example — in Postgres versions 10 or earlier, adding new columns with a default value caused a full table rewrite, which was fixed in Postgres v11. It doesn't mean that the DDL performance is already a solved problem of course, but upgrading to a newer DB server version could potentially make your life easier.

Tip #2: When modifying a large table, check what happens under the hood. In many cases, you can reduce the risk of downtime by using a slightly different set of operations that will be easier for the database server to process. Here're some useful links:

MySQL — Online DDL Operations
Postgres — check the django-pg-zero-downtime-migrations package that provides a detailed explanation of how locks are working in Postgres and which operations can be considered safe.

Tip #3: Make upgrades when the service has the least amount of traffic. If an upgrade touches a large table, consider doing it during a period of low activity. It could be beneficial in two ways. First, the DB server will be less loaded and therefore could potentially complete the upgrade faster. Second, even despite careful preparation, such upgrades can be risky. It could be hard to fully test how the upgrade will work under the production load. Therefore, it makes sense to reduce the blast radius if downtime happens. During periods of low activity, the impact will be lower due to fewer users being online.

Tip #4: Consider slow-running migrations. Some tables can be so large that the traditional migration way is simply not a viable option for them. In such cases, you can consider embedding the data migration code right into your application, or use a special utility like GitHub's online schema migration for MySQL. A slow-running migration can work in production for days or even weeks. It gradually converts the data by small chunks, so you can carefully balance the load on the database while making sure that it doesn't cause slowness or downtime.

Conclusion

While zero-downtime database migration requires some effort, it's not that complex. The two main reasons for downtime are:

Failure to maintain backward compatibility;
Heavy DB operations.

To solve the backward compatibility problem, you may need to deploy a new feature in multiple phases instead of pushing everything into production at once. This article covers the three most common scenarios in detail and provides generic guidelines on how to avoid downtime in other cases.

The heavy DB operations section above briefly covers the second problem and provides some links for further reading.

I hope this article may help you avoid some downtime in your project. Less downtime enables more frequent deployments, and therefore makes development faster. Happy migrations!