SANKET PATIL

Posted on Mar 6

Production Systems Fail in Ways Tutorials Never Discuss

#softwareengineering #backend #devops #startup

Most engineers think of bugs as something technical.

A broken query.
A race condition.
A missing index.

You fix it, deploy a hotfix, and move on.

But sometimes a “bug” reveals something much deeper - not just about the system, but about how software businesses actually work.

Recently, I faced one such situation that changed how I think about production systems, ownership, and responsibility as a senior engineer.

The Performance Issue That Didn’t Make Sense

A customer reported that the application had become very slow.

Naturally, the first place we checked was the application infrastructure.

The application was running on an Azure App Service plan P1v3, which is a reasonably capable tier. When we looked at the metrics:

CPU utilization was below 60%
Memory looked stable
No major spikes in requests
No error rates

From the application side, everything looked normal.

So the question became:

If the application server is healthy, why is the app slow?

The First Clue: Local Environment Was Fast

When we tested the application locally using a restored database backup, the performance was significantly faster.

That was the moment we realized:

This might not be an application problem.

This might be a database problem.

The Real Culprit: A $5 Database Plan

After investigating the database metrics, we finally found the issue.

The production database was running on a $5 basic plan, and it was consistently hitting 100% utilization.

The application itself was fine.

But the database simply did not have enough resources to handle the workload.

Once the database became the bottleneck, every request slowed down - giving the impression that the entire system was performing poorly.

The Awkward Truth: We Had Recommended This Earlier

The uncomfortable part of this situation was that this wasn’t completely unexpected.

Earlier in the project we had recommended:

Start with the basic plan and upgrade the database as data grows.

This is a common and reasonable approach during early deployments.

But like many things in production systems, it was never revisited again.

The system kept growing.
The data increased.
Usage increased.

But the database plan stayed the same.

Eventually the performance issue surfaced - and the customer reported it as a bug.

The Engineering Instinct: Just Fix It

As engineers, our instinct is simple:

Fix the problem.

So the immediate solution was obvious:

Upgrade the database tier
Monitor performance
Verify application responsiveness

And technically, that would resolve the issue.

But this is where I learned something new - something that tutorials never teach.

The Business Reality of Fixed-Price Projects

This project had been delivered as a fixed-price engagement.

In such projects, development and delivery are scoped within an agreed cost.

But what happens months later when:

Infrastructure needs adjustment
External services change policies
Usage grows beyond initial assumptions
Performance tuning becomes necessary

Are these bugs?

Or are they support and maintenance work?

This is where things get complicated.

Startups and small teams often fall into a trap:

They keep fixing things for free because it feels like the “right thing to do.”

But over time, this creates a hidden cost for the company.

Engineering time becomes support time.

And support becomes unplanned work.

Another Case: When Third-Party Services Change

We encountered a similar situation with Auth0.

A change in policy meant that a callback configuration required updates, including adding localhost to the allowed callback URLs for certain flows.

This change caused issues in the application, and it was reported as a bug.

But the root cause wasn’t in our application code.

It was a policy change from a third-party service.

So the question becomes:

Should the development team fix this under the original project scope?

Or should it be treated as ongoing platform maintenance?

The Bigger Question: What Is Actually a Bug?

This experience made me rethink something important.

Not every issue reported in production is actually a software bug.

Some issues fall into other categories:

1. Infrastructure limitations
Example: Under-provisioned database resources.

2. Usage growth
Systems behaving differently as data and traffic scale.

3. Third-party service changes
Authentication providers, APIs, SDKs updating their policies.

4. Operational configuration issues
DNS, environment variables, callback URLs, quotas.

Yet many organizations treat all of these as bugs that must be fixed immediately.

And engineers often jump in and solve them without questioning the long-term impact.

What Larger Organizations Typically Do

More mature software organizations usually separate work into clear categories:

1. Development Scope
Features agreed in the project contract.

2. Warranty Period
A limited period where genuine defects are fixed.

3. Support / Maintenance Plan
Ongoing system maintenance, monitoring, infrastructure tuning, and external dependency updates.

This ensures that engineering teams are not constantly pulled into unplanned work months after delivery.

Lessons I Learned as an Engineer

This experience taught me several lessons that go beyond debugging.

1. Production Systems Are Living Systems

They evolve as data, users, and dependencies change.

Infrastructure decisions made early in a project may not hold months later.

2. Not Every Issue Is a Bug

Sometimes the system is behaving exactly as expected.

It’s the environment that changed.

3. Infrastructure Is Part of the Product

Developers often think only about application code.

But performance is frequently determined by:

database tiers
network limits
cache strategies
third-party APIs

Ignoring infrastructure can lead to misleading conclusions about “bugs”.

4. Fixed-Price Projects Need Clear Boundaries

Without clear support agreements, teams risk entering a cycle of continuous unpaid maintenance.

A sustainable approach usually includes:

post-launch support plans
infrastructure monitoring
defined response scopes
upgrade recommendations

5. Engineers Should Think Beyond Code

One of the biggest mindset shifts for me was realizing that senior engineering roles require understanding not only:

systems
debugging
architecture

but also product sustainability and operational ownership.

The health of a system is not just technical.

It’s also organizational.

Final Thought

The original problem looked like a simple performance bug.

But the deeper lesson had nothing to do with code.

It was about understanding how software systems, infrastructure, business agreements, and third-party dependencies intersect.

As engineers, we often focus on fixing the problem in front of us.

But sometimes the real challenge is asking a different question:

Is this actually a bug - or is this part of running a real production system?

And the answer to that question can shape how a product - and a company - operates long term.

DEV Community