Arpit Gupta

Posted on May 17

8 Engineering Lessons That Only Production Teaches You

#softwareengineering #career #webdev #programming

When I started writing software, I thought engineering growth was mostly about writing cleaner code, learning more frameworks, and designing better APIs.

Over time, I realized that real engineering maturity shows up somewhere else:

In production.

Production teaches you things that tutorials, side projects, and clean architecture diagrams usually do not.

It teaches you about failure modes, maintainability, observability, debugging, ownership, tradeoffs, and operational simplicity.

Here are 8 engineering lessons I learned over the years.

1. The best code you write might be the code you delete

Every line of code has a cost.

Not just the cost of writing it.

The cost of reading it.
The cost of testing it.
The cost of debugging it.
The cost of deploying it.
The cost of explaining it to the next engineer.
The cost of maintaining it when requirements change.

A feature might need 500 lines today, but if the same outcome can be achieved with 100 simpler lines, the smaller solution is often the better engineering decision.

This applies to:

unused abstractions
duplicate workflows
dead feature flags
stale configuration
old migration logic
unnecessary wrappers
over-engineered services
dependencies that solve tiny problems

Deleting code reduces surface area.

And reducing surface area reduces failure probability.

Good engineering is not always about adding more.

Sometimes it is about removing what the system no longer needs.

2. Silent failures are worse than loud failures

A loud failure hurts immediately.

A service crashes.
An API returns 500.
A job fails.
An alert fires.
Someone investigates.

That is painful, but visible.

Silent failures are more dangerous.

A background job fails but does not alert.
An event is dropped but not retried.
A report is sent with missing data.
A queue consumer swallows exceptions.
A scheduled task stops running.
A data pipeline produces incorrect output.
A third-party API call fails, but the system marks the operation as successful.

No one knows immediately.

The system appears healthy while trust is slowly breaking.

This is why production systems need observability by design.

At minimum, important workflows should have:

structured logs
metrics
alerts
retries
dead-letter queues
idempotency
clear error states
dashboards for business-critical flows

A system should not only handle failure.

It should make failure visible.

If a failure can happen silently, it eventually will.

3. Code should be written for the maintainer, not just the author

The person who writes the code and the person who maintains it are often not the same person.

Sometimes that person is another engineer.

Sometimes it is you, six months later, with no memory of the original context.

Maintainable code is not just code that works.

It is code that can be understood, changed, and debugged safely.

That means:

clear naming
smaller functions
explicit boundaries
predictable control flow
fewer hidden side effects
meaningful error messages
comments that explain “why”, not obvious “what”
tests around business-critical behavior

The goal is not to make code look clever.

The goal is to reduce the cognitive load for the next person.

Complex code is sometimes necessary.

Unclear code is not.

A good test of maintainability is simple:

Can another engineer understand the intent without needing a meeting?

If not, the code probably needs better structure, naming, or documentation.

4. “It works on my machine” is usually a system problem

When code works locally but fails in dev, staging, or production, the bug is not always in the code.

It can be a symptom of a weak engineering process.

Common causes include:

environment variable mismatch
missing secrets
dependency version drift
inconsistent database state
different feature flag values
different infrastructure permissions
manual setup steps
weak CI checks
poor local development parity
assumptions hidden inside the runtime environment

The real question is not only:

Why does it fail there?

The better question is:

Why are our environments allowed to behave so differently?

Good teams reduce environment surprises.

They invest in:

reproducible setup
automated tests
CI/CD validation
infrastructure as code
clear configuration management
database migration discipline
consistent deployment practices

“It works on my machine” should not be the end of debugging.

It should be the start of improving the system.

5. You cannot optimize what you cannot measure

Without measurement, optimization becomes opinion.

One person thinks the database is slow.
Another thinks the API is slow.
Another thinks the frontend is heavy.
Another thinks the issue is infrastructure.

Everyone may be partially right.

But without data, the team is guessing.

Useful engineering decisions need signals like:

p95 latency
error rate
throughput
CPU usage
memory usage
queue depth
database query time
cache hit ratio
retry count
external API latency
cost per request
impact by customer or tenant

Metrics help separate symptoms from causes.

They also help prioritize.

A 20 ms optimization on an endpoint used once a day may not matter.

A 500 ms improvement on a high-volume workflow might matter a lot.

Measurement turns performance work from guesswork into engineering.

Before optimizing, first ask:

What are we measuring?
Where is the bottleneck?
How often does it happen?
Who is affected?
What is the business impact?

Optimization without measurement is usually just expensive guessing.

6. The best architecture is the one your team can operate at 2 AM

Architecture is not only about design.

It is also about operations.

A system can look beautiful in a diagram and still be painful in production.

At 2 AM, when something breaks, the team needs practical answers:

Which service owns this workflow?
Where are the logs?
What changed recently?
Is there a rollback path?
Which dependency failed?
What is the blast radius?
Is the data safe?
Can we retry safely?
Who should be alerted?
What dashboard should we check first?

A highly elegant architecture that only one person understands is a risk.

A slightly boring architecture that the whole team can operate is often better.

Good architecture should optimize for:

clarity
ownership
debuggability
failure isolation
operational visibility
safe deployment
easy rollback
understandable tradeoffs

The real test of architecture is not how impressive it looks in a design review.

The real test is how quickly the team can understand and recover when it fails.

7. Documentation is part of the feature

A feature is not complete just because the code is merged.

If people do not know how to use it, configure it, deploy it, debug it, or maintain it, the feature is incomplete.

Documentation does not need to be perfect.

But it should answer the questions future engineers will ask:

What does this feature do?
Why was it built this way?
What assumptions does it make?
How do I run it locally?
How do I deploy it?
How do I debug common failures?
What are the known edge cases?
What dashboards or logs should I check?
Who owns this workflow?
What should I not change without understanding the impact?

Good documentation reduces dependency on tribal knowledge.

It also protects the team from single-person ownership risk.

If only one engineer understands a system, that system has an operational risk.

Documentation is not extra work after engineering.

Documentation is engineering.

8. Senior engineering requires saying “No” well

Saying yes is easy.

Saying no is harder.

But many engineering problems come from saying yes too quickly.

Yes to unclear requirements.
Yes to rushed deadlines.
Yes to shortcuts without ownership.
Yes to adding complexity without measuring impact.
Yes to features without operational planning.
Yes to changes that increase long-term maintenance cost.

Senior engineering is not about blocking work.

It is about protecting the system, the team, and the business from avoidable risk.

But “no” has to be said properly.

Not like this:

No, that is wrong.

Better:

Here is the risk. Here is the tradeoff. Here is what can go wrong. Here is a safer alternative.

A good “no” includes:

context
data
tradeoffs
alternatives
impact
a path forward

The goal is not to win an argument.

The goal is to help the team make a better decision.

Strong engineers do not just write code.

They improve decisions.

Final thought

Early in my career, I thought better engineering meant writing better code.

Now I think it means something broader.

Better engineering means building systems that are easier to understand, easier to operate, easier to debug, easier to change, and harder to break silently.

Code matters.

But production teaches you that code is only one part of the system.

The real work is building software that can survive real users, real failures, real deadlines, real data, and real operational pressure.

Good engineers solve problems.

Strong engineers reduce future problems.

I originally shared a shorter version of this on LinkedIn:

View the LinkedIn post

If this resonated with you, feel free to connect with me on LinkedIn. I share thoughts around software engineering, system design, production ownership, backend engineering, and lessons from building real-world systems.

Connect with me on LinkedIn