Rani Hirani

Posted on Feb 26

From Query Tuning to Cache Versioning: Lessons from a Production Endpoint

This week, I tried optimizing a production endpoint.

My first instinct was predictable: rewrite the query.

Because when performance drops, we instinctively blame the database. Performance issue equals database issue. Right?

Wrong.

The Query Wasn’t the Problem

I started by reviewing everything carefully. the CTE usage, the ordering logic, the count behavior, the execution plan, and multiple rewrite attempts to see if something subtle was being missed.

I tested variations. I compared execution times. I re-evaluated the logic to ensure nothing unnecessary was happening.

The result?

No consistent improvement.

And that’s when it became clear: the query wasn’t inefficient. It was already aligned with the business requirements. It was returning exactly what it was supposed to return — Jira boards, their sub-boards, and their sprints — in a structurally correct way.

Trying to squeeze more performance out of it wasn’t optimization anymore. It was forcing change where none was needed.

The problem was somewhere else.

Strategic Caching — Not Just @Cacheable

So we shifted focus to caching.

But caching is not just slapping @Cacheable on a method and calling it a day. It forces you to think about boundaries and trade-offs.

We had to answer three questions:

What exactly are we caching?
For how long?
What risks does stale data introduce?

Initially, we considered a TTL of five hours, since the database sync only runs every twenty-four hours. On paper, that seemed reasonable.

But once we evaluated the risks more carefully, the picture changed.

Permissions can change.
Board structures can change.
Serving stale Jira board and sprint data could affect user experience and correctness.

So we reduced the TTL to thirty minutes.

It was a deliberate trade-off: freshness over aggressive caching.

Because performance means nothing if correctness degrades.

The Stale Cache Reality

After introducing caching, we later made changes to the DTO structure.

That’s when things started behaving inconsistently.

The endpoint wasn’t broken — it was unpredictable.

The reason was simple but painful: Redis was still serving old serialized payloads. The cache contained outdated object structures that no longer matched the current DTO shape.

We weren’t facing a logic bug. We were facing stale cache data.

The solution was explicit cache versioning.

We introduced versioned cache names like:

jiraBoardKeysCacheV6

jiraSubBoardsByOrgCacheV7

Instead of hoping stale entries would disappear, we forced hard invalidation.

It was a reminder that cache invalidation isn’t an academic concept. It’s a production concern.

The Architectural Mistake We Didn’t See at First

During this process, we uncovered a deeper issue.

We were caching JPA entities directly.

That decision quietly introduced risk:

Lazy-loading proxies leaking into serialization.
Tight coupling between persistence and API layers.
Serialization inconsistencies.
Difficulty evolving response structures.

The cache had effectively become an extension of the database layer.

So we introduced proper boundaries:

BoardDTO

SubBoardDTO

Each with a static fromEntity(...) mapping method.

DTOs became the contract between layers. They became the cache boundary.

Now the separation was clear:

Database layer ≠ API layer ≠ Cache layer

And that separation stabilized the system.

The Null Pointer Edge Case

Just when things seemed stable, a runtime error appeared:

“element cannot be mapped to a null key”

The root cause was subtle. We were grouping sub-boards by boardId, but some boardId values were null. The grouping operation failed because null keys were not handled.

The fix required defensive handling:

Filtering null keys before grouping.
Adding null checks inside DTO mapping.

After that, we eliminated:

Null pointer risks.
Null sub-board responses.
Fragile response shapes.

It reinforced something simple: optimistic assumptions don’t survive production.

What This Was Really About

At the beginning, this looked like a query optimization problem.

It wasn’t.

It was:

A data-shape problem.
A cache-boundary problem.
A stale data problem.
An edge-case handling problem.
The database wasn’t slow.
The architecture was fragile.

And once we strengthened the boundaries, defined cache strategy properly, and handled edge cases deliberately, the endpoint stabilized.

System design is rarely about dramatic rewrites.
Most of the time, it’s about understanding where your assumptions break — and fixing the seams between layers.

And more often than not, that’s where the real performance work lives.

DEV Community

From Query Tuning to Cache Versioning: Lessons from a Production Endpoint

The Stale Cache Reality

The Architectural Mistake We Didn’t See at First

Top comments (0)