π¨ The Symptom
I started noticing something strange in my observability stack:
- Integration tokens were being minted repeatedly
- My token endpoint showed activity even when no user interaction was happening
- Metrics suggested constant βtrafficβ to an otherwise idle system
At first glance, it looked like:
- A security issue
- A rogue client
- Or a broken API consumer
It was none of those.
π The Root Cause
The issue came down to a subtle but critical architectural mistake:
I was using a non-shared cache in a multi-worker environment.
Stack involved:
- PHP-FPM (2 workers)
- APCu (in-memory cache)
- Token-based integration between services
βοΈ What Went Wrong
APCu is process-local, not shared.
That means:
Worker A cache β Worker B cache
Each PHP-FPM worker had its own isolated memory.
π₯ The Cascade Effect
My token logic was straightforward:
if token not in cache:
mint_new_token()
But in reality, the system behaved like this:
- Request hits Worker A β token exists β OK
- Next request hits Worker B β cache miss β mint new token
- Repeat across workers β continuous token regeneration
π Why Observability Looked βWrongβ
From the outside, it looked like traffic was hitting the token endpoint.
But in reality:
The system was generating its own traffic due to cache inconsistency.
This is a key lesson:
- Not all traffic is external
- Some is emergent behavior from system design
β The Fix
I switched from APCu to:
- Redis (shared cache)
Now:
All workers β same cache β consistent token state
Result:
- Tokens minted once
- Reused across all workers
- Metrics stabilized instantly
π Production Hardening (What I Added Next)
Fixing the cache wasnβt enough β I hardened the system further.
1. Distributed Locking
To prevent race conditions:
if token exists:
return token
acquire lock
re-check cache
mint token if still missing
release lock
2. TTL Buffering
Avoid edge expiration issues:
cache_ttl = token_expiry - safety_margin
3. Observability Metrics
I added:
token_cache_hitstoken_cache_missestoken_mint_count
Now anomalies show up immediately.
π§ Key Takeaway
This wasnβt just a bug.
It was a distributed systems failure mode:
Cache locality + multi-worker architecture β inconsistent state β emergent traffic
β‘ Final Insight
If your system:
- Runs multiple workers
- Uses in-memory caching
- Relies on shared state
Then this rule applies:
If your cache isnβt shared, your state isnβt real.
π Closing
This issue reinforced something critical in my engineering journey:
You donβt debug systems by staring at code β
you debug them by understanding how state flows across boundaries.
If you're building distributed APIs, token systems, or high-concurrency services β
this is one edge case worth designing for early.
Top comments (0)