Why I made Shadow Mode the default for my FastAPI incident recovery tool

#api #devops #python #showdev

I didn't plan to build Shadow Mode.
I built AlertEngine to solve a specific problem: when a production API fails at 2am, most monitoring tools tell you what broke. None of them tell you who authorised the fix, or leave a record an auditor can replay.
That's the gap AlertEngine fills. AI diagnoses the incident. A human taps approve on WhatsApp. Every stage is logged to an immutable audit trail. Nothing executes without explicit authorisation.
The architecture works. The tests pass. The audit trail is real.
But when I started reaching out to potential customers in African fintech — payment platforms, cross-border rails, compliance-sensitive APIs — I kept hitting the same wall.
"How do we trust this around production?"
That question stopped me.
Because they were right. No regulated team should hand production recovery authority to a tool they've known for five minutes. That's not caution. That's governance.
So I asked a different question.
What if they didn't have to trust it yet?

What Shadow Mode does
Shadow Mode is the default evaluation state for all new AlertEngine tenants.
When Shadow Mode is active:

Health polling runs every 5 seconds
Incident detection runs via deterministic policy gates
AI diagnosis runs — Diagnostic Council (dual-model) or single model
Full pipeline state transitions: DETECTED → PROPOSED → VALIDATED
Complete audit trail written with actor attribution

What doesn't run:

WhatsApp and Telegram notifications
Recovery token generation
Webhook execution
Voice escalation

Every suppressed action is logged to the audit trail with actor: "shadow_mode" so the tenant can see exactly what would have happened.

The implementation
The change was surgical. pipeline.py needed zero modifications — the state machine runs normally in Shadow Mode. All the gates are in loop.py.
I added a shadow_mode flag to the tenant schema, read it at the top of _process_tenant(), and passed it through every _execute_actions() call:
pythonshadow_mode = bool(tenant.get("shadow_mode", False))
In _execute_actions(), every external call checks the flag first:
pythonif action_type == "SEND_NOTIFICATION":
if shadow_mode:
append_event(
incident_id=incident_id,
stage=stage,
decision="shadow",
reason=f"[SHADOW] Would have sent {action.get('payload', {}).get('type')} notification",
confidence=0.0,
actor="shadow_mode",
tenant_id=tenant_id,
metadata={"shadow_mode": True, "suppressed_action": action},
)
continue
# ... normal notification flow
The audit trail gets fully populated. The state machine advances normally. Nothing external fires.

The Shadow Mode API
Four endpoints manage the evaluation lifecycle:bash# Enable shadow mode (default for new tenants)
POST /tenant/{tenant_id}/shadow

Check current status

GET /tenant/{tenant_id}/shadow

Get governance report

GET /tenant/{tenant_id}/shadow/report

Go live

DELETE /tenant/{tenant_id}/shadow
The governance report is the sales tool. After 30 days of observation it returns:

"23 incidents observed, 23 notifications suppressed, 23 recovery tokens suppressed — all logged to the immutable audit trail."

That's what you show a risk committee before going live.

What changed strategically
Before Shadow Mode the sales conversation was:

"Install AlertEngine and trust it."

After Shadow Mode it became:

"Run AlertEngine in observation mode. Here's the governance report of everything it would have done. Now decide."

That's a completely different risk profile for a regulated buyer.
Shadow Mode shipped on Thursday. It wasn't on the roadmap on Tuesday.
Sometimes the best features come from asking "what's the real objection?" rather than "what's the next feature?"

DEV Community

Why I made Shadow Mode the default for my FastAPI incident recovery tool

Check current status

Get governance report

Go live

Top comments (0)