Erez Rozenbaum

Posted on May 18

Building a Unified Operational Timeline for Multi-Tenant OpenStack Environments

#devops #infrastructure #monitoring #sre

One of the most difficult operational problems I encountered while working with multi-tenant OpenStack environments wasn’t provisioning.

It was a correlation.

At scale, infrastructure failures rarely happen in isolation.

A single incident may involve:

monitoring alerts
provisioning events
restore workflows
migration activity
authentication changes
backup operations
tenant lifecycle events
support tickets

And operationally, the hardest question often isn’t:
“What failed?”

It’s:
“What ELSE changed around the same time?”

That realization became the foundation for building the Unified Operational Timeline inside pf9-mngt.

The “Blast Radius” Problem

In real MSP environments, responders frequently jump between multiple systems during incidents:

monitoring dashboards
log aggregators
backup systems
ticketing tools
provisioning systems
infrastructure APIs

Each platform exposes only part of the operational story.

The result is what I started calling:
Correlation Hell.

You know something failed.
But reconstructing the operational blast radius becomes slow, fragmented, and dependent on tribal knowledge.

Reducing MTTR becomes less about monitoring itself and more about reconstructing operational context efficiently.

The Initial Design Mistake

Initially, I assumed the hard problem would be the ingestion scale.

It wasn’t.

The real complexity was identity resolution.

Many operational events simply do not arrive with reliable tenant ownership metadata.

For example:

Provisioning logs may reference projects indirectly
restore events may lack clean domain attribution
monitoring alerts may only reference resource IDs
auth events may map to users but not tenants
migration workflows may reference topology objects indirectly

Operationally, this becomes dangerous in multi-tenant environments because visibility itself must remain tenant-isolated.

The system needed to determine:
“Which tenant operationally owns this event?”

In real time.

Building the Harvester

The solution evolved into a centralized operational harvester.

Instead of relying entirely on external observability stacks, the platform consolidates operational events into a persistent operational timeline.

The architecture now:

harvests events from 10+ sources
normalizes operational metadata
dynamically resolves tenant ownership
maintains resumable harvesting cursors
exposes tenant-scoped operational history
correlates restore, migration, and infrastructure activity

The interesting engineering challenge wasn’t storage.

It was operational consistency.

Why Idempotency Became Critical

Operational workers restart.
Infrastructure APIs fail intermittently.
Long-running restore workflows are partially complete.

If event harvesting duplicated or skipped operational events, the timeline would lose trust immediately.

The harvester, therefore, became:

resumable
cursor-driven
incrementally processed
idempotent

That design turned out to be far more important than ingestion performance itself.

Tenant Visibility Without Breaking Isolation

Another major challenge was exposing the operational context safely to tenants.

MSP environments require:

provider-level operational visibility
tenant-level operational isolation

Those goals often conflict.

The final model exposed:

tenant-scoped event history
correlated infrastructure activity
restore operations
provisioning visibility
timeline filtering

Without exposing cross-tenant operational data.

Operationally, this significantly reduced “What happened?” support escalations.

The Shift From Monitoring to Operational Context

What became increasingly clear throughout the project is that:
monitoring alone is no longer enough.

Most observability systems are very good at:

metrics
logs
alerts

But infrastructure operations increasingly require:

operational correlation
workflow reconstruction
topology awareness
tenant ownership context
infrastructure chronology

The operational layer around infrastructure is becoming just as important as infrastructure provisioning itself.

Final Thought

The hardest operational problems in modern infrastructure are no longer isolated technical failures.

They are coordination problems.

And solving coordination problems requires systems that understand operational context, not just infrastructure state.

🔗 GitHub:
https://github.com/erezrozenbaum/pf9-mngt

DEV Community