DEV Community

Erez Rozenbaum
Erez Rozenbaum

Posted on

Building a Unified Operational Timeline for Multi-Tenant OpenStack Environments

One of the most difficult operational problems I encountered while working with multi-tenant OpenStack environments wasn’t provisioning.

It was a correlation.

At scale, infrastructure failures rarely happen in isolation.

A single incident may involve:

  • monitoring alerts
  • provisioning events
  • restore workflows
  • migration activity
  • authentication changes
  • backup operations
  • tenant lifecycle events
  • support tickets

And operationally, the hardest question often isn’t:
“What failed?”

It’s:
“What ELSE changed around the same time?”

That realization became the foundation for building the Unified Operational Timeline inside pf9-mngt.

The “Blast Radius” Problem

In real MSP environments, responders frequently jump between multiple systems during incidents:

  • monitoring dashboards
  • log aggregators
  • backup systems
  • ticketing tools
  • provisioning systems
  • infrastructure APIs

Each platform exposes only part of the operational story.

The result is what I started calling:
Correlation Hell.

You know something failed.
But reconstructing the operational blast radius becomes slow, fragmented, and dependent on tribal knowledge.

Reducing MTTR becomes less about monitoring itself and more about reconstructing operational context efficiently.

The Initial Design Mistake

Initially, I assumed the hard problem would be the ingestion scale.

It wasn’t.

The real complexity was identity resolution.

Many operational events simply do not arrive with reliable tenant ownership metadata.

For example:

  • Provisioning logs may reference projects indirectly
  • restore events may lack clean domain attribution
  • monitoring alerts may only reference resource IDs
  • auth events may map to users but not tenants
  • migration workflows may reference topology objects indirectly

Operationally, this becomes dangerous in multi-tenant environments because visibility itself must remain tenant-isolated.

The system needed to determine:
“Which tenant operationally owns this event?”

In real time.

Building the Harvester

The solution evolved into a centralized operational harvester.

Instead of relying entirely on external observability stacks, the platform consolidates operational events into a persistent operational timeline.

The architecture now:

  • harvests events from 10+ sources
  • normalizes operational metadata
  • dynamically resolves tenant ownership
  • maintains resumable harvesting cursors
  • exposes tenant-scoped operational history
  • correlates restore, migration, and infrastructure activity

The interesting engineering challenge wasn’t storage.

It was operational consistency.

Why Idempotency Became Critical

Operational workers restart.
Infrastructure APIs fail intermittently.
Long-running restore workflows are partially complete.

If event harvesting duplicated or skipped operational events, the timeline would lose trust immediately.

The harvester, therefore, became:

  • resumable
  • cursor-driven
  • incrementally processed
  • idempotent

That design turned out to be far more important than ingestion performance itself.

Tenant Visibility Without Breaking Isolation

Another major challenge was exposing the operational context safely to tenants.

MSP environments require:

  • provider-level operational visibility
  • tenant-level operational isolation

Those goals often conflict.

The final model exposed:

  • tenant-scoped event history
  • correlated infrastructure activity
  • restore operations
  • provisioning visibility
  • timeline filtering

Without exposing cross-tenant operational data.

Operationally, this significantly reduced “What happened?” support escalations.

The Shift From Monitoring to Operational Context

What became increasingly clear throughout the project is that:
monitoring alone is no longer enough.

Most observability systems are very good at:

  • metrics
  • logs
  • alerts

But infrastructure operations increasingly require:

  • operational correlation
  • workflow reconstruction
  • topology awareness
  • tenant ownership context
  • infrastructure chronology

The operational layer around infrastructure is becoming just as important as infrastructure provisioning itself.

Final Thought

The hardest operational problems in modern infrastructure are no longer isolated technical failures.

They are coordination problems.

And solving coordination problems requires systems that understand operational context, not just infrastructure state.

🔗 GitHub:
https://github.com/erezrozenbaum/pf9-mngt

Top comments (0)