<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Erez Rozenbaum</title>
    <description>The latest articles on DEV Community by Erez Rozenbaum (@erez_rozenbaum).</description>
    <link>https://dev.to/erez_rozenbaum</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3937363%2F19021f52-1c1a-4ae8-820c-1a2a15b2f77f.jpg</url>
      <title>DEV Community: Erez Rozenbaum</title>
      <link>https://dev.to/erez_rozenbaum</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/erez_rozenbaum"/>
    <language>en</language>
    <item>
      <title>Building a Unified Operational Timeline for Multi-Tenant OpenStack Environments</title>
      <dc:creator>Erez Rozenbaum</dc:creator>
      <pubDate>Mon, 18 May 2026 06:41:00 +0000</pubDate>
      <link>https://dev.to/erez_rozenbaum/building-a-unified-operational-timeline-for-multi-tenant-openstack-environments-3hm7</link>
      <guid>https://dev.to/erez_rozenbaum/building-a-unified-operational-timeline-for-multi-tenant-openstack-environments-3hm7</guid>
      <description>&lt;p&gt;One of the most difficult operational problems I encountered while working with multi-tenant OpenStack environments wasn’t provisioning.&lt;/p&gt;

&lt;p&gt;It was a correlation.&lt;/p&gt;

&lt;p&gt;At scale, infrastructure failures rarely happen in isolation.&lt;/p&gt;

&lt;p&gt;A single incident may involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;monitoring alerts&lt;/li&gt;
&lt;li&gt;provisioning events&lt;/li&gt;
&lt;li&gt;restore workflows&lt;/li&gt;
&lt;li&gt;migration activity&lt;/li&gt;
&lt;li&gt;authentication changes&lt;/li&gt;
&lt;li&gt;backup operations&lt;/li&gt;
&lt;li&gt;tenant lifecycle events&lt;/li&gt;
&lt;li&gt;support tickets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And operationally, the hardest question often isn’t:&lt;br&gt;
“What failed?”&lt;/p&gt;

&lt;p&gt;It’s:&lt;br&gt;
“What ELSE changed around the same time?”&lt;/p&gt;

&lt;p&gt;That realization became the foundation for building the Unified Operational Timeline inside pf9-mngt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The “Blast Radius” Problem
&lt;/h2&gt;

&lt;p&gt;In real MSP environments, responders frequently jump between multiple systems during incidents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;monitoring dashboards&lt;/li&gt;
&lt;li&gt;log aggregators&lt;/li&gt;
&lt;li&gt;backup systems&lt;/li&gt;
&lt;li&gt;ticketing tools&lt;/li&gt;
&lt;li&gt;provisioning systems&lt;/li&gt;
&lt;li&gt;infrastructure APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each platform exposes only part of the operational story.&lt;/p&gt;

&lt;p&gt;The result is what I started calling:&lt;br&gt;
Correlation Hell.&lt;/p&gt;

&lt;p&gt;You know something failed.&lt;br&gt;
But reconstructing the operational blast radius becomes slow, fragmented, and dependent on tribal knowledge.&lt;/p&gt;

&lt;p&gt;Reducing MTTR becomes less about monitoring itself and more about reconstructing operational context efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Initial Design Mistake
&lt;/h2&gt;

&lt;p&gt;Initially, I assumed the hard problem would be the ingestion scale.&lt;/p&gt;

&lt;p&gt;It wasn’t.&lt;/p&gt;

&lt;p&gt;The real complexity was identity resolution.&lt;/p&gt;

&lt;p&gt;Many operational events simply do not arrive with reliable tenant ownership metadata.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provisioning logs may reference projects indirectly&lt;/li&gt;
&lt;li&gt;restore events may lack clean domain attribution&lt;/li&gt;
&lt;li&gt;monitoring alerts may only reference resource IDs&lt;/li&gt;
&lt;li&gt;auth events may map to users but not tenants&lt;/li&gt;
&lt;li&gt;migration workflows may reference topology objects indirectly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Operationally, this becomes dangerous in multi-tenant environments because visibility itself must remain tenant-isolated.&lt;/p&gt;

&lt;p&gt;The system needed to determine:&lt;br&gt;
“Which tenant operationally owns this event?”&lt;/p&gt;

&lt;p&gt;In real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Harvester
&lt;/h2&gt;

&lt;p&gt;The solution evolved into a centralized operational harvester.&lt;/p&gt;

&lt;p&gt;Instead of relying entirely on external observability stacks, the platform consolidates operational events into a persistent operational timeline.&lt;/p&gt;

&lt;p&gt;The architecture now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;harvests events from 10+ sources&lt;/li&gt;
&lt;li&gt;normalizes operational metadata&lt;/li&gt;
&lt;li&gt;dynamically resolves tenant ownership&lt;/li&gt;
&lt;li&gt;maintains resumable harvesting cursors&lt;/li&gt;
&lt;li&gt;exposes tenant-scoped operational history&lt;/li&gt;
&lt;li&gt;correlates restore, migration, and infrastructure activity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting engineering challenge wasn’t storage.&lt;/p&gt;

&lt;p&gt;It was operational consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Idempotency Became Critical
&lt;/h2&gt;

&lt;p&gt;Operational workers restart.&lt;br&gt;
Infrastructure APIs fail intermittently.&lt;br&gt;
Long-running restore workflows are partially complete.&lt;/p&gt;

&lt;p&gt;If event harvesting duplicated or skipped operational events, the timeline would lose trust immediately.&lt;/p&gt;

&lt;p&gt;The harvester, therefore, became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;resumable&lt;/li&gt;
&lt;li&gt;cursor-driven&lt;/li&gt;
&lt;li&gt;incrementally processed&lt;/li&gt;
&lt;li&gt;idempotent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That design turned out to be far more important than ingestion performance itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tenant Visibility Without Breaking Isolation
&lt;/h2&gt;

&lt;p&gt;Another major challenge was exposing the operational context safely to tenants.&lt;/p&gt;

&lt;p&gt;MSP environments require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider-level operational visibility&lt;/li&gt;
&lt;li&gt;tenant-level operational isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those goals often conflict.&lt;/p&gt;

&lt;p&gt;The final model exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tenant-scoped event history&lt;/li&gt;
&lt;li&gt;correlated infrastructure activity&lt;/li&gt;
&lt;li&gt;restore operations&lt;/li&gt;
&lt;li&gt;provisioning visibility&lt;/li&gt;
&lt;li&gt;timeline filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without exposing cross-tenant operational data.&lt;/p&gt;

&lt;p&gt;Operationally, this significantly reduced “What happened?” support escalations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift From Monitoring to Operational Context
&lt;/h2&gt;

&lt;p&gt;What became increasingly clear throughout the project is that:&lt;br&gt;
monitoring alone is no longer enough.&lt;/p&gt;

&lt;p&gt;Most observability systems are very good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics&lt;/li&gt;
&lt;li&gt;logs&lt;/li&gt;
&lt;li&gt;alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But infrastructure operations increasingly require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;operational correlation&lt;/li&gt;
&lt;li&gt;workflow reconstruction&lt;/li&gt;
&lt;li&gt;topology awareness&lt;/li&gt;
&lt;li&gt;tenant ownership context&lt;/li&gt;
&lt;li&gt;infrastructure chronology&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operational layer around infrastructure is becoming just as important as infrastructure provisioning itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The hardest operational problems in modern infrastructure are no longer isolated technical failures.&lt;/p&gt;

&lt;p&gt;They are coordination problems.&lt;/p&gt;

&lt;p&gt;And solving coordination problems requires systems that understand operational context, not just infrastructure state.&lt;/p&gt;

&lt;p&gt;🔗 GitHub:&lt;br&gt;
&lt;a href="https://github.com/erezrozenbaum/pf9-mngt" rel="noopener noreferrer"&gt;https://github.com/erezrozenbaum/pf9-mngt&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjis0co8xg9mypvgys6vz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjis0co8xg9mypvgys6vz.jpg" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
