<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mickael Lamare</title>
    <description>The latest articles on DEV Community by Mickael Lamare (@feranor).</description>
    <link>https://dev.to/feranor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3986992%2F95944d6c-a0ae-4357-8b4c-97599e566040.jpg</url>
      <title>DEV Community: Mickael Lamare</title>
      <link>https://dev.to/feranor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/feranor"/>
    <language>en</language>
    <item>
      <title>Three days to find who owned a failing service</title>
      <dc:creator>Mickael Lamare</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:53:55 +0000</pubDate>
      <link>https://dev.to/feranor/three-days-to-find-who-owned-a-failing-service-2aem</link>
      <guid>https://dev.to/feranor/three-days-to-find-who-owned-a-failing-service-2aem</guid>
      <description>&lt;p&gt;The incident started the way they always do: a monitoring alert, then another, then the dashboard turning red faster than anyone could acknowledge the pages.&lt;/p&gt;

&lt;p&gt;P1. Highest severity. Production down. All hands.&lt;/p&gt;

&lt;p&gt;The platform was not small. Thousands of APIs in production, a second event-driven platform on top, four environments, a cloud footprint to match. By any external measure, a sophisticated, mature engineering organization. The kind of place where you'd assume that when something breaks, somebody knows whose job it is to fix it.&lt;/p&gt;

&lt;p&gt;Nobody did.&lt;/p&gt;

&lt;p&gt;The first hours of the incident were not spent debugging. They were spent answering a question that should never need asking: whose system is this?&lt;/p&gt;

&lt;p&gt;The investigation pulled in Microsoft support, network engineers, and a rotating cast of developers who each owned a piece of the picture but never the whole. Everyone could explain their fragment. Nobody could explain the path a request actually took from one end to the other.&lt;/p&gt;

&lt;p&gt;What eventually emerged was worse than a bug. Load balancer traffic routed through the United States — when legal requirements mandated it stay within European borders. Somewhere in the chain, data was leaving the private network and crossing the public internet. Microsoft's position was clear and entirely fair: once traffic exits the private network, it is outside their responsibility.&lt;/p&gt;

&lt;p&gt;The fix, when it finally came, was a DNS configuration change. In the organization's own infrastructure. One record.&lt;/p&gt;

&lt;p&gt;Three days. A P1. An international legal exposure. Resolved by changing a DNS entry that someone should have owned, documented, and validated from day one.&lt;/p&gt;

&lt;p&gt;Here is the uncomfortable part: the engineers were talented. The tools were modern. The architecture had ambition. Nothing about this incident required more skill, better tooling, or smarter people.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The root cause was the absence of a named owner for a precise perimeter — and the absence of a contract that defined what that perimeter was supposed to guarantee.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nobody had drawn the line and put a name next to it. So when the line broke, the organization spent three days discovering where the line even was. With production down across a platform of that scale, the cost was conservatively in the millions of euros. Not because anyone was incompetent. Because ownership was structurally ambiguous, and ambiguity has a price that comes due at the worst possible moment.&lt;/p&gt;

&lt;p&gt;In every ungoverned system I've worked in, the first question after an incident is never how do we fix this. It's whose problem is this. And the answer is rarely obvious, because ownership was never assigned — it was assumed. It drifted with reorganizations, evaporated with departures, and dissolved in the spaces between teams where nobody had claimed jurisdiction.&lt;/p&gt;

&lt;p&gt;This is the industry default. Ownership gaps don't announce themselves. They accumulate silently, invisible on every roadmap, until an incident forces them into the open. By then, the cost of the gap isn't the time to fix the bug. It's the time to find out who should.&lt;/p&gt;

&lt;p&gt;The test is simple. Pick any service in your production environment, right now, and ask: if this fails at 3am, is there a single named person who is unambiguously accountable for it — not a team, not a channel, a name? If you hesitate, you have the same gap. You just haven't paid for it yet.&lt;/p&gt;

&lt;p&gt;A named owner for the network perimeter. A documented contract stating what that boundary guaranteed — including where traffic was allowed to flow. That's it. Not a new platform, not a reorg, not a methodology with a logo. A name and a written guarantee.&lt;/p&gt;

&lt;p&gt;The incident would still have happened — entropy doesn't ask permission. But the three days of archaeology would have been thirty minutes of escalation, because the question whose system is this would have had an answer before the incident, not during it.&lt;/p&gt;

&lt;p&gt;Every decision has a consequence. Every unowned perimeter is an accountability gap accumulating interest. The question is never whether the consequences arrive. They always do. The only question is whether you'll be in a position to contain them when they do.&lt;/p&gt;

&lt;p&gt;I'm writing a book about this — how distributed systems lose ownership, why it always ends in an incident, and the governance framework I built to prevent it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.feranor.com/dhg" rel="noopener noreferrer"&gt;Read the preface free →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>distributedsystems</category>
      <category>devops</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
