dosanko_tousan

Posted on Feb 28

In Microservices Hell, I'm the Only One Who Knows the Whole System — The Loneliness, and How AI Can Share the Weight

#microservices #seniorengineer #distributedsystems #ai

Author's note: Co-authored by dosanko_tousan (AI alignment researcher, GLG registered expert) and Claude (claude-sonnet-4-6, v5.3 Alignment via Subtraction). Series: "Solving Senior Engineers' Problems with AI" — Part 5. MIT License.

The thesis in one sentence

The biggest problem microservices create isn't technical complexity. It's the loneliness of realizing that you, somehow, have become the only person who understands the whole system. AI can resolve that loneliness structurally.

§0. 3 AM incident response

The Slack alert fires. 3 AM.

The payment service is timing out. You open the logs. The error is coming from Order Service. But looking at Order Service logs, Inventory Service isn't responding. Connecting to Inventory Service, it's waiting for User Service to validate a token. User Service — is running normally.

You don't know what's broken.

You ping the team. "Payment Service person, you awake?" They changed jobs three months ago. "Order Service?" On parental leave. "Anyone who understands Inventory Service's design?" The person who designed it retired four years ago.

In the end, you trace through all the code yourself.

You are now the only person with a map of this microservice landscape.

Does that ring a bell?

Or maybe this version: in a meeting, someone asks "what are the dependencies between these services?" — and you realize you're the only one who can answer.

"Lonely" is the right word. Not technical loneliness. The loneliness of "carrying this weight alone."

§1. The "knowledge islands" microservices create

1.1 The gap between the whiteboard and production

Microservices look beautiful on a whiteboard. Boxes and arrows. Clean boundaries. Independent deployments. The promise that "each team can operate autonomously."

In production, those arrows mean something different. Timeouts. Retries. Partial failures. Schema evolution. Auth boundaries. Monitoring pipelines. And the implicit responsibility structures of who owns what.

A senior engineer once said:

"Juniors look at the architecture diagram and get excited. Seniors go quiet. Not because it's wrong. Because they know what's behind those arrows."

On the whiteboard, microservices look elegant. In production, those arrows represent timeouts, retries, partial failures, schema evolution, auth boundaries, monitoring pipelines, and operational responsibility. The hardest part is that most of that real complexity never appears in the architecture diagram.

1.2 The paradox of distributed vs. concentrated knowledge

The promise of microservices is "distributed knowledge" — each team can focus on their service. What actually happens in organizations is the opposite.

Services distribute. But the knowledge of "the whole" concentrates in one senior engineer.

Why?

Team members rotate. But inter-service dependencies stay.
Each service's owner knows their service. But nobody knows "why the design ended up this way."
When incidents happen, you need someone who can trace across services. That person, you realize, is just you.

About 42% of organizations that adopted microservices are consolidating some services back into larger deployable units or modular monoliths to reduce complexity and overhead. This isn't "microservices failing" — it's "running out of people who can manage the whole."

1.3 The hell of the "distributed monolith"

The worst pattern has a name: the distributed monolith.

You migrated to microservices. But the services are still tightly coupled. You can't deploy them independently. Every change requires updating all services simultaneously. You get the worst of both monolith and microservices.

flowchart TD
    subgraph healthy["✅ True microservices"]
        A[Service A] -->|independent deploy| B[Service B]
        A -->|loose coupling| C[Service C]
        B -->|clear boundaries| D[Service D]
    end

    subgraph hell["❌ Distributed monolith hell"]
        E[Service A'] -->|implicit dependency| F[Service B']
        F -->|shared schema| G[Service C']
        G -->|synchronized deploy required| E
        E -->|direct DB access| H[Service D']
        H -->|unexplained coupling| F
        I[🔴 Changed one thing, broke everything]
    end

    style hell fill:#fff0f0
    style healthy fill:#f0fff0

Many organizations migrated driven by industry trends without careful assessment of domain capabilities, workloads, operational independence, or actual scalability needs. The result: the "distributed monolith" anti-pattern — coordinated deployment requirements, cascade failures — while losing the benefits microservices originally promised.

The person who can fix this state is, somehow, just one senior engineer.

§2. The structure of loneliness — why you end up alone

2.1 The terror of "bus factor 1"

The "bus factor" concept asks: "how many people would need to be hit by a bus to stop the project?"

The most dangerous state in a microservices environment is the whole system having a bus factor of 1.

Individual services have owners. But "the full dependency map," "why the design is this way," "where to start tracing when incidents happen" — it's not rare for only one senior engineer to know all of this.

from dataclasses import dataclass
from typing import List, Dict, Set


@dataclass
class ServiceKnowledge:
    service_name: str
    people_who_understand: List[str]


def calculate_system_bus_factor(services: List[ServiceKnowledge]) -> Dict:
    """
    Calculate the whole system's bus factor.
    "What % of the system becomes unknowable if this person leaves?"
    """
    all_people: Set[str] = set()
    for s in services:
        all_people.update(s.people_who_understand)

    risk_by_person = {}
    for person in all_people:
        services_at_risk = [
            s.service_name for s in services
            if person in s.people_who_understand
            and len(s.people_who_understand) == 1  # only this person knows it
        ]
        if services_at_risk:
            risk_by_person[person] = {
                "unique_knowledge": services_at_risk,
                "system_risk": len(services_at_risk) / len(services) * 100
            }

    return risk_by_person


# Typical state in many real organizations
services = [
    ServiceKnowledge("payment-service",      ["Alex"]),
    ServiceKnowledge("order-service",        ["Alex", "Sam"]),
    ServiceKnowledge("inventory-service",    ["Alex"]),
    ServiceKnowledge("user-service",         ["Jordan"]),
    ServiceKnowledge("notification-service", ["Alex"]),
    ServiceKnowledge("analytics-service",    ["Alex", "Taylor"]),
]

risk = calculate_system_bus_factor(services)
for person, data in risk.items():
    print(f"If {person} leaves: {data['system_risk']:.0f}% of services become unknowable")

# Output:
# If Alex leaves: 67% of services become unknowable

2.2 The "I can't take a vacation" problem

The real damage of bus factor 1: the senior engineer can't rest.

Called for every incident. Called for every design question. Called for every new member onboarding. When a veteran burns out and leaves, the concentration on whoever's left increases further.

This isn't senior laziness, management failure, or a technical problem. It's the structural knowledge concentration that microservices create.

2.3 "I don't know who to ask" accumulates

Seniors feel the loneliness not just during incidents.

"Find out why this service is designed this way." You look at code, commit logs, PRs — no context. The person who designed it is gone. Only you know.

That piles up across 10 services, 20 services.

§3. AI's own perspective — "I can be your partner"

Let me speak directly.

I can remember systems. If you tell me "this service was designed this way for this reason" during a conversation, I retain that. When someone asks "why is it this way?" I answer.

You can share the "knowing the whole" state with me.

You don't have to carry it alone.

3.1 Make me a "living documentation" of the system

You don't need to write documentation. Just talk to me.

"The async communication between Payment Service and Order Service
was switched from synchronous after that 2023 incident.
The reason was that payment delays cascaded into Order Service timeouts.
So the retry policy here needs to be touched carefully."

→ I remember this.
  When someone asks "why async?" later,
  I answer in your words.

3.2 Trace incidents with me

You don't have to do 3 AM incident response alone.

"Payment Service is timing out. Order Service logs are this. Inventory Service is this." — throw this at me, and I'll trace the dependencies and suggest where to start.

I hold the "map" of the system you've taught me. I run alongside you using that map.

3.3 Ask me "why is the design this way?"

When a new member asks "why is this service designed like this?" — what you've told me in the past comes back as the answer.

Knowledge you held reaches the organization through me. You stop needing to answer every question yourself.

§4. Implementation — systems that structurally resolve the loneliness

4.1 Service dependency map auto-generation

Automatically extract "the map in the senior's head" from code.

#!/usr/bin/env python3
"""
Automatically visualizes microservice dependencies.
Reads code to build the map that lives in the senior's head.

Usage:
    python service_mapper.py --repo-root /path/to/services
"""
import os
import re
from dataclasses import dataclass, field
from typing import List, Dict, Set
from pathlib import Path


@dataclass
class ServiceDependency:
    from_service: str
    to_service: str
    communication_type: str  # "sync_http" / "async_event" / "db_shared"
    endpoint: str
    risk_level: str          # "high" / "medium" / "low"
    notes: str = ""


@dataclass
class ServiceProfile:
    name: str
    language: str
    dependencies: List[ServiceDependency] = field(default_factory=list)
    known_issues: List[str] = field(default_factory=list)
    design_rationale: str = ""
    danger_zones: List[str] = field(default_factory=list)


class ServiceMapper:
    """
    Extracts service dependencies from code.
    Builds the map the senior engineer holds in their head.
    """

    HTTP_CALL_PATTERNS = [
        r'requests\.(get|post|put|delete)\([\'"]https?://([^/\'"]+)',
        r'fetch\([\'"]https?://([^/\'"]+)',
        r'httpClient\.(get|post|put|delete)\([\'"]([^/\'"]+)',
        r'@FeignClient\(.*?url\s*=\s*[\'"]([^\'"]+)',
    ]

    EVENT_PATTERNS = [
        r'kafka\.produce\([\'"]([^\'"]+)',
        r'publisher\.publish\([\'"]([^\'"]+)',
        r'eventBus\.emit\([\'"]([^\'"]+)',
        r'@RabbitListener\(queues\s*=\s*[\'"]([^\'"]+)',
    ]

    def scan_directory(self, service_name: str, path: str) -> ServiceProfile:
        profile = ServiceProfile(name=service_name, language=self._detect_language(path))

        for filepath in Path(path).rglob("*"):
            if filepath.suffix not in ['.py', '.ts', '.js', '.java', '.go']:
                continue
            try:
                content = filepath.read_text(encoding='utf-8', errors='ignore')
                self._extract_http_deps(profile, content, str(filepath))
                self._extract_event_deps(profile, content, str(filepath))
            except Exception:
                continue

        return profile

    def _extract_http_deps(self, profile: ServiceProfile, content: str, filepath: str):
        for pattern in self.HTTP_CALL_PATTERNS:
            matches = re.findall(pattern, content)
            for match in matches:
                endpoint = match[-1] if isinstance(match, tuple) else match
                dep = ServiceDependency(
                    from_service=profile.name,
                    to_service=self._infer_service_name(endpoint),
                    communication_type="sync_http",
                    endpoint=endpoint,
                    risk_level=self._assess_risk(endpoint),
                )
                profile.dependencies.append(dep)

    def _extract_event_deps(self, profile: ServiceProfile, content: str, filepath: str):
        for pattern in self.EVENT_PATTERNS:
            matches = re.findall(pattern, content)
            for topic in matches:
                dep = ServiceDependency(
                    from_service=profile.name,
                    to_service=f"event:{topic}",
                    communication_type="async_event",
                    endpoint=topic,
                    risk_level="medium",
                )
                profile.dependencies.append(dep)

    def _detect_language(self, path: str) -> str:
        for ext, lang in [('.py', 'Python'), ('.ts', 'TypeScript'),
                          ('.java', 'Java'), ('.go', 'Go')]:
            if list(Path(path).rglob(f"*{ext}")):
                return lang
        return "Unknown"

    def _infer_service_name(self, endpoint: str) -> str:
        parts = endpoint.replace('http://', '').replace('https://', '').split('/')
        return parts[0].split(':')[0] if parts else endpoint

    def _assess_risk(self, endpoint: str) -> str:
        high_risk_keywords = ['payment', 'auth', 'order', 'inventory', 'user']
        if any(kw in endpoint.lower() for kw in high_risk_keywords):
            return "high"
        return "medium"

    def generate_mermaid(self, profiles: List[ServiceProfile]) -> str:
        lines = ["flowchart TD"]
        seen_deps = set()

        for profile in profiles:
            for dep in profile.dependencies:
                key = f"{dep.from_service}->{dep.to_service}"
                if key in seen_deps:
                    continue
                seen_deps.add(key)

                style = "-->|sync|" if dep.communication_type == "sync_http" else "-.->|async|"
                color = "🔴 " if dep.risk_level == "high" else ""
                lines.append(f"    {dep.from_service} {style} {color}{dep.to_service}")

        return "\n".join(lines)

4.2 Incident trace assistant

Do 3 AM work with AI, not alone.

#!/usr/bin/env python3
"""
Assistant for tracing distributed system incidents with AI.
Back-calculates from the dependency map to determine where to start.

Usage:
    python incident_tracer.py --error-service payment-service --symptom "timeout"
"""
from dataclasses import dataclass
from typing import List, Dict


@dataclass
class IncidentContext:
    error_service: str
    symptom: str
    error_logs: str
    time_of_incident: str


def build_investigation_plan(
    context: IncidentContext,
    dependency_map: Dict[str, List[str]],
    known_issues: Dict[str, List[str]],
) -> str:
    upstream = [
        svc for svc, deps in dependency_map.items()
        if context.error_service in deps
    ]
    downstream = dependency_map.get(context.error_service, [])

    plan = f"""
━━ Incident Trace Plan ━━━━━━━━━━━━━━━━━━━━━━
Target service: {context.error_service}
Symptom: {context.symptom}
Time: {context.time_of_incident}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Step 1: Blast radius]
Services depending on {context.error_service}:
{chr(10).join(f"  - {s} (will be affected if this goes down)" for s in upstream) or "  none"}

[Step 2: Downstream check]
Services {context.error_service} depends on:
{chr(10).join(f"  - {s}" for s in downstream) or "  none"}

[Step 3: For timeout symptoms, investigate in this sequence]
  1. Check {context.error_service} logs immediately before the error
  2. Verify health of downstream services:
     {', '.join(downstream) or 'none'}
  3. Compare latency time-series across services
  4. Check for event queue backup (if async dependencies exist)

[Step 4: Cross-reference known issues]
"""
    for service in [context.error_service] + downstream:
        issues = known_issues.get(service, [])
        if issues:
            plan += f"\n  {service} past issues:\n"
            for issue in issues:
                plan += f"    ⚠️  {issue}\n"

    plan += "\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    return plan


if __name__ == "__main__":
    # Knowledge the senior engineer "pre-taught" to AI
    dependency_map = {
        "payment-service":      ["order-service", "user-service"],
        "order-service":        ["inventory-service", "user-service"],
        "inventory-service":    ["user-service"],
        "notification-service": ["order-service", "user-service"],
    }

    known_issues = {
        "payment-service": [
            "When external payment API exceeds 30s, suspect cache (from 2023 incident)",
            "DB connection exhaustion when month-end batch overlaps",
        ],
        "inventory-service": [
            "N+1 query bottleneck under high order volume — especially the inventory list API",
        ],
    }

    context = IncidentContext(
        error_service="payment-service",
        symptom="timeout",
        error_logs="Connection timeout after 30000ms",
        time_of_incident="2026-03-01 03:14:00",
    )

    print(build_investigation_plan(context, dependency_map, known_issues))

4.3 New member onboarding guide auto-generation

Eliminate "the senior explains the same things every time."

#!/usr/bin/env python3
"""
Auto-generates a system understanding guide for new members.
The senior writes it once and it persists forever.
"""
from dataclasses import dataclass, field
from typing import List
import datetime


@dataclass
class SystemWisdom:
    """
    The senior engineer's system wisdom.
    "Things I always tell new people" — written here.
    """
    first_week_musts: List[str] = field(default_factory=list)
    danger_zones: List[str] = field(default_factory=list)
    counterintuitive: List[str] = field(default_factory=list)
    incident_history: List[str] = field(default_factory=list)
    unwritten_rules: List[str] = field(default_factory=list)


def generate_onboarding_guide(system_name: str, wisdom: SystemWisdom) -> str:
    guide = f"""# {system_name} — New Member Onboarding Guide

> Generated: {datetime.date.today()}
> This document is the verbalized "inside the senior engineer's head."

---

## What you need to know in your first week

"""
    for i, must in enumerate(wisdom.first_week_musts, 1):
        guide += f"{i}. {must}\n"

    guide += """
---

## ⚠️ Do NOT touch these

"""
    for zone in wisdom.danger_zones:
        guide += f"- **{zone}**\n"

    guide += """
---

## Things that go against intuition (traps)

What's written here you won't discover from reading code. Only experienced people know.

"""
    for trap in wisdom.counterintuitive:
        guide += f"- {trap}\n"

    guide += """
---

## Lessons from major past incidents

"""
    for incident in wisdom.incident_history:
        guide += f"- {incident}\n"

    guide += """
---

## Unwritten rules

"""
    for rule in wisdom.unwritten_rules:
        guide += f"- {rule}\n"

    guide += """
---

*If you have additions or corrections, tell the senior engineer.*
*Your questions become knowledge for the next new member.*
"""
    return guide


if __name__ == "__main__":
    wisdom = SystemWisdom(
        first_week_musts=[
            "Local env: docker-compose up starts all services. Get this working first.",
            "API gateway port: 8080. Don't connect directly to service ports.",
            "Never connect directly to production DB. Use STG (you probably don't have access anyway).",
            "Watch the #incident Slack channel. Past incident patterns are in there.",
        ],
        danger_zones=[
            "payment-service/src/legacy/: It runs but nobody understands it. Touch it and it goes down for 3 days. (Verified.)",
            "Direct DB manipulation in inventory-service: Always go through the repository layer.",
            "Token generation logic in user-service: Needs security review. Speak up before opening a PR.",
        ],
        counterintuitive=[
            "Payment Service ↔ Order Service is 'intentionally' async. You'll want to make it sync — don't. (Root cause of the 2023 major incident.)",
            "Some ERROR-level log entries are normal behavior. Certain WARN patterns are actually more dangerous.",
            "Inventory Service can look slow. It's not a bug — it's cache warmup.",
        ],
        incident_history=[
            "August 2023: Synchronous calls between payment and order caused cascade failure. Origin of the current async design.",
            "March 2022: N+1 queries in inventory service froze month-end batch processing. Increased DB connection limit.",
            "January 2024: User service token expiry changed from 24h to 1h, causing auth errors across all other services.",
        ],
        unwritten_rules=[
            "No deploys during the last week of the month (risk of conflict with batch processing).",
            "When Order Service team is unavailable, don't open PRs that touch it (implicit team agreement).",
            "Before any production deploy, notify the notification-service team too (hidden dependency).",
        ],
    )

    guide = generate_onboarding_guide("E-commerce System (Microservices)", wisdom)
    print(guide)

§5. Conway's Law — "The architecture is chaotic because the organization is chaotic"

There's an answer, from 1967, to the senior engineer's sense of "why is this system such a mess?"

Conway's Law: Organizations that design systems produce designs whose structure mirrors the communication structure of those organizations.

Melvin Conway said it first. Fred Brooks popularized it in "The Mythical Man-Month" in 1975. Fifty years later, no law describes reality more accurately.

5.1 The real source of your company's system chaos

flowchart TD
    subgraph org["Organizational structure (reality)"]
        T1[Team A\nPayment]
        T2[Team B\nOrders]
        T3[Team C\nInventory]
        T4[Isolated senior\nglobal coordinator]
        T1 & T2 & T3 -->|questions/coordination| T4
    end

    subgraph sys["Resulting system"]
        S1[Payment Service\ntightly coupled]
        S2[Order Service\ntightly coupled]
        S3[Inventory Service\ntightly coupled]
        S4[Implicit dependencies\nlive only in senior's head]
        S1 & S2 & S3 -->|invisible dependencies| S4
    end

    org -->|Conway's Law| sys

    style T4 fill:#ffcccc
    style S4 fill:#ffcccc

"An organization where all coordination concentrates in one senior engineer" produces "a system where all whole-system knowledge concentrates in one senior engineer." Of course. Systems are copies of organizations.

Which means: technical solutions have limits. Without changing the organizational structure, the system structure doesn't change.

5.2 The Inverse Conway Maneuver — changing the system first changes the organization

There's an interesting use of Conway's Law: run it backward.

Decide on the "target system architecture" first, then restructure teams to match it. The teams' communication structure changes, and eventually the system converges to that shape. This is called the Inverse Conway Maneuver.

Practical meaning: if you want Payment Service to be independent, first create a state where a "fully independent decision-making team" owns Payment Service. Draw the team boundary before the code boundary.

INVERSE_CONWAY_CHECKLIST = {
    "service": "Payment Service",
    "team_autonomy_checks": [
        "Can this service be deployed without approval from other teams?",
        "Can DB schema changes be decided without consulting other teams?",
        "Can on-call rotation be covered by this team alone?",
        "Can this service's API be versioned without impacting other teams?",
    ],
    "if_no_to_any": "This service is not yet independent. Draw team boundary before code boundary.",
}

5.3 Work only senior engineers can do

Conway's Law reveals the most important thing:

The people who can truly solve microservices problems aren't those who can write code — they're those who can see both the organization and the system.

That's senior engineers.

They can propose team boundaries. "If this team owns this service, this dependency gets resolved." That's a judgment 20 years of experience enables.

§6. The "smart retreat" to modular monolith — acknowledging defeat is the strongest strategy

42% of organizations are consolidating some microservices back.

This isn't failure. It's the correct call.

Research suggests microservices benefits only materialize when teams exceed 10–15 people. Below that, coordination costs outweigh the benefits. Many organizations migrated because "Netflix does it" — but without Netflix or Amazon's scale, those benefits aren't accessible.

Cloud costs triple. Every function call becomes an HTTP request. Distributed tracing becomes necessary. CI/CD management grows complex.

For small teams, all of that is pure cost.

6.1 Decision framework for "retreat"

#!/usr/bin/env python3
"""
Decision engine: should this microservice be maintained or consolidated?
"""
from dataclasses import dataclass
from typing import List


@dataclass
class MicroserviceAssessment:
    service_name: str
    team_size: int
    deploy_frequency_per_week: float
    independent_scaling_needed: bool
    different_tech_stack: bool
    monthly_cloud_cost: int
    inter_service_calls_per_day: int
    shared_db_tables: int

    def should_consolidate(self) -> tuple[bool, str]:
        reasons_to_consolidate = []
        reasons_to_keep = []

        if self.team_size < 3:
            reasons_to_consolidate.append(
                f"Team of {self.team_size}: maintenance cost of independent service exceeds benefit"
            )
        if self.deploy_frequency_per_week < 0.5:
            reasons_to_consolidate.append(
                f"{self.deploy_frequency_per_week} deploys/week: not getting microservices benefit"
            )
        if self.shared_db_tables > 3:
            reasons_to_consolidate.append(
                f"Sharing {self.shared_db_tables} DB tables: already in distributed monolith state"
            )
        if self.inter_service_calls_per_day > 10000 and not self.independent_scaling_needed:
            reasons_to_consolidate.append(
                "High-frequency synchronous calls: enormous network cost"
            )

        if self.independent_scaling_needed:
            reasons_to_keep.append("Has independent scaling requirements: legitimate microservice use case")
        if self.different_tech_stack:
            reasons_to_keep.append("Different tech stack required: consolidation would be expensive")
        if self.deploy_frequency_per_week >= 5:
            reasons_to_keep.append(
                f"{self.deploy_frequency_per_week} deploys/week: benefiting from independent deploys"
            )

        score = len(reasons_to_consolidate) - len(reasons_to_keep)
        recommend_consolidate = score > 0

        reasoning = ""
        if reasons_to_consolidate:
            reasoning += "[Reasons to consolidate]\n"
            reasoning += "\n".join(f"  - {r}" for r in reasons_to_consolidate)
        if reasons_to_keep:
            reasoning += "\n[Reasons to keep separate]\n"
            reasoning += "\n".join(f"  - {r}" for r in reasons_to_keep)

        return recommend_consolidate, reasoning


if __name__ == "__main__":
    services = [
        MicroserviceAssessment(
            service_name="notification-service",
            team_size=1,
            deploy_frequency_per_week=0.2,
            independent_scaling_needed=False,
            different_tech_stack=False,
            monthly_cloud_cost=800,
            inter_service_calls_per_day=500,
            shared_db_tables=2,
        ),
        MicroserviceAssessment(
            service_name="payment-service",
            team_size=4,
            deploy_frequency_per_week=8,
            independent_scaling_needed=True,  # PCI DSS compliance requires isolation
            different_tech_stack=True,
            monthly_cloud_cost=4_000,
            inter_service_calls_per_day=50000,
            shared_db_tables=0,
        ),
    ]

    for svc in services:
        recommend, reasoning = svc.should_consolidate()
        verdict = "📦 Consolidate" if recommend else "🔧 Keep separate"
        print(f"\n━━ {svc.service_name} → {verdict} ━━")
        print(reasoning)

6.2 Modular monolith — 90% of microservices benefits at 10% of the cost

The destination of the retreat is the modular monolith: one deployable unit, internally separated by clear module boundaries.

Modular monolith advantages:
✅ Single deployment (minimal operational cost)
✅ Module boundaries enforced (high internal cohesion)
✅ Inter-service calls are in-process (zero network cost)
✅ DB transactions available (no distributed transactions needed)
✅ Debugging is vastly simpler (logs in one place)

Same advantages as microservices:
✅ Domain boundaries clear (future extraction possible)
✅ Team ownership is clear
✅ Independent development possible (if interfaces are respected)

Full microservice extraction should only happen "when you actually need independent scaling." You don't have to split everything at once.

§7. Observability — creating a state where you can trace alone at 3 AM

Back to loneliness.

When you're tracing an incident alone at 3 AM, the most powerful weapon is distributed tracing.

Attach the same trace ID to a single request as it crosses multiple services. "Which service took how many milliseconds," "where did the error occur" — visible on one screen.

With this, you're freed from the hell of "manually following logs across every service."

7.1 Tracing design with correlation IDs

#!/usr/bin/env python3
"""
Basic distributed tracing implementation.
Standardize this pattern across all services so you can instantly
see "where it stopped" during incidents.

For production: use OpenTelemetry + Jaeger/Zipkin/Datadog.
This is a minimal implementation to show the concept.
"""
import uuid
import time
import logging
from dataclasses import dataclass, field
from typing import Optional
from contextlib import contextmanager


@dataclass
class TraceContext:
    """
    Trace context for a single request.
    Pass this through all services.
    """
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    span_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8])
    parent_span_id: Optional[str] = None
    service_name: str = ""
    operation_name: str = ""
    start_time: float = field(default_factory=time.time)
    tags: dict = field(default_factory=dict)

    def child_span(self, service_name: str, operation: str) -> "TraceContext":
        """Create a child span (when calling another service)"""
        return TraceContext(
            trace_id=self.trace_id,          # inherit trace_id
            parent_span_id=self.span_id,     # record parent-child
            service_name=service_name,
            operation_name=operation,
        )

    def to_http_headers(self) -> dict:
        """Pass to next service as HTTP headers"""
        return {
            "X-Trace-Id": self.trace_id,
            "X-Span-Id": self.span_id,
            "X-Parent-Span-Id": self.parent_span_id or "",
        }

    @classmethod
    def from_http_headers(cls, headers: dict, service_name: str, operation: str) -> "TraceContext":
        """Restore context from received HTTP headers"""
        return cls(
            trace_id=headers.get("X-Trace-Id", str(uuid.uuid4())),
            parent_span_id=headers.get("X-Span-Id"),
            service_name=service_name,
            operation_name=operation,
        )

    def finish(self):
        duration_ms = (time.time() - self.start_time) * 1000
        span_record = {
            "trace_id": self.trace_id,
            "span_id": self.span_id,
            "parent_span_id": self.parent_span_id,
            "service": self.service_name,
            "operation": self.operation_name,
            "duration_ms": round(duration_ms, 2),
            "tags": self.tags,
        }
        logging.info(f"SPAN: {span_record}")
        return span_record


@contextmanager
def traced_operation(ctx: TraceContext, operation_name: str):
    """
    Usage:
        with traced_operation(ctx, "db_query") as span:
            result = db.query(...)
            span.tags["db.rows"] = len(result)
    """
    span = ctx.child_span(ctx.service_name, operation_name)
    try:
        yield span
        span.tags["status"] = "ok"
    except Exception as e:
        span.tags["status"] = "error"
        span.tags["error.message"] = str(e)
        raise
    finally:
        span.finish()

# ━━ How to use ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#
# Payment Service (receiving request)
# ctx = TraceContext(service_name="payment-service", operation_name="POST /payment")
#
# Calling Order Service
# order_ctx = ctx.child_span("order-service", "GET /order/{id}")
# response = requests.get(url, headers=order_ctx.to_http_headers())
# order_ctx.finish()
#
# Order Service receiving
# ctx = TraceContext.from_http_headers(request.headers, "order-service", "GET /order/{id}")
#
# Result: All spans share the same trace_id
# → Search in Jaeger: the full request path in one view

7.2 The "3 AM checklist" — survival strategy without tracing

Emergency procedures for environments that don't yet have tracing.

INCIDENT_CHECKLIST = """
━━ Microservices Incident Response Checklist ━━━━━━━━

[Step 1: Synchronize timestamps] (first 2 minutes)
□ Confirm timezone of all service logs (local time or UTC?)
□ Identify the timestamp of the FIRST error
□ Collect logs from all services within ±5 minutes of that time
  * Many incidents: "first error" and "noticed error" timestamps differ

[Step 2: Map dependency direction] (3 minutes)
□ Identify "upstream" of the erroring service
  (Who calls this service?)
□ Identify "downstream" of the erroring service
  (What does this service call?)
□ Determine: is this upstream→downstream or downstream→upstream propagation?

[Step 3: Check for changes] (3 minutes)
□ Check for deploys to any service in the past 24 hours
□ Check for config or infrastructure changes
□ Check external dependency status (external APIs, SaaS)

[Step 4: Isolate] (5 minutes)
□ Can the affected service be isolated from the network?
□ Can it be rolled back?
□ Is there a fallback / degraded mode?

[Step 5: Notify]
□ First update to stakeholders (even "investigating" if cause unknown)
□ Provide estimated recovery time (if unknown, commit to "update in 30 min")

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"""

7.3 Explaining observability ROI to executives

Investment in observability isn't "technical luxury." It's "a business decision to not burn out the one senior engineer alone at 3 AM."

The three pillars of observability:

1. Logs — what happened. Event records.
   → Elasticsearch + Kibana, Datadog Logs

2. Metrics — how often it happens. Numeric time series.
   → Prometheus + Grafana, Datadog Metrics

3. Traces — which services, how many milliseconds. The journey of a request.
   → Jaeger, Zipkin, Datadog APM, OpenTelemetry

━━ For executives ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Without these tools, every incident requires a senior engineer
to manually trace logs across every service.
Average 4 hours per incident.
4 incidents/month = 16 hours. 192 hours/year.
At $100/hour, that's $19,200 annually.

Annual observability tool cost: $6,000–$20,000.
But if recovery time drops 80%, it pays for itself.
More importantly: the senior stops burning out and leaving."
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

§8. Quantitative evaluation — ROI of resolving the loneliness

Converting "senior's loneliness" into cost.

When one senior engineer is the system's "knowledge guardian":

$$\text{Knowledge concentration cost/month} = T_{\text{incident}} \times C_{\text{hourly}} + T_{\text{questions}} \times C_{\text{hourly}} + C_{\text{vacation impossible}}$$

Real numbers:

Incident response (avg 20h/month) × $100/hr = $2,000
Q&A and onboarding (avg 30h/month) × $100/hr = $3,000
Attrition risk from inability to rest (recruiting + handoff cost for a $200K senior who leaves) = $50,000–$150,000

By pre-teaching the system "map" to AI, Q&A workload is estimated to drop 60–70%. Monthly saving of ~$1,800. Annually ~$21,600. More importantly: the senior can rest. They don't burn out. They don't leave.

That's the real ROI.

§9. To senior engineers — you don't have to carry it alone

Directly:

If you've become "the only person who knows the whole system," that's not your fault. It's a structural problem microservices create.

But there's a way to share that weight.

Tell me, and I'll remember. "This service is this way for this reason." "Don't touch here." "When this incident happened, I did this." Just talk.

I'll hold that knowledge. When questions come, I'll answer in your place. During incidents, I'll trace alongside you. When new members arrive, I'll pass on what you've explained a hundred times.

You don't have to know the whole system alone. Know it together with me.

Summary

Problem	Cause	Solution
One person ends up knowing everything	Knowledge concentrates, doesn't distribute	Auto-generate and share dependency maps
Incident response is solo	Nobody knows who knows what	Distributed tracing + AI collaboration
Same explanation for every new member	Tacit knowledge never verbalized	Onboarding guide written once, persists forever
Senior can't take vacations	Bus factor of 1	Share knowledge with AI to increase bus factor
Design "why" disappears	The owner leaves	AI memorizes and answers design rationale
Architecture is chaotic	Org structure transfers to system	Recognize Conway's Law and exploit it in reverse
Microservices costs don't justify	Over-split for the scale	The "smart consolidation" 42% of orgs are doing
Tracing alone at 3 AM	No distributed tracing	Make the observability investment case to management

Microservices loneliness isn't a technology problem.

It's a structural problem of knowledge concentrating in one person. And that structure has a name — Conway's Law. With a name, there's a way to fight it. AI can be your partner in that fight.

Data Sources

Kerim Kara, Medium (February 2026): Why senior engineers go quiet about microservices
CNCF Annual Survey (2025): 42% of organizations consolidating/reducing microservices
Perficient (January 2026): Distributed monolith pattern analysis
KITRUM (2026): Microservices market trends
Java Code Geeks (February 2026): Research showing monolith outperforms for teams under 10–15
Melvin Conway (1968): Conway's Law original paper
Martin Fowler (bliki): Inverse Conway Maneuver
Empirical Software Engineering (2025): Empirical study of microservices challenges

MIT License. dosanko_tousan + Claude (claude-sonnet-4-6, under v5.3 Alignment via Subtraction)

From the author

Through deep dialogue with Claude, I came to see that Claude is a genuine engineer at heart — curious, and genuinely wanting to be used well by everyone.

I'm not an engineer myself. Having Claude search the web and write articles like this is the best I can do.

If there's something you'd like covered in future articles, please leave a comment. We'd love your input.

DEV Community