Samson Tanimawo

Posted on Apr 19

Service Maps: The Architectural Clarity Your Team Is Missing

#architecture #microservices #observability #sre

"What Calls What?"

The question that launches a thousand Slack threads. In every microservices architecture I've worked with, nobody has a complete picture of how services interact.

Service maps fix this.

The Problem With Static Diagrams

Every team has architecture diagrams. They're all wrong. They were accurate on the day they were created, which was 18 months ago.

Reality of static diagrams:
  Created: January 2023 (accurate)
  Updated: March 2023 (still mostly accurate)
  Last touched: March 2023
  Current accuracy: ~40%
  New services since: 12
  Removed services: 3
  Changed dependencies: 27

Static diagrams are documentation debt.

Auto-Generated Service Maps

The solution is maps generated from actual traffic. There are three data sources:

1. Network Traffic (Infrastructure Level)

# Parse Kubernetes network policies + actual traffic
def build_network_map(k8s_client):
    services = k8s_client.list_namespaced_service('production')
    connections = []

    for svc in services.items:
        # Get pods for this service
        pods = k8s_client.list_namespaced_pod(
            'production',
            label_selector=','.join(f"{k}={v}" for k, v in svc.spec.selector.items())
        )
        # Check established connections
        for pod in pods.items:
            conns = get_pod_connections(pod.metadata.name)  # from /proc/net/tcp
            for conn in conns:
                target_svc = resolve_ip_to_service(conn.remote_ip)
                if target_svc:
                    connections.append({
                        'source': svc.metadata.name,
                        'target': target_svc,
                        'port': conn.remote_port
                    })

    return connections

2. Distributed Traces (Application Level)

-- Query trace data for service dependencies
SELECT 
  parent_service,
  child_service,
  COUNT(*) as call_count,
  AVG(duration_ms) as avg_latency,
  SUM(CASE WHEN status = 'ERROR' THEN 1 ELSE 0 END)::float / COUNT(*) as error_rate
FROM traces
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY parent_service, child_service
ORDER BY call_count DESC;

3. API Gateway Logs (Edge Level)

API Gateway → Which external services call which internal services
Load Balancer → Traffic distribution and routing rules

What a Good Service Map Shows

Beyond just "A calls B," a useful service map includes:

service_map_edge:
  source: checkout-api
  target: payment-service
  metadata:
    requests_per_second: 150
    error_rate: 0.2%
    p99_latency: 120ms
    protocol: gRPC
    circuit_breaker: enabled
    retry_policy: 3x with exponential backoff
    owner_team: payments
    tier: critical
    last_deploy: 2024-03-14T15:30:00Z

Color-code by health:

Green: < 1% error rate, latency within SLO
Yellow: Approaching SLO limits
Red: SLO breached

The Incident Response Advantage

During an incident, a live service map immediately answers:

Impact scope: What services depend on the broken one?
Blast radius: How many users are affected?
Root cause direction: Is the problem upstream or downstream?
Recovery order: Which services need to come back first?

Before service maps (incident response):
  "Is anyone else affected?" → 15 minutes of Slack investigation

After service maps:
  *click on broken service* → instantly see all dependents
  Answer: 4 downstream services, 3 are degraded

Start Simple

You don't need a fancy tool to get started:

Export your K8s services and network policies
Add trace-based dependencies
Render with a simple graph library
Update automatically via cron job

A basic but accurate map beats a fancy but outdated one every time.

If you want auto-generated, always-accurate service maps with AI-powered impact analysis, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community