"What Calls What?"
The question that launches a thousand Slack threads. In every microservices architecture I've worked with, nobody has a complete picture of how services interact.
Service maps fix this.
The Problem With Static Diagrams
Every team has architecture diagrams. They're all wrong. They were accurate on the day they were created, which was 18 months ago.
Reality of static diagrams:
Created: January 2023 (accurate)
Updated: March 2023 (still mostly accurate)
Last touched: March 2023
Current accuracy: ~40%
New services since: 12
Removed services: 3
Changed dependencies: 27
Static diagrams are documentation debt.
Auto-Generated Service Maps
The solution is maps generated from actual traffic. There are three data sources:
1. Network Traffic (Infrastructure Level)
# Parse Kubernetes network policies + actual traffic
def build_network_map(k8s_client):
services = k8s_client.list_namespaced_service('production')
connections = []
for svc in services.items:
# Get pods for this service
pods = k8s_client.list_namespaced_pod(
'production',
label_selector=','.join(f"{k}={v}" for k, v in svc.spec.selector.items())
)
# Check established connections
for pod in pods.items:
conns = get_pod_connections(pod.metadata.name) # from /proc/net/tcp
for conn in conns:
target_svc = resolve_ip_to_service(conn.remote_ip)
if target_svc:
connections.append({
'source': svc.metadata.name,
'target': target_svc,
'port': conn.remote_port
})
return connections
2. Distributed Traces (Application Level)
-- Query trace data for service dependencies
SELECT
parent_service,
child_service,
COUNT(*) as call_count,
AVG(duration_ms) as avg_latency,
SUM(CASE WHEN status = 'ERROR' THEN 1 ELSE 0 END)::float / COUNT(*) as error_rate
FROM traces
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY parent_service, child_service
ORDER BY call_count DESC;
3. API Gateway Logs (Edge Level)
API Gateway → Which external services call which internal services
Load Balancer → Traffic distribution and routing rules
What a Good Service Map Shows
Beyond just "A calls B," a useful service map includes:
service_map_edge:
source: checkout-api
target: payment-service
metadata:
requests_per_second: 150
error_rate: 0.2%
p99_latency: 120ms
protocol: gRPC
circuit_breaker: enabled
retry_policy: 3x with exponential backoff
owner_team: payments
tier: critical
last_deploy: 2024-03-14T15:30:00Z
Color-code by health:
- Green: < 1% error rate, latency within SLO
- Yellow: Approaching SLO limits
- Red: SLO breached
The Incident Response Advantage
During an incident, a live service map immediately answers:
- Impact scope: What services depend on the broken one?
- Blast radius: How many users are affected?
- Root cause direction: Is the problem upstream or downstream?
- Recovery order: Which services need to come back first?
Before service maps (incident response):
"Is anyone else affected?" → 15 minutes of Slack investigation
After service maps:
*click on broken service* → instantly see all dependents
Answer: 4 downstream services, 3 are degraded
Start Simple
You don't need a fancy tool to get started:
- Export your K8s services and network policies
- Add trace-based dependencies
- Render with a simple graph library
- Update automatically via cron job
A basic but accurate map beats a fancy but outdated one every time.
If you want auto-generated, always-accurate service maps with AI-powered impact analysis, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)