Samson Tanimawo

Posted on May 5

Kubernetes Network Policies: Lessons from Production Incidents

#kubernetes #security #networking #sre

Why Default Kubernetes Networking Is Wrong

Fresh Kubernetes cluster:

Every pod can talk to every other pod
Across namespaces, across services, across environments
No egress restrictions
No ingress restrictions

This is a lateral movement attack waiting to happen. One compromised pod = entire cluster.

Network Policies fix this. Most teams ignore them until the first security audit.

A Simple Rule That Breaks Things

Start with: "deny all traffic by default, explicitly allow what you need."

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress

Apply this to a running namespace, and everything breaks:

Pods can't reach DNS (kube-dns is in kube-system)
Pods can't reach the API server
Metrics scraping fails
Service mesh control plane loses connectivity

This is not a bug. This is the point. You have to explicitly allow everything.

The Allow-List Pattern

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Database
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432

Every allowed flow has to be declared. No exceptions.

Production Incidents We've Debugged

Incident 1: The invisible DNS failure

Symptoms: pods running but inter-service calls failing intermittently. Logs showed DNS resolution timeouts.

Root cause: a recently applied Network Policy forgot to allow UDP port 53 to kube-dns. About 50% of DNS queries were failing.

Fix: always include DNS egress in the base template.

egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53 # Some clients fall back to TCP

Incident 2: Metrics gap after namespace migration

Symptoms: Prometheus stopped scraping metrics for services in the payments namespace after a security audit applied strict policies.

Root cause: the scraper pod in the monitoring namespace couldn't reach the metrics port on payments pods. Network Policy blocked it.

Fix:

ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090

Incident 3: External API calls blocked

Symptoms: a service that integrates with Stripe started returning 503s after a deploy.

Root cause: the new Network Policy allowed egress to internal services but didn't allow egress to external IPs. Stripe calls failed.

Fix:

egress:
# Allow to all external IPs on HTTPS
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # Block private ranges
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443

Namespace-Level vs Pod-Level

Namespace-level policies are easier but coarse:

# Allow all pods in "api" namespace to reach "db" namespace
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: db

Pod-level policies are harder but more secure:

# Only the "orders" service can reach the "orders_db"
spec:
podSelector:
matchLabels:
app: orders
egress:
- to:
- podSelector:
matchLabels:
app: orders-db

Start namespace-level. Tighten to pod-level for sensitive services (payments, auth, PII stores).

The CNI Matters

Not all CNIs support all Network Policy features:

Cilium: full support, plus advanced L7 (HTTP method filtering)
Calico: full support for v1 spec
Weave Net: basic support
Flannel: NONE need to pair with Calico

If you're on Flannel without Calico, your Network Policies are being silently ignored. Check your CNI.

The Rollout Strategy

Don't apply strict policies to production on day one. You will cause an outage.

Phase 1 (week 1): Deploy in dev cluster only, break things, fix things
Phase 2 (week 2): Apply "log only" mode in staging (Cilium supports this)
Phase 3 (week 3-4): Apply in staging as enforced, watch for issues
Phase 4 (week 5+): Apply in production during a quiet window, have rollback ready

Total time: 4-6 weeks for a full rollout. Faster rollouts cause faster outages.

Testing Network Policies

Before deploying:

# Test from inside the cluster
kubectl run test-pod --image=busybox --rm -it -- sh
wget -qO- http://api-service:8080

# Verify denied flows fail
wget -qO- http://admin-service:8080 # Should timeout

CI idea: spin up a cluster, apply your policies, run a test suite that asserts which pods can reach which others. Fails on policy regressions.

Observability Into Policies

Cilium provides Hubble UI for real-time network flows:

See which pods are talking
See which flows are denied
Visualize policy coverage

Without something like Hubble, debugging Network Policy issues is archaeology. Invest in it.

The Minimum Viable Policy

If you're starting from zero, here's the minimum:

# 1. Default deny ingress across all namespaces
# 2. Allow DNS egress (kube-dns)
# 3. Allow same-namespace pod-to-pod
# 4. Explicit cross-namespace allows for specific services
# 5. Deny egress to private IP ranges (except approved)
# 6. Allow egress to 0.0.0.0/0 on 443 for external APIs

This covers 80% of the security benefit for 20% of the complexity. Tighten from there over time.

Common Mistakes

Skipping DNS everything breaks
Forgetting metrics scrape paths Prometheus goes blind
Not testing in staging first production outage on day one
Using a CNI that doesn't support policies silent failure
Applying policies before installing them chicken and egg
No rollback plan panic mode when things break

Network Policies are one of the highest-leverage security controls in Kubernetes. They're also one of the easiest ways to cause a self-inflicted outage. Treat them with respect.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community