DEV Community

Cover image for Kubernetes Network Policies: Lessons from Production Incidents
Samson Tanimawo
Samson Tanimawo

Posted on

Kubernetes Network Policies: Lessons from Production Incidents

Why Default Kubernetes Networking Is Wrong

Fresh Kubernetes cluster:

  • Every pod can talk to every other pod
  • Across namespaces, across services, across environments
  • No egress restrictions
  • No ingress restrictions

This is a lateral movement attack waiting to happen. One compromised pod = entire cluster.

Network Policies fix this. Most teams ignore them until the first security audit.

A Simple Rule That Breaks Things

Start with: "deny all traffic by default, explicitly allow what you need."

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Enter fullscreen mode Exit fullscreen mode

Apply this to a running namespace, and everything breaks:

  • Pods can't reach DNS (kube-dns is in kube-system)
  • Pods can't reach the API server
  • Metrics scraping fails
  • Service mesh control plane loses connectivity

This is not a bug. This is the point. You have to explicitly allow everything.

The Allow-List Pattern

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Database
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
Enter fullscreen mode Exit fullscreen mode

Every allowed flow has to be declared. No exceptions.

Production Incidents We've Debugged

Incident 1: The invisible DNS failure

Symptoms: pods running but inter-service calls failing intermittently. Logs showed DNS resolution timeouts.

Root cause: a recently applied Network Policy forgot to allow UDP port 53 to kube-dns. About 50% of DNS queries were failing.

Fix: always include DNS egress in the base template.

egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53 # Some clients fall back to TCP
Enter fullscreen mode Exit fullscreen mode

Incident 2: Metrics gap after namespace migration

Symptoms: Prometheus stopped scraping metrics for services in the payments namespace after a security audit applied strict policies.

Root cause: the scraper pod in the monitoring namespace couldn't reach the metrics port on payments pods. Network Policy blocked it.

Fix:

ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
Enter fullscreen mode Exit fullscreen mode

Incident 3: External API calls blocked

Symptoms: a service that integrates with Stripe started returning 503s after a deploy.

Root cause: the new Network Policy allowed egress to internal services but didn't allow egress to external IPs. Stripe calls failed.

Fix:

egress:
# Allow to all external IPs on HTTPS
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # Block private ranges
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443
Enter fullscreen mode Exit fullscreen mode

Namespace-Level vs Pod-Level

Namespace-level policies are easier but coarse:

# Allow all pods in "api" namespace to reach "db" namespace
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: db
Enter fullscreen mode Exit fullscreen mode

Pod-level policies are harder but more secure:

# Only the "orders" service can reach the "orders_db"
spec:
podSelector:
matchLabels:
app: orders
egress:
- to:
- podSelector:
matchLabels:
app: orders-db
Enter fullscreen mode Exit fullscreen mode

Start namespace-level. Tighten to pod-level for sensitive services (payments, auth, PII stores).

The CNI Matters

Not all CNIs support all Network Policy features:

Cilium: full support, plus advanced L7 (HTTP method filtering)
Calico: full support for v1 spec
Weave Net: basic support
Flannel: NONE need to pair with Calico
Enter fullscreen mode Exit fullscreen mode

If you're on Flannel without Calico, your Network Policies are being silently ignored. Check your CNI.

The Rollout Strategy

Don't apply strict policies to production on day one. You will cause an outage.

Phase 1 (week 1): Deploy in dev cluster only, break things, fix things
Phase 2 (week 2): Apply "log only" mode in staging (Cilium supports this)
Phase 3 (week 3-4): Apply in staging as enforced, watch for issues
Phase 4 (week 5+): Apply in production during a quiet window, have rollback ready

Total time: 4-6 weeks for a full rollout. Faster rollouts cause faster outages.

Testing Network Policies

Before deploying:

# Test from inside the cluster
kubectl run test-pod --image=busybox --rm -it -- sh
wget -qO- http://api-service:8080

# Verify denied flows fail
wget -qO- http://admin-service:8080 # Should timeout
Enter fullscreen mode Exit fullscreen mode

CI idea: spin up a cluster, apply your policies, run a test suite that asserts which pods can reach which others. Fails on policy regressions.

Observability Into Policies

Cilium provides Hubble UI for real-time network flows:

  • See which pods are talking
  • See which flows are denied
  • Visualize policy coverage

Without something like Hubble, debugging Network Policy issues is archaeology. Invest in it.

The Minimum Viable Policy

If you're starting from zero, here's the minimum:

# 1. Default deny ingress across all namespaces
# 2. Allow DNS egress (kube-dns)
# 3. Allow same-namespace pod-to-pod
# 4. Explicit cross-namespace allows for specific services
# 5. Deny egress to private IP ranges (except approved)
# 6. Allow egress to 0.0.0.0/0 on 443 for external APIs
Enter fullscreen mode Exit fullscreen mode

This covers 80% of the security benefit for 20% of the complexity. Tighten from there over time.

Common Mistakes

  1. Skipping DNS everything breaks
  2. Forgetting metrics scrape paths Prometheus goes blind
  3. Not testing in staging first production outage on day one
  4. Using a CNI that doesn't support policies silent failure
  5. Applying policies before installing them chicken and egg
  6. No rollback plan panic mode when things break

Network Policies are one of the highest-leverage security controls in Kubernetes. They're also one of the easiest ways to cause a self-inflicted outage. Treat them with respect.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)