Why Default Kubernetes Networking Is Wrong
Fresh Kubernetes cluster:
- Every pod can talk to every other pod
- Across namespaces, across services, across environments
- No egress restrictions
- No ingress restrictions
This is a lateral movement attack waiting to happen. One compromised pod = entire cluster.
Network Policies fix this. Most teams ignore them until the first security audit.
A Simple Rule That Breaks Things
Start with: "deny all traffic by default, explicitly allow what you need."
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Apply this to a running namespace, and everything breaks:
- Pods can't reach DNS (kube-dns is in
kube-system) - Pods can't reach the API server
- Metrics scraping fails
- Service mesh control plane loses connectivity
This is not a bug. This is the point. You have to explicitly allow everything.
The Allow-List Pattern
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Database
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
Every allowed flow has to be declared. No exceptions.
Production Incidents We've Debugged
Incident 1: The invisible DNS failure
Symptoms: pods running but inter-service calls failing intermittently. Logs showed DNS resolution timeouts.
Root cause: a recently applied Network Policy forgot to allow UDP port 53 to kube-dns. About 50% of DNS queries were failing.
Fix: always include DNS egress in the base template.
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53 # Some clients fall back to TCP
Incident 2: Metrics gap after namespace migration
Symptoms: Prometheus stopped scraping metrics for services in the payments namespace after a security audit applied strict policies.
Root cause: the scraper pod in the monitoring namespace couldn't reach the metrics port on payments pods. Network Policy blocked it.
Fix:
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
Incident 3: External API calls blocked
Symptoms: a service that integrates with Stripe started returning 503s after a deploy.
Root cause: the new Network Policy allowed egress to internal services but didn't allow egress to external IPs. Stripe calls failed.
Fix:
egress:
# Allow to all external IPs on HTTPS
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # Block private ranges
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443
Namespace-Level vs Pod-Level
Namespace-level policies are easier but coarse:
# Allow all pods in "api" namespace to reach "db" namespace
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: db
Pod-level policies are harder but more secure:
# Only the "orders" service can reach the "orders_db"
spec:
podSelector:
matchLabels:
app: orders
egress:
- to:
- podSelector:
matchLabels:
app: orders-db
Start namespace-level. Tighten to pod-level for sensitive services (payments, auth, PII stores).
The CNI Matters
Not all CNIs support all Network Policy features:
Cilium: full support, plus advanced L7 (HTTP method filtering)
Calico: full support for v1 spec
Weave Net: basic support
Flannel: NONE need to pair with Calico
If you're on Flannel without Calico, your Network Policies are being silently ignored. Check your CNI.
The Rollout Strategy
Don't apply strict policies to production on day one. You will cause an outage.
Phase 1 (week 1): Deploy in dev cluster only, break things, fix things
Phase 2 (week 2): Apply "log only" mode in staging (Cilium supports this)
Phase 3 (week 3-4): Apply in staging as enforced, watch for issues
Phase 4 (week 5+): Apply in production during a quiet window, have rollback ready
Total time: 4-6 weeks for a full rollout. Faster rollouts cause faster outages.
Testing Network Policies
Before deploying:
# Test from inside the cluster
kubectl run test-pod --image=busybox --rm -it -- sh
wget -qO- http://api-service:8080
# Verify denied flows fail
wget -qO- http://admin-service:8080 # Should timeout
CI idea: spin up a cluster, apply your policies, run a test suite that asserts which pods can reach which others. Fails on policy regressions.
Observability Into Policies
Cilium provides Hubble UI for real-time network flows:
- See which pods are talking
- See which flows are denied
- Visualize policy coverage
Without something like Hubble, debugging Network Policy issues is archaeology. Invest in it.
The Minimum Viable Policy
If you're starting from zero, here's the minimum:
# 1. Default deny ingress across all namespaces
# 2. Allow DNS egress (kube-dns)
# 3. Allow same-namespace pod-to-pod
# 4. Explicit cross-namespace allows for specific services
# 5. Deny egress to private IP ranges (except approved)
# 6. Allow egress to 0.0.0.0/0 on 443 for external APIs
This covers 80% of the security benefit for 20% of the complexity. Tighten from there over time.
Common Mistakes
- Skipping DNS everything breaks
- Forgetting metrics scrape paths Prometheus goes blind
- Not testing in staging first production outage on day one
- Using a CNI that doesn't support policies silent failure
- Applying policies before installing them chicken and egg
- No rollback plan panic mode when things break
Network Policies are one of the highest-leverage security controls in Kubernetes. They're also one of the easiest ways to cause a self-inflicted outage. Treat them with respect.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)