DEV Community

Diganto Paul
Diganto Paul

Posted on

How to configure, secure, and operate a production-ready Kubernetes cluster

Introduction

Kubernetes has become the de facto standard for container orchestration, but standing up a cluster is only the beginning. The real work of administration lies in configuring it correctly, securing it, and keeping it healthy over time. This guide walks through the core building blocks of Kubernetes cluster administration — from initial setup to day-two operations — so you can run a cluster with confidence.


1. Choosing Your Cluster Architecture

Before writing a single YAML file, decide how your cluster will be built and who will manage the control plane.

Approach Best For Trade-offs
Managed (EKS, GKE, AKS) Teams that want to focus on workloads, not infrastructure Less control over control-plane internals
Self-managed (kubeadm) On-prem, air-gapped, or highly customized environments Full responsibility for upgrades, HA, and patching
Lightweight (k3s, kind, minikube) Edge, dev/test, or resource-constrained setups Not typically suited for large-scale production

A common starting point for self-managed clusters is kubeadm, which automates control-plane bootstrapping while leaving room for customization.

# Initialize the control plane node
kubeadm init --pod-network-cidr=10.244.0.0/16

# Set up local kubeconfig access
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Enter fullscreen mode Exit fullscreen mode

2. Networking: Choosing a CNI Plugin

Kubernetes doesn't ship with built-in pod networking — you need a Container Network Interface (CNI) plugin. Popular choices include:

  • Calico — strong network policy support, widely used in production
  • Cilium — eBPF-based, excellent observability and security features
  • Flannel — simple, minimal overhead, good for smaller clusters

Installing Calico, for example, is typically a single manifest away:

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml
Enter fullscreen mode Exit fullscreen mode

Tip: Choose your CNI before joining worker nodes — some plugins require specific pod CIDR configurations set at kubeadm init time.


3. Node Configuration and Joining Workers

Once the control plane is up, worker nodes join using a token generated during initialization:

kubeadm token create --print-join-command
Enter fullscreen mode Exit fullscreen mode

Run the resulting command on each worker node. After joining, verify cluster health:

kubectl get nodes -o wide
kubectl get pods -n kube-system
Enter fullscreen mode Exit fullscreen mode

All nodes should report Ready, and system pods (CoreDNS, kube-proxy, CNI agents) should be Running.


4. Role-Based Access Control (RBAC)

Security starts with least-privilege access. RBAC governs who can do what within the cluster.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: development
  name: pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: development
subjects:
  - kind: User
    name: jane.doe
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode

Best practices:

  • Avoid binding to cluster-admin except for break-glass accounts.
  • Use Groups over individual Users for easier management at scale.
  • Regularly audit bindings with kubectl get rolebindings,clusterrolebindings -A.

5. Resource Management: Quotas and Limits

Left unchecked, workloads can consume an entire cluster's resources. Use ResourceQuota and LimitRange to keep things fair.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"
Enter fullscreen mode Exit fullscreen mode

Pair this with per-container defaults:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: development
spec:
  limits:
    - default:
        cpu: 500m
        memory: 256Mi
      defaultRequest:
        cpu: 250m
        memory: 128Mi
      type: Container
Enter fullscreen mode Exit fullscreen mode

6. Storage Configuration

Persistent workloads need reliable storage. Define a StorageClass so PersistentVolumeClaims can be dynamically provisioned:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
Enter fullscreen mode Exit fullscreen mode

Setting reclaimPolicy: Retain prevents accidental data loss when a PVC is deleted — a small setting that saves a lot of pain later.


7. Monitoring and Observability

You can't administer what you can't see. A standard, battle-tested stack includes:

  • Prometheus — metrics collection and alerting
  • Grafana — dashboards and visualization
  • Loki or EFK stack — centralized logging
  • kube-state-metrics — cluster object state exposed as metrics

A minimal Prometheus install via Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace
Enter fullscreen mode Exit fullscreen mode

Set up alerts for the essentials early: node disk pressure, pod crash loops, and API server latency.


8. Upgrades and Maintenance

Kubernetes releases new minor versions roughly every four months, and only the latest three minor versions are supported upstream. A safe upgrade path:

  1. Back up etcd before any upgrade.
  2. Upgrade the control plane first, one minor version at a time.
  3. Drain and upgrade nodes individually:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubeadm upgrade node
kubectl uncordon <node-name>
Enter fullscreen mode Exit fullscreen mode
  1. Validate workloads after each stage before proceeding.

9. Backup and Disaster Recovery

At minimum, back up:

  • etcd (the source of truth for cluster state)
  • Persistent volumes (application data)
  • Cluster manifests (via GitOps, so they're already version-controlled)

Tools like Velero simplify this significantly:

velero backup create daily-backup --include-namespaces production
Enter fullscreen mode Exit fullscreen mode

Test restores periodically — a backup you've never restored isn't a real backup.


Closing Thoughts

Kubernetes administration is less about a single "correct" configuration and more about establishing sane defaults, guardrails, and repeatable processes. Get networking, RBAC, and resource limits right early, invest in observability, and treat backups and upgrades as routine — not emergencies. With that foundation, your cluster will scale with your team rather than against it.

Top comments (0)