Andrey

Posted on Aug 7, 2025 • Originally published at idatamax.com

Kubernetes Best Practices - Deployment and Troubleshooting

#kubernetes #dataengineering #devops #help

Kubernetes best practices for deployment and troubleshooting ensure reliable application rollouts and rapid issue resolution. This article provides detailed strategies for managing updates with tools like Helm, implementing scalable deployment patterns, and diagnosing issues using modern monitoring and logging techniques. These practices equip professionals to maintain high availability and performance in production clusters.

Optimizing Kubernetes Rollouts

Kubernetes rollouts ensure seamless application updates with minimal downtime. Declarative configurations, automated package management, and advanced deployment strategies enable scalable and reliable deployments in production clusters. This section covers fine-tuned Deployment configurations, Helm-driven automation, and modern patterns like Canary and GitOps.

Configuring Deployments

Deployments orchestrate ReplicaSets to manage stateless applications, ensuring the desired number of Pods run consistently. Configuration options in a Deployment's specification control scaling, updates, and rollback behavior, while labels streamline resource management.

Deployment Configuration

A Deployment's specification includes two key sections: one for ReplicaSet settings and another for Pod configuration. Key parameters include:

replicas: Specifies the number of Pods.
progressDeadlineSeconds: Sets a timeout for update completion.
revisionHistoryLimit: Limits retained ReplicaSet versions for rollbacks.
strategy: Defines update behavior (e.g., RollingUpdate).

Example configuration:

apiVersion: apps/v1  
kind: Deployment  
metadata:  
  name: dev-web  
spec:  
  replicas: 1  
  progressDeadlineSeconds: 600  
  revisionHistoryLimit: 10  
  selector:  
    matchLabels:  
      app: dev-web  
  strategy:  
    type: RollingUpdate  
    rollingUpdate:  
      maxSurge: 25%  
      maxUnavailable: 25%  
  template:  
    metadata:  
      labels:  
        app: dev-web  
    spec:  
      containers:  
      - name: web  
        image: nginx:1.14

Scale the Deployment:

kubectl scale deployment/dev-web --replicas=4

The RollingUpdate strategy ensures gradual Pod replacement, with maxSurge allowing 25% additional Pods and maxUnavailable limiting downtime to 25% of replicas.

Managing Updates and Rollbacks

Updates to a Deployment (e.g., changing the container image) create a new ReplicaSet, replacing old Pods. Modify configurations using kubectl apply or kubectl edit:

kubectl set image deployment/dev-web web=nginx:1.15

Monitor rollout status:

kubectl rollout status deployment/dev-web

Roll back to a previous version if an update fails:

kubectl rollout undo deployment/dev-web

Retained ReplicaSet versions (revisionHistoryLimit) enable reliable rollbacks, critical for production stability.

Labels for Administration

Labels, stored in metadata as key-value pairs, allow querying and managing resources without referencing individual names or UIDs. For example:

metadata:  
  labels:  
    app: dev-web  
    environment: pro

Select resources:

kubectl get pods -l app=dev-web,environment=prod

Labels facilitate flexible operations, such as scaling or routing traffic, enhancing administrative efficiency.

Helm for Application Management

Helm simplifies complex application deployments through packaged charts. As a package manager, Helm bundles Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets) into charts, enabling single-command deployments and versioned releases.

Chart Structure

A Helm chart is an archived set of manifests with a defined structure:

Chart.yaml: Metadata (name, version, keywords).
values.yaml: Configurable values for templates.
templates/: Manifest YAMLs with Go templating.

Example Chart.yaml for PostgreSQL:

apiVersion: v2  
name: postgresql  
version: 1.0.0

Example values.yaml:

postgresqlPassword: pgpass123

Template example (secrets.yaml):

apiVersion: v1  
kind: Secret  
metadata:  
  name: {{ template "fullname" . }}  
  labels:  
    app: {{ template "fullname" . }}  
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"  
type: Opaque  
data:  
  postgresql-password: {{ .Values.postgresqlPassword | b64enc | quote }}

Install a chart:

helm install my-release ./postgresql

Helm replaces template variables with values.yaml data, generating manifests for deployment.

Managing Releases

Upgrade a release:

helm upgrade my-release ./postgresql --set postgresqlPassword=newpass

Roll back a release:

helm rollback my-release 1

Helm's versioned releases and repositories streamline updates, making it ideal for CI/CD pipelines.

Advanced Deployment Strategies

Modern deployment patterns enhance reliability and minimize downtime. Beyond RollingUpdate, Kubernetes supports advanced strategies, complementing Helm and Deployment configurations.

Blue-Green Deployments

Blue-Green deployments maintain two environments (blue and green), switching traffic to a new version after validation. Implement using Service selectors:

apiVersion: v1  
kind: Service  
metadata:  
  name: my-service  
spec:  
  selector:  
    app: my-app  
    version: green  
  ports:  
  - port: 80

Deploy a new version (version: blue), test, then update the Service selector to version: blue. This ensures zero downtime but requires double resources during transitions.

Canary Deployments

Canary deployments route a small percentage of traffic to a new version. Use Argo Rollouts for fine-grained control:

apiVersion: argoproj.io/v1alpha1  
kind: Rollout  
metadata:  
  name: my-app  
spec:  
  replicas: 4  
  strategy:  
    canary:  
      steps:  
      - setWeight: 20  
      - pause: {duration: 10m}  
      - setWeight: 100  
  selector:  
    matchLabels:  
      app: my-app  
  template:  
    metadata:  
      labels:  
        app: my-app  
    spec:  
      containers:  
      - name: app  
        image: myapp:2.0

This routes 20% traffic to version 2.0 for 10 minutes before full rollout.

GitOps with ArgoCD

GitOps uses Git as the source of truth for cluster state. ArgoCD synchronizes manifests from a Git repository:

argocd app create my-app --repo https://github.com/myrepo --path manifests --dest-server https://kubernetes.default.svc  
argocd app sync my-app

ArgoCD ensures declarative deployments, enabling automated rollbacks if Git changes are reverted.

CI/CD Integration

CI/CD pipelines automate Helm deployments. Example with GitHub Actions:

name: Deploy to Kubernetes  
on:  
  push:  
    branches: [ main ]  
jobs:  
  deploy:  
    runs-on: ubuntu-latest  
    steps:  
    - uses: actions/checkout@v3  
    - name: Install Helm  
      run: curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash  
    - name: Deploy Helm Chart  
      run: helm upgrade --install my-release ./chart --set image.tag=${{ github.sha }}

This pipeline deploys a Helm chart on each push, ensuring consistent updates.

Troubleshooting

Effective troubleshooting in Kubernetes pinpoints issues to keep clusters running smoothly. When a Pod fails, a Service is unreachable, or performance dips, quick diagnosis is critical to maintaining production-grade reliability. This section walks through monitoring, logging, and diagnostic techniques, enriched with modern tools and clear explanations to help you resolve issues efficiently.

Monitoring and Logging

Monitoring and logging give you a window into your cluster's health and application behavior. By collecting metrics and logs, you can spot problems early, from resource bottlenecks to application errors, and dive into root causes with confidence. Let's explore the key tools and how they fit into real-world scenarios.
Metrics Server: Getting Started with Resource Metrics
Metrics Server is a lightweight tool that gathers CPU and memory usage for nodes and Pods, making it a go-to for basic performance checks. It exposes these metrics through the Kubernetes API (/apis/metrics.k8s.io/), which is handy for quick diagnostics or feeding data to autoscalers like Horizontal Pod Autoscaling (HPA).
To see resource usage, run:

kubectl top node

This command lists nodes with their CPU and memory consumption, helping you identify if a node is overloaded - say, a node maxed out at 90% CPU might explain why Pods are slow to schedule. Similarly, check Pod usage:

kubectl top pod

If a Pod is hogging resources, it could be causing contention, like a memory leak in an app eating up node capacity.
To install Metrics Server, apply its manifests:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Once running, Metrics Server pulls data from kubelet on each node, so ensure kubelet is accessible. If kubectl top fails, check for network issues or misconfigured RBAC for Metrics Server's ServiceAccount. This tool is great for quick checks but limited for deep diagnostics, so we'll layer on more advanced options next.

Fluentd: Centralizing Logs for Clarity

Fluentd, a CNCF project, acts as a unified logging layer, collecting logs from all nodes and containers, then routing them to storage like Elasticsearch or Loki. Running Fluentd as a DaemonSet ensures every node has a logging agent, capturing logs from /var/log and container outputs.
Here's a simplified DaemonSet for Fluentd:

apiVersion: apps/v1  
kind: DaemonSet  
metadata:  
  name: fluentd  
  namespace: kube-system  
spec:  
  selector:  
    matchLabels:  
      name: fluentd  
  template:  
    metadata:  
      labels:  
        name: fluentd  
    spec:  
      containers:  
      - name: fluentd  
        image: fluent/fluentd:v1.16  
        volumeMounts:  
        - name: varlog  
          mountPath: /var/log  
      volumes:  
      - name: varlog  
        hostPath:  
          path: /var/log

Deploy this, and Fluentd starts aggregating logs. Imagine a scenario where an app crashes intermittently - Fluentd collects container logs, letting you query for errors like NullPointerException across all Pods. Pair it with a backend like Loki, and you can search logs with a query like {app="myapp"} |="ERROR". This beats manually checking each Pod's logs, especially in a 100-node cluster.

Prometheus and Grafana: Deep Visibility

Prometheus, a CNCF time-series database, excels at collecting and querying metrics, while Grafana visualizes them in dashboards. Together, they're a powerhouse for spotting trends, like a gradual memory leak or a spike in API latency.
Deploy Prometheus with a basic configuration:

apiVersion: monitoring.coreos.com/v1  
kind: Prometheus  
metadata:  
  name: prometheus  
  namespace: monitoring  
spec:  
  replicas: 2  
  serviceAccountName: prometheus  
  resources:  
    requests:  
      memory: "400Mi"

Prometheus scrapes metrics from Pods, nodes, and custom endpoints. For example, if your app exposes metrics at /metrics, Prometheus can track request latency or error rates. Access Grafana to visualize these metrics:

kubectl port-forward svc/grafana -n monitoring 3000:80

Open http://localhost:3000 in your browser, and you'll see dashboards showing CPU usage, Pod restarts, or custom metrics. In a real-world case, a dashboard might reveal a Pod restarting every 10 minutes due to OOM (Out of Memory) errors, pointing you to a misconfigured memory limit.

Kube-State-Metrics: Object-Level Insights

Kube-State-Metrics complements Prometheus by exposing metrics about Kubernetes objects, like Pods, Deployments, or Services. For instance, kube_pod_status_phase tracks whether Pods are Running, Pending, or Failed, helping you spot stuck workloads.
Install it:

kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/cluster-monitoring.yaml

Query Prometheus for stalled Pods:

kube_pod_status_phase{phase="Pending"} > 0

If a Pod is Pending, it might be stuck due to insufficient CPU or a missing PVC. This metric saved me once when a misconfigured StorageClass left Pods hanging - Kube-State-Metrics flagged the issue in seconds.

Diagnostic Techniques

Diagnostic techniques uncover the root causes of cluster issues, from application crashes to network failures. Kubernetes offers built-in commands to inspect resources, supplemented by advanced tools for deeper analysis. Here's how to approach common problems with practical steps.

Checking Pod Logs

Logs are your first stop for application issues. To view a Pod's container logs:

kubectl logs pod-name

If a Pod has multiple containers, specify one:

kubectl logs pod-name -c container-name

For example, if a web app returns 500 errors, logs might show a database connection failure like Connection refused: db-service. If logs are empty, the app might not be logging to stdout/stderr - consider adding a sidecar like Fluentd to capture output. Tail logs for real-time debugging:

kubectl logs -f pod-name

Inspecting Resources
When logs aren't enough, kubectl describe reveals detailed object states:

kubectl describe pod pod-name

This shows events, like ImagePullBackOff (failed image pull) or FailedAttachVolume (storage issue). For instance, ImagePullBackOff might mean a typo in the image tag or a private registry needing a Secret. Check node events for broader context:

kubectl describe node node-name

A node marked NotReady could indicate a kubelet crash or resource exhaustion.

Diagnosing Network and DNS Issues

Network problems, like Pods failing to reach Services, are common culprits. Test DNS resolution inside a Pod:

kubectl exec -ti pod-name -- nslookup svc-name

If it fails, CoreDNS might be misconfigured. Check its logs:

kubectl logs -n kube-system -l k8s-app=kube-dns

For connectivity issues, ping a node or Service:

kubectl exec -ti pod-name -- ping svc-ip

A timeout might point to a CNI plugin issue (e.g., Calico misconfiguration). In one case, a missing NetworkPolicy blocked traffic to a Service - logs and kubectl describe helped trace it.

Verifying RBAC and Security

RBAC misconfigurations can prevent actions, like a user unable to create Pods. Test permissions:

kubectl auth can-i create pods --as=user-name

A no response means missing RoleBindings. For security-related errors, check SELinux or AppArmor logs:

kubectl logs pod-name | grep -i apparmor

If AppArmor blocks a file access, adjust the Pod's security profile.

Auditing API Calls

API auditing tracks actions for forensic analysis. Configure an audit policy:

apiVersion: audit.k8s.io/v1  
kind: Policy  
rules:  
- level: Metadata  
  resources:  
  - group: ""  
    resources: ["pods"]

Apply it to kube-apiserver: - audit-policy-file=/etc/kubernetes/audit-policy.yaml. Audit logs, stored in /var/log/kubernetes/audit.log, reveal failed API calls, like a denied Pod creation due to RBAC. Check logs:

cat /var/log/kubernetes/audit.log | grep -i denied

This helped me once debug a misconfigured ServiceAccount blocking a CI pipeline.

Lens: Visual Troubleshooting

Lens, an open-source Kubernetes IDE, offers a graphical interface for cluster inspection. After installing locally and connecting via ~/.kube/config, Lens displays Pods, Services, and metrics in a dashboard. For example, a Pending Pod might show a SchedulingFailed event due to insufficient CPU - Lens highlights this instantly, saving time over kubectl describe. It's like having a control tower for your cluster, especially in multi-namespace setups.

Structured Logging for Advanced Analysis

Structured logging (e.g., JSON format) makes logs machine-readable, boosting analysis with tools like Loki. Configure an app for JSON logs:

apiVersion: v1  
kind: Pod  
metadata:  
  name: app  
spec:  
  containers:  
  - name: app  
    image: myapp  
    env:  
    - name: LOG_FORMAT  
      value: json

Query errors with Loki:

loki-cli query '{app="myapp"} | json | level="error"'

This filters error logs, revealing patterns like timeout errors from a misconfigured database. Structured logging transforms raw logs into actionable data, speeding up diagnostics.

Kubernetes deployments and troubleshooting practices empower seamless application updates and rapid issue resolution. Tools like Helm and Prometheus, paired with modern strategies, optimize cluster performance. These techniques ensure robust, scalable systems in production environments.

DEV Community