DEV Community

Srinivasaraju Tangella
Srinivasaraju Tangella

Posted on

How DevOps Engineers Troubleshoot Applications in Production: Tools, Tips & Examples

Troubleshooting applications as a DevOps engineer requires a systematic and multi-layered approach across infrastructure, application, network, and CI/CD layers. Here's a step-by-step guide to help you troubleshoot effectively in real-world production environments.

✅ 1. Understand the Application Architecture

Identify whether it’s a monolith or microservices.

Know the tech stack: Java, Node.js, Python, .NET, etc.

Understand dependencies: databases, caches, message queues, APIs.

✅ 2. Gather Information First

Ask these questions:

What is the issue? (Slow, crashed, not responding)

When did it start?

Is it affecting one instance/user or all?

Was there any recent deployment or infra change?

✅ 3. Check Monitoring Tools

Use:

CloudWatch, Prometheus, Grafana, Datadog, New Relic
→ Look for CPU spikes, memory leaks, disk usage, response time, and error rates

Example:

If using Prometheus + Grafana

Check:

  • CPU > 90%
  • Memory > 80%
  • Response time > 2s

✅ 4. Check Application Logs

Use ELK, EFK, or direct file access:
/var/log/app.log, /var/log/syslog, or container logs.

For systemd-managed service

journalctl -u myapp.service

Docker logs

docker logs

Kubernetes logs

kubectl logs -n

Look for:

Stack traces

500, 502, 503 errors

Database timeout or connection errors

✅ 5. Check Server Health

top
htop
free -m
df -h
uptime

Look for:

High CPU

Out-of-memory

Disk full (especially /var, /tmp, /logs)

✅ 6. Check Network/Port Issues

Is the app listening?

ss -tuln | grep 8080

Is the port reachable from another server?

telnet 8080
curl http://:8080/health
ping

✅ 7. Verify Configuration

Check .env, YAML, JSON, or config maps.

For Kubernetes, check config maps and secrets.

kubectl describe configmap -n
kubectl get secret -n -o yaml

✅ 8. Rollback or Restart

If issue started after a deployment, roll back:

kubectl rollout undo deployment myapp -n production

Restart app or pod:

systemctl restart myapp
docker restart
kubectl delete pod -n production

✅ 9. Validate CI/CD Pipelines

Check recent pipeline logs for build/test/deploy failures.

Was an image/tag updated?

Any secrets/token expired?

✅ 10. Review Recent Code Commits

Ask developers to review recent commits.

git log -p -n 3

Check for:

Unhandled exceptions

Misconfigurations

Incompatible dependency upgrades

✅ 11. Common Real Issues to Check

Issue Type Symptoms Quick Fix

Memory Leak:

App crashes periodically Restart service
analyze heap.

Disk Full:
App not logging Clear logs or increase disk
DB Connection Failure:
Timeout
Too many connections

Check DB
increase pool size
Wrong ENV Vars:
App behaves unexpectedly Fix config
restart
Load Balancer Issue:
504, 502 errors Check backend health
DNS Misconfigured:
Name resolution failed Fix DNS/host entries

✅ 12. Post-Mortem and Prevention

Write an RCA (Root Cause Analysis)

Add monitoring/alerts (if missing)

Fix pipeline/unit test gaps

Add health checks and circuit breakers

Top comments (0)