Troubleshooting applications as a DevOps engineer requires a systematic and multi-layered approach across infrastructure, application, network, and CI/CD layers. Here's a step-by-step guide to help you troubleshoot effectively in real-world production environments.
✅ 1. Understand the Application Architecture
Identify whether it’s a monolith or microservices.
Know the tech stack: Java, Node.js, Python, .NET, etc.
Understand dependencies: databases, caches, message queues, APIs.
✅ 2. Gather Information First
Ask these questions:
What is the issue? (Slow, crashed, not responding)
When did it start?
Is it affecting one instance/user or all?
Was there any recent deployment or infra change?
✅ 3. Check Monitoring Tools
Use:
CloudWatch, Prometheus, Grafana, Datadog, New Relic
→ Look for CPU spikes, memory leaks, disk usage, response time, and error rates
Example:
If using Prometheus + Grafana
Check:
- CPU > 90%
- Memory > 80%
- Response time > 2s
✅ 4. Check Application Logs
Use ELK, EFK, or direct file access:
/var/log/app.log, /var/log/syslog, or container logs.
For systemd-managed service
journalctl -u myapp.service
Docker logs
docker logs
Kubernetes logs
kubectl logs -n
Look for:
Stack traces
500, 502, 503 errors
Database timeout or connection errors
✅ 5. Check Server Health
top
htop
free -m
df -h
uptime
Look for:
High CPU
Out-of-memory
Disk full (especially /var, /tmp, /logs)
✅ 6. Check Network/Port Issues
Is the app listening?
ss -tuln | grep 8080
Is the port reachable from another server?
telnet 8080
curl http://:8080/health
ping
✅ 7. Verify Configuration
Check .env, YAML, JSON, or config maps.
For Kubernetes, check config maps and secrets.
kubectl describe configmap -n
kubectl get secret -n -o yaml
✅ 8. Rollback or Restart
If issue started after a deployment, roll back:
kubectl rollout undo deployment myapp -n production
Restart app or pod:
systemctl restart myapp
docker restart
kubectl delete pod -n production
✅ 9. Validate CI/CD Pipelines
Check recent pipeline logs for build/test/deploy failures.
Was an image/tag updated?
Any secrets/token expired?
✅ 10. Review Recent Code Commits
Ask developers to review recent commits.
git log -p -n 3
Check for:
Unhandled exceptions
Misconfigurations
Incompatible dependency upgrades
✅ 11. Common Real Issues to Check
Issue Type Symptoms Quick Fix
Memory Leak:
App crashes periodically Restart service
analyze heap.
Disk Full:
App not logging Clear logs or increase disk
DB Connection Failure:
Timeout
Too many connections
Check DB
increase pool size
Wrong ENV Vars:
App behaves unexpectedly Fix config
restart
Load Balancer Issue:
504, 502 errors Check backend health
DNS Misconfigured:
Name resolution failed Fix DNS/host entries
✅ 12. Post-Mortem and Prevention
Write an RCA (Root Cause Analysis)
Add monitoring/alerts (if missing)
Fix pipeline/unit test gaps
Add health checks and circuit breakers
Top comments (0)