DevOps Do's and Don'ts: What I Wish Someone Told Me 10 Years Ago
It's 2 AM. Phone's ringing. Production's down. Users are mad. CEO's asking questions. And somewhere in your git history is a deployment that passed every test but somehow broke everything.
Yeah. Been there. More times than I'd like to admit.
I've been doing DevOps for about 10 years now, mostly as a freelancer. That means I've seen a LOT of companies, a LOT of different setups, and honestly... a LOT of the same mistakes over and over again.
This isn't some theoretical best practices guide. This is stuff I learned by screwing up, fixing it at 3 AM, and promising myself I'd never do it again. (Spoiler: I did it again. We all do.)
Let's Be Honest About What DevOps Actually Is
Before we get into it - DevOps isn't a job title. I know, I know, my LinkedIn says "DevOps Engineer" too. But technically it's supposed to be a culture where devs and ops work together instead of hating each other.
In reality though? Companies hire us to:
- Keep servers running
- Build CI/CD pipelines
- Deal with infrastructure
- Be the person who gets called at 2 AM
So let's talk about doing that without losing your mind.
DO: Automate Stuff (But Don't Go Crazy)
Here's where everyone gets it wrong. They read about DevOps, get excited, and try to automate EVERYTHING in week one.
I watched a startup spend three weeks building this insane auto-scaling Kubernetes setup with service mesh, observability, the whole nine yards. For an app with like... 50 users. Meanwhile their devs were still SSH-ing into servers to deploy code manually.
That's backwards.
What you should automate first:
Deployments - seriously, if you're still deploying manually, stop reading this and go set up GitHub Actions or something. Even a basic pipeline will save you hours every week.
Environment setup - if it takes half a day to spin up a dev environment, you're wasting time. Automate that.
Backups - this should honestly be #1. Automate your backups. Test your backups. I've seen too many "oh shit we need to restore" moments where the backups were broken.
Monitoring alerts - can't fix what you don't know is broken.
What can wait:
- That complex orchestration that runs once a month
- Edge cases that happen twice a year
- Stuff that keeps changing (wait till it stabilizes)
Real Talk
I helped that startup I mentioned earlier. We ripped out the fancy Kubernetes setup and went with:
- Docker Compose (yes, really)
- GitHub Actions for CI/CD
- Basic health checks
Deploy time went from 30 minutes to 3 minutes. Setup took 2 days instead of 3 weeks. They're doing fine. When they actually need to scale, we'll scale. But you don't need Kubernetes for 50 users.
DON'T: Skip Monitoring Because "We'll Add It Later"
This is probably the mistake I see most often. "We'll add monitoring once we're live."
Then you go live. Everything SEEMS fine. Three days later you realize your database has been maxed out for 48 hours and you lost customer data.
Yeah.
Minimum viable monitoring:
Infrastructure stuff:
- CPU, RAM, disk space
- Network latency
- Is the service even up
Application stuff:
- Error rates
- How fast things respond
- Database connection pools
- Queue lengths if you use background jobs
Business stuff:
- User signups
- Payment success rate
- Whatever matters to your business
The Cron Job Blind Spot
Here's one that gets everyone: cron jobs.
People monitor their web servers, their databases, their APIs. But scheduled jobs? Those just run in the background, silent and unmonitored.
Until they don't run. And nobody notices for a week.
I've seen:
- Invoice jobs that stopped for a week (customers were NOT happy)
- Database backups failing for 3 days (found out when we needed to restore)
- Data sync breaking silently
The fix is stupid simple. Make your cron job ping a URL when it's done. If it doesn't ping within the expected time, get an alert.
That's literally it. One curl at the end of your script. I built CronGuard because I was tired of this exact problem. But there are other tools too. Just... monitor your cron jobs.
DO: Put Everything in Git
If it's not in git, it doesn't exist.
I mean it. Everything should be reproducible from your repo.
Infrastructure as code:
# Don't click buttons in AWS console
# Write Terraform instead
resource "aws_instance" "app" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "production-app"
}
}
Configuration management:
# Don't SSH in and edit nginx.conf
# Use Ansible
- name: Configure Nginx
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
Database migrations:
# Don't run random SQL scripts
# Use proper migrations
npx prisma migrate deploy
Why This Actually Matters
Client of mine had a production server die last year. Hardware failure. Completely dead.
They had backups. They had monitoring. What they DIDN'T have was any record of how that server was configured.
Two years of manual tweaks. Undocumented changes. Some guy who left 6 months ago had set up "some stuff" but nobody knew what.
Took 3 days to get back online instead of 3 hours.
If everything was in git - Terraform configs, Ansible playbooks, whatever - they could've rebuilt that server in an hour.
DON'T: Treat Production Like a Playground
"I'll just quickly fix this in prod..."
Famous last words.
I've said them. You've probably said them. We've all said them. Sometimes it works out. Sometimes you take down production for 4 hours.
When something breaks in prod:
- Is it on fire RIGHT NOW? If yes, do what you gotta do. If no, keep reading.
- Can you reproduce it in staging?
- Fix it in a branch
- Test it
- Code review
- Deploy through your normal pipeline
- Verify it worked
- Write down what happened so it doesn't happen again
For actual emergencies:
Yeah okay sometimes you DO need to hotfix prod. When you do:
- Write down EXACTLY what you changed
- Tell your team what you did
- Create a ticket to do it properly later
- Add a test so it can't happen again
Feature Flags Are Your Friend
Want to deploy without the stress? Feature flags.
if (featureFlags.newCheckout) {
return <NewCheckoutFlow />
} else {
return <OldCheckoutFlow />
}
Deploy the new code. Turn it on for yourself. Then a few internal users. Then 10% of traffic. Then everyone.
Something breaks? Flip the switch back. No emergency deploys. No 3 AM rollbacks. Just... flip it off.
DO: Have a Rollback Plan (And Actually Test It)
Every deployment should be reversible. Every single one.
Database migrations need down migrations:
# Going up
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
# Going back down
ALTER TABLE users DROP COLUMN email_verified;
Keep your old docker images. Tag releases properly. Document how to rollback.
And here's the thing - TEST YOUR ROLLBACKS.
Don't wait till production is on fire to find out your rollback procedure doesn't work. Do it in staging. Time how long it takes. Fix the slow parts.
We do a quarterly drill where we deploy something, intentionally break it, and practice rolling back. Sounds paranoid but it's saved us multiple times.
DON'T: Skip Security Because It's Annoying
"We'll fix the security issues after launch."
No you won't.
Security isn't a feature you add later. It's a requirement.
Don't do this:
# Secrets in git
DATABASE_URL=postgresql://admin:password123@prod-db.com/myapp
Do this:
# Environment variables
DATABASE_URL=${DATABASE_URL}
Use AWS Secrets Manager, HashiCorp Vault, or at minimum just environment variables. But get secrets out of your code.
Don't give everything admin access:
// Bad
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
// Good
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"
}
Least privilege. Always.
Use HTTPS everywhere. Even in dev. mkcert makes it easy to get trusted local certificates. In production, Let's Encrypt is free.
And monitor your SSL certificates so they don't expire. (Yeah I built CertGuard for this too. Expired certificates are embarrassing.)
Rate limit your APIs. Even internal ones. You'll thank me later.
Don't run containers as root. Just don't.
USER node
DO: Write Things Down
"The code is self-documenting" is a lie we tell ourselves.
Future you (or the poor person who replaces you) needs to understand how this works.
Document:
- How services talk to each other
- Database schemas
- External dependencies
- How to deploy
- How to rollback
- How to debug common issues
- Where the logs are
Keep it close to the code. README files work great. Architecture Decision Records are good too. Just... write it down.
And keep it updated. Outdated docs are worse than no docs.
DON'T: Build For Scale You Don't Have
"We need Kubernetes from day one."
No you don't.
Most apps don't need Kubernetes. Most databases don't need sharding. Most APIs don't need Redis caching.
Scale when you have data showing it's actually a problem.
Not when you THINK it might be a problem someday. When your monitoring shows:
- Response times above your SLA
- Database queries timing out
- Servers hitting resource limits
Start simple:
- One server handles 10k users easily
- SQLite handles millions of rows
- You don't need a CDN till you have global users
I've watched startups spend 6 months building distributed systems for apps that never got past 100 users. That time could've been spent building features people actually wanted.
DO: Break Things (In Staging)
Staging should be where things break safely.
Kill services randomly. Simulate slow databases. Fill up the disk. Cut network connections.
See what breaks. Fix it. Repeat.
You don't need Netflix's fancy Chaos Monkey. Just manually break stuff and see what happens.
DON'T: Ignore Technical Debt
"We'll refactor this later."
You won't.
That hacky fix becomes critical infrastructure. That "temporary" workaround is still there 2 years later.
Managing tech debt:
- Create tickets for known issues
- Estimate the actual cost (time + risk)
- Dedicate 20% of your sprint to fixing it
- Don't let it pile up
And explain to stakeholders WHY you're "not building features." Show them the cost of NOT fixing the debt.
What It All Comes Down To
Strip away all the fancy tools and methodologies and it's really just:
- Automate the painful stuff
- Monitor what matters
- Make it reproducible
- Plan for failure
- Secure by default
- Write down the why
- Scale when needed (not before)
- Pay down your debt
Final Thoughts
DevOps isn't about having the coolest tools or the most complex setup. It's about:
- Shipping faster
- Breaking less
- Recovering quicker when you do break
- Actually sleeping at night
Start small. Automate ONE thing this week. Add ONE monitor. Write ONE runbook. Ship ONE improvement.
Do that every week and in a year you'll be way ahead of teams still "planning their DevOps transformation."
Trust me. I've been doing this for 10 years and I'm still learning. We all are. That's kind of the point.
About me: I'm Jean-Pierre, a freelance DevOps engineer in the Netherlands. I help teams build infrastructure that doesn't suck. More stuff on my blog at FreelyIT.nl.
Got DevOps war stories? Drop them in the comments. I want to hear what you've learned the hard way too.
Top comments (0)