Posted on Feb 28

DevOps Do's and Don'ts: What I Wish Someone Told Me 10 Years Ago

#devops #automation #monitoring #infrastructure

DevOps Do's and Don'ts: What I Wish Someone Told Me 10 Years Ago

It's 2 AM. Phone's ringing. Production's down. Users are mad. CEO's asking questions. And somewhere in your git history is a deployment that passed every test but somehow broke everything.

Yeah. Been there. More times than I'd like to admit.

I've been doing DevOps for about 10 years now, mostly as a freelancer. That means I've seen a LOT of companies, a LOT of different setups, and honestly... a LOT of the same mistakes over and over again.

This isn't some theoretical best practices guide. This is stuff I learned by screwing up, fixing it at 3 AM, and promising myself I'd never do it again. (Spoiler: I did it again. We all do.)

Let's Be Honest About What DevOps Actually Is

Before we get into it - DevOps isn't a job title. I know, I know, my LinkedIn says "DevOps Engineer" too. But technically it's supposed to be a culture where devs and ops work together instead of hating each other.

In reality though? Companies hire us to:

Keep servers running
Build CI/CD pipelines
Deal with infrastructure
Be the person who gets called at 2 AM

So let's talk about doing that without losing your mind.

DO: Automate Stuff (But Don't Go Crazy)

Here's where everyone gets it wrong. They read about DevOps, get excited, and try to automate EVERYTHING in week one.

I watched a startup spend three weeks building this insane auto-scaling Kubernetes setup with service mesh, observability, the whole nine yards. For an app with like... 50 users. Meanwhile their devs were still SSH-ing into servers to deploy code manually.

That's backwards.

What you should automate first:

Deployments - seriously, if you're still deploying manually, stop reading this and go set up GitHub Actions or something. Even a basic pipeline will save you hours every week.
Environment setup - if it takes half a day to spin up a dev environment, you're wasting time. Automate that.
Backups - this should honestly be #1. Automate your backups. Test your backups. I've seen too many "oh shit we need to restore" moments where the backups were broken.
Monitoring alerts - can't fix what you don't know is broken.

What can wait:

That complex orchestration that runs once a month
Edge cases that happen twice a year
Stuff that keeps changing (wait till it stabilizes)

Real Talk

I helped that startup I mentioned earlier. We ripped out the fancy Kubernetes setup and went with:

Docker Compose (yes, really)
GitHub Actions for CI/CD
Basic health checks

Deploy time went from 30 minutes to 3 minutes. Setup took 2 days instead of 3 weeks. They're doing fine. When they actually need to scale, we'll scale. But you don't need Kubernetes for 50 users.

DON'T: Skip Monitoring Because "We'll Add It Later"

This is probably the mistake I see most often. "We'll add monitoring once we're live."

Then you go live. Everything SEEMS fine. Three days later you realize your database has been maxed out for 48 hours and you lost customer data.

Yeah.

Minimum viable monitoring:

Infrastructure stuff:

CPU, RAM, disk space
Network latency
Is the service even up

Application stuff:

Error rates
How fast things respond
Database connection pools
Queue lengths if you use background jobs

Business stuff:

User signups
Payment success rate
Whatever matters to your business

The Cron Job Blind Spot

Here's one that gets everyone: cron jobs.

People monitor their web servers, their databases, their APIs. But scheduled jobs? Those just run in the background, silent and unmonitored.

Until they don't run. And nobody notices for a week.

I've seen:

Invoice jobs that stopped for a week (customers were NOT happy)
Database backups failing for 3 days (found out when we needed to restore)
Data sync breaking silently

The fix is stupid simple. Make your cron job ping a URL when it's done. If it doesn't ping within the expected time, get an alert.

That's literally it. One curl at the end of your script. I built CronGuard because I was tired of this exact problem. But there are other tools too. Just... monitor your cron jobs.

DO: Put Everything in Git

If it's not in git, it doesn't exist.

I mean it. Everything should be reproducible from your repo.

Infrastructure as code:

# Don't click buttons in AWS console
# Write Terraform instead

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "production-app"
  }
}

Configuration management:

# Don't SSH in and edit nginx.conf
# Use Ansible

- name: Configure Nginx
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: restart nginx

Database migrations:

# Don't run random SQL scripts
# Use proper migrations

npx prisma migrate deploy

Why This Actually Matters

Client of mine had a production server die last year. Hardware failure. Completely dead.

They had backups. They had monitoring. What they DIDN'T have was any record of how that server was configured.

Two years of manual tweaks. Undocumented changes. Some guy who left 6 months ago had set up "some stuff" but nobody knew what.

Took 3 days to get back online instead of 3 hours.

If everything was in git - Terraform configs, Ansible playbooks, whatever - they could've rebuilt that server in an hour.

DON'T: Treat Production Like a Playground

"I'll just quickly fix this in prod..."

Famous last words.

I've said them. You've probably said them. We've all said them. Sometimes it works out. Sometimes you take down production for 4 hours.

When something breaks in prod:

Is it on fire RIGHT NOW? If yes, do what you gotta do. If no, keep reading.
Can you reproduce it in staging?
Fix it in a branch
Test it
Code review
Deploy through your normal pipeline
Verify it worked
Write down what happened so it doesn't happen again

For actual emergencies:

Yeah okay sometimes you DO need to hotfix prod. When you do:

Write down EXACTLY what you changed
Tell your team what you did
Create a ticket to do it properly later
Add a test so it can't happen again

Feature Flags Are Your Friend

Want to deploy without the stress? Feature flags.

if (featureFlags.newCheckout) {
  return <NewCheckoutFlow />
} else {
  return <OldCheckoutFlow />
}

Deploy the new code. Turn it on for yourself. Then a few internal users. Then 10% of traffic. Then everyone.

Something breaks? Flip the switch back. No emergency deploys. No 3 AM rollbacks. Just... flip it off.

DO: Have a Rollback Plan (And Actually Test It)

Every deployment should be reversible. Every single one.

Database migrations need down migrations:

# Going up
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

# Going back down
ALTER TABLE users DROP COLUMN email_verified;

Keep your old docker images. Tag releases properly. Document how to rollback.

And here's the thing - TEST YOUR ROLLBACKS.

Don't wait till production is on fire to find out your rollback procedure doesn't work. Do it in staging. Time how long it takes. Fix the slow parts.

We do a quarterly drill where we deploy something, intentionally break it, and practice rolling back. Sounds paranoid but it's saved us multiple times.

DON'T: Skip Security Because It's Annoying

"We'll fix the security issues after launch."

No you won't.

Security isn't a feature you add later. It's a requirement.

Don't do this:

# Secrets in git
DATABASE_URL=postgresql://admin:password123@prod-db.com/myapp

Do this:

# Environment variables
DATABASE_URL=${DATABASE_URL}

Use AWS Secrets Manager, HashiCorp Vault, or at minimum just environment variables. But get secrets out of your code.

Don't give everything admin access:

// Bad
{
  "Effect": "Allow",
  "Action": "*",
  "Resource": "*"
}

// Good
{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::my-bucket/*"
}

Least privilege. Always.

Use HTTPS everywhere. Even in dev. mkcert makes it easy to get trusted local certificates. In production, Let's Encrypt is free.

And monitor your SSL certificates so they don't expire. (Yeah I built CertGuard for this too. Expired certificates are embarrassing.)

Rate limit your APIs. Even internal ones. You'll thank me later.

Don't run containers as root. Just don't.

USER node

DO: Write Things Down

"The code is self-documenting" is a lie we tell ourselves.

Future you (or the poor person who replaces you) needs to understand how this works.

Document:

How services talk to each other
Database schemas
External dependencies
How to deploy
How to rollback
How to debug common issues
Where the logs are

Keep it close to the code. README files work great. Architecture Decision Records are good too. Just... write it down.

And keep it updated. Outdated docs are worse than no docs.

DON'T: Build For Scale You Don't Have

"We need Kubernetes from day one."

No you don't.

Most apps don't need Kubernetes. Most databases don't need sharding. Most APIs don't need Redis caching.

Scale when you have data showing it's actually a problem.

Not when you THINK it might be a problem someday. When your monitoring shows:

Response times above your SLA
Database queries timing out
Servers hitting resource limits

Start simple:

One server handles 10k users easily
SQLite handles millions of rows
You don't need a CDN till you have global users

I've watched startups spend 6 months building distributed systems for apps that never got past 100 users. That time could've been spent building features people actually wanted.

DO: Break Things (In Staging)

Staging should be where things break safely.

Kill services randomly. Simulate slow databases. Fill up the disk. Cut network connections.

See what breaks. Fix it. Repeat.

You don't need Netflix's fancy Chaos Monkey. Just manually break stuff and see what happens.

DON'T: Ignore Technical Debt

"We'll refactor this later."

You won't.

That hacky fix becomes critical infrastructure. That "temporary" workaround is still there 2 years later.

Managing tech debt:

Create tickets for known issues
Estimate the actual cost (time + risk)
Dedicate 20% of your sprint to fixing it
Don't let it pile up

And explain to stakeholders WHY you're "not building features." Show them the cost of NOT fixing the debt.

What It All Comes Down To

Strip away all the fancy tools and methodologies and it's really just:

Automate the painful stuff
Monitor what matters
Make it reproducible
Plan for failure
Secure by default
Write down the why
Scale when needed (not before)
Pay down your debt

Final Thoughts

DevOps isn't about having the coolest tools or the most complex setup. It's about:

Shipping faster
Breaking less
Recovering quicker when you do break
Actually sleeping at night

Start small. Automate ONE thing this week. Add ONE monitor. Write ONE runbook. Ship ONE improvement.

Do that every week and in a year you'll be way ahead of teams still "planning their DevOps transformation."

Trust me. I've been doing this for 10 years and I'm still learning. We all are. That's kind of the point.

About me: I'm Jean-Pierre, a freelance DevOps engineer in the Netherlands. I help teams build infrastructure that doesn't suck. More stuff on my blog at FreelyIT.nl.

Got DevOps war stories? Drop them in the comments. I want to hear what you've learned the hard way too.

DEV Community

DevOps Do's and Don'ts: What I Wish Someone Told Me 10 Years Ago

DevOps Do's and Don'ts: What I Wish Someone Told Me 10 Years Ago

Let's Be Honest About What DevOps Actually Is

DO: Automate Stuff (But Don't Go Crazy)

Real Talk

DON'T: Skip Monitoring Because "We'll Add It Later"

The Cron Job Blind Spot

DO: Put Everything in Git

Why This Actually Matters

DON'T: Treat Production Like a Playground

Feature Flags Are Your Friend

DO: Have a Rollback Plan (And Actually Test It)

DON'T: Skip Security Because It's Annoying

DO: Write Things Down

DON'T: Build For Scale You Don't Have

DO: Break Things (In Staging)

DON'T: Ignore Technical Debt

What It All Comes Down To

Final Thoughts

Top comments (0)