Hossein Zolfi

Posted on Nov 23

The Hidden Tax You Are Paying Every Day: A Dev's Journey into Ops Automation

#softwareengineering #devops #career #automation

Starting a new career is kinda like walking into a workshop full of tools you've never seen before. You know they're useful, maybe even powerful, but honestly... you're not totally sure what they do or whether you'll break something the moment you touch them. That's me right now. Everything feels new. And confusing. And fun. Sometimes all at once.

This article is part of a series where I'm basically thinking out loud about the stuff I'm learning. I want my future self to read these later and go like: "Oh wow, that's where I started." Or maybe he'll laugh at me. Either way, it's worth documenting.

Along the way I ran into a bunch of questions. Some I answered. Some I couldn't. Some probably don't even have a real answer. I'm gonna put them here anyway because they're part of the journey.

Step 1: Identify the Problem (aka "Why am I doing this manually??")

One of the things I'm struggling with is: how do I share operations with my future self? Suppose I want to set up a client on a machine that wants to ship logs to external servers, how should I do it? How should I remember what I've done before? Reading documentation or command history to find out what I did before? Seriously. What about urgent situations: What happens if we lose a VM? How do I know what's even on these machines?

💰 The CEO/Manager ROI Corner

If you are a manager reading this, here is the math: Manual ops is a hidden tax. Every hour I spend SSH-ing into a box to fix a config is an hour I am not building features. Plus, manual changes mean human error, which means downtime risk. Automation isn't just "cool", it is risk management.

Coming back to the ops world after doing dev work, my first thought when I saw these GCP instances was: how do I get rid of manual work.

Throughout my entire life as a software engineer, I've never been able to persuade myself to do repetitive work manually. I just... I can't. I hate it. If I have to do the same clicks and commands more than twice, I literally start to feel dizziness. It just feels wrong.

This is basically the DRY Principle (Don't Repeat Yourself) applied to infrastructure.

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system

To save time in these kinds of situations the obvious answer is docs. Notion, Confluence, Google Docs, whatever. But... why? Just to tell other people (technical or not) what's installed? And how I set it up? Especially since not everyone can just ssh in and check.

That just doesn't work for me. We're software developers. We're supposed to like code, right? We live in it. And for me, running a command from a script is always safer and faster than typing it manually. If someone else is doing this manual work, it isn't a reason I should do it. So in the middle of my work and tasks, I explore my options.

Step 2: Exploring Options

The real answer, for me, has to involve Git. A repo.

My mind immediately went to Ansible. But I did pause for a second to think about the other players in the room.

Ansible is open-source, command-line IT automation software written in Python. It can configure systems, deploy software, and orchestrate advanced workflows to support application deployment, system updates, and more.

That's exactly what I need.

Here’s a quick mental matrix I went through:

Tool	The Vibe	Why I picked (or didn't pick) it
Terraform	The "Industry Standard" for infra. Great for creating VMs, less great for configuring inside them.	Overkill for right now. I have the VMs; I need to configure the OS. The learning curve is steep when I just need to install Docker.
Pulumi	"Infrastructure as Code" but actual code (TS/Python).	Super cool, but adds complexity I don't need yet. I want simple config files, not a compile step.
Ansible	The "Old Reliable". Agentless, runs over SSH.	Winner. It uses YAML, handles server configuration perfectly, and I already know the basics.

Decision Point: Why Ansible?

To be honest, it's just what I know. I used it years ago, so I'm familiar with it. My focus right now wasn’t to learn a brand-new provisioning tool. That just wasn't aligned with my career goals at this moment.

Note: When you are a solo dev or a small team, Velocity > Perfection. Always.

So, good or not, I chose Ansible. It's my baseline. It'll tell me what I set up, and it'll help me install services in minutes instead of hours (like webhooks or configuring logs for Loki — should that take hours? No. But the context-switching kills you). And we can place the configuration in code and commit it in a Git repo.

Implementation: OK, let's do this

So, I chose Ansible. Now. How do I use it to provision and configure machines in GCP?

This is where it got weird. The GCP infra is really different from what I'm used to. The network access is super abstract and hidden. The OS runs differently (it has agents! that configure things from the web console!).

And everything is about regions and zones. You've seen the docs:

Regions are independent geographic areas that consist of zones. A zone is a deployment area within a region.

Yada yada. Why do I care? Cost.

This stuff isn't free. If you're self-hosting, you just stack your servers in one datacenter. In GCP, if you have a machine in Europe and one in North America, you're paying for that data transfer (like $0.05 per GB).

Even external IPs cost money! A static IP in Belgium is $3.65/month. It's not much, but for 10 VMs? That adds up. Even if the machine has an external IP to connect, its IP address can be ephemeral! It means if a machine stops and starts (not restarts) GCP will assign a new external IP address. Just wanting to have an external IP doesn't mean I should reserve one.

This leads to a simple conclusion:

If you don't need an external IP, DON'T ASSIGN ONE. It's just wasted money.

Step 3: The Actual Problem

Okay, so if I follow my own advice and don't assign an external IP how the hell does Ansible connect to it? Ansible needs an IP in its inventory file. Where can I find the IP address? Am I wrong to choose Ansible as my provisioning tool? So I have Ansible — how can I configure machines with it?

This is where the gcloud command comes in. GCP gives you this tool to manage everything. You can connect to a machine with it even without a public IP:

gcloud compute ssh demo --zone=europe-north1-a --project=my-project

It's basically just magic. It tunnels through or something. I don't care how right now, just that it works.

So, if gcloud can ssh maybe Ansible can use gcloud instead of ssh?

The Solution: A Wrapper Script

The idea: we tell Ansible:

"When you want to ssh, run this script instead."

ansible.cfg:

[ssh_connection]
ssh_executable = gcp-ssh-wrapper.sh

Now we need to build gcp-ssh-wrapper.sh.

The script receives all the arguments Ansible passes to SSH and instead passes them to gcloud compute ssh.

The core looks like:

exec gcloud compute ssh "${opts[@]}" "${host}" \
  --tunnel-through-iap \
  --no-user-output-enabled \
  --zone="${zone}" \
  --project="${project}" \
  -- -C "${cmd}"

To get the host and command:

host="${@: -2:1}"
cmd="${@: -1:1}"

To collect SSH options:

declare -a opts
for ssh_arg in "${@:1:$#-3}"; do
  if [[ "${ssh_arg}" == --* ]]; then
    opts+=("${ssh_arg}")
  fi
done

But... wait. gcloud needs zone and project. Ansible doesn't know about them. So the script has to ask Ansible:

zone=$(ansible-inventory -i "${inventory_path}" --host "${host}" |
    jq -r '.zone')
project=$(ansible-inventory -i "${inventory_path}" --host "${host}" |
    jq -r '.project')

💾 Grab the Code
Instead of copy-pasting the snippets above and hoping for the best, I've uploaded the final, working gcp-ssh-wrapper.sh script to this GitHub Gist. Download it, chmod +x it, and you are good to go.

The Inventory

# inventory/inventory.ini
[project1]
demo zone=europe-north1-a

[project1:vars]
project=my-project

Now it works.

Let's test it

Playbook:

---
- name: Configure Docker logging
  hosts: all
  become: true
  gather_facts: true

  tasks:
    - name: Template Docker daemon configuration
      template:
        src: templates/daemon.json.j2
        dest: /etc/docker/daemon.json
        owner: root
        group: root
        mode: '0644'
      notify: Restart Docker
      tags: docker

  handlers:
    - name: Restart Docker
      service:
        name: docker
        state: restarted

Template:

{
{% if loki_enabled %}
  "log-driver": "loki",
  "log-opts": {
    "loki-url": "{{ loki_url }}",
    "loki-batch-size": "400"
  }
{% else %}
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "300m",
    "max-file": "2"
  }
{% endif %}
}

Vars:

---
loki_enabled: true
loki_url: "http://10.166.0.20:3100/loki/api/v1/push"

Run:

ansible-playbook -i inventory/ --diff -l demo playbook_docker.yml

Important Note on Refinement

Before you copy-paste this into production and yell at me: I know this is a Proof of Concept.

I'm aware that querying ansible-inventory twice per connection is a performance bottleneck. I also know a true "Enterprise" setup would use the official google.cloud.gcp_compute plugin for dynamic inventory and leverage Service Accounts for CI/CD.

However, this approach trades 'Best Practice' complexity for immediate velocity. When you are a solo dev just trying to ship features, that is a trade worth making.

That said, here is my Technical Debt list to tackle later:

Speed: Caching the inventory lookups to stop spawning processes.
Scale: Handling multi-region/project setups without hardcoded zones.
Resilience: Better error handling for when gcloud hiccups.

Final Thoughts

So, the job is... done? Kinda.

I'm left with a few more questions to explore in the next part of this series:

The Magic Tunnel: How exactly does IAP (Identity-Aware Proxy) work under the hood?
Name Collisions: What if two machines in different projects share the same name?
The "Cloudflare" Layer: There are some blackbox areas regarding how this interacts with Cloudflare that I need to document.

Anyway, that's where I am. It's a hack, but it's my hack, and it works.

This article took nearly a month to write, test, and clean up. I'm really glad I started using Ansible. it's solving problems I didn't even know I had. If you've struggled with similar manual-ops dizziness, let me know in the comments.

DEV Community