DEV Community: Hossein Zolfi

The Hidden Tax You Are Paying Every Day: A Dev's Journey into Ops Automation

Hossein Zolfi — Sun, 23 Nov 2025 05:09:02 +0000

Starting a new career is kinda like walking into a workshop full of tools you've never seen before. You know they're useful, maybe even powerful, but honestly... you're not totally sure what they do or whether you'll break something the moment you touch them. That's me right now. Everything feels new. And confusing. And fun. Sometimes all at once.

This article is part of a series where I'm basically thinking out loud about the stuff I'm learning. I want my future self to read these later and go like: "Oh wow, that's where I started." Or maybe he'll laugh at me. Either way, it's worth documenting.

Along the way I ran into a bunch of questions. Some I answered. Some I couldn't. Some probably don't even have a real answer. I'm gonna put them here anyway because they're part of the journey.

Step 1: Identify the Problem (aka "Why am I doing this manually??")

One of the things I'm struggling with is: how do I share operations with my future self? Suppose I want to set up a client on a machine that wants to ship logs to external servers, how should I do it? How should I remember what I've done before? Reading documentation or command history to find out what I did before? Seriously. What about urgent situations: What happens if we lose a VM? How do I know what's even on these machines?

💰 The CEO/Manager ROI Corner

If you are a manager reading this, here is the math: Manual ops is a hidden tax. Every hour I spend SSH-ing into a box to fix a config is an hour I am not building features. Plus, manual changes mean human error, which means downtime risk. Automation isn't just "cool", it is risk management.

Coming back to the ops world after doing dev work, my first thought when I saw these GCP instances was: how do I get rid of manual work.

Throughout my entire life as a software engineer, I've never been able to persuade myself to do repetitive work manually. I just... I can't. I hate it. If I have to do the same clicks and commands more than twice, I literally start to feel dizziness. It just feels wrong.

This is basically the DRY Principle (Don't Repeat Yourself) applied to infrastructure.

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system

To save time in these kinds of situations the obvious answer is docs. Notion, Confluence, Google Docs, whatever. But... why? Just to tell other people (technical or not) what's installed? And how I set it up? Especially since not everyone can just ssh in and check.

That just doesn't work for me. We're software developers. We're supposed to like code, right? We live in it. And for me, running a command from a script is always safer and faster than typing it manually. If someone else is doing this manual work, it isn't a reason I should do it. So in the middle of my work and tasks, I explore my options.

Step 2: Exploring Options

The real answer, for me, has to involve Git. A repo.

My mind immediately went to Ansible. But I did pause for a second to think about the other players in the room.

Ansible is open-source, command-line IT automation software written in Python. It can configure systems, deploy software, and orchestrate advanced workflows to support application deployment, system updates, and more.

That's exactly what I need.

Here’s a quick mental matrix I went through:

Tool	The Vibe	Why I picked (or didn't pick) it
Terraform	The "Industry Standard" for infra. Great for creating VMs, less great for configuring inside them.	Overkill for right now. I have the VMs; I need to configure the OS. The learning curve is steep when I just need to install Docker.
Pulumi	"Infrastructure as Code" but actual code (TS/Python).	Super cool, but adds complexity I don't need yet. I want simple config files, not a compile step.
Ansible	The "Old Reliable". Agentless, runs over SSH.	Winner. It uses YAML, handles server configuration perfectly, and I already know the basics.

Decision Point: Why Ansible?

To be honest, it's just what I know. I used it years ago, so I'm familiar with it. My focus right now wasn’t to learn a brand-new provisioning tool. That just wasn't aligned with my career goals at this moment.

Note: When you are a solo dev or a small team, Velocity > Perfection. Always.

So, good or not, I chose Ansible. It's my baseline. It'll tell me what I set up, and it'll help me install services in minutes instead of hours (like webhooks or configuring logs for Loki — should that take hours? No. But the context-switching kills you). And we can place the configuration in code and commit it in a Git repo.

Implementation: OK, let's do this

So, I chose Ansible. Now. How do I use it to provision and configure machines in GCP?

This is where it got weird. The GCP infra is really different from what I'm used to. The network access is super abstract and hidden. The OS runs differently (it has agents! that configure things from the web console!).

And everything is about regions and zones. You've seen the docs:

Regions are independent geographic areas that consist of zones. A zone is a deployment area within a region.

Yada yada. Why do I care? Cost.

This stuff isn't free. If you're self-hosting, you just stack your servers in one datacenter. In GCP, if you have a machine in Europe and one in North America, you're paying for that data transfer (like $0.05 per GB).

Even external IPs cost money! A static IP in Belgium is $3.65/month. It's not much, but for 10 VMs? That adds up. Even if the machine has an external IP to connect, its IP address can be ephemeral! It means if a machine stops and starts (not restarts) GCP will assign a new external IP address. Just wanting to have an external IP doesn't mean I should reserve one.

This leads to a simple conclusion:

If you don't need an external IP, DON'T ASSIGN ONE. It's just wasted money.

Step 3: The Actual Problem

Okay, so if I follow my own advice and don't assign an external IP how the hell does Ansible connect to it? Ansible needs an IP in its inventory file. Where can I find the IP address? Am I wrong to choose Ansible as my provisioning tool? So I have Ansible — how can I configure machines with it?

This is where the gcloud command comes in. GCP gives you this tool to manage everything. You can connect to a machine with it even without a public IP:

gcloud compute ssh demo --zone=europe-north1-a --project=my-project

It's basically just magic. It tunnels through or something. I don't care how right now, just that it works.

So, if gcloud can ssh maybe Ansible can use gcloud instead of ssh?

The Solution: A Wrapper Script

The idea: we tell Ansible:

"When you want to ssh, run this script instead."

ansible.cfg:

[ssh_connection]
ssh_executable = gcp-ssh-wrapper.sh

Now we need to build gcp-ssh-wrapper.sh.

The script receives all the arguments Ansible passes to SSH and instead passes them to gcloud compute ssh.

The core looks like:

exec gcloud compute ssh "${opts[@]}" "${host}" \
  --tunnel-through-iap \
  --no-user-output-enabled \
  --zone="${zone}" \
  --project="${project}" \
  -- -C "${cmd}"

To get the host and command:

host="${@: -2:1}"
cmd="${@: -1:1}"

To collect SSH options:

declare -a opts
for ssh_arg in "${@:1:$#-3}"; do
  if [[ "${ssh_arg}" == --* ]]; then
    opts+=("${ssh_arg}")
  fi
done

But... wait. gcloud needs zone and project. Ansible doesn't know about them. So the script has to ask Ansible:

zone=$(ansible-inventory -i "${inventory_path}" --host "${host}" |
    jq -r '.zone')
project=$(ansible-inventory -i "${inventory_path}" --host "${host}" |
    jq -r '.project')

💾 Grab the Code
Instead of copy-pasting the snippets above and hoping for the best, I've uploaded the final, working gcp-ssh-wrapper.sh script to this GitHub Gist. Download it, chmod +x it, and you are good to go.

The Inventory

# inventory/inventory.ini
[project1]
demo zone=europe-north1-a

[project1:vars]
project=my-project

Now it works.

Let's test it

Playbook:

---
- name: Configure Docker logging
  hosts: all
  become: true
  gather_facts: true

  tasks:
    - name: Template Docker daemon configuration
      template:
        src: templates/daemon.json.j2
        dest: /etc/docker/daemon.json
        owner: root
        group: root
        mode: '0644'
      notify: Restart Docker
      tags: docker

  handlers:
    - name: Restart Docker
      service:
        name: docker
        state: restarted

Template:

{
{% if loki_enabled %}
  "log-driver": "loki",
  "log-opts": {
    "loki-url": "{{ loki_url }}",
    "loki-batch-size": "400"
  }
{% else %}
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "300m",
    "max-file": "2"
  }
{% endif %}
}

Vars:

---
loki_enabled: true
loki_url: "http://10.166.0.20:3100/loki/api/v1/push"

Run:

ansible-playbook -i inventory/ --diff -l demo playbook_docker.yml

Important Note on Refinement

Before you copy-paste this into production and yell at me: I know this is a Proof of Concept.

I'm aware that querying ansible-inventory twice per connection is a performance bottleneck. I also know a true "Enterprise" setup would use the official google.cloud.gcp_compute plugin for dynamic inventory and leverage Service Accounts for CI/CD.

However, this approach trades 'Best Practice' complexity for immediate velocity. When you are a solo dev just trying to ship features, that is a trade worth making.

That said, here is my Technical Debt list to tackle later:

Speed: Caching the inventory lookups to stop spawning processes.
Scale: Handling multi-region/project setups without hardcoded zones.
Resilience: Better error handling for when gcloud hiccups.

Final Thoughts

So, the job is... done? Kinda.

I'm left with a few more questions to explore in the next part of this series:

The Magic Tunnel: How exactly does IAP (Identity-Aware Proxy) work under the hood?
Name Collisions: What if two machines in different projects share the same name?
The "Cloudflare" Layer: There are some blackbox areas regarding how this interacts with Cloudflare that I need to document.

Anyway, that's where I am. It's a hack, but it's my hack, and it works.

This article took nearly a month to write, test, and clean up. I'm really glad I started using Ansible. it's solving problems I didn't even know I had. If you've struggled with similar manual-ops dizziness, let me know in the comments.

Too Much Talent: Why Building a Great Team Isn't Just About Seniority

Hossein Zolfi — Tue, 13 May 2025 11:35:01 +0000

From Stars to Team

Before I became an engineering manganager and when I was just reading software engineering books, I believed the surest path to success was hiring as many top-tier engineers as possible. My logic was simple: great software design comes from great engineers. A colleague once told me, "Hire stars—they'll build game-changing features!" And often, they did. These engineers introduced powerful designs, elegant abstractions, and impressive capabilities that elevated our products.

But over time, I saw the limits of that approach. Debates dragged on. Unpopular tasks went unowned. Engineers left because they felt underutilized. After reflecting on what I saw in my own teams—and what I learned from others—I realized:

Focusing solely on individual brilliance—even when it delivers great features—often sacrifices teamwork, consistency, and long-term growth.

The real magic happens when the team works as a cohesive unit. The goal isn't to collect stars—it's to build a constellation.

When building teams—whether in sports or software—the instinct is to stack the roster with as much talent as possible. More senior engineers, more superstars, more wins, right? Not always. Research and real-world experience suggest a more nuanced truth: too much talent can actually hinder performance. Even when individual engineers introduce brilliant features, the lack of balance and collaboration can slow the team down. Here's why—and how to build a team that thrives.

The Research: Talent Has Diminishing Returns

A 2014 study by Swaab, Schaerer, Anicich, and Ronay analyzed highly interdependent team sports like soccer and basketball. They found that adding talent improves performance—up to a point. Beyond that, performance suffers. Too many stars can lead to coordination breakdowns, ego clashes, and poor collective results.

This challenges the assumption that talent and output scale linearly. In interdependent teams—like most software engineering groups—excessive star power can actually disrupt the very collaboration needed to succeed.

Key takeaway: Talent drives performance, but only if it doesn't undermine teamwork.

Why This Matters for Software Teams

Software engineering is a team sport. Engineers must align on architecture, review each other's work, and ship features together. A team made up solely of senior engineers or high-performers can run into serious problems:

Decision Gridlock: Strong opinions may clash over tech choices or system design, delaying progress.
Neglected Execution: Critical but unglamorous tasks—like CI/CD setup, writing documentation, or fixing edge-case bugs—may get ignored.
Leadership Friction: Too many leaders without clear roles often compete instead of collaborate.
Blocked Growth: Without mid-level or junior engineers, there's no natural path for mentorship or long-term team development.

Even when talented engineers ship great features, a lack of cohesion or clarity in roles can erode team efficiency and morale.

Real-World Examples: Talent Alone Isn't Enough

Google's Early Struggles

In the early 2010s, Google's engineering teams were packed with elite talent. Yet, some teams struggled to ship:

Endless architecture debates delayed progress.
Glue work like documentation and infrastructure lagged behind.
Junior engineers felt sidelined, which hurt morale and retention.

Google eventually recognized that technical brilliance wasn't enough. They began emphasizing soft skills—collaboration, humility, team orientation—in hiring and promotions. They also formalized the role of the Tech Lead to create alignment and speed up decisions.

Node.js and Open Source Governance

Node.js, in its early days, suffered from too many strong voices pushing in different directions. Conflicts and lack of governance stalled progress. The project rebounded only after implementing clearer roles and decision-making processes. This empowered contributors at all levels and helped the project scale.

These examples show that raw talent is not a silver bullet. Without structure and shared purpose, even the best engineers can struggle to work effectively together.

The Role of Tech Leads in Balancing Talent

A well-balanced team still needs guidance. That's where Tech Leads come in. The Tech Lead isn't the smartest person in the room—they're the one who keeps the room aligned.

Before I became a team lead, I didn't fully understand this role. In my past companies, the Tech Lead was treated as the boss—someone who went to meetings, made decisions alone, and became disconnected from the team. Often, these leads lacked strong software instincts and sometimes introduced friction with senior engineers. This was far from the kind of leadership that fosters a healthy engineering culture.

Now I realize:

A Tech Lead is not a boss. A good Tech Lead is a coordinator and a listener.

The role is about creating alignment—not control. It's about ensuring that everyone contributes, conflicts are resolved constructively, and decisions move forward without dragging the team down.

An effective Tech Lead:

Guides Without Dominating: They set direction while making space for team input.
Resolves Conflict: They facilitate decisions so the team doesn't stall.
Leverages All Levels: They ensure seniors mentor, mid-levels execute, and juniors grow.
Focuses on Outcomes: They prioritize what's best for the product—not their own technical legacy.

Spotify's Squad Leads operate in this mold, helping cross-functional teams stay aligned while maintaining speed and cohesion.

Without a Tech Lead—or with one driven by ego—teams of superstars often descend into dysfunction.

How to Build a Balanced Team: Practical Tips

If talent alone isn't the answer, what is? Here are some actionable ways to build a high-performing team:

Hire T-Shaped Engineers: Seek people with deep expertise in one area, but enough breadth to collaborate across functions.
Assess Gaps Before Hiring: Need innovation? Hire a senior designer or architect. Need delivery? Mid-levels are crucial. Need culture? Add a mentor.
Prioritize Soft Skills: Interview for humility, communication, and collaboration—not just algorithms and system design.
Foster Mentorship: Pair seniors with juniors. This builds culture and spreads knowledge.
Empower a Tech Lead: Choose someone who values team success over individual heroics. Support their growth with training in facilitation and leadership.

A Fun Metaphor: The LeBron James Problem

You might call this the LeBron James Problem:

You'll wait forever trying to hire three LeBron Jameses—they're rare and expensive.
Even if you succeed, they may clash instead of cooperate.

Software teams aren't basketball fantasy drafts. Obsessing over rockstar hires wastes time and breeds dysfunction. A team of role players with clear leadership often outperforms a group of misaligned all-stars.

Final Thoughts: Build a Constellation, Not a Cluster of Stars

Talent absolutely matters. Superstar engineers can introduce groundbreaking features. But once you reach a certain threshold, adding more talent without balancing the team dynamic can backfire.

To build a truly great team:

Seek balance: Mix senior, mid-level, and junior engineers.
Prioritize teamwork: Hire for compatibility as well as capability.
Support leadership: Use Tech Leads to align the team and resolve friction.
Value the whole: Celebrate team outcomes over individual accomplishments.

Don't build a cluster of lone stars. Build a constellation—where everyone shines together.

Remote debugging Go App

Hossein Zolfi — Sat, 12 Oct 2024 13:59:06 +0000

For the longest time, I wasn’t a fan of debuggers. Coming from a background in Spring Framework, Java, Python, and PHP (Symfony/Laravel), I always found logs and traces more reliable for debugging. I had even dabbled with GDB in the early days, but it didn’t stick. Instead, I relied on logging to figure out what my applications were doing.

However, my perspective changed when I started working with Go. While developing in Kubernetes and working on microservices, I faced a situation where the complexity of the service made logging and tests insufficient. This was a large service with multiple dependencies, serving both users and operators, making bug fixes particularly challenging.

Let me share a recent experience where using a debugger saved me a significant amount of time.

Debugging in Kubernetes: A Real-World Example

A few weeks ago, I was developing a feature that affected multiple parts of a microservice running in Kubernetes. This service was being called by several other components (like the app and admin panels), and I needed to trace what was happening inside the service, step by step. Tests couldn’t cover everything, especially when I needed to ensure that certain flags were set correctly to prevent sending multiple SMS notifications to users.

I could have used logs to trace the service, but every time I missed an entry, I would have had to add a new one, push the code, and wait for the CI pipeline to complete—wasting 3-5 minutes each time just to deploy to staging.

Instead, I decided to try using a debugger. Here’s how I set it up.

Setting Up Delve Debugger in a Kubernetes Pod

After a lot of trial and error, I found that adding the following lines to the Dockerfile allowed me to run Delve (a debugger for Go) inside the Kubernetes Pod:

ENTRYPOINT ["/bin/sh", "-c", "/go/bin/dlv --listen=127.0.0.1:8001 --headless=true --api-version=2 --only-same-user=false exec /path/to/exec"]

To install Delve, I added this line to the Dockerfile:

RUN go install github.com/go-delve/delve/cmd/dlv@v1.23.0

Next, I compiled the Go application with specific flags to disable optimizations and inlining:

go build -gcflags "all=-N -l"

Connecting to the Debugger

Once the service was deployed, I used port forwarding to connect to the debugger. I added the following port configuration to the Kubernetes manifest:

ports:
  - name: dlv
    containerPort: 8081

Then, I forwarded the port to my local machine:

oc port-forward $(oc get pods -l app=LABEL -o name | head -n 1) 8001:8001

Now I could connect to the debugger from my local machine using:

dlv connect :8001

Setting Breakpoints and Configuring Paths

With the debugger connected, I set breakpoints, such as:

(dlv) break main.main

Because the service was built in a different environment (GitLab’s pipeline), the paths didn’t match my local setup. To solve this, I used substitute-path to map the paths:

(dlv) config substitute-path /usr/local/go/ /Users/hossein/sdk/go1.23.1/

Running these commands every time was tedious, so I created a script (dlv.init) to automate the setup:

dlv connect :8001 --init dlv.init

Explanation of Key Commands

Delve Debugger Command:

dlv --listen=127.0.0.1:8001 --headless=true --api-version=2 --only-same-user=false exec /path/to/exec

--listen: Opens the debugger on 127.0.0.1:8001 for remote connections.
--headless=true: Runs Delve without an interactive UI, suitable for remote debugging.
--api-version=2: Uses API version 2 for better tool compatibility.
--only-same-user=false: Allows users other than the process owner to connect.
exec /path/to/exec: Starts or attaches to the Go executable for debugging.

This setup allows remote debugging of a Go service in a Kubernetes environment.

Go Build Command:

go build -gcflags "all=-N -l"

-N: Disables optimizations.
-l: Prevents inlining of functions.

These flags ensure the code stays closer to the source, making it easier to debug by allowing you to step through every function.

Example Usage

To demonstrate how to debug a remote service, I use a local Docker container for simplicity. However, in practice, this service would be deployed on Kubernetes.

To build the Docker image, run the following command:

make build

This command will build a Docker image. Once built, you can start the service by running:

make start

Output:

$ make start
docker run -it -p 8001:8001 -p 8080:8080 --rm my-app
API server listening at: [::]:8001
2024-10-12T13:48:31Z warning layer=rpc Listening for remote connections (connections are not authenticated nor encrypted)

The prompt shows that the debugger is running. To attach the debugger to the remote service, run:

make debug

Output:

$ make debug
bash -c "dlv connect :8001 --init <(sed 's|PWD|'`pwd`'|g; s|HOME|'$HOME'|g' dlv.init)"
Type 'help' for list of commands.
Breakpoint 1 set at 0x2188cc for main.main.func1() ./main.go:9
(dlv) c

Once attached, send the c command to continue execution. Then, send a request to the service using the following:

make send-request

The debugger will show output like this:

> [Breakpoint 1] main.main.func1() ./main.go:9 (hits goroutine(17):1 total:1) (PC: 0x2188cc)
Warning: debugging optimized function
Warning: listing may not match stale executable
     4:     "fmt"
     5:     "net/http"
     6: )
     7:
     8: func main() {
=>   9:     http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
    10:         fmt.Fprintf(w, "Hello, World!")
    11:     })
    12:
    13:     http.ListenAndServe(":8080", nil)
    14: }

At this point, you can manually debug the service. For example, to inspect the HTTP method, use:

(dlv) p r.Method
"GET"

Once troubleshooting is complete, use the c command to continue and the client will receive the output.

Conclusion

Using a debugger in this case allowed me to step through the code and understand the service's behavior without constantly pushing new logging code. It was a game-changer, especially when working with complex microservices in Kubernetes. Debugging Go in production environments can be daunting, but with the right setup, it can save you a lot of time and frustration.

You can check out the full example of this setup on my GitHub.