Jose Soares

Posted on Mar 25 • Originally published at grafosai.substack.com

I’m a Frontend Engineer. Let me spin up a scalable GCP backend real quick.

#ai #infrastructure #infrastructureascode #gcp

I'm a frontend engineer who had to build an AI backend, and later debug a collapsed GCP environment. Here is what those two weekends taught me about the context gap between code and infrastructure.

During a hackathon, our data engineer was near the summit of Mount Toubkal in Morocco (image above), and I was left alone to build an end-to-end AI backend. A few months later, when our CTO was away for the weekend, our staging environment collapsed.

So naturally, I decided to spin up a scalable GCP backend and fix our infrastructure real quick.

Okay, I didn’t build a distributed backend from scratch. But over those two weekends, I did build a working LLM agent from scratch, debug a cascade of GCP failures I’d never encountered before, untangle IAM permissions, and learn how to run production database migrations. A year ago, any one of those things would have taken me weeks, but look at me now.

The Hackathon Project

We’d just shipped the MVP of Grafos.ai as an infrastructure visualiser - a clean, interactive graph of your Terraform. Looking at it, I had a thought that felt obvious: we had a beautiful way to see infrastructure. How hard could it be to add a chat interface that allowed users to change it?

The barrier to building something like this isn’t the code. It’s the time it takes to acquire the context. I’d never touched the Gemini API or written an LLM agent. I barely knew how our own FastAPI backend was wired up. Which endpoints existed, how authentication worked or how the Terraform data was stored and accessed. Under normal circumstances, that’s a week of reading documentation before writing a single line that does anything useful.

So before I opened my editor, I opened Gemini and asked for a crash course. Not a general “explain LLMs to me”, I needed a specific, dense, 60-minute conversation on the Gemini API, context management, intent classification, and retry logic. The kind of briefing you’d get from a senior engineer who had 45 minutes before a flight. By the time I opened Cursor, I had enough of a mental model to know what to build.

Then Cursor did something I still find uncanny. I pointed it at our FastAPI backend and started describing what I needed: an endpoint that takes a user message, loads the right Terraform context, classifies the intent, and returns a response. Cursor already knew how our authentication middleware worked, where our Terraform data lived, and how our existing endpoints were structured. I’d describe the logic, the conventions to follow, the existing code to draw from, and it’d write the implementation. I spent my two days on the design decisions, not fighting the boilerplate.

I gave the UI to Lovable. Ten minutes for a chat interface I’d have spent an hour on myself. With two days on the clock, an hour was too expensive to spend on something I already knew how to build. Not worth the pride.

By the end of the second day, a user could ask a question about their Terraform, get a sensible answer, and request a PR. The agent had no memory between messages, the JSON parsing was so brittle that a malformed response could break the whole flow. When Peran came back from the mountain and looked at the code, he was not delighted. His post on making it production ready is well worth reading if you want to know how he improved this.

But it worked, and it took two days instead of two weeks. That gap comes down to how fast I was able to filter through research and documentation for the specific problem in front of me, without having to wade through everything that wasn’t relevant.

The Infrastructure Outage

A few months later, our CTO was away for the weekend and our staging environment collapsed. Fresh off the hackathon, I felt I could do it. Cursor had made me feel like I could squash any bug. So I pulled up GCP and started debugging.

It was painful in a way the hackathon hadn’t been. Not because the problems were harder in isolation, but because I had no map. Every LLM was happy to help but none of them knew what our infrastructure actually looked like. Cursor had our codebase indexed. It didn’t have our GCP project, our IAM roles, our Terraform state, or any concept of what would break if I changed an Ingress setting.

What followed was about six hours of whack-a-mole, where fixing one problem would immediately surface the next one hiding behind it.

The first problem was that I couldn’t see any problems. Our FastAPI service was returning 500 errors, but GCP Logs showed nothing. Just a clean HTTP request log with a status code and silence. It took a conversation with Gemini to understand why: a try/except block in our middleware was catching every unexpected exception and converting it into a tidy HTTPException. FastAPI received a neat error response, decided the developer had handled it, and logged nothing. GCP’s Error Reporting was completely blind. The fix was one line, but finding that it was even needed took an hour.

Once I could see errors, I hit the second problem: one of our services wasn’t responding. The traceback pointed to a ConnectionRefusedError trying to reach localhost:80. After working through the networking layer with Gemini, I eventually found it: the GitHub Action that built and deployed that service was running docker build from the root of the monorepo instead of inside the service folder. It had been deploying the wrong Docker image to the wrong service, probably for longer than I’d realised.

Fixing the Docker context fixed the build. But now our main app was getting a 404 back from the service, but the GCP logs were empty. Not a single entry. Google’s load balancer was silently swallowing them before they ever touched the container, because the service’s Cloud Run Ingress was set to Internal. From the outside it looked like a dead service but from the inside, nothing had happened. No logs, no errors, no trace of any request ever arriving. I only figured it out by noticing a column in the Cloud Run dashboard that said “Internal” next to the service name, compared it to our production service set to “All”, and changed it.

Now I was getting a 403 instead of a 404. This was actually progress: the request was reaching the service, but being rejected because our main app had no authentication token. I had to write Python logic to fetch a Google Identity Token and grant the Cloud Run Invoker role to our service account. That got the services talking.

Then the app crashed because a database column didn’t exist. Running the migrations job appeared to succeed, green checkmark and all, but the logs showed it had connected to localhost:5432 the entire time. The job’s environment variables had never been configured to point at the actual SQL instance. I ran migrations against the right database, and after a full day of debugging, the staging environment was back up.

What struck me was how differently the two weekends had felt. The hackathon was hard but navigable. Every time I hit a wall, I could show Cursor the exact file it needed to understand the problem. The infrastructure outage was disorienting in a different way. I wouldn’t have been able to do it without LLMs but every conversation started from scratch. I had to describe our setup, copy-paste logs, explain what IAM roles existed and what they were supposed to do. None of them could see the blast radius of changing that Ingress rule. None of them knew which Docker image was deployed where, or that our migrations job had never been pointed at the right database. I was the only source of context, and also the one who didn’t understand the system.

Six months earlier, a frontend engineer attempting to debug a Cloud Run IAM lockout would have just waited for the CTO to come back from his weekend. The reason I could do it at all was Gemini explaining GCPs logging architecture step by step and Cursor helping me write Python I’d never written. The context gap made it painful and slow in a way that the hackathon wasn’t.

What this means for Grafos

The two weekends are basically the same story told twice. In the first one, the tools worked because the context existed. Cursor had the codebase, Gemini could give me a crash course on a well-documented API. In the second one, the same tools were harder to use because the context was split. Gemini knew GCP inside out but it didn’t know our specific setup. It couldn’t see our IAM configuration, our Terraform state, or which services depended on which. Every time I hit a new failure, I had to reconstruct a new picture from scratch before Gemini could help me reason about it. The general knowledge was there but the specific context wasn’t.

This is what Grafos is built around. The whole product is a bet that infrastructure has a context problem, and that if you can make that context legible, you can give any developer the same footing with their cloud environment that Cursor gives them with their codebase. Turning thousands of lines of Terraform into an interactive graph, with an AI assistant that actually understands your state, is just solving the same problem that made the hackathon work and made the outage so much harder than it needed to be.

I didn’t come out of that weekend a cloud architect but I did come out of it more certain that the era of “I can’t touch that, I’m a frontend dev” is ending.

We originally published this on our Substack. I'm part of a team of 4 engineers building Grafos AI - check it out if you're tired of debugging Terraform infrastructure in the dark without context.

DEV Community

I’m a Frontend Engineer. Let me spin up a scalable GCP backend real quick.

The Hackathon Project

The Infrastructure Outage

What this means for Grafos

Top comments (0)