Building an Azure VM Sizer for LLMs — with Codex Doing 90% of the Work

#azure #ai #llm #devtools

Introduction

Lately I’ve been looking into hosting open‑source models on dedicated Azure virtual machines and thought: how hard can it be to pick the right VM and how much cost can you actually save by choosing a smaller model? Of course, Microsoft’s serverless options are cheaper and much easier to deploy. But I like to know how things work and to manage my own compute, partly out of technical curiosity, partly for compliance and privacy. Running on dedicated VMs gives you more control on both fronts.

I couldn’t find a clear, practical guide for sizing a VM based on the model you choose, so I built one: a quick single page web app. Try it here → Live Azure LLM Sizer (code: GitHub).
To move fast (and avoid writing more code than necessary) I used Codex for about 90% of the work. In this post I’ll cover how I built the site with Codex, where it’s useful and where it hits its limits, and I’ll share what I learned along the way, plus what’s next.

How i build it with Codex

I started with a simple problem definition, a rough goal for the app, and a few broad requirements. I deliberately kept things open ended to see what frameworks it would pick. I fed the description into ChatGPT with a “deep research” prompt, and out came a functional spec (trimmed here for brevity):

Azure-LLM-Sizer — Functional Spec (≤ 512 words)

Purpose
Browser-only SPA that tells users the smallest Azure VM (GPU SKU) capable of serving / fine-tuning a chosen open-source large-language model under given precision, context length, batch size, and optional multi-GPU constraints.

⸻

Core Workflow
    1.  Select inputs
    • Model (type-ahead over HF IDs)
    • Precision (FP32 / FP16 / BF16 / INT8 / INT4)
    • Context length slider (256 – 128 k)
    • Batch size slider (1 – 64)
    • Advanced: “Training mode” toggle adds optimizer-state multiplier; “World-size” sets required GPUs.
    2.  Estimator (runs client-side, ≤ 100 ms)

It even suggested the tech stack, some non-functional requirements, and the reasoning behind certain choices. I quickly reviewed it, dropped it into Codex, and off it went 🚀. The very first PR included a complete repo setup and a working version of the app. The frontend wasn’t pretty, but it did the job.

From there, the process was a loop of iterating and adding features. If something didn’t work or I didn’t like it, I’d just describe the change to Codex. It would spin up its own environment, make the changes, and open a PR. My job was to review, test locally, and merge to main.

My workflow looked like this:
Define problem or feature → Create Codex task → Review PR → Test locally → Approve/merge or request tweaks → Repeat.

When Codex hit a wall, either struggling with complex issues or producing subpar results, I’d switch tools. ChatGPT and Perplexity handled research heavy questions. For the frontend, I brought in Claude from Anthropic to rework the UI in Tailwind, which massively improved the look. Once Claude produced the HTML, I pasted it back into Codex to integrate with the existing codebase, and it slotted in seamlessly.

Challenges & Insights

Where Codex shines

Turns a vague problem into a working repo fast. The first PR had scaffolding, routing, basic state, and CI/CD to GitHub Pages—without me touching YAML.
Excellent at wiring and repetitive tasks: project setup, glue code, small refactors, and “add this option to the form + estimator.”
PR-first flow works well: I describe the change, it opens a PR, I review/test, and merge.

Where it struggled

Visual design: getting a polished UI took multiple passes. I handed the layout to Claude (Tailwind) and then had Codex integrate it.
Cross cutting changes: when work spanned multiple files or data flow edges, it sometimes lost the thread and anchored on the existing approach.
Code critique & alternatives: it tended to justify the current implementation instead of proposing new designs.

What worked

Keep tasks tight and atomic with explicit acceptance criteria.
Minimize context: link to 1–2 files and quote exact function names or components.
Use specialists: ChatGPT/Perplexity for research, Claude for UI drafts, Codex for integration and wiring.

Takeaways

Treat codegen like a sharp junior engineer on rails: great at well-specified tasks; less great at architecture and product taste.
Always keep PR review and a local run in the loop.

Results & Impact

One of the biggest wins was how quickly I could go from a blank page to a fully working, deployed application. With Codex handling most of the scaffolding and setup, the initial project came together in hours, not days. The automatic build and deployment pipeline to GitHub Pages was in place right after the first PR merge. Again, all generated by Codex without me touching a single YAML file.

Iteration speed was another game changer. Adding new features or fixing bugs became a matter of describing the change, letting Codex implement it, reviewing the PR, and pushing it live. User feedback could be integrated almost instantly. If someone reported that something looked off or didn’t work as expected, I could have a fix deployed within minutes. The only caveat was the UI: while Codex could wire up the functionality quickly, the styling usually needed a few rounds of refinement (or a complete handoff to Claude) to make it fit the look and feel I wanted.

Overall, the development loop felt incredibly fast and satisfying, giving me the ability to turn ideas into live, functional features almost as quickly as I could think of them.

Next Steps

This is just the first iteration of a wild idea. I want to expand it so I can use it for any open source model on any hardware. For example, I still need to add the latest gpt-oss models from OpenAI and many others. For frontend and functional changes, I will try to do as much as possible with Codex or other AI systems, while the data mining and sourcing will mostly be done by hand or with other approaches, since Codex does not excel in this part. Also, the way the calculation is done can be improved, as it makes some assumptions and is only relevant for loading the whole model with a certain KV cache. While it gives a decent estimate, it does not represent real world scenarios: you might have a multi cluster setup or want to know how fast a system is and how many requests it can handle. The website currently focuses solely on inference and is limited to Azure. Expanding into training capabilities and support for other cloud platforms would be a natural next step.

Final Thoughts

Codex really surprised me. I started with a broad, almost vague prompt, and it spun up a complete repository with everything I needed to get going. Adding new features was just as smooth, sometimes I’d have five changes in progress at once, each handled in parallel. As a full stack engineer, I’ve never been fond of wrangling CSS or Tailwind, so being able to delegate the design side and focus on the technical logic felt like a huge productivity win.

That said, Codex isn’t perfect. It can be stubborn, especially when dealing with bigger patterns or more abstract feedback. Frontend polish often took a few extra passes, and longer conversations sometimes left it stuck in a loop. The way I’ve come to see it: Codex is like a sharp junior engineer. It’s fast, confident, and great with well-defined tasks, but it struggles when the problem gets fuzzy or spans multiple layers.

Even with those limitations, the experience of working this way was exciting. It shifted my role from typing out every line to directing, reviewing, and refining. That’s a glimpse of where software development is headed: AI-assisted tools embedded in our everyday workflows, helping us move faster while we focus on the higher level decisions. I’m curious to see how quickly these tools mature and how they’ll change the way we build software in the enterprise.

Resources & Links

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.