Beyond Prompting: Building a 4-Stage LLM Compiler with Surgical Self-Repair

#ai #docker #fastapi #llm

A single prompt often yields inconsistent, unvalidated AI output. To fix this, I built Compyl a multi-stage LLM compiler that inputs english words converting them into directly usable JSON blueprint.

Compyl converts plain English into a complete, validated, machine-readable JSON blueprint (UI schema, API schema, DB schema, and authentication rules) directly usable to power a working application.

The Input: "Build a CRM with login, contacts, dashboard, role-based access, and payments. Admins can see analytics."
The Output: A fully synchronized JSON blueprint spanning all four layers.

Why Multi-Stage?

I wanted to break the workflow in modules, for better understanding and error checking:

Stage	Name	Purpose
Stage 1	Lexer	Parses raw input into structured tokens (entities, roles, features).
Stage 2	Parser	Builds the application architecture from those tokens.
Stage 3	Code Gen	Emits four synchronized schemas.
Stage 4	Linter/Repair	Catches and repairs cross-layer inconsistencies.

Each stage is a separate Groq API call with its own system prompt, Pydantic output schema, and retry logic.

The Secret Sauce: Surgical Validation + Repair

This is what separates Compyl from a glorified wrapper. After generation, four strict cross-layer dependency checks run automatically:

UI ↔️ API: Every UI component must map to a real API endpoint.
API ↔️ DB: Every API endpoint must have a matching DB table.
UI ↔️ Auth: Every route used in auth rules must exist in the UI schema.
Auth ↔️ System: Every role used in pages must exist in the auth schema.

The "Surgical" Fix

If a check fails, Compyl doesn't throw away the whole output and retry (which is slow and expensive). Instead, it performs a targeted repair—sending only the broken layer back to the LLM with the error logs for a precise fix.

LLM vs SLM — A Deliberate Tradeoff

Stages 1–3 (Llama 3.3 70B via Groq): Primarily because of creative input and taking decisions when user is not very specific with the prompts.
Stage 4 (Rule-Based / Future SLM): Since this stage is purely mechanical linting, i decided to swap it for a 3B–7B Small Language Model, and as expected, it reduced costs by ~30% and latency by ~40%.

Evaluation & Edge Cases

Tested against 10 real product prompts and 6 extreme edge cases:

Success Rate: 100% (after surgical retry).
Avg. Latency: 11.4 seconds end-to-end.
Repair Success Rate: 95.6% (out of 46 detected cross-layer errors).

How it handled edge cases:

"Build something cool" → Inferred context and built a fully functional entertainment app.
"App with login but also no auth needed" → Intelligently resolved the conflict by creating distinct Guest and Registered User roles.
"Make it like Uber + Amazon + Instagram" → Coherently merged them into 6 DB tables and 3 roles without breaking. There were more cases i wanted to try, however, Groq offers 100k tokens and i ran out of them eventually trying all cases and hit my limit. But i did log the errors it ran out of , latency and all the other metrics in detail on my github README as well the google doc, for more specificity.

compyl log chec - Google Docs

docs.google.com

The Tech Stack

LLM Engine: Llama 3.3 70B via Groq (Insanely fast inference)
Validation: Pydantic v2 (Strict data contracts)
Backend: FastAPI + Uvicorn
Runtime Proof: Generates raw SQL CREATE statements + Flask route skeletons from the final JSON.
Hosting: Docker on HuggingFace Spaces (Zero cold starts).

Key Takeaways

The hardest part wasn't the LLM prompts; it was defining exactly what valid output looked like using Pydantic before writing code.
Initially, I used 0.3 for the creative/design stages, but dropped it to 0.1 for the repair stage to force deterministic code correction.

Links & Feedback

🚀 Live Demo: HuggingFace Spaces
💻 Source Code: GitHub Repository

I would love to hear your thoughts on this project :)

Top comments (1)

Harjot Singh • May 31

Calling it a compiler is the right mental model and it's the move most people miss. A single prompt asking for a whole app blueprint is like asking a compiler to go from source to optimized binary in one undifferentiated step, you get something that looks right and breaks at the seams. Splitting into stages (intent to schema, schema to schema, validate, repair) means each stage has a narrow, checkable contract, and the synchronization across UI/API/DB/auth is exactly where single-shot generation falls apart, the model emits a UI field that the DB schema never declared and nobody notices until runtime. The surgical self-repair is the part I'd dig into: repairing the specific failing constraint instead of regenerating the whole blueprint is what keeps it from oscillating (fix one thing, break another). That's the difference between a compiler with error recovery and a slot machine. This staged, validated, repair-in-place architecture is almost exactly how I think about spec-to-app generation in Moonshift, structure plus verification beats one big confident output. How do you keep the four schemas synchronized, is there a single source-of-truth stage they all derive from, or a cross-validation pass that catches drift between layers?