DEV Community

Cover image for Beyond Prompting: Building a 4-Stage LLM Compiler with Surgical Self-Repair
Quratulain Nayeem
Quratulain Nayeem

Posted on

Beyond Prompting: Building a 4-Stage LLM Compiler with Surgical Self-Repair

A single prompt often yields inconsistent, unvalidated AI output. To fix this, I built Compyl a multi-stage LLM compiler that inputs english words converting them into directly usable JSON blueprint.

Compyl converts plain English into a complete, validated, machine-readable JSON blueprint (UI schema, API schema, DB schema, and authentication rules) directly usable to power a working application.

The Input: "Build a CRM with login, contacts, dashboard, role-based access, and payments. Admins can see analytics."
The Output: A fully synchronized JSON blueprint spanning all four layers.


Why Multi-Stage?

I wanted to break the workflow in modules, for better understanding and error checking:

Stage Name Purpose
Stage 1 Lexer Parses raw input into structured tokens (entities, roles, features).
Stage 2 Parser Builds the application architecture from those tokens.
Stage 3 Code Gen Emits four synchronized schemas.
Stage 4 Linter/Repair Catches and repairs cross-layer inconsistencies.

Each stage is a separate Groq API call with its own system prompt, Pydantic output schema, and retry logic.


the UI

The Secret Sauce: Surgical Validation + Repair

This is what separates Compyl from a glorified wrapper. After generation, four strict cross-layer dependency checks run automatically:

  • UI ↔️ API: Every UI component must map to a real API endpoint.
  • API ↔️ DB: Every API endpoint must have a matching DB table.
  • UI ↔️ Auth: Every route used in auth rules must exist in the UI schema.
  • Auth ↔️ System: Every role used in pages must exist in the auth schema.

The "Surgical" Fix

If a check fails, Compyl doesn't throw away the whole output and retry (which is slow and expensive). Instead, it performs a targeted repair—sending only the broken layer back to the LLM with the error logs for a precise fix.


LLM vs SLM — A Deliberate Tradeoff

  • Stages 1–3 (Llama 3.3 70B via Groq): Primarily because of creative input and taking decisions when user is not very specific with the prompts.
  • Stage 4 (Rule-Based / Future SLM): Since this stage is purely mechanical linting, i decided to swap it for a 3B–7B Small Language Model, and as expected, it reduced costs by ~30% and latency by ~40%.

Evaluation & Edge Cases

Tested against 10 real product prompts and 6 extreme edge cases:

  • Success Rate: 100% (after surgical retry).
  • Avg. Latency: 11.4 seconds end-to-end.
  • Repair Success Rate: 95.6% (out of 46 detected cross-layer errors).

How it handled edge cases:

  • "Build something cool" → Inferred context and built a fully functional entertainment app.
  • "App with login but also no auth needed" → Intelligently resolved the conflict by creating distinct Guest and Registered User roles.
  • "Make it like Uber + Amazon + Instagram" → Coherently merged them into 6 DB tables and 3 roles without breaking.

There were more cases i wanted to try, however, Groq offers 100k tokens and i ran out of them eventually trying all cases and hit my limit. But i did log the errors it ran out of , latency and all the other metrics in detail on my github README as well the google doc, for more specificity.

The Tech Stack

  • LLM Engine: Llama 3.3 70B via Groq (Insanely fast inference)
  • Validation: Pydantic v2 (Strict data contracts)
  • Backend: FastAPI + Uvicorn
  • Runtime Proof: Generates raw SQL CREATE statements + Flask route skeletons from the final JSON.
  • Hosting: Docker on HuggingFace Spaces (Zero cold starts).

Key Takeaways

  1. The hardest part wasn't the LLM prompts; it was defining exactly what valid output looked like using Pydantic before writing code.
  2. Initially, I used 0.3 for the creative/design stages, but dropped it to 0.1 for the repair stage to force deterministic code correction.

Links & Feedback

I would love to hear your thoughts on this project :)

Top comments (0)