A single prompt often yields inconsistent, unvalidated AI output. To fix this, I built Compyl a multi-stage LLM compiler that inputs english words converting them into directly usable JSON blueprint.
Compyl converts plain English into a complete, validated, machine-readable JSON blueprint (UI schema, API schema, DB schema, and authentication rules) directly usable to power a working application.
The Input: "Build a CRM with login, contacts, dashboard, role-based access, and payments. Admins can see analytics."
The Output: A fully synchronized JSON blueprint spanning all four layers.
Why Multi-Stage?
I wanted to break the workflow in modules, for better understanding and error checking:
| Stage | Name | Purpose |
|---|---|---|
| Stage 1 | Lexer | Parses raw input into structured tokens (entities, roles, features). |
| Stage 2 | Parser | Builds the application architecture from those tokens. |
| Stage 3 | Code Gen | Emits four synchronized schemas. |
| Stage 4 | Linter/Repair | Catches and repairs cross-layer inconsistencies. |
Each stage is a separate Groq API call with its own system prompt, Pydantic output schema, and retry logic.
The Secret Sauce: Surgical Validation + Repair
This is what separates Compyl from a glorified wrapper. After generation, four strict cross-layer dependency checks run automatically:
- UI ↔️ API: Every UI component must map to a real API endpoint.
- API ↔️ DB: Every API endpoint must have a matching DB table.
- UI ↔️ Auth: Every route used in auth rules must exist in the UI schema.
- Auth ↔️ System: Every role used in pages must exist in the auth schema.
The "Surgical" Fix
If a check fails, Compyl doesn't throw away the whole output and retry (which is slow and expensive). Instead, it performs a targeted repair—sending only the broken layer back to the LLM with the error logs for a precise fix.
LLM vs SLM — A Deliberate Tradeoff
- Stages 1–3 (Llama 3.3 70B via Groq): Primarily because of creative input and taking decisions when user is not very specific with the prompts.
- Stage 4 (Rule-Based / Future SLM): Since this stage is purely mechanical linting, i decided to swap it for a 3B–7B Small Language Model, and as expected, it reduced costs by ~30% and latency by ~40%.
Evaluation & Edge Cases
Tested against 10 real product prompts and 6 extreme edge cases:
- Success Rate: 100% (after surgical retry).
- Avg. Latency: 11.4 seconds end-to-end.
- Repair Success Rate: 95.6% (out of 46 detected cross-layer errors).
How it handled edge cases:
- "Build something cool" → Inferred context and built a fully functional entertainment app.
- "App with login but also no auth needed" → Intelligently resolved the conflict by creating distinct
GuestandRegistered Userroles. - "Make it like Uber + Amazon + Instagram" → Coherently merged them into 6 DB tables and 3 roles without breaking.
There were more cases i wanted to try, however, Groq offers 100k tokens and i ran out of them eventually trying all cases and hit my limit. But i did log the errors it ran out of , latency and all the other metrics in detail on my github README as well the google doc, for more specificity.
The Tech Stack
- LLM Engine: Llama 3.3 70B via Groq (Insanely fast inference)
- Validation: Pydantic v2 (Strict data contracts)
- Backend: FastAPI + Uvicorn
- Runtime Proof: Generates raw SQL
CREATEstatements + Flask route skeletons from the final JSON. - Hosting: Docker on HuggingFace Spaces (Zero cold starts).
Key Takeaways
- The hardest part wasn't the LLM prompts; it was defining exactly what valid output looked like using Pydantic before writing code.
- Initially, I used
0.3for the creative/design stages, but dropped it to0.1for the repair stage to force deterministic code correction.
Links & Feedback
- 🚀 Live Demo: HuggingFace Spaces
- 💻 Source Code: GitHub Repository
I would love to hear your thoughts on this project :)

Top comments (0)