Beyond Prompting: Building a 4-Stage LLM Compiler with Surgical Self-Repair

Quratulain Nayeem — Tue, 26 May 2026 16:46:06 +0000

A single prompt often yields inconsistent, unvalidated AI output. To fix this, I built Compyl a multi-stage LLM compiler that inputs english words converting them into directly usable JSON blueprint.

Compyl converts plain English into a complete, validated, machine-readable JSON blueprint (UI schema, API schema, DB schema, and authentication rules) directly usable to power a working application.

The Input: "Build a CRM with login, contacts, dashboard, role-based access, and payments. Admins can see analytics."
The Output: A fully synchronized JSON blueprint spanning all four layers.

Why Multi-Stage?

I wanted to break the workflow in modules, for better understanding and error checking:

Stage	Name	Purpose
Stage 1	Lexer	Parses raw input into structured tokens (entities, roles, features).
Stage 2	Parser	Builds the application architecture from those tokens.
Stage 3	Code Gen	Emits four synchronized schemas.
Stage 4	Linter/Repair	Catches and repairs cross-layer inconsistencies.

Each stage is a separate Groq API call with its own system prompt, Pydantic output schema, and retry logic.

The Secret Sauce: Surgical Validation + Repair

This is what separates Compyl from a glorified wrapper. After generation, four strict cross-layer dependency checks run automatically:

UI ↔️ API: Every UI component must map to a real API endpoint.
API ↔️ DB: Every API endpoint must have a matching DB table.
UI ↔️ Auth: Every route used in auth rules must exist in the UI schema.
Auth ↔️ System: Every role used in pages must exist in the auth schema.

The "Surgical" Fix

If a check fails, Compyl doesn't throw away the whole output and retry (which is slow and expensive). Instead, it performs a targeted repair—sending only the broken layer back to the LLM with the error logs for a precise fix.

LLM vs SLM — A Deliberate Tradeoff

Stages 1–3 (Llama 3.3 70B via Groq): Primarily because of creative input and taking decisions when user is not very specific with the prompts.
Stage 4 (Rule-Based / Future SLM): Since this stage is purely mechanical linting, i decided to swap it for a 3B–7B Small Language Model, and as expected, it reduced costs by ~30% and latency by ~40%.

Evaluation & Edge Cases

Tested against 10 real product prompts and 6 extreme edge cases:

Success Rate: 100% (after surgical retry).
Avg. Latency: 11.4 seconds end-to-end.
Repair Success Rate: 95.6% (out of 46 detected cross-layer errors).

How it handled edge cases:

"Build something cool" → Inferred context and built a fully functional entertainment app.
"App with login but also no auth needed" → Intelligently resolved the conflict by creating distinct Guest and Registered User roles.
"Make it like Uber + Amazon + Instagram" → Coherently merged them into 6 DB tables and 3 roles without breaking. There were more cases i wanted to try, however, Groq offers 100k tokens and i ran out of them eventually trying all cases and hit my limit. But i did log the errors it ran out of , latency and all the other metrics in detail on my github README as well the google doc, for more specificity.

compyl log chec - Google Docs

docs.google.com

The Tech Stack

LLM Engine: Llama 3.3 70B via Groq (Insanely fast inference)
Validation: Pydantic v2 (Strict data contracts)
Backend: FastAPI + Uvicorn
Runtime Proof: Generates raw SQL CREATE statements + Flask route skeletons from the final JSON.
Hosting: Docker on HuggingFace Spaces (Zero cold starts).

Key Takeaways

The hardest part wasn't the LLM prompts; it was defining exactly what valid output looked like using Pydantic before writing code.
Initially, I used 0.3 for the creative/design stages, but dropped it to 0.1 for the repair stage to force deterministic code correction.

Links & Feedback

🚀 Live Demo: HuggingFace Spaces
💻 Source Code: GitHub Repository

I would love to hear your thoughts on this project :)

90% of the internships I applied to weren’t real. So I’m building a way to expose them.

Quratulain Nayeem — Sat, 25 Apr 2026 19:56:06 +0000

If you're a student or dev in India right now, you know the grind. I've spent months obsessing over my resume, hunting for keywords, and sending out countless applications for AI/ML roles. I did everything "right." The cold emails, the startup hunting, the works.

And what do you get in return? Either complete silence, or worse, a website asking you to PAY for an internship. ₹1,499 for a "1-Month AI Internship." As if the job market wasn't already stacked against us.

But after months of this, I realized the problem wasn't my resume. The problem was that I was applying to "ghosts”.
I started digging and realized that a massive chunk of job listings some estimate up to 90% in certain sectors are ghost listings. Companies post them just to collect our data, gauge salary expectations, or build a "talent pipeline" for a role that doesn't actually exist. It’s a waste of our time and a hit to our confidence.
I’m an AI student. Why am I not using my skills to fix this?
I decided to stop just "applying" and start "building."
I’m currently developing a Chrome extension designed to give job seekers a real-time "Credibility Score" for every listing they see on LinkedIn.

How it works (The Tech Behind the Logic):

I didn't want to build just another "AI wrapper." I’m building a system that aggregates five distinct signal categories to calculate a legitimacy score:

Company Legitimacy:
Cross-referencing domains and registration data.
Recruiter Credibility:
Analyzing the profile behind the post.
Posting Behavior:
Tracking how long a post stays up vs. actual hiring signals.
On-Device ML:
I’m planning to use ONNX Runtime Web to run text classification locally in the browser to keep it fast and privacy-first.
Why an extension and not a native app?
LinkedIn’s official APIs are locked behind enterprise gates. To help people now, we need to be where the jobs are. By using a Manifest V3 extension, I can provide an instant "Truth Layer" directly on the page without waiting for a formal partnership that may never come.
The Goal:
I’ve previously built production AI systems that analyzed over 568k reviews and engineered RAG pipelines from scratch. Now, I’m applying that same "production-first" mindset to help students like me avoid the ghost-job trap.

I’m looking for collaborators!

This is still a work in progress. I’m currently mapping out the Vercel Edge functions and refining the ML text classifier.
If you're tired of ghost listings and want to help build this, connect with me on LinkedIn or drop a comment below.

Let’s stop applying into thin air :)

DEV Community: Quratulain Nayeem