Nrk Raju Guthikonda

Posted on May 2

I've Shipped 113 Local-AI Projects in 18 Months — Here Are the Five Architecture Patterns That Actually Survived

#ai #llm #architecture #opensource

I've Shipped 113 Local-AI Projects in 18 Months. Here Are the Five Architecture Patterns That Actually Survived

Tags: ai, llm, architecture, opensource

A weird thing happens around project number forty.

You stop being excited about model picking. You stop arguing about which framework to use. You stop thinking the hard part is the LLM. The boring patterns — the ones you reach for whether the project is a healthcare summarizer, a legal-clause analyzer, or an AI that calls a restaurant and orders biryani — start to crystallize into a small set of choices that you make almost without thinking. Everything else is decoration.

I've shipped 113 original applied-AI projects under the GitHub identity kennedyraju55 in the last 18 months — every one runs on local LLM inference (Gemma 4 + Ollama), no cloud, no API keys. The portfolio spans healthcare, legal, education, finance, security, dev tools, voice agents, and creative-writing assistants. Across all of them, the same five architecture decisions have either saved or sunk the project. This post is the retrospective I wish someone had handed me at project number one.

Pattern 1: The LLM is a function call, not a framework

The first ten projects all imported some flavor of "agent framework." LangChain, LlamaIndex, AutoGPT-likes, CrewAI. Every one of them got ripped out by the time the project shipped.

The reason was the same every time. The framework wanted to manage the LLM — chains, agents, memory, tool dispatch — and what I actually needed was the opposite. I wanted the LLM to be a small, deterministic function I called from inside my control flow. Input goes in, structured JSON comes out, my code decides what happens next. That's the whole shape.

The pattern that survived is almost embarrassingly small:

def ask_llm(system, user, schema=None):
    resp = ollama.chat(
        model="gemma3:4b",
        messages=[{"role":"system","content":system},
                  {"role":"user","content":user}],
        options={"temperature": 0, "num_ctx": 8192},
        format=schema,  # ollama's structured-output mode
    )
    return json.loads(resp["message"]["content"])

That's the entire interface. Every one of the 113 projects has some variant of this function and very little else from any "framework." The control flow — when to call it, what to do with the response, how to recover from a malformed output — lives in plain Python. The agent pattern, when I needed one, became a while loop with a step counter. That's not a downgrade. That's the point.

Pattern 2: Compute everything you can before you generate

The single biggest quality difference between projects 1–20 and projects 80–113 is this: never ask the LLM to do something a deterministic function can do better.

The flaky-test triager I described in the last post is the cleanest example. The LLM does not compute the failure rate. The LLM does not count the recent outcomes. The LLM does not parse the timestamp. Python does all of that, hands the LLM the numbers, and asks the LLM exactly one question: given these numbers, classify this failure into one of five categories.

Same shape in the healthcare projects: I never ask the LLM to extract drug-name + dosage + frequency from the note. I extract those with a deterministic NER pipeline first, hand the LLM the structured triple, and ask the LLM to assess the clinical intent. Same in the legal projects: the contract date, the parties, the governing law — those come from structured extraction. The LLM is asked the interpretive question on top of the structured ground truth.

The rule, sharpened over 113 projects: make the LLM's job a reading-comprehension question with the facts already labeled. Models, especially small open ones, are excellent reading-comprehension engines. They are mediocre extraction engines and worse arithmetic engines. Treat them as their best self.

Pattern 3: Constrain the output, every single time

For my first dozen projects I spent almost as much time parsing LLM outputs as building the rest of the system. Free-form generation produces beautiful prose and unparseable garbage in roughly equal measure.

The patch is one line per project: every prompt ends with "Respond ONLY with a JSON object matching this schema:" plus the schema. Then the runtime enforces it. Ollama supports this natively (the format field above). Llama.cpp does too via grammar-constrained sampling. There is essentially no reason in 2026 to be parsing an unconstrained LLM output by hand.

Two corollaries that took me longer to learn:

Validate before you return. Even with constrained output, models occasionally produce JSON that parses but violates the semantic schema (an enum value not in the list, a number out of range). Validate every output against a Pydantic model. If validation fails, retry once at temperature 0. If the second attempt fails, return a typed error — not a guess.
Schema design is product design. The shape of the JSON you ask for is, in effect, the API of your LLM-powered feature. Spend time on it. The schema that says {"category": "flaky_intermittent" | "recently_broken" | ..., "confidence": 0..1, "reasoning": str} is a different and better product than the schema that says {"answer": str}.

Pattern 4: Retrieval is two stages, not one

Every project that does anything more than text-in / text-out hits a retrieval problem. The healthcare assistant needs to ground in a clinical guideline. The legal-clause analyzer needs precedent paragraphs. The CallPilot voice agent needs the restaurant menu. The personal-knowledge-base "second brain" needs a relevant note.

The pattern that survived all 113 projects is two-stage retrieval, never one.

Stage 1: a fast, cheap, recall-oriented retriever. Usually BM25 over the corpus, sometimes a small bi-encoder embedding (bge-small-en or nomic-embed-text — both fit on a laptop). Pull the top 50 candidates. Throw away anything below a sane similarity floor.

Stage 2: a precision-oriented reranker. A small cross-encoder (bge-reranker-base is great) or, for anything where domain matters, the LLM itself prompted with "rank these 50 candidates 1–50 for relevance to this query." Take the top 5.

One stage is never enough. A pure embedding-similarity retriever produces 50 candidates of which 30 are vaguely on-topic and 5 are exactly right; the LLM doesn't know which of the 30 is which, and it hallucinates around the missing precision. A pure BM25 retriever produces 50 candidates with high lexical overlap but blind to paraphrase. Two stages — recall first, then precision — eats both failure modes.

Pattern 5: The system prompt is your contract; the user prompt is the call

I waited too long to internalize this one. In every successful project, the system prompt does all of the following, and the user prompt does none of them:

Defines the role ("You are a clinical-note summarizer for a primary-care physician")
Defines the audience ("The reader is a busy clinician who has 30 seconds")
Defines the constraints ("Never invent a medication. Never assert a diagnosis the source did not assert.")
Defines the output schema ("Respond ONLY with the JSON schema below")
Defines the rules for edge cases ("If the input is missing the chief complaint, return { \"error\": \"missing_chief_complaint\" }")

The user prompt is then almost embarrassingly minimal:

Source note:
<<<
{the actual note}
>>>

That's it. The LLM is being called, not negotiated with. The system prompt is the function signature; the user prompt is the argument. Treating them this way makes prompts versionable (I keep them in plain .md files with hashes in the filenames), diffable, and testable. It also collapses a whole class of "the model isn't following instructions" bugs — usually because the instructions were tangled into the user message and got down-weighted.

What I got wrong, in order, so you don't have to

A short list, because the failures were the lessons:

I kept switching models for too long. Projects 1–30 alternated between five different local models. The variance in output across models obscured the variance in prompt quality. Pick a model. Stay there for at least 20 projects. Then switch with a controlled comparison.
I built UIs before I built CLIs. Streamlit and Gradio are wonderful, and I shipped a lot of them — but the projects where I built the CLI first and only added the UI when the CLI was correct shipped twice as fast and were twice as easy to debug.
I optimized for novelty in early projects. Projects 5, 7, 11 tried to be clever — multi-agent routing, self-reflection loops, dynamic prompt rewriting. Every one of them got removed before shipping. The boring single-call shape kept winning.
I underestimated voice. When I finally built CallPilot — a multi-provider voice agent that actually places phone calls — it was both the hardest project in the portfolio and the one that produced the loudest "wait, this is real?" reaction. If you can ship a voice agent, ship a voice agent.
I overdocumented at first and underdocumented in the middle. The first ten projects had elaborate READMEs that nobody read. The middle thirty had thin READMEs and forgotten launch commands. The last few dozen converged on a four-section README — what it is, why it exists, how to run it, what to break next — that's enough to hand the project to a stranger.

The portfolio-level architecture

There's a meta-pattern I didn't appreciate until the last twenty projects. The portfolio itself behaves like a system, not a collection. Every project I build now uses the same Python virtualenv pattern, the same Makefile targets, the same scripts/run.sh entry point, the same env-var convention, the same logging format. I can ssh into any of the 113 repos, run make demo, and watch the thing work.

That consistency is what made the portfolio possible at this volume. If every new project required relearning the harness, I would have stopped at twenty. The harness is not what I publish — readers see the LLM logic, the prompts, the architecture diagrams. But the reason there's a "the LLM logic, the prompts, the architecture diagrams" to look at across 113 projects is that the parts of the system that aren't the LLM are uniform enough that I never have to think about them again.

What I'd build differently if I started over today

A handful of things, none of them about the model:

Start with structured output from project one. The day I made every prompt return JSON was the day my projects stopped breaking on Friday afternoons.
Two-stage retrieval from project one. Single-stage retrieval was the cause of half my early hallucination bugs.
A shared eval harness from project one. I built one at project sixty. It should have been project two. Even ten labeled examples per project, run before every prompt change, would have caught regressions I shipped instead.
A canonical, opinionated CLI shape from project one. Click + a run/eval/demo triad. Saves a working week per project, compounded.
A small library of "prompts that worked" with a hash and a benchmark. I have one now. I should have had one at project five.

Closing

If you take one thing from this: the best local-AI projects are the ones where the LLM is the smallest part of the system. The retrieval is solid. The structured output is enforced. The deterministic computations are done in code. The system prompt is a contract. The user prompt is a call. The harness is invisible because it's the same every time.

You don't need a framework. You need a JSON schema, a two-stage retriever, a Pydantic validator, and a system-prompt file in version control. Stack a few of those together and you've got the spine of every project that survived in my portfolio.

The other 109 ideas? Decoration.

I'm a Senior Software Engineer at Microsoft and the author of 113 open-source local-AI projects and 22 technical articles on Dev.to. If you've shipped a few local-AI projects yourself and your "patterns that survived" list is different from mine, I genuinely want to read it — leave a comment or DM.

DEV Community