Don’t Burn Claude Tokens: A Free, Local, Secure Way to Explore Your Code First

#agents #ai #opensource #python

Every time you point Claude or ChatGPT at an unfamiliar codebase and ask “where is the auth logic?”, you’re spending real money and context just to get oriented.

The model has to read files it’s never seen, guess at structure, and you’re paying API tokens (or burning your message limit) for what is essentially a Google-Maps-level question, not a hard reasoning question.

talk-to-your-code is a neat fix for this. It's a small local app that lets you index a codebase on your own machine and ask it plain-English questions, using a small local model through Ollama instead of a cloud API.

The idea isn't to replace Claude, but to do the cheap, repetitive "let me look around first" work locally and for free, so that by the time you do talk to Claude, you already know which files matter and can ask a sharp, narrow question.

The big picture: what actually happens

At a high level, the app does four things in sequence:

It walks your repo and breaks files into chunks,
It stores those chunks in a local database,
It lets an LLM figure out which chunks are relevant to your question. (Context building)
It lets the LLM answer using only those chunks.

> Ingests any repository

This happens once, when you click “Ingest.” After that, everything else like the actual conversation, runs against this local database, not against the raw files.

That matters: it means each question is fast and cheap, because the app isn’t re-reading your whole repo every time you ask something.

> What happens when you actually ask a question

Here’s what each box is doing in practice:

Build a query plan. Your raw question gets turned into something a retrieval system can actually use: keywords to search for, files it suspects are relevant, and what kind of question this is (explain vs debug vs locate).
Hybrid retrieval. “Hybrid” here means combining different ways of finding relevant chunks: keyword, symbol matching (good for “find the function named authenticate") and embedding similarity search (good for "find code that does something like this," even without exact wording). Using both catches more relevant code than either alone.
Context builder with a budget. This is the unglamorous but critical step. You can’t just hand the LLM every chunk that matched. The context windows are finite, and stuffing in irrelevant code wastes tokens and confuses the model. So the builder ranks chunks by relevance and packs in as many as fit under a character/token limit you control (that’s the “context length slider” in the UI).
Generate a structured answer. The final LLM call gets the user’s question plus the packed context, and is asked to generate answer with a certain schema. [Example]

class StructuredAnswer(BaseModel):
    summary: str
    relevant_files: list[str]
    confidence: str

That’s it, two structured LLM calls, with a plain retrieval step sandwiched in between. No agents looping indefinitely, no unbounded tool calls. It’s a deliberately small, predictable pipeline.

Important detail:

When you type a question like “where is authentication handled?”, the app doesn’t just dump your code into a prompt and hope for the best.

It runs the LLM twice, each time forcing a specific output shape using structured generation (also called constrained decoding), so you get a guaranteed JSON object back instead of free text you’d have to parse and hope is valid.

In Python, that schema is typically just a Pydantic model, something like this: [Example]

from pydantic import BaseModel
class QueryPlan(BaseModel):
    keywords: list[str]
    target_files: list[str]
    intent: str  # e.g. "explain", "debug", "locate"

You hand this structured-data to the LLM call and the API guarantees the response matches it. No more “please respond in JSON” prompts that occasionally come back with extra commentary attached.

Summary:

A few things this repo demonstrates cleanly that show up everywhere once you start building with LLMs:

Structured generation beats prompting for JSON. Defining a Pydantic schema and binding it to the call is more reliable than asking nicely in the prompt and parsing the response with regex.
Separate “finding the right context” from “answering the question.” Bundling retrieval and generation into a single giant prompt is how you get hallucinated answers about code that isn’t relevant.

Splitting it into a plan → retrieve → answer pipeline keeps each step’s job small and checkable — you can literally inspect the query plan in the UI before the final answer is generated.

Context budgets aren’t optional. Whether you’re calling a 7B local model or Claude through the API, “how much do I actually send” is a real engineering decision, not an afterthought.
Local-first has a real use case. It’s not about avoiding Claude, it’s about not exposing a private codebase to an external API for the boring first-pass questions, and not paying API costs for what a free local model can already answer.

Once this tool/application helps you find the right files and understand them, you can ask Claude about the actual fix instead of spending tokens on repo exploration.

DEV Community

Don’t Burn Claude Tokens: A Free, Local, Secure Way to Explore Your Code First

The big picture: what actually happens

Summary:

Top comments (0)