Shakti Wadekar

Posted on Jun 26

Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos

#ai #llm #agents #python

1. The problem we’re actually trying to solve

Ask any LLM to “summarize this repository” and it will happily oblige, and it will also happily make things up. It will mention a test suite that doesn’t exist. It will describe an API endpoint it inferred from the folder name api/. It will confidently tell you about a “data flow” it never actually traced :(

Reason:

LLMs are pattern completers, not code analyzers. When you dump a pile of files into a context window and ask for a summary, the model is guessing based on naming conventions and statistical priors from millions of other repos it has seen, not from understanding this code.

Solution:

code-graph-ai-summarizer is a small, Python project that takes a different approach: don't let the LLM look at raw code at all.

Instead, build a precise, structured, graph-derived set of facts about the repo first, using real static analysis, and only then hand the LLM a curated fact-sheet and ask it to write summary.

The LLM's job shrinks from “understand this codebase” to “narrate these facts I already verified,” which is a job LLMs are very good at.

That one design decision is the whole story of this repo, and it’s a pattern worth learning regardless of whether you ever run this exact tool.

2. The core idea, in one picture

Five stages. Each stage only passes forward what the next stage needs and never the raw source code itself.

Let’s walk through each one, building up from the basics.

3. Stage 1 : Turning code into a graph (Joern + CPG)

Before we can reason about a codebase, we need a representation of it that a program can query.

Reading characters in a .py file tells you nothing about which function calls which other function. You need structure. This is where Joern comes in.

Joern is an open-source static analysis platform that parses source code into a Code Property Graph (CPG): a single graph data structure that fuses together several representations:

The Abstract Syntax Tree (what the code literally says)
The Control Flow Graph (what order things execute in)
The Data Dependence Graph (which values flow into which)
Call graph edges (what calls what)

Once your repo is imported into Joern, all of that becomes queryable through CPGQL : a Scala-based query language that treats the whole codebase as one big graph you can filter, map, and traverse.

In code-graph-ai-summarizer repo, that connection lives in joern/client.py:

class JoernRunner:
    def __init__(self, server: str) -> None:
        self.client = CPGQLSClient(server)

    def import_repo(self, repo_path: Path, project_name: str) -> None:
            result = self.client.execute(import_code_query(str(repo_path), project_name))
            ...

JoernRunner is just a thin wrapper around cpgqls_client, which talks to a Joern server running locally (joern --server, listening on localhost:8080 by default).

You point it at a local folder, it imports the repo, and from then on you can fire CPGQL queries at it. (This is done by code-graph-ai-summarizer so you don’t have to)

Why a graph and not just an AST per file?

Because the interesting questions about a codebase are inherently cross-file: “what calls this function,” “where does this value end up,” “which file is the most central.”

Those are graph-traversal questions, not single-file parsing questions.

The CPG gives you one graph spanning the entire repo, so those questions become tractable.

4. Stage 2 : Asking the graph the right questions

A CPG by itself is just a big graph sitting in memory.

The value comes from the specific queries you run against it.

joern/queries.py defines six of them, and reading through them is basically a mini-lesson in “what does a useful static analysis tool actually need to know about a codebase.”

Query and What it extracts:

files : Every source file in the repo

methods: Every function/method, with file, full name, signature, line

types: Every class/type declared

call_edges: For each method, which other methods it calls (internal + external)

calls: Every individual call site, with its code text

entry_candidates: Methods that look like entry points

source_sink_calls: Calls that look like data sources or data sinks

entry_candidates

The entry_candidates query is critical. There’s no universal CPGQL way to say “find the main function” across Python, JS, Go, etc.

So the repo uses a name/filename heuristic instead:

val entryRe = "(?i).*(main|run|start|serve|handler|handle|route|controller|command|execute|process|consume|worker|app).*"

cpg.method
  .filterNot(_.isExternal)
  .filter(m => m.name.matches(entryRe) || m.filename.matches(entryRe))
  .take(maxItems)

One more detail worth noticing:

joern/client.py wraps every single query in a try/except:

for name, query in joern_queries(max_items).items():
    try:
        facts[name] = self.run_json_query(name, query)
    except Exception as exc:
        print(f"[warn] Joern query failed: {name}: {exc}")
        facts[name] = []

If one query fails (say, the data-flow query isn’t supported for a given language overlay), the pipeline doesn’t crash, it just records an empty result and moves on.

5. Stage 3 : From raw facts to ranked signal (the Python analysis layer)

Joern hands back raw lists: every file, every method, every call. That’s hundreds or thousands of items, too much, too unstructured, and too noisy to throw straight at an LLM.

This is where the repo’s analysis/ package comes in.

Its whole job is compression with judgment: turning a flood of graph facts into a small set of ranked, labeled signals.

5.1 Classifying what code “is”, without a single import statement

analysis/patterns.py defines keyword buckets for common categories: api_web, cli, storage_db, filesystem, llm, network, auth, queue_worker.

A snippet:

CATEGORY_PATTERNS = {
    "storage_db": [
        "sqlite", "postgres", "mysql", "mongodb", "redis", "sqlalchemy",
        "save", "insert", "update", "delete", "select", "execute", "commit", "query",
    ],
    "llm": [
        "openai", "ollama", "anthropic", "gemini", "groq", "cerebras",
        "completion", "chat.completions", "llm", "model", "generate",
    ],
    ...
}

analysis/classify.py then just checks whether any of these substrings show up in a call's name/code/target text.

This is deliberately simple: no embeddings, no ML model, just substring matching. Reason: it's fast, debuggable, language-agnostic, and “good enough” because its output isn't the final answer, it's a signal that downstream ranking and the LLM will further interpret.

Don't reach for a heavyweight model when a keyword list solves 90% of the problem at near-zero cost.

5.2 Finding the repo’s “important” files

analysis/architecture.py turns the call-edge facts into a per-file importance score.

The logic, simplified:

A file gets points for every internal call it makes (it’s doing things)
A file gets more points when other files call into it (it’s depended on)
A file gets points whenever its calls match one of the category patterns above (it touches storage, network, LLMs, etc.)

file_scores[caller_file] += len(internal_callees) + len(external_callees)
...
file_edge_counts[(caller_file, callee_file)] += 1
file_scores[callee_file] += 2

Sort by score, and you get a ranked list of “central files”, a cheap but effective proxy for architectural importance.

5.3 Finding runtime flows: graph traversal, not guesswork

This is the most conceptually interesting part of the repo.

analysis/flows.py runs a breadth-first search (BFS) starting from each entry-point candidate, walking forward through the call graph, and scoring every path it finds:

entry_point
   -> calls method A
        -> calls method B  (touches "storage_db")
             -> calls method C  (touches "llm")

queue = deque([[entry]])
while queue and seen_paths < 80:
    path = queue.popleft()
    if len(path) >= 3:
        score, signals = path_score(path, method_to_file)
        if signals:
            candidates.append(runtime_candidate(...))
    if len(path) >= 5:
        continue
    for next_method in graph.get(path[-1], [])[:12]:
        if next_method not in path:
            queue.append(path + [next_method])

Each path’s score (analysis/graph.py) rewards length and, much more heavily, rewards touching “important” categories like api_web, storage_db, or llm:

score = len(path) + 4 * len(set(categories))
important = {"api_web", "storage_db", "filesystem", "llm", "network", "auth", "queue_worker"}
score += 5 * len(important.intersection(categories))

In plain English:

A path that goes from an entry point all the way to a database call or an LLM call is more “interesting” than a path that just bounces between two utility functions.

That’s a simple but effective heuristic.

find_data_flows is the mirror image: instead of starting from entry points, it starts from calls that look like data sources (request, input, argv, env, ...) and BFS-searches forward until it reaches calls that look like data sinks (write, save, insert, chat, post, ...).

source: read user input
    |
    v
  [ some processing methods ]
    |
    v
sink: save to DB / send to LLM

Important nuance the README states explicitly and the code backs up: these are graph-derived candidates, not proven runtime traces.

Joern is doing static analysis, it never executes the code.

A BFS path through the call graph is a plausible flow, not a guaranteed one.

6. Stage 4 : Compacting everything into one fact sheet

All of the analysis above gets assembled in summarization/facts_builder.py into a single summary_facts dictionary:

def build_summary_facts(repo_path: Path, facts: dict) -> dict:
    repo_map = build_repo_map(facts.get("files", []))
    architecture = derive_architecture(facts)

    return {
            "repo_name": repo_path.name,
            "repo_map": repo_map,
            "architecture_signals": architecture,
            "entry_points": facts.get("entry_candidates", [])[:40],
            "critical_runtime_flow_candidates": find_runtime_flows(facts),
            "critical_data_flow_candidates": find_data_flows(facts),
            "important_symbols": important_symbols(facts, architecture),
            "limits": {"note": "This is static analysis. Runtime/data flows are graph-derived candidates, not guaranteed actual production traces."},
        }

Notice what’s not in here:

raw source code,

full call lists,

every method in the repo.

important_symbols is deliberately filtered down to only the methods/types that live in the already-identified “central files”, another compression step that keeps the eventual LLM prompt small and focused.

This dictionary, not the repo itself, is what the LLM will actually see.

7. Stage 5 : Writing the prompt like a contract, not a suggestion

summarization/prompts.py

It builds the final prompt, and it’s worth reading closely because it shows how to constrain an LLM rather than just hope it behaves:

return f"""
You are generating a repository summary using Joern Code Property Graph facts.

Use only the supplied graph facts.
Do not invent files, folders, APIs, tests, classes, functions, runtime flows, or data flows.
Separate detected facts from inferred conclusions.
For runtime flows and data flows, include only the critical ones, not every path.
If something is weakly supported, say "likely".
If something is not supported, say "not detected".

Return Markdown with exactly these sections:
# Repository Summary
## 1. Repository Purpose
## 2. Repository Map
## 3. Architecture
## 4. Critical Runtime Flows
## 5. Critical Data Flows
## 6. Important Files
## 7. Important Symbols
## 8. Not Detected / Unknown
...
"""

A few important details:

Closed-world instruction: “use only the supplied facts” + “do not invent X, Y, Z” is the single highest-leverage anti-hallucination instruction you can give a model. It can’t promise zero hallucination, but combined with a small, accurate fact-sheet, it dramatically narrows the model’s room to wander.
Calibrated language is mandated, not optional: forcing the model to say “likely” or “not detected” instead of asserting things flatly turns confidence into a visible, checkable signal for the reader.
A fixed output schema: by naming the exact eight sections, the output is predictable and easy to parse, render, or diff across repos. Useful if you ever want to compare summaries over time or build a UI on top of this.
An explicit “Not Detected / Unknown” section: most prompts ask a model what it knows; this one also asks it to state what it doesn’t know. That’s a small change with an outsized effect on trustworthiness.

llm/client.py

llm/client.py then does the boring-but-important part: it’s a thin OpenAI-SDK wrapper that works with any OpenAI-compatible endpoint: Groq, OpenRouter, Gemini’s OpenAI-compatible endpoint, or Cerebras, controlled purely through .env config:

LLM_PROVIDER=groq
LLM_API_KEY=your_api_key_here
LLM_MODEL=llama-3.3-70b-versatile

def generate_repo_summary(summary_facts: dict, config: LLMConfig) -> str:
    client = make_client(config)
    response = client.chat.completions.create(
        model=config.model,
        temperature=config.temperature,
        max_tokens=config.max_tokens,
        messages=[
            {"role": "system", "content": "You are a precise static-analysis repo summarizer. You must not hallucinate unsupported repo facts."},
            {"role": "user", "content": build_summary_prompt(summary_facts)},
        ],
    )
    return response.choices[0].message.content or ""

8. Putting it all together: what actually happens when you run it

uv run code-graph-ai-summarizer /path/to/local/repo

Walking through run() in cli/main.py function end to end:

Make the output directory outputs/<repo-name>/.
JoernRunner.import_repo(...): Joern parses your repo into a CPG.
joern.collect_facts(max_items) : the six CPGQL queries (+ the optional data-flow query) run, each wrapped in its own try/except.
build_summary_facts(...) : repo map, architecture signals, runtime/data flow candidates, important symbols all get derived and compacted.
generate_repo_summary(...) : the compact fact JSON goes into the locked-down prompt, and the LLM writes the final Markdown.
Three files land in outputs/<repo-name>/ . First, joern_facts.json: the raw graph extraction (for debugging / inspection). Second, summary_facts.json : the compacted, ranked fact-sheet (what the LLM actually saw).

Third, repo_summary.md: the final human-readable summary.

outputs/<repo-name>/
├── joern_facts.json     <- raw, large, exact
├── summary_facts.json   <- compact, ranked, curated
└── repo_summary.md      <- narrated, by the LLM

9. Why this design generalizes beyond “summarizing a repo”

The specific use case here is repo summarization, but the underlying pattern is broadly applicable to anyone building tools on top of LLMs:

Don’t hand an LLM a haystack and ask it to find the needle. Use a purpose-built tool (a parser, a graph engine, a database, a search index) to find candidate needles first.
Rank and compress before you prompt. The smaller and more relevant the context, the less room there is for hallucination, and the cheaper/faster the call.
Keep every intermediate artifact. joern_facts.json and summary_facts.json aren’t just debug exhaust, they’re what let you trust, or distrust, the final output with evidence rather than vibes.
Fail softly at each stage. One bad query or one weird file shouldn’t take down the whole pipeline.

10. Trying it yourself

git clone <your-repo-url>
cd code-graph-ai-summarizer
uv sync
cp .env.example .env
# edit .env: set LLM_PROVIDER, LLM_API_KEY, LLM_MODEL

# in a separate terminal
joern --server

# in a separate terminal (if using ollama)
ollama serve

# back in your main terminal
uv run code-graph-ai-summarizer /path/to/any/local/repo

Point it at a small repo first, and get repo_summary.md.

11. A closing note:

Full credit where it’s due: I didn’t write this by hand, line by line, heroically, at 2 AM, fueled by coffee.

I played as an orchestrator, pointing ChatGPT and Claude at the problem, arguing with them when they hallucinated a function that didn’t exist, and stitching their outputs into something that actually runs and is a useful application. They wrote the code, I supplied the opinions, the rejections, and the “no, that’s not what I meant” loop until it converged.

So consider this repo a small case study in human + AI pair programming, minus the part where the AI gets annoyed at my code review comments.

DEV Community