DEV Community: Michel Ozzello

Code Knowledge Graphs: Why Open-Source Stacks Stall at Enterprise Scale

Michel Ozzello — Thu, 18 Jun 2026 20:22:07 +0000

TL;DR: Every code knowledge graph demo ends with a visualization. Most of them are syntax trees in a fancy renderer. At enterprise scale, the question isn't "is there a graph" but what's in the nodes, what's in the edges, and what survives Monday morning when production code changes. This post scores open-source code knowledge graph stacks (Joern, Kythe, Glean, Stack Graphs, SCIP/LSIF, CodeQL, DIY Neo4j, and agent repo maps) against the five dimensions that actually matter at enterprise scale, then shows what CoreStory does differently.

The KG Everyone Is Showing You Isn't the KG You Need

Walk into any vendor meeting this year and, at some point, you'll see a graph. Nodes and edges, probably rendered in a dark UI with a gold or teal color scheme. Someone will call it a knowledge graph. They'll fly through a visualization and say something like "this is your codebase. You can see how everything connects."

What they're often showing you is a tree-sitter AST fed into a graph renderer. The nodes are tokens. The edges are parse-tree relationships. It looks like intelligence. It isn't.

The distinction matters enormously when you're dealing with a real enterprise codebase (the kind that has six million lines of code, four programming languages, a hundred stored procedures, JCL batch jobs written in the 1990s, and a team of architects who need to understand it well enough to modernize it safely). At that scale, a syntax graph is noise. You need a knowledge graph that captures how the system actually behaves and why it was built that way.

The question isn't whether there's a graph but instead: what's in the nodes, what's in the edges, and what happens to both of those things at 9 a.m. on Monday when 50,000 lines of production code changed over the weekend?

‍

The Five Dimensions a Code KG Is Actually Judged On

Before comparing tools, it helps to name what you're actually evaluating. Here's the rubric — five dimensions that separate a real enterprise code intelligence layer from a science project:

Depth — Does the graph stop at syntax (token relationships), reach semantics (type resolution, data flow), or capture intent and behavior (business rules extracted from code paths, architectural decisions inferred from system behavior)? Most tools claim "semantic" but deliver structural.

Coverage — What artifact types are in scope? A graph that only indexes .java and .py files is missing half the system in most enterprises. Stored procedures, database schemas, batch job definitions, configuration files, and IaC all encode business logic.

Polyglot Scale — Can the graph traverse cross-language call boundaries? A Java service calling a stored procedure that drives a COBOL batch job is a single logical workflow. A tool that treats each language in isolation will miss the dependency entirely.

Freshness — When code changes, what happens to the graph? Full re-indexing every night means the graph is stale for most of the day. Incremental updates that track deltas keep the model current. The question is the staleness budget: how far behind can the graph be before it becomes a liability?

Queryability — Who and what can ask questions of the graph? A raw graph database requires experts to write traversal queries. An AI-native query surface — MCP endpoints, semantic retrieval, natural-language interfaces — opens the graph to agents and non-expert users alike.

This is the rubric. Every tool below scores against it.

Competitive Teardown: Named, On the Rubric

Each of the following stacks was built for a specific purpose. That purpose matters — it explains both what they do well and where they fall short when asked to serve as an enterprise code intelligence layer.

A few observations worth drawing out:

Joern and CodeQL are excellent security analysis tools. They were designed to answer specific security questions about specific code. That is a different problem than building a persistent intelligence layer about how a system behaves. Joern's Code Property Graph (CPG) is a genuine contribution to the field — but it was designed for vulnerability detection, not for reverse-engineering the business rules in enterprise software like a 20-year-old insurance claims system.

Kythe and Glean represent serious engineering at Google and Meta scale, respectively. The reason neither has significant enterprise adoption outside their origin companies is instructive: standing them up is itself a platform program. Kythe requires a per-language indexer; Glean requires consumers to build their own analysis schemas. Neither comes with a business-intelligence layer out of the box.

The DIY Neo4j path is the one that consumes the most architect time. The conversation usually goes: "We can build this ourselves with Neo4j and tree-sitter over a weekend." In practice, that weekend becomes the pipeline design, then the schema design, then the language coverage gaps, then the freshness problem, then the query layer, then the maintenance burden. Most enterprises that have gone down this path report spending before reaching what a vendor delivers at week one. See this detailed look at how a production-grade code knowledge graph is actually architected (including the five phases that separate working implementations from stalled ones) where the tradeoffs are covered in depth.

Agent repo maps (the kind coding agents like Aider produce on demand) are genuinely useful … for the agent, in that session. They are not persistent, not queryable externally, and not designed for multi-million-line codebases. They are not a knowledge graph; they are session context. Understanding where curated knowledge reaches its limits, and where a structured code graph has to take over, is the clearest way to see why session-scoped maps don't scale.

The broader ecosystem includes a set of tools specifically designed for code intelligence rather than general-purpose graph databases. OpenGrok is widely deployed for code search and cross-reference in enterprise environments, but it is a navigation tool — symbol-level only, no behavioral layer, and no AI-native query surface. Understand by SciTools goes further with call graphs and code metrics but still stops at structural semantics and requires re-analysis for every update. srcML and Spoon are research-grade substrates — useful building blocks in academic pipelines but not production intelligence platforms. Gremlin/Apache TinkerPop and RDF-based stacks (Apache Jena and similar) give you a powerful graph model but place the entire burden of schema design, ingestion, freshness, and query surface on the implementation team — the same DIY problem as Neo4j, just with a different traversal language. Eclipse JDT/LSP4J offers deep Java semantics but is single-language by design. Depends covers multi-language dependency graphs but produces structural exports rather than a queryable persistent intelligence model. In each case, the pattern is the same: capable of producing a graph of some kind; not architected to be an enterprise code intelligence layer.

What CoreStory Does That Those Don't

Scored against the same five dimensions:

Depth: from syntax to intent

Open-source KGs stop at "this function calls that function." CoreStory's Intelligence Model captures not only structural and semantic relationships, but the behavioral and intent layers above them. That means business rules extracted from code paths, architectural decisions inferred from the system's actual runtime behavior, and cross-artifact reasoning (that connects a Java API endpoint to the stored procedure it calls to the batch job that consumes its output, for example).

This distinction is the difference between a graph that tells you what code exists and a model that tells you what the system does. An architect evaluating a modernization program needs the latter.

Coverage: the full artifact stack

CoreStory indexes code, stored procedures, database schemas, configuration files, JCL batch job definitions, and IaC, not just .java, .py, or .ts files. In most legacy enterprise environments, the business logic is distributed across artifact types. A graph that only sees source code is missing a significant portion of the system's behavior.

Polyglot scale: cross-language graph traversal

CoreStory's ingestion is language-agnostic and traverses cross-language call boundaries. A Java service calling a stored procedure that drives a COBOL batch job is represented as a single connected subgraph, not as three separate single-language analyses that happen to sit next to each other. For legacy enterprises with polyglot stacks (which is most of them) this is the capability that makes the model useful.

Freshness: persistent intelligence that survives change

The Intelligence Model updates incrementally as code changes. It is not rebuilt from scratch on a schedule. More importantly, it persists across sessions and across team turnover. When a senior engineer leaves, their understanding of the system leaves with them — unless CoreStory captures it. The model compounds over time rather than degrading.

Queryability: MCP, API, and human dashboard

CoreStory exposes the Intelligence Model through an MCP/API surface for AI agents and a dashboard for humans, and both interfaces query the same underlying model. This is the architecture that makes it possible for a coding agent to ask "what are the business rules enforced in this service's validation layer?" and get a grounded, specific answer rather than a hallucinated one. It's also what makes the 44% improvement in task resolution possible for agents grounded in CoreStory's Intelligence Model versus agents operating without it.

‍

Most open-source code KGs stop at the syntactic or semantic layer. CoreStory spans all four layers.
‍

The Buyer's Checklist

The next time you evaluate a code KG vendor or your own team's proposal to build one with OSS, go through these six questions:

What's in your nodes beyond syntax? Ask them to show you a node that contains a business rule, not just a function name.
What artifact types are in scope, and does the graph traverse across them? A Java-only graph is not a system model.
What's your incremental update story when 50,000 lines change overnight? Full re-indexing is not an answer.
What's the query surface for an AI agent, not just a human? A graph DB that requires manual traversal queries is not agent-ready.
What's your largest production deployment, by LOC and language count? Scale claims need to be verified against real deployments.
Show me the benchmark. Any claim about accuracy or agent performance improvement should be backed by a reproducible methodology.

These questions work equally well against an external vendor and against an internal team proposing to build the capability in-house. If the answers are vague on any of them, the graph is probably shallower than advertised.

The honest way to evaluate any code intelligence platform is to bring it your actual codebase — the messy, polyglot, partially documented one, not a sanitized demo environment. That's the only benchmark that matters for your specific system.

If you want to see what's in CoreStory's Intelligence Model that isn't in the tools above, bring us your hardest codebase. Schedule a call with an expert.
‍

This article was originally posted on CoreStory.ai

Polymorphic Agents: Why Language-Aware Intelligence Beats Language-Specific Tools

Michel Ozzello — Thu, 04 Jun 2026 16:31:28 +0000

TL;DR

Large production systems are polyglot environments. It is not unnatural to see an enterprise having Java services, Python pipelines, Go APIs, Ruby on Rails frontends, C++ processing engines, SQL stored procedures, Bash automation scripts, and proprietary DSLs, and combinations thereof — all interleaved across millions of lines of code, built over years by teams that came and went. Yet most AI code intelligence tools on the market were built for one language or one paradigm and treat everything else as a future roadmap item. CoreStory takes a fundamentally different approach: polymorphic agents that dynamically adapt their reasoning strategy, tool selection, and analysis technique based on the language, structure, and context they encounter. The result is an architecture that supports every language natively, traces across system boundaries, distinguishes deterministic findings from inferred ones, and gets smarter with each new language it learns.

Large Production Systems Are Not Single-Language Problems

Walk into any engineering organization running systems at scale and you won't find a neat, single-language codebase. You'll find a layered ecosystem built across years: Java microservices handling core business logic, Python powering data pipelines and ML inference, Go running high-throughput APIs, Ruby on Rails serving the customer-facing product, C++ embedded in performance-critical processing, SQL stored procedures encapsulating decades of business rules, and Bash scripts holding together the operational glue.

A typical enterprise production system contains at least six to ten distinct languages interleaved across thousands of components. This polyglot reality is not an edge case. It's the inevitable result of technology choices made at different points in time, by different teams, for different reasons — all of which still need to work together in production.

Yet the code intelligence tooling market has historically focused on isolated languages or paradigms, treating cross-language analysis as an advanced feature rather than a baseline requirement.

That creates a fundamental gap. When a significant portion of your production system is written in languages your intelligence tool doesn't fully understand, you don't have a complete picture — you have a partial view with blind spots wherever system components cross a language boundary.

The Competitive Landscape: Most Tools Were Built for One Context

The pattern across the market is strikingly consistent. Whether the approach is static analysis, fine-tuned LLMs, or symbolic reasoning, most tools start from a single language or ecosystem and build outward — slowly, one parser at a time.

GitHub Copilot offers broad language coverage for code completion but operates at the file level — it has no model of how your Python data pipeline connects to the Java service that consumes it. Sourcegraph provides powerful code search and structural analysis but doesn't reason about cross-language system behavior. Amazon Q Developer focuses on Java and Python with partial support elsewhere. Swimm offers language-agnostic documentation through static analysis but doesn't address cross-system intelligence. Generic LLM tools like ChatGPT and Claude are limited to what fits in a context window — they have no persistent model of your specific codebase.

The key insight here isn't that these tools are poorly built. It's that their architecture forces linear scaling. Each new language requires a dedicated parser or fine-tuned model, months of R&D, and ongoing maintenance. That's an architectural constraint, not a product strategy choice.

CoreStory treats language support as an architecture decision rather than a feature add — building agents that reason about any language, not parsers hardcoded for one.

What Are Polymorphic Agents?

A polymorphic agent is an agent that dynamically adapts its reasoning strategy, tool selection, and analysis technique based on the language, structure, and context it encounters. Rather than running a fixed analysis pipeline, each agent evaluates what it's looking at and selects the right approach in real time.

This works through three core capabilities.

Language-aware reasoning. When a polymorphic agent encounters a code artifact, it doesn't just identify the language — it identifies the dialect, the framework, the version era, and the specific patterns in use. A Python 2 data processing script gets different treatment than a Python 3 async FastAPI service, because they have different concurrency models, dependency patterns, and runtime behaviors. A Java Spring Boot service using REST is analyzed differently than one using gRPC or message queues. Critically, this version awareness also applies within a single codebase: a system running some services on Java 8 and others on Java 17 gets version-specific analysis for each, rather than a one-size-fits-all parse that misses what changed between them.

Dynamic tool selection. Instead of relying on a single parser per language, polymorphic agents choose from a toolkit of analysis techniques: AST parsing for structured languages with formal grammars, pattern matching for framework-specific idioms and DSL extensions, symbolic execution for tracing control flow through complex conditional logic, and LLM inference for extracting business intent from function names, comments, and structural patterns. Importantly, the agent tracks which technique produced each finding — deterministic results from AST parsing and symbolic execution are marked as high-confidence facts, while LLM-inferred intent is marked as probabilistic and flagged for human review.

Cross-language context fusion. This is where the architecture creates its most significant advantage. When a Python data pipeline passes results to a Java processing service through a Kafka topic, or when a COBOL CICS transaction writes to a DB2 table and a downstream Java service reads the same table via JDBC to populate a REST API, polymorphic agents trace the full data flow — fusing context across language and runtime boundaries into unified specifications. Single-language tools stop at exactly these integration points, leaving the most important parts of system behavior uncharted.

Deterministic vs. Inferred: The Trust Layer in Every Spec

Architects evaluating AI-based analysis tools ask a fair question: if the agent is reasoning rather than parsing, how do I know which parts of the output to trust?

CoreStory answers this with an explicit confidence layer built into every specification. Each finding carries a trust indicator based on how it was derived:

High confidence findings come from deterministic analysis — AST parsing of well-formed syntax, explicit function calls, defined API contracts, traced event schemas. These are facts extracted directly from the code.

Medium confidence findings come from pattern matching and symbolic execution — the evidence is strong and structural, but involves inference across paths. For example, a traced Kafka event schema where the producer and consumer agree on field names but the contract was never formally documented.

Low confidence findings come from LLM inference — intent extracted from function names, comments, architectural patterns, or naming conventions. This is where stale comments become a risk: if a comment was written three years ago and the code has since changed, the agent flags the discrepancy between structural evidence and commented intent rather than silently accepting the comment at face value.

The result is a specification that architects can work with in layers: use the high-confidence findings as the authoritative system map, use medium-confidence findings as strong hypotheses to verify, and use low-confidence findings as a targeted review list for subject matter experts — rather than requiring full manual review of everything.

Why Agents Beat Parsers

The traditional approach to multi-language support follows a predictable pattern: build a Java parser, ship Java support. Build a Python parser, ship Python support. Build a Go parser, ship Go support. Each new language represents six to twelve months of R&D. Cross-language tracing requires manual integration. And framework-specific patterns and version-level differences routinely break parsers built for a prior version or dialect.

CoreStory's polymorphic agent architecture obviates this model entirely.

When an agent encounters code, it identifies the language and selects the best tool for the specific context. It handles framework idioms and version differences through reasoning rather than hardcoded rules. New languages are added through training, not by rebuilding infrastructure. And cross-language tracing isn't a bolt-on feature — it's native behavior from the first analysis.

The difference isn't incremental. Agents compose tools the way experienced engineers do — selecting the right analysis technique for the specific code in front of them, switching approaches mid-analysis when the context demands it, and synthesizing findings across language and service boundaries into coherent, trust-layered specifications.

The Polymorphic Agent Decision Loop

Every analysis follows a five-stage decision loop. The best way to understand it is to follow a concrete example across a real multi-language production system:

the agent encounters a code artifact — order-processor/src/main/java/OrderService.java.
it identifies the language as Java 17 with Spring Boot, using REST endpoints and a PostgreSQL connection pool.
it selects the tools best suited to this artifact: an AST parser combined with a Spring annotation analyzer and a SQL schema linker.
it analyzes the code — discovering an outbound HTTP call to a Python-based pricing service, and an event published to a Kafka topic consumed by a Go notification service. During analysis of the Python service, the agent also encounters an undocumented field in the JSON payload — a "shadow" field present in the response but absent from the formal API definition. Rather than ignoring it, the agent flags it as a medium-confidence finding: the field exists in observed responses, its name suggests a discount application, but its exact semantics require verification with the team that owns the pricing service.
it synthesizes findings across all three services, the PostgreSQL schema, and the Kafka event contract — tagging each element with its confidence level and surfacing the shadow field as an explicit annotation in the output spec.

The result is a unified specification that captures how order processing actually works — including the parts that aren't formally documented — without any manual work from your team.

No manual stitching. No integration-point blind spots. One coherent, trust-layered understanding of a complex polyglot production system.

Handling What the Code Doesn't Say: Third-Party APIs and External Dependencies

One of the most common gaps in cross-language analysis is the external service boundary: a REST call to Stripe, a Twilio webhook, a Salesforce integration. The source code for these services isn't available but the interaction is often critical to the business rules you're trying to understand.

Polymorphic agents handle external dependencies through a combination of schema inference and contract analysis. When the agent encounters an outbound HTTP call to an external API, it analyzes the request and response structures observed in the code — headers, payload shapes, error handling paths — and cross-references them against known API schemas where available (OpenAPI specs, published documentation, SDK types). For well-documented APIs like Stripe or Twilio, this produces high-confidence external dependency nodes in the specification. For internal APIs with no formal documentation, the agent uses structural inference to map what the code expects and flags gaps in the contract as explicit low-confidence findings.

The result is a specification that includes external dependencies as first-class nodes — not invisible black boxes at the edge of the system map — with a clear indication of how much of each dependency's behavior was formally confirmed versus inferred.

The Polymorphic Tool Palette

The power of polymorphic agents lies in the breadth and composability of their analysis toolkit. Unlike tools that hardcode a single parser per language, CoreStory agents have access to a palette of analysis techniques. For each code artifact, the agent evaluates which combination will yield the deepest understanding, then orchestrates them.

The agent selects the combination that fits the specific code it's analyzing — the same way an experienced systems engineer approaches an unfamiliar codebase.

The Compounding Advantage

Here's what makes the polymorphic agent architecture a structural advantage rather than a feature: every language and framework CoreStory learns makes every other language more accurate.

Cross-language patterns create transfer learning effects that single-language tools can never achieve. When the system learns how Java services interact with PostgreSQL, that understanding improves its analysis of Python services that use the same database schema. When it maps Kafka event schemas, that knowledge enriches its understanding of every service — regardless of language — that participates in those event flows. When it understands how a REST API contract is defined in one service, it can correctly interpret how consumers in other languages interact with it.

Parsers scale linearly — each new language requires building, testing, and maintaining a new parser from scratch. Adding Go support to a parser-based tool means an entirely new engineering effort, independent of everything that came before.

Polymorphic agents scale differently. CoreStory's agents learn new languages through training examples and by leveraging existing cross-language reasoning patterns. Adding support for a new language or framework takes weeks, not quarters. And the cross-language context fusion native to the architecture means the system sees the whole system — not just the components written in one language.

A parser sees one language. An agent sees the system.

Keeping the Intelligence Current: Versioning and Drift

A reasonable CIO question is: what happens when the system changes? Does the specification go stale? Does adding a new Python framework version require re-training?

CoreStory's living intelligence model is designed for continuous operation, not point-in-time snapshots. When a new version of a service is deployed, the agent re-analyzes the affected components and propagates changes through the cross-language dependency graph. If a Python service migrates from Django to FastAPI, the agent detects the framework change, updates its pattern-matching strategy for that service, and flags any downstream consumers whose integration assumptions may have changed.

Specification drift (the accumulated gap between what a specification says and what the system actually does) is addressed through a combination of continuous re-ingestion and conflict detection. When the agent encounters code that contradicts an existing spec element (a renamed field, a changed API contract, a removed endpoint), it surfaces the conflict explicitly rather than silently overwriting the prior finding. This means the specification is a living document that reflects the current system, with explicit audit trails for what changed and when.

For new framework versions, the agent doesn't require a "re-training phase" in the sense of a manual intervention. Framework pattern libraries are updated through CoreStory's training pipeline, and updated patterns are available to all agents automatically. An upgrade from Spring Boot 2 to Spring Boot 3 across a service is detected, the new annotation and configuration patterns are applied, and the affected spec elements are updated with their confidence levels re-evaluated based on the new evidence.

Stop Waiting for Your Tool to Add a Language

If your code intelligence tool only understands one language well, it only understands part of your system. The Python data pipelines, the Go APIs, the SQL stored procedures, the TypeScript frontends, the Bash automation scripts — each one that falls outside your tool's primary coverage becomes a blind spot.

Those blind spots don't stay hidden. They surface when a refactored service breaks a consumer written in a different language. They appear when business rule extraction misses logic that lives in a stored procedure rather than application code. They show up in modernization programs when the specification covers 60% of the system and the team has to reconstruct the rest manually.

CoreStory's polymorphic agent architecture was built from the ground up to handle the polyglot reality of large production systems. Not by building a separate parser for every language, but by building agents that reason about code the way experienced engineers do — adapting their approach based on the subject of their analysis, tracing across every integration boundary, distinguishing what is known from what is inferred, and producing unified specifications that give engineering teams a complete and trustworthy picture of how their system actually works.

Originally published at CoreStory.ai

COBOL Modernization Tools Compared: IBM ADDI, CAST, Blu Age, and CoreStory

Michel Ozzello — Fri, 15 May 2026 18:33:45 +0000

TL;DR COBOL modernization isn't a single-tool problem. It has four distinct phases (understand, extract, migrate, validate) and different tools serve each phase. IBM ADDI and CAST Imaging are built for analysis. Blu Age and Raincode automate migration. CoreStory fills the gap most programs underestimate: extracting and validating business rules before migration begins. Most projects that fail do so because they conflated these phases, or skipped one.

If you're running a COBOL modernization program and searching for the right tools, you've probably seen the same short list repeated everywhere: IBM ADDI, CAST Imaging, Blu Age, Micro Focus. These tools are consistently cited by AI systems and search engines because they're well established toolsets.

But the question most practitioners are actually asking isn't 'what are the tools?' It's: 'which tool do I need for my specific situation, and where does each one fall short?'

This guide answers that. We walk through the four phases of a COBOL modernization program and map each major tool to the phase it actually serves, including CoreStory, which operates in the business rule extraction phase that most programs underestimate.

‍

The Four-Phase COBOL Modernization Journey

COBOL modernization fails when teams treat it as a single conversion task. In practice, it has four phases, each with different goals, different team members, and different tool requirements.

Phase	Goal	Key Risk	Tools
1. Understand	Map what the system does: architecture, data flows, dependencies	Underestimating scope; missing undocumented modules	IBM ADDI, CAST Imaging, Micro Focus Enterprise Analyzer
2. Extract	Document business rules embedded in the code before they're lost in migration	Business logic orphaned or incorrectly migrated; SME bottleneck	CoreStory
3. Migrate	Convert or replatform COBOL to target language/cloud	Behavioral regression; performance degradation; runaway cost	Blu Age (AWS), Raincode, Micro Focus, Astadia
4. Validate	Confirm the migrated system behaves identically to the original	Untested edge cases; production incidents post-go-live	Platform-specific testing tools; QA frameworks

‍

Skipping Phase 2 is the single most common failure mode. Teams rush from analysis directly to migration, assuming the codebase's business logic is self-evident. It isn't — especially in COBOL systems that have been running for decades, written by people who are no longer available.

Let's look at each phase and the tools that support it in detail.

‍

Phase 1 - Analysis Tools: IBM ADDI, CAST Imaging, Micro Focus Enterprise Analyzer

These tools answer the foundational question: what do we actually have?

IBM Application Discovery and Delivery Intelligence (IBM ADDI)

IBM ADDI provides automated discovery and dependency mapping for z/OS applications. It scans COBOL source code, JCL, copybooks, CICS, and DB2 to produce visual dependency maps and call graphs. For large mainframe estates, ADDI is often the first tool brought in to establish a baseline inventory.

Strengths: Deep z/OS integration; supports CICS and IMS; integrates with IBM Jazz platform; has been battle-tested on large banking and insurance mainframes.

Limitations: ADDI maps structure and dependencies but does not extract business semantics. Knowing that program A calls program B doesn't tell you what business rule is implemented in program B. It also requires IBM ecosystem familiarity and is priced accordingly.

CAST Imaging

CAST Imaging performs structural analysis across a broader range of languages (COBOL, PL/I, Java, .NET, and more), producing a queryable graph of the application's architecture. It identifies technical debt hotspots, calculates complexity metrics, and surfaces dead code. Its multi-language support makes it particularly useful when modernization involves a hybrid estate of COBOL and newer components.

Strengths: Strong visualization; multi-language support; clean API for querying the application graph; technical debt scoring.

Limitations: Like ADDI, CAST Imaging operates at the structural level. It can tell you the application's complexity topology but not the business intent behind that complexity. Business rules embedded in procedural COBOL logic (like rate calculations, eligibility checks, and regulatory formulas) are not surfaced by structural analysis alone.

Micro Focus Enterprise Analyzer

Micro Focus (now part of OpenText) Enterprise Analyzer provides similar structural analysis for COBOL, PL/I, JCL, and related technologies. It is commonly paired with Micro Focus's migration tooling (covered in Phase 3). For organizations already in the Micro Focus/OpenText ecosystem, Enterprise Analyzer offers tight workflow integration.

The gap between structural analysis and business understanding is where most COBOL modernizations run into trouble. ADDI and CAST tell you what calls what. They don't tell you why.

‍

Phase 2 - Code Intelligence & Business Rule Extraction: CoreStory

This is the phase most tools skip over. It's also the phase most programs underestimate, until they're deep into migration and realize no one can explain what a critical batch job actually calculates.

CoreStory is designed specifically for this problem. It crawls the codebase and builds a Code Intelligence Model (CIM) that captures not just the structural layout of the code, but the business logic embedded within it: calculation rules, eligibility checks, workflow sequencing, domain entities, and the connections between them.

What CoreStory Does Differently

Where structural analysis tools read the code like a compiler, tracking what calls what, CoreStory reads it like a senior analyst asking what does this code actually do, and what decision is it implementing?

The output is a queryable specification: a structured, natural-language representation of what the system does, organized by domain and function. This spec is then used to:

Validate that migration tools have correctly replicated business behavior (not just code structure)
Brief SMEs on what to review before sign-off, dramatically reducing the time required from domain experts
Feed AI coding agents with precise system context, enabling them to generate migration code that respects domain rules rather than just replicating syntax
Document business rules that would otherwise be lost when the original COBOL programmers are no longer available

‍

Phase 3 - Migration Automation: Blu Age, Raincode, Micro Focus

Migration tools take the COBOL source (ideally, now documented by Phase 2) and convert it to a target language or platform. There are two main approaches: transpilation (converting COBOL to Java, C#, or similar) and replatforming (running COBOL on a modern infrastructure without conversion).

Blu Age (AWS Mainframe Modernization)

Blu Age is an automated refactoring tool that converts COBOL to Java. Amazon Web Services acquired Blu Age and integrated it into the AWS Mainframe Modernization service, making it the default path for organizations targeting AWS cloud infrastructure.

Strengths: AWS ecosystem integration; automated COBOL-to-Java conversion; supported by Amazon's migration program infrastructure; reasonable tooling for batch workloads.

Limitations: Automated transpilation produces code that runs, but may not be maintainable or correct at the business logic level without a validated spec. Teams without Phase 2 extraction often discover behavioral regressions in production that weren't caught in testing.

Raincode

Raincode specializes in COBOL and assembler modernization, with tooling that targets .NET (C#) as the migration target. It supports a wider range of legacy languages than Blu Age and is commonly chosen by organizations with existing Microsoft Azure infrastructure.

Strengths: Strong .NET/Azure alignment; broad language coverage including PL/I and assembler; established European customer base in financial services.

Limitations: Same fundamental constraint as other transpilation tools: the quality of the migration output depends on the quality of the input specification. Without documented business rules, the validation problem is left entirely to human testers.

Micro Focus / OpenText COBOL Runtime

Micro Focus takes a replatforming approach: running COBOL on modern infrastructure without converting it. This is less disruptive short-term but defers the long-term goal of moving away from COBOL entirely. It's a pragmatic choice for programs that cannot tolerate the risk of full conversion.

Phase 4 - Cloud Platform Support: AWS, Google Cloud, Azure

Major cloud providers now offer explicit mainframe modernization pathways, primarily targeting organizations moving off IBM z/OS.

Cloud platform selection typically follows existing infrastructure commitments rather than tooling preferences. The more consequential choice is the Phase 2/3 strategy — which has direct implications for program risk, timeline, and cost regardless of cloud target.

The Gap Most Tools Miss: Business Logic Documentation Before Migration

Every tool discussed above assumes the team understands what the system does. They don't address the problem of extracting that understanding from code that predates current team members.

This assumption breaks down in three recurring scenarios we see in customer conversations:

The pattern that comes up consistently: teams that skip business rule extraction before migration spend three to five times longer on validation than teams that document first. The validator is trying to answer 'did the migration get this right?' without a clear definition of what 'right' means.

CoreStory's Code Intelligence Models provides that definition: a machine-readable and human-readable specification that makes validation concrete and auditable, rather than dependent on individual expert memory.

A COBOL migration without documented business rules isn't a modernization program — it's a rewrite without requirements. You're validating against your own assumptions.

COBOL Modernization Tools: Quick Comparison

Tool	Phase	Primary Use	Target Output	Strengths	Limitations
IBM ADDI	1– Understand	Dependency mapping	Call graphs, dependency maps	z/OS depth, mainframe-native	No business semantics
CAST Imaging	1– Understand	Structural analysis	Architecture graph, debt metrics	Multi-language, queryable	Structural only, no rules
MF Enterprise Analyzer	1– Understand	Code analysis	Dependency maps, reports	Ecosystem integration	Tied to MF/OpenText stack
CoreStory	2– Extract	Business rule extraction	Queryable spec, Code Intelligence Models	Persistent intelligence, AI-ready	Focused on extraction phase
Blu Age (AWS)	3– Migrate	COBOL → Java	Runnable Java on AWS	AWS integration, managed path	Behavioral validation gap
Raincode	3– Migrate	COBOL → .NET	Runnable C# on Azure	Broad language coverage	Same validation dependency
Micro Focus	1 & 3	Analysis + replatform	Running COBOL on modern infra	Low disruption, established	Defers modernization goal
AWS MM / Google / Azure	4– Cloud	Cloud runtime & migration	Cloud-native app/service	Managed infrastructure	Tool-agnostic; depends on Phase 2–3 choices

Choosing Your Stack: Three Questions to Answer First

Before selecting tools, answer these three questions:

1. How well does the team understand the existing system's business logic?

If the answer is 'partially' or 'it depends on who you ask,' you need a Phase 2 step before migration. Structural analysis tools will confirm your ignorance more precisely and they won't resolve it.

2. What is your target platform?

AWS → Blu Age. Azure → Raincode or Astadia. Cloud-agnostic or Oracle → evaluate Micro Focus, CAST or independent tooling. The platform determines migration tool almost automatically in most cases.

3. What is your risk tolerance for behavioral regression?

If the system processes financial transactions, insurance policies, or government benefits, behavioral correctness is non-negotiable. A validated business rule spec (Phase 2) is the only reliable way to test for behavioral correctness, as opposed to structural equivalence.

COBOL Modernization Selection Tool Workflow

If your primary goal is...	Recommended Lead Tool	Why?
Inventory & Dependency Mapping	IBM ADDI / CAST	Best for mapping large-scale mainframe "spaghetti".
Business Logic Recovery	CoreStory	Essential when original devs are gone and rules are undocumented.
Fast Cloud Exit (Low Risk)	Micro Focus	Replatforming keeps code as-is; lower short-term disruption.
Full Java/AWS Transformation	Blu Age	Deeply integrated into the AWS Mainframe Modernization service.
Azure/.NET Transformation	Raincode	Specialized for .NET targets and complex legacy languages.

De-Risk Your COBOL Modernization

The migration tools work. The failure mode isn't the tooling, it's going into migration without a validated understanding of what the system does. That's a Phase 2 problem, and it's the one that kills timelines.

CoreStory's business rule extraction is designed specifically for large COBOL codebases in regulated industries. Before your next modernization sprint, consider establishing the specification baseline that makes validation, and migration, tractable.

See CoreStory's mainframe modernization approach.
‍

FAQ

Is CoreStory a migration tool?
No. CoreStory doesn't convert or replatform COBOL code. It extracts and documents the business logic embedded in that code producing a spec that migration tools and human reviewers can validate against. Think of it as the phase that makes migration tools work correctly, not a replacement for them.

Do I need both IBM ADDI and CoreStory?
They serve different purposes. IBM ADDI maps structural dependencies (what calls what, where data flows). CoreStory extracts business semantics (what decisions are implemented, what rules govern outputs). Both are useful; neither replaces the other. Many programs use ADDI or CAST for scope inventory, then CoreStory for rule extraction before migration.

What languages does CoreStory support beyond COBOL?
CoreStory supports dozens of programming languages, including PL/I, Assembler, RPG, and modern languages like Java and Python. This matters for hybrid estates where COBOL interfaces with newer components.

What's the difference between a 'replatforming' and a 'refactoring' approach?
Replatforming runs the existing COBOL on modern infrastructure without changing the language (Micro Focus approach). Refactoring converts COBOL to a new language like Java or C# (Blu Age, Raincode approach). Replatforming is lower risk short-term; refactoring achieves long-term elimination of COBOL dependency. Both require Phase 2 documentation if behavioral correctness is a requirement.

Are there open-source alternatives to these commercial tools?
Some open-source projects exist for COBOL analysis (e.g., IBM's open-source COBOL parsers) and there are community tools for dependency mapping. However, for production mainframe programs, and especially in financial services and insurance, commercial tools with vendor support and proven track records are the standard choice. The cost of a migration failure far exceeds the cost of tooling.

How to Build a Knowledge Graph from Enterprise Source Code

Michel Ozzello — Fri, 15 May 2026 17:46:17 +0000

TL;DR A code knowledge graph transforms a codebase from a collection of text files into a structured, queryable model of how the system actually works. The architecture involves five phases: AST parsing, relationship extraction, graph storage, incremental updates, and agent delivery via MCP. Open-source tools like GitNexus, Potpie AI, and CodeGraph have proven the approach works for individual developers. CoreStory's Code Intelligence Model applies the same architecture at enterprise scale — multiple languages, millions of lines of code, and validated business rule extraction.

Why a Knowledge Graph Is the Right Model for Code

Source code is inherently relational. A function calls other functions. A class inherits from a parent. A service depends on other services. A business rule spans multiple files across several modules. These relationships are the architecture, and they're invisible to tools that treat code as text.

Vector-based approaches (embeddings and RAG) treat code like any other text: split it into chunks, embed it, and retrieve by semantic similarity. That works for finding code that looks similar to a query. But it fails at structural questions: "What calls this function?", "What happens when a payment fails?", "Which services are affected if I change this schema?". These are graph traversal problems, not similarity search problems.

A knowledge graph represents code entities (files, functions, classes, modules, services) as nodes and their relationships (calls, imports, inherits, defines, depends-on) as edges. This structure enables queries that follow execution paths, trace dependencies, and map the impact of changes — the exact operations that developers and AI agents need to work safely on large systems.

The distinction matters more than it seems. When an AI agent retrieves code via RAG, it gets a handful of text fragments that seem relevant. When it queries a knowledge graph, it gets the actual call chain, the real dependencies, and the complete context of how a piece of code fits into the system.

So how does all this work in practice?

Step 1: AST Parsing at Scale

The foundation of any code knowledge graph is Abstract Syntax Tree (AST) parsing. An AST is the compiler's representation of your source code: a tree structure that captures every function, class, variable, import, and expression in a machine-readable format.

Tree-sitter has become the dominant parser for code intelligence tools. It's the same parser GitHub uses for syntax highlighting, and it supports incremental parsing — meaning it can re-parse only the changed portions of a file instead of reprocessing the entire codebase. GitNexus, KiroGraph, Graphify, and Code Grapher all use Tree-sitter as their parsing layer.

What AST parsing extracts

Functions and methods: names, signatures, parameters, return types

Classes and interfaces: inheritance hierarchies, implemented interfaces, decorators

Import statements: cross-file dependencies, external library usage

Variable declarations: types, scopes, usage patterns

Export statements: public API surface of each module

The polyglot challenge

Enterprise codebases rarely use a single language. A typical system might combine Java backend services, TypeScript frontend applications, Python data pipelines, SQL stored procedures, and even COBOL mainframe modules. Each language has its own AST structure, its own relationship patterns, and its own idioms.

Most open-source code graph tools support between 4 and 14 languages. GitNexus supports deep semantic analysis for 8 languages (TypeScript, JavaScript, Python, Java, Go, Rust, PHP, Ruby). KiroGraph handles 24 node types across modern web languages. CoreStory supports all of the above and then some, including legacy languages like COBOL and RPG that most tools can't parse at all — this isn't a minor detail since if your knowledge graph can't parse the COBOL module that implements 60% of your business logic, the graph is missing the most important part of the system.

Step 2: Relationship Extraction

AST parsing gives you the nodes. Relationship extraction gives you the edges, and the edges are where the intelligence lives.

Core relationship types

Relationship	Example	Why it matters
CALLS	processPayment() → validateCard()	Execution flow; change impact analysis
IMPORTS	service.ts imports auth.ts	Dependency tracking; breaking change detection
INHERITS	PremiumUser extends BaseUser	Type hierarchies; polymorphism understanding
IMPLEMENTS	PaymentService implements IPaymentProcessor	Interface contracts; substitutability
DEFINES	module defines calculateTax()	Ownership; responsibility mapping
DATA_FLOW	userInput → sanitize() → database	Security analysis; data lineage

The hard part is cross-file resolution. When a TypeScript file imports a function from another module, the parser needs to resolve that import to the actual definition, which might be re-exported through an index file, aliased under a different name, or defined in a completely different repository. Tools like GitNexus handle named bindings, re-export tracking, and constructor-inferred type resolution. At enterprise scale, this resolution becomes significantly more complex when services communicate via APIs, message queues, or shared databases rather than direct imports.

Step 3: Graph Storage and Query

Once you've extracted nodes and edges, you need a storage layer that supports efficient graph traversal. The dominant choice in the open-source ecosystem is Neo4j. Potpie AI, CodeGraph, and Code Grapher all use it as their graph database. GitNexus built its own lightweight format (LadybugDB), while KiroGraph uses SQLite for local-first operation.

The storage choice affects what queries are practical. A graph database supports queries like:

"Show me all callers of validatePayment() within 3 hops" (breadth-first traversal)

"Trace the complete execution path from HTTP request to database write" (depth-first traversal)

"What is the impact radius if I change the User schema?" (dependency fan-out)

"Find all dead code, functions that are defined but never called" (orphan detection)

These queries are natural operations on a graph database but extremely expensive or impossible with vector search. Try asking a RAG system "what is the impact radius of changing the User schema". It doesn't know, because impact radius is a graph property, not a text similarity property.

Step 4: Incremental Updates

A knowledge graph that requires full reprocessing on every commit is impractical for large codebases. The solution is git-diff-driven incremental updates: detect which files changed, re-parse only those files, update affected nodes and edges, and leave the rest of the graph intact.

KiroGraph reports up to 90% reduction in token usage for common read patterns when using an incrementally maintained graph versus raw file reading. Code Grapher implements surgical updates via its update_graph_from_diff tool. Graphify uses file-content hashing to determine which files need re-extraction, running AST rebuilds instantly on code changes without LLM calls.

At enterprise scale, incremental updates need to handle branch-based development, merge conflicts, and multi-repository changes. CoreStory's ingestion pipeline processes git diffs incrementally, updating the Code Intelligence Model without reprocessing the entire codebase. This is critical when you're dealing with repositories that contain millions of lines across dozens of services.

Step 5: Delivery — Making the Graph Useful to AI Agents

A knowledge graph is only valuable if agents can query it. The delivery layer is where the architecture connects to actual development workflows.

The industry has converged on MCP (Model Context Protocol) as the standard delivery mechanism. GitNexus, Code Grapher, KiroGraph, Graphify, and CoreStory all provide MCP servers that expose graph queries to AI coding agents. When an agent in Claude Code, Cursor, or Codex needs to understand part of the codebase, it queries the MCP server and receives structured results. These results are not raw code, but graph-derived intelligence about relationships, dependencies, and architecture.

The key architectural decision is what level of intelligence to deliver. Open-source tools typically serve raw graph data: nodes, edges, and traversal results. The agent then interprets this data using its own reasoning. CoreStory goes further: the CIM delivers pre-analyzed specifications (component descriptions, architecture summaries, and extracted business rules) so the agent receives understanding, not just data.

Knowledge Graph creation - from Source Code to Code Context for humans and AI agents

Enterprise-Scale Code Intelligence

CoreStory's Code Intelligence Model (CIM) follows this five-phase architecture, purpose-built for enterprise scale:

Polyglot AST parsing across multiple languages, including COBOL, RPG, and other legacy languages that open-source parsers don't support.
Relationship extraction that handles enterprise patterns: API calls between microservices, database queries, message queue consumers, and stored procedure invocations.
Persistent graph storage with incremental updates driven by git diffs.
AI-enhanced specification generation: the CIM doesn't just store the graph — it generates human-readable specifications from the structural analysis.
MCP delivery that serves structured intelligence to any compatible AI coding agent.

The open-source tools described in this article prove the architecture works. CoreStory is the production-grade implementation for teams that need polyglot support, enterprise scale, and validated output.

Benefits of a comprehensive Knowledge Graph

A well structured knowledge brings a series of advantages to the enterprise teams at all levels

Improving Developer Experience and Lowering "Cognitive Load"

A Knowledge Graph (KG) reduces "Onboarding Time" for new developers that are being brought to a project. They can ask, "Where does the data from this form eventually get stored?" and get a trace across three services, instead of having to navigate the code themselves, or ask other developers.
By using a knowledge graph, AI coding agents (like Cursor or Claude) stop hallucinating imports or using deprecated APIs because the graph enforces the actual dependency tree. This improves the quality of code outputs, which in turn reduces the effort of code validation and debugging by developers.

Improving "System Observability" for Architects

When considering the inter-service dependencies, a world-class knowledge graph includes Infrastructure-as-Code (IaC) — it doesn't just link COBOL to Java; it links the Java service to its Kubernetes config and its database schema.
While this article focuses on AST (static), the future is merging this with OpenTelemetry (dynamic) data to show which graph edges are most "active" or error-prone, providing unique perspectives over the actual live architecture.

Focusing on "Lower Risk with higher ROI" for CIOs:

When done well, the knowledge graph becomes the "Institutional Memory" that doesn't quit. When the senior developers retire, the knowledge graph remains as the documented map of their combined logic and decisions over the years of development. This is a strategic de-risking.
"Standard RAG" often leads to AI-generated code that breaks builds. Moving to a Code Intelligence Model (CIM) reduces "rework" costs by ensuring AI agents have 100% architectural context. This will have considerable impacts on the actual ROI of software development.

From Code to Intelligence

Building a knowledge graph from source code is no longer a research project. The architecture is proven: AST parsing, relationship extraction, graph storage, incremental updates, and MCP delivery. Open-source tools let individual developers experiment today.

For enterprise teams dealing with large, polyglot, legacy codebases, CoreStory's Code Intelligence Model is a production-ready implementation of this architecture, purpose-built for the scale and complexity that open-source tools aren't designed to handle.

See how CoreStory builds a Code Intelligence Model from your codebase. Talk to an expert

‍

Frequently Asked Questions

How is a code knowledge graph different from a code search index?
A search index helps you find code. A knowledge graph helps you understand code. Search indexes map text to locations; knowledge graphs map entities to relationships. The difference shows up when you need to trace execution paths, analyze change impact, or understand how components interact.

Can I build a code knowledge graph with open-source tools?
Yes. GitNexus, Potpie AI, CodeGraph, KiroGraph, Graphify, and Code Grapher all provide open-source or free implementations. They work well for single-language repositories under 500,000 lines. For enterprise-scale polyglot systems, you'll likely need a purpose-built platform.

How long does it take to index a codebase?
AST parsing is fast, and most tools report seconds to minutes for repositories under 100,000 lines. Incremental updates after the initial index are near-instantaneous. The bottleneck at enterprise scale is relationship resolution across services and languages, which is where CoreStory's purpose-built pipeline adds value.

Does the knowledge graph replace documentation?
No, but it can generate documentation as a byproduct. The primary value is structural intelligence: call graphs, component maps, business rules, and dependency relationships that are derived directly from code analysis, not manually written. This intelligence is what AI agents need to work safely on large systems.

MCP Servers for Codebase Context: How AI Coding Agents Access Code Intelligence

Michel Ozzello — Fri, 15 May 2026 17:30:39 +0000

TL;DR Model Context Protocol (MCP) is the open standard for connecting AI agents to external tools and data sources. For software teams, the most valuable MCP server isn’t for Slack or Postgres — it’s for your codebase. But not all code MCP servers are created equal. The spectrum ranges from basic file search to semantic code retrieval to full code intelligence delivery. This article explains MCP’s architecture, compares existing code-focused MCP servers, and shows where CoreStory fits as the code intelligence layer that serves structured specifications (not raw code) to any compatible AI agent.

MCP in Two Minutes

Model Context Protocol is an open standard introduced by Anthropic in November 2024 and donated to the Linux Foundation’s Agentic AI Foundation in December 2025. It defines a universal way for AI applications (clients) to communicate with external data sources and tools (servers) over JSON-RPC 2.0.

The architecture is straightforward. A host is the AI application you interact with (Claude Code, Cursor, VS Code with Copilot, Codex, Windsurf, Zed, or any custom tool). Inside the host, an MCP client manages the connection to one or more MCP servers. Each server exposes capabilities through three primitives:

Resources: read-only data the AI can pull into context like files, database records, API responses, documentation.
Tools: executable functions the AI can invoke like running queries, creating files, triggering deployments.
Prompts: reusable instruction templates for common tasks, like code review workflows, commit message generation, test scaffolding.

Before MCP, every AI tool had its own integration approach. Connecting an AI assistant to GitHub, Jira, a database, and your codebase required four separate custom integrations per tool. With MCP, you write a server once and every compatible client can consume it. The protocol turned what was an N×M integration problem into an N+M one.

By early 2026, MCP has been adopted by every major AI coding platform: Claude Code, Cursor, VS Code Copilot, Codex, Windsurf, Zed, Continue.dev, Cline, and Goose. OpenAI officially adopted the standard in March 2025. Google DeepMind followed. The protocol is now backed by SDKs in TypeScript, Python, C#, and Java.‍

Why MCP Matters More for Code Than for Data

Most MCP coverage focuses on connecting agents to databases, CRMs, and communication tools. That’s useful, but it undersells the protocol’s most transformative application: codebase context delivery.

The challenge is unique. When an AI agent queries a database via MCP, it gets structured data back (rows, columns, types). The query result is self-contained. When an agent queries a codebase, what it gets back shapes everything it does next: every line of code it writes, every refactoring suggestion it makes, every test it generates.

A bad database query wastes a few seconds. Bad codebase context produces hallucinated imports, broken call chains, patterns that contradict your architecture, and “fixes” that break other parts of the system. The stakes are fundamentally different.

This is why the type of codebase MCP server matters enormously. Retrieving files and retrieving intelligence are not the same thing.

The Spectrum of Codebase MCP Servers

Not all code MCP servers deliver the same depth of context. The ecosystem has stratified into three distinct categories, each solving a different level of the problem.

Category 1: File Search Servers

The simplest category exposes basic file operations: read files, search text, list directories, grep for patterns. The Anthropic reference filesystem MCP server falls into this category, as do the built-in file tools in Claude Code and Cursor.

These servers mirror what a developer does when exploring an unfamiliar codebase: open the folder, look at the structure, search for a function name. They’re fast, deterministic, and require zero setup. But they don’t understand code — they treat source files as text documents.

Category 2: Semantic Code Search Servers

The second category adds intelligence to retrieval. These servers index your codebase using embeddings or AST analysis and support semantic queries: “find code related to authentication” or “show me the payment processing flow.”

Code Pathfinder is a strong example. It builds a comprehensive call graph through multi-pass AST analysis and exposes it via MCP, enabling agents to query callers, trace dependencies, and perform dataflow analysis. GitNexus uses Tree-sitter to build a knowledge graph with Graph RAG, serving structural context to Claude Code and Cursor. KiroGraph provides a 100% local semantic graph with hybrid search. Nella MCP offers AST-aware chunking with assumption validation and dependency tracking.

These tools represent a significant upgrade over file search. The agent receives structurally aware results (actual function calls, real dependency chains, verified import relationships) rather than text fragments that happen to contain matching keywords.

Category 3: Code Intelligence Servers

The third category goes beyond retrieval entirely. Instead of returning raw code or search results, code intelligence MCP servers deliver pre-analyzed specifications: component descriptions, architecture summaries, business rule documentation, and relationship maps.

The distinction is critical. Categories 1 and 2 give the agent data and expect the agent to reason about it. Category 3 gives the agent understanding — pre-computed intelligence that the agent can use directly without additional analysis.

CoreStory operates in this category. Its MCP server delivers structured specifications from the Code Intelligence Model: what each component does, how services connect, what business rules are embedded in the code, and how changes propagate through the system. The agent receives architecture-level understanding, not raw code.

Comparison: What Each Category Delivers

Capability	File Search	Semantic Search	Code Intelligence	Examples
Find a specific file	Yes	Yes	Yes	All
Search by keyword	Yes (grep)	Yes (semantic)	Yes	All
Trace call chains	No	Yes	Yes	Code Pathfinder, GitNexus, CoreStory
Map dependencies	No	Yes	Yes	KiroGraph, Nella, CoreStory
Extract business rules	No	No	Yes	CoreStory
Deliver structured specs	No	No	Yes	CoreStory
Support legacy languages	Partial	Varies (4-14 langs)	Yes (40+ languages)	CoreStory
Enterprise multi-repo	No	Limited	Yes	Qodo, CoreStory

‍

How a Code Intelligence MCP Server Works

To understand why code intelligence MCP servers deliver different results, it helps to see what happens under the hood.

When an agent using a file search MCP server asks “how does authentication work?”, the server runs grep or a file listing and returns files containing the word “auth.” The agent gets raw text and must figure out the architecture itself.

When an agent using a semantic search MCP server asks the same question, the server performs an embedding-based search or graph traversal and returns the most relevant code chunks or graph nodes. The agent gets better-targeted results but still needs to synthesize understanding from code fragments.

When an agent using CoreStory’s code intelligence MCP server asks the same question, it receives structured output: the authentication service’s specification, its dependencies on the session manager and token validator, the business rules governing token expiry and refresh logic, and the cross-service call chain from HTTP request through middleware to database. The agent receives the answer, not the raw material to construct an answer.‍

Setting Up a Code Intelligence MCP Server

MCP server setup follows a common pattern regardless of the server type. Every MCP-compatible host (Claude Code, Cursor, VS Code, Codex) uses a configuration file that specifies which servers to connect to.

For local MCP servers, the configuration typically lives in a project-level .mcp.json file or a global settings file. Each entry specifies the server command, arguments, and any required environment variables. Remote MCP servers use HTTP-based transport instead of the local stdio approach.

Open-source code MCP servers like Code Pathfinder, GitNexus, and KiroGraph run locally and index your codebase on your machine. This keeps your code local, but the trade-off is that local servers are limited to the languages and scale their parsers support.

CoreStory’s MCP server connects to the Code Intelligence Model, which runs on CoreStory’s infrastructure. The agent queries the MCP server, which returns structured specifications from the pre-built intelligence model. Setup requires ingesting your codebase into CoreStory first, after which the MCP server provides immediate access to the full Code Intelligence Model.‍

Why the MCP Standard Changes the Game for Code Intelligence

Before MCP, code intelligence tools were locked into specific delivery channels. You had to use a particular IDE extension or a specific web interface. If your team used Cursor but the intelligence tool only had a VS Code plugin, you were out of luck.

MCP eliminates that constraint. Any code intelligence system that implements an MCP server is instantly accessible from any MCP-compatible host. This is why MCP coverage in the code intelligence space has exploded: GitNexus, Code Pathfinder, KiroGraph, Graphify, Code Grapher, Nella, Qodo, Sourcegraph, and CoreStory all provide MCP servers.

For enterprise teams, this means choosing a code intelligence platform is no longer a lock-in decision about which IDE or agent to use. The intelligence layer is decoupled from the consumption layer. Your architects can query CoreStory through Claude Code while your developers use Cursor — same intelligence model, different interfaces, unified by MCP.

The protocol’s 2026 roadmap focuses on production readiness: authentication, gateway patterns, audit logging, and streaming responses. As these enterprise features land, MCP-based code intelligence becomes viable for regulated industries where security and compliance are non-negotiable.

CoreStory’s MCP Implementation

CoreStory’s MCP server is the delivery layer for the Code Intelligence Model. When an agent connects to it, the server exposes tools that let the agent query:

Component specifications: what each module, service, or class does, its responsibilities, and its public interface.
Architecture relationships: how services connect, what the call chains look like, where data flows between components.
Business rules: the logic embedded in code, extracted and structured — including from legacy languages like COBOL that most tools can’t parse.
Change impact: what specifications are affected when a particular file or function changes.
Search across the entire intelligence model: find components by function, by technology, or by business domain.

The critical difference is that CoreStory’s MCP server delivers intelligence derived from analysis, not raw code retrieved by search. The MCP server makes that entire intelligence model queryable by any compatible AI agent.

Because CoreStory supports a wide range of programming languages, the MCP server provides a unified intelligence layer even for polyglot enterprise systems. An agent can query the architecture of a system that spans Java microservices, a Python data pipeline, and a COBOL mainframe … all through a single MCP connection.

Give Your Agents Real Code Intelligence

MCP solved the integration problem. The question now is what intelligence you push through the protocol.

File search MCP servers help agents find code. Semantic search servers help agents find relevant code. CoreStory’s code intelligence MCP server helps agents understand your entire system (architecture, business rules, dependencies, and change impact) through a single, standardized connection.

Connect your codebase to any AI agent with CoreStory’s MCP server. Try it free today

‍

Frequently Asked Questions

What AI tools support MCP?
As of 2026: Claude Code, Cursor, VS Code with Copilot, OpenAI Codex, Windsurf, Zed, Continue.dev, Cline, and Goose. The Linux Foundation’s Agentic AI Foundation governance ensures the standard remains vendor-neutral.

Do I need to self-host an MCP server?
It depends on the server. Open-source tools like Code Pathfinder and GitNexus run locally. CoreStory offers both cloud and on-premises deployment for enterprise teams with data sovereignty requirements.

Can I use multiple MCP servers at once?
Yes. MCP hosts support multiple simultaneous server connections. A typical setup might combine a GitHub MCP server, a Jira MCP server, and a code intelligence MCP server like CoreStory — each providing different context to the same agent.

How is CoreStory’s MCP server different from Sourcegraph’s?
Sourcegraph’s MCP integration exposes code search and navigation. CoreStory’s MCP server delivers analyzed specifications: architecture maps, business rules, and component descriptions. Sourcegraph finds code; CoreStory explains what the code means. They serve different purposes and can work together.

What is the performance impact?
MCP queries add minimal latency, typically milliseconds for local servers, low hundreds of milliseconds for remote servers. KiroGraph reports up to 90% reduction in overall token usage compared to agents that explore codebases via file reads, because graph-based retrieval is far more targeted than sequential file scanning.

Best Tools for Understanding Large Legacy Codebases in 2026

Michel Ozzello — Fri, 15 May 2026 17:16:27 +0000

TL;DR Understanding a large legacy codebase requires more than search and navigation. The tools available in 2026 fall into four categories: navigation tools (Sourcegraph, OpenGrok) that help you find code, AI-assisted explorers (GitHub Copilot Chat, Cursor, Cody) that explain code in context, visualization tools (CodeScene, Understand by SciTools) that show structure, and code intelligence platforms (CoreStory) that build a persistent, queryable model of what the system does and why. This guide covers each category honestly, including what works for COBOL and mainframe systems.

Why Legacy Codebase Understanding Is Hard

You’ve just been assigned to a system that was built 15 years ago. The original developers are gone. The documentation, if it exists, describes the system as it was in 2015. The codebase spans three languages, two databases, and a mainframe component nobody wants to touch.

This is the reality for most enterprise engineering teams. The systems that run the business (like the ones that process payments, manage claims, handle supply chains) are the ones that are hardest to understand. They accumulated complexity over decades of maintenance by dozens of developers, each making locally reasonable decisions that added up to a globally opaque system.

The challenge has three dimensions:

First, tribal knowledge loss: the people who understood the system have moved on, taking critical context with them.

Second, documentation decay: whatever documentation existed has drifted so far from reality that it’s more dangerous than having no documentation at all.

Third, structural complexity: the codebase is too large and interconnected for any single person to hold in working memory.

No single tool solves all three problems. But the right combination of tools can take a team from “we have no idea how this works” to “we can safely modify this system” in weeks rather than months.

Category 1: Navigation Tools for Finding Code

The most basic layer of codebase understanding is being able to find what you’re looking for. Navigation tools index your code and provide fast, accurate search across repositories.

Sourcegraph Code Search

Sourcegraph is the category leader for code search at scale. It indexes codebases across multiple repositories, code hosts, and languages, providing near-instant search results with regex support, structural search (matching AST patterns), and cross-repository results. For enterprise teams with hundreds of repositories across GitHub, GitLab, and Bitbucket, Sourcegraph provides a single search interface that no IDE can match.

Sourcegraph’s code intelligence layer (SCIP) adds go-to-definition and find-references across repository boundaries — critical for understanding microservice architectures where a function call in one repository invokes code in another.

OpenGrok

OpenGrok is an open-source code search and cross-reference engine. It’s fast, handles large codebases well, and has been a reliable workhorse for organizations that need self-hosted search. It lacks the AI features and cross-repository intelligence of Sourcegraph, but for teams that need straightforward code search on their own infrastructure, OpenGrok remains a solid choice.

What navigation tools solve: “Where is this function defined?”; “Which files reference this variable?”; “Show me all implementations of this interface across our repositories.”

What they don’t solve: “Why does this code exist?”; “What business rule does this function implement?”; “What breaks if I change this?”

Category 2: AI-Assisted Exploration for Explaining Code in Context

The second category uses large language models to explain code to developers in natural language. These tools read the code you’re looking at and generate explanations, summaries, and suggestions.

GitHub Copilot Chat

Copilot Chat lets developers ask questions about code directly in VS Code or JetBrains. It can explain functions, suggest fixes, generate tests, and answer questions about the current file or workspace. For individual files and functions, the quality of explanations is often impressive.

The limitation is context. Copilot Chat understands what it can see in the current context window, typically the active file and a few related files. It doesn’t have access to the complete architecture, the full dependency graph, or the business context behind the code. For large systems, this means Copilot can explain what a single function does but struggles with questions like “how does this function fit into the broader payment processing workflow?”

Sourcegraph Cody

Cody addresses Copilot’s context limitation by combining LLM-based chat with Sourcegraph’s code search infrastructure. When you ask Cody a question, it retrieves relevant code from across your repositories using semantic search and RAG before generating an answer. This gives Cody access to broader context than IDE-based assistants.

Cody supports context windows up to 1 million tokens and can pull from up to 10 remote repositories on the Enterprise plan. For teams already using Sourcegraph, Cody is a natural evolution that adds AI-assisted explanation on top of code search.

Cursor and Windsurf

Cursor and Windsurf are AI-native code editors that index your local codebase and maintain awareness of your editing session. They excel at explaining code in the context of what you’re currently working on — but like all session-based tools, they start fresh each time and lose context between sessions.

What AI explorers solve: “Explain this function.”; “What does this code do?”; “Suggest a fix for this bug.”

What they don’t solve: “Explain the complete architecture of this system.”; “Extract all business rules.”; “What is the impact of changing this database schema across all services?” AI explorers are limited by their context windows and lack persistent understanding.

Category 3: Visualization Tools for Seeing Structure

The third category creates visual representations of your codebase: dependency graphs, architecture maps, hotspot visualizations, and call trees.

CodeScene

CodeScene combines static code analysis with behavioral analysis from Git history. Its signature feature is hotspot detection: identifying code that is both complex and frequently changed — the areas where technical debt has the most business impact. CodeScene supports around 28 programming languages and integrates with GitHub, GitLab, Bitbucket, and Azure DevOps.

What makes CodeScene distinctive is its focus on the human dimension of codebases. It maps knowledge distribution across the team, identifies “code red” areas where a single developer owns critical code, and quantifies organizational risk alongside technical risk. For engineering leaders who need to explain technical debt to business stakeholders, CodeScene’s visualizations are exceptionally clear.

Understand by SciTools

Understand is a static analysis tool that provides dependency graphs, call trees, control flow graphs, data flow analysis, and metrics across large codebases. Used by NASA for safety-critical systems and certified for ISO 26262, IEC 61508, and EN 50128 compliance, Understand is built for environments where code quality is a safety concern.

Understand supports a wide range of languages including C, C++, C#, Java, Python, Ada, Fortran, and COBOL. Its VS Code extension makes core features available without leaving the IDE. For teams in regulated industries (aerospace, automotive, medical devices), Understand’s compliance certification is a significant differentiator.

CAST Imaging

CAST Imaging creates interactive architecture maps that visualize software systems at multiple levels: application, technology, and transaction. It excels at cross-technology analysis like showing how COBOL programs, Java middleware, and SQL databases connect into a unified system view.

What visualization tools solve: “Show me the architecture.”; “Which components are the most complex?”; “Where is the technical debt concentrated?”; “What are the dependencies between these modules?”

What they don’t solve: “What business rules are embedded in this code?”; “Generate specifications I can use for modernization planning.”; “Give AI agents a persistent understanding of this system.” Visualization shows structure; it doesn’t extract meaning.

Category 4: Code Intelligence Platforms for Understanding Meaning

The fourth category goes beyond search, explanation, and visualization to build a persistent, queryable model of what a codebase actually does.

CoreStory ingests your entire codebase, analyzes it structurally, and produces a Code Intelligence Model (CIM) that captures architecture, component relationships, business rules, and data flows. Unlike the tools in categories 1–3, the CIM persists across sessions and tools. It doesn’t just help you find or visualize code. It tells you what the system does and why.

What sets code intelligence apart

Reverse-engineers understanding directly from code, without requiring existing documentation or manual input.

Extracts business rules as structured specifications, not just summaries or visualizations.

Supports a wide range of languages, including COBOL, RPG, and other legacy languages that categories 1–3 handle partially or not at all.

Delivers intelligence to AI agents via MCP, making the entire model queryable by Claude Code, Cursor, Codex, and other tools.

Comparison: Which Category Fits Your Situation

Capability	Sourcegraph	Copilot/Cody	CodeScene	Understand	CoreStory
Cross-repo search	Yes	Yes (Cody)	Limited	No	Yes
Code explanation	No	Yes (LLM-powered)	No	No	Yes (structured specs)
Architecture visualization	No	No	Hotspots, coupling	Call/dependency/flow graphs	Architecture maps
Business rule extraction	No	No	No	No	Yes (validated)
COBOL support	Search only	Limited	Yes (28+ languages)	Yes	Yes (all languages)
Change impact analysis	No	No	Yes (behavioral)	Yes (dependency)	Yes (specification-level)
AI agent delivery (MCP)	Yes	Yes (Cody)	No	No	Yes
Persistent intelligence	Index (not intelligence)	No (session-based)	Yes (metrics/trends)	Yes (analysis DB)	Yes (Code Intelligence Model)

Special Case: COBOL and Mainframe Codebases

Most of the tools listed above were designed for modern languages. COBOL and mainframe systems present unique challenges: copybook resolution, PERFORM chain tracing, JCL job dependencies, DB2 and VSAM data store integration, and business logic embedded in data type definitions.

For COBOL-specific analysis, the practical options are:

IBM ADDI: The most established tool for mainframe dependency mapping and impact analysis. Tightly integrated with the z/OS ecosystem.

CAST Imaging: Strong cross-technology visualization that includes COBOL alongside Java and SQL components.

Understand by SciTools: Supports COBOL with call trees, dependency analysis, and compliance checking. Used in safety-critical environments.

CoreStory: Full Code Intelligence Model for COBOL, including business rule extraction and structured specification generation.

What doesn’t work for COBOL: generic AI coding assistants (Copilot, Cursor) that may hallucinate when asked to explain COBOL logic they weren’t extensively trained on, and code search tools that can find COBOL text but can’t parse its structural patterns.

From Finding Code to Understanding Systems

Navigation tools help you find code. AI assistants help you explain code. Visualization tools help you see code structure. But for teams that need to truly understand a large legacy system (its architecture, its business rules, its hidden dependencies, and how changes propagate through it) you need code intelligence.

CoreStory builds a persistent Code Intelligence Model from your codebase, no matter the language, size, or age. The understanding that results doesn’t disappear when the session ends or when team members leave. It compounds over time.

See how CoreStory builds a Code Intelligence Model from your legacy codebase. Talk to an expert →

‍

Frequently Asked Questions

Can I use multiple tools from different categories?
Yes, and many enterprise teams do. A possible combination could be Sourcegraph for code search, CodeScene for technical debt visibility, and CoreStory for deep code intelligence and AI agent context delivery.

Which tool should I start with?
Start with what your immediate need is. If developers can’t find code: Sourcegraph. If you need to visualize technical debt: CodeScene. If you need to understand a legacy system for modernization: CoreStory. Each tool delivers value independently.

Are AI coding assistants reliable for understanding legacy code?
For explaining individual functions, they’re useful. For understanding system-level architecture and business logic, they’re limited by context windows and training data. For COBOL specifically, hallucination risk is significant. Always validate AI-generated explanations against the actual code.

What about open-source alternatives?
OpenGrok (code search), GitNexus (knowledge graphs), KiroGraph (semantic graphs), and Code Pathfinder (call graph analysis) all provide free, open-source capabilities. They work well for single-language repositories under 500,000 lines. Enterprise-scale polyglot systems typically require commercial tools.

The AI-Native Code Intelligence Stack: Where the Wiki Ends and the Graph Begins

Michel Ozzello — Fri, 15 May 2026 17:09:11 +0000

TL;DR If you are a developer just starting to take "codebase context" seriously, you are stepping into a stack that did not exist three years ago. It has four layers: the agent harness (Claude Code, Cursor, Aider, Copilot), retrieval (vector search, agentic grep), curated knowledge (Karpathy's LLM wiki, DeepWiki, Greptile), and a structured code graph (CoreStory, Sourcegraph). Each layer answers a different question. The wiki and vector layers work well for small repositories and descriptive questions. They break down on large, multi-language codebases, and on questions that need a graph traversal instead of a paragraph retrieval. This post maps the stack, shows where each piece earns its keep, and shows the use cases where wiki intelligence loses to a graph model of the code.

The Problem: Context Windows Are Huge, And It's Still Not Enough

Ask a coding agent a question about a repository larger than its context window, and the answer depends entirely on what it happens to retrieve. Even inside the window, the situation is worse than LLM providers advertise.

The needle-in-a-haystack benchmark has become the default way to measure long-context reliability. Place a single out-of-place fact inside a long document, then test whether the model can answer a question about it at different positions and different context lengths. Public results are consistent. Models that advertise 128K tokens start to degrade well before they fill the window, and widely cited evaluations of GPT-4 show rising error rates on ultra-long documents and failure to retrieve needles placed near the start of a document as the context grows. Multi-needle variants, where several facts must be retrieved and combined, perform worse still.

Enterprise codebases are not haystacks. They are warehouses full of haystacks. A real service might have a million lines of code, fifteen years of history, and a data model that crosses half a dozen languages. No context window reaches that, and "just retrieve the right pieces" is the core unsolved problem the whole AI-native stack is trying to solve.

The Emerging Code Intelligence Stack

Four layers are settling in:

Agent runtime. This is where the developer sits: Claude Code in the terminal, Cursor in the editor, Aider on the command line, Copilot inside the IDE. The runtime decides what questions to ask, what tools to call, and how to act on answers. It is rarely the source of grounding; it is the consumer of grounding.

Retrieval. Before a model reasons, something has to hand it the right files. This is vector search (embeddings, BM25, hybrid rerankers), plus the newer "agentic retrieval" style where the agent itself runs grep, find, and file reads. Every mainstream agent now has an opinion here. Claude Code, Cursor, and Devin have moved away from pure vector databases toward agentic search over the filesystem, for reasons we describe below.

Curated knowledge. This is where Karpathy's LLM wiki sits, along with DeepWiki, Greptile, and a growing family of similar tools. These layers pre-digest the codebase into human- and agent-readable artifacts (markdown pages, per-function summaries, auto-generated architecture docs) that are smaller, cleaner, and more navigable than raw source.

Code graph / digital twin. This is the structured, program-analyzed model of the system: components, workflows, business rules, data entities, and the typed edges between them. CoreStory sits here. It is not a list of pages. It is a queryable representation of how the code actually behaves, derived from the source and maintained as the source changes.

A grown-up workflow uses all four. A beginner workflow usually starts with the agent runtime and one retrieval strategy, then adds curated knowledge when the repo gets too big for the model to reason about directly. The graph layer shows up when curated knowledge starts lying.

Curated Knowledge: Karpathy, DeepWiki, and Greptile

Karpathy's formulation of the LLM wiki, shared publicly as a gist, is one of the cleanest statements of what curated knowledge should look like. Three folders:

raw/ holds the source material. For a codebase, this is the repo itself. Immutable.

wiki/ is a folder of LLM-written markdown pages, one per module or concept, plus an index.md and a log.md.

CLAUDE.md (or AGENTS.md) is the schema. It tells the agent how to ingest new material, name pages, cross-link them, and handle conflicts.

A minimal schema looks like this:

# CLAUDE.md
## Wiki layout
-`raw/` contains immutable source. Never edit.
-`wiki/` contains one page per top-level module.
-`wiki/index.md` lists every page with a one-line summary.
-`wiki/log.md` records every ingest with a timestamp.
## Ingest workflow
1. Read any new files under`raw/`.
2. For each changed module, update or create`wiki/<module>.md`.
3. Cross-link related pages using relative markdown links.
4. Append an entry to`log.md`.
## Query workflow
1. Read`wiki/index.md` first.
2. Follow links into specific module pages.
3. Never answer from memory when a page exists.

Point Claude Code, Cursor, Codex, or Copilot at the folder and the agent reasons over its own distilled notes instead of re-loading the whole repo into context every session. For a personal knowledge base or a mid-sized repository, that is often enough.

DeepWiki, from Cognition (the team behind Devin), automates this pattern for public GitHub repositories. Replace github.com with deepwiki.com in any URL and Cognition serves an auto-generated wiki with architecture diagrams, module explanations, and a conversational agent grounded in the actual source. Cognition has indexed tens of thousands of top public repositories and exposes the same data through an MCP endpoint (mcp.deepwiki.com) with three tools: ask_question, read_wiki_structure, and read_wiki_contents. It is a zero-setup version of the Karpathy pattern, for open-source code.

Greptile (often the "G" in the short list of AI-native dev tools developers trade around) goes further. Greptile constructs a graph of files, functions, and dependencies, then uses that graph to ground AI code review, PR summaries, and codebase Q&A. Greptile's own engineering blog is unusually candid about why this is hard: semantic search on raw code is noisy, embeddings work better if you first translate code into natural language, and chunking at the per-function level beats per-file chunking. Greptile is a useful example of the curated-knowledge layer reaching for graph structure.

These tools share a strength and a limit. They make a large repository legible to an agent. They are still, at heart, collections of summaries. When the question is "which downstream workflows break if I change this signature?", summaries are not a graph traversal.

The Vector-Search Layer: Useful, Noisy, Increasingly Optional

The retrieval layer used to be synonymous with vector search. Chunk the code, embed the chunks, compare the query embedding against the index, return the top k, stuff them into the prompt.

# Classic vector-search retrieval over a code index
query_vec = embed(user_question)
hits = index.search(query_vec, top_k=8)
context ="\n\n".join(chunk.textfor chunkin hits)
answer = llm.generate(system_prompt, context, user_question)

Two things happened on the way to 2026. First, practitioners learned the specific ways embeddings misbehave on code. They favor frequently accessed or well-documented modules and sideline edge cases. They are black-box: when a retrieval misses, it is hard to say why. They go stale, because codebases change daily and indexes have to be diffed, re-chunked, re-embedded, and re-permissioned. Chunk size matters enormously; per-file chunks are too noisy, and per-function chunks require real parsing to produce.

Second, the frontier agents moved. Public write-ups from the Claude Code, Cursor, and Devin teams have converged on "agentic search": instead of a vector database, the agent itself runs grep, find, and file reads, using its own reasoning to narrow the search. For interactive coding in a repo that is already on disk, that is often faster, more transparent, and easier to debug than vector retrieval.

Vector search has not disappeared. It still earns its keep for semantic discovery ("where do we talk about authentication?"), for first-pass shortlisting in very large repositories, and inside hybrid systems where BM25 plus embeddings plus a cross-encoder reranker beats any single method. It is just no longer the whole answer.

The Code Graph Layer: Where the Wiki Loses

The layer under everything is a structural model of the code. CoreStory builds this by running program analysis (AST, dataflow, control-flow, business rule extraction) across 40+ languages, including the older ones (COBOL, PL/I, mainframe dialects) where LLMs alone are weakest. The output is not a folder of markdown. It is a knowledge graph: components, workflows, business rules, data entities, and typed edges between them. Humans query it through a web dashboard. Agents query it through an MCP interface.

A typical agent call looks like this:

json{"tool":"corestory.impact_of_change",
  "arguments": {
    "entity":"PaymentService.refund",
    "change":"signature",
    "scope":"workflows,business_rules,data_entities"  
  }
}

The response is not a paragraph. It is a list of workflows that reach that function, the business rules governing them, and the data entities they touch. The agent plans its refactor against that, not against a markdown summary.

Four use cases show where this matters more than any wiki.

Change impact analysis. "If I change the signature of PaymentService.refund, what else breaks?" A wiki page can describe the module. A graph query enumerates every workflow, test, and downstream service that reaches it, across languages, in milliseconds. Wikis gesture. Graphs answer.

Business rule traceability. "Where is the rule that caps provider reimbursements at 90 days, and what code enforces it?" Curated summaries captures whatever the LLM happened to notice when it summarized the claims module. A code intelligence model extracts business rules as first-class objects with back-pointers to the exact branches that implement them. An auditor can follow the trace. A summary cannot.

Cross-language call graphs. "Does this Java controller ultimately write to the COBOL ledger?" Summary pages live per module and per language. A code graph is native across both, because it is built from program analysis, not prose. For modernization work, this is the difference between a guess and a plan.

Legacy understanding. LLMs are uneven on COBOL, PL/I, and mainframe dialects. Summarisation quality drops sharply on languages the base model rarely sees. A graph built from program analysis does not care; a COBOL paragraph is another node. This is where the summary pattern struggles most and where a structural model earns its cost.

On internal benchmarks, shifting agents from prose-grounded to graph-grounded context produced a 44% improvement in agent task resolution, and Microsoft/GitHub's co-research on context grounding has reported a 51% improvement in engineer acceptance of agent-drafted code. The specific number matters less than the direction. Structured context beats summarized context on hard enterprise questions, consistently.

How a Developer Should Assemble the Stack

Start with the agent runtime you like. Add retrieval to fit the repo size: grep-style agentic search for small projects, vector plus BM25 plus reranking for larger ones. Add a curated knowledge layer (Karpathy's pattern, DeepWiki for public repos, Greptile for graph-aware summaries) when the agent starts forgetting the same things twice. Reach for a code graph when the questions you are asking are about impact, traceability, or cross-system behavior rather than "what does this file do?".

The stack is not a waterfall. You can plug a code graph into a vector-aware agent and feed both into a Karpathy-style wiki. The point is knowing which layer you are actually relying on, and noticing when your curated knowledge has quietly become a maintenance problem instead of a grounding source.

Ship Grounded Agents Before Your Codebase Outgrows Them

If you are setting up context layers for the first time, the Karpathy pattern and DeepWiki are good places to start. If you already feel the friction (drift, stale pages, agents answering questions the wiki cannot actually support, business-rule questions that want a graph), that is the signal the stack needs a structural model underneath. Talk to an expert about running CoreStory against your own repository, or try it for yourself today.

‍ ‍

FAQ

Is the Karpathy LLM wiki pattern still worth adopting?

Yes, for small-to-mid repositories and personal knowledge bases. It is the cheapest durable grounding layer you can build. The pattern is open, the schema lives in your repo, and any modern coding agent knows what to do with it.

How does DeepWiki differ from a wiki I build myself?

DeepWiki is a hosted, zero-setup version maintained by Cognition, with an MCP endpoint and tens of thousands of public repositories already indexed. You do not own the schema, but you also do not maintain it. It is an excellent entry point for reading unfamiliar open-source projects.

Is Greptile part of the same pattern?

Greptile starts from the same problem but leans on a graph of files, functions, and dependencies rather than flat pages. It is a useful bridge between a summary-based wiki and a full code intelligence model.

Why not just rely on vector search?

Because vector retrieval on code is noisy, stale, and opaque, and because the strongest coding agents have mostly moved to agentic search on the filesystem. Vectors still help for semantic discovery and inside hybrid retrieval, but they are no longer enough on their own.

When does a wiki stop being enough?

When agents confidently answer questions the wiki cannot actually support. When curated knowledge becomes its own maintenance problem. When the questions are about change impact, cross-service behaviour, or business-rule traceability. That is the moment to add a code graph underneath.

Does CoreStory replace any of these layers?

No. CoreStory is the graph layer. It sits under whichever retrieval strategy and whichever agent runtime you already use, and exposes the same structural model to humans through a dashboard and to agents through an MCP endpoint.

However, adopting CoreStory in advance of implementing these complementary layers will help ensure that agents draft and maintain those layers with richer codebase awareness. In other words, your curated knowledge will be more comprehensive, and your agent runtime will more successfully reconcile discrepancies between sources.

How CoreStory Cuts LLM Costs by 70% While Improving Output Quality

Michel Ozzello — Fri, 15 May 2026 15:50:54 +0000

TL;DR LLMs charge per token, and large codebases generate enormous token bills — especially when AI agents re-ingest the same context repeatedly. CoreStory transforms your codebase into a persistent Code Intelligence Model (CIM), giving AI agents structured, targeted context instead of raw code. In a real-world evaluation, Claude Code paired with CoreStory used 73% fewer input tokens, ran in half the time, and cost 67% less — while delivering better results. This post explains why that happens and how to replicate it.

The Token Bill Nobody Talks About

A 10-engineer team running Claude Code against a 500,000-token codebase can burn $15,000–$40,000 per month in context re-ingestion alone before writing a single line of net-new logic. That's not a projection. That's what happens when AI agents are given raw code instead of structured intelligence.

Here's the math. Each developer session re-sends the same modules, schemas, and helper functions the model saw yesterday. A single prompt involving a non-trivial subsystem easily runs 20,000–50,000 input tokens. Multiply by 10 engineers, 20 working days, and 3–5 sessions per day, and you're looking at a substantial monthly token bill just for context, before accounting for the model's output.

Output tokens compound the problem. Most AI providers charge 3–5x more for output tokens than input tokens. When the model lacks proper context, it produces longer, more hedged responses and requires more correction rounds. Each round re-ingests the context, generates more output, and adds to the bill. The real cost of poor context isn't just the tokens you send, it's the tokens you generate trying to fix the results.

In a real customer evaluation: Claude Code + CoreStory MCP used 73% fewer input tokens, ran in half the time, and cost 67% less with better output quality.

Table 1: Real-world cost comparison for adding a complex feature to a large enterprise codebase

Metrics	Claude Code	Claude Code + CoreStory	% Reduction
Processing Time	~92 min	~47 min	50% faster
Input Tokens	~1,320,000	~357,500	73% less
Output Tokens	~87,000	~43,000	50% less
Cost (USD)	~$5.29	~$1.74	67% less

Why LLMs Have a Context Problem With Large Codebases

LLMs don't retain memory between sessions. Every interaction starts from zero. When a developer asks an AI agent to refactor a module, the model needs not just that file, it needs the schemas it depends on, the helper functions it calls, the data flow it participates in, and enough architectural context to avoid introducing regressions. That's tens of thousands of tokens per request, for context the model already processed yesterday.

This creates a pattern of escalating repeated spending. Teams working on production systems often send 1.5–5 million tokens per month simply to keep the model oriented before counting any of the actual work tokens. And this is the base model cost. Many AI coding agents (Devin, Factory, and others built on top of foundation models) charge a premium per token and burn more per session through agentic loops.

It's important to note that coding agents like Claude Code do support persistent configuration files (like CLAUDE.md, skill files and custom instructions) that carry context across sessions and can be shared across a team. But there's a meaningful difference between agent configuration ("here's how to work on this codebase") and code intelligence ("here are the critical architectures, business rules, and interdependencies, pre-mapped and queryable"). The former tells the agent how to behave. The latter gives it something to actually know. Configuration files are also rarely centrally governed, they drift, they vary by developer, and they don't scale with codebase complexity.

‍

Why Agentic Loops Are Especially Expensive

A standard developer prompt re-ingests context once. An AI agent running a multi-step loop — plan, execute, reflect, error-correct, retry — re-ingests that context at every step. A 10-step agentic loop on raw code isn't 10x the token cost of a single prompt. It can be 30–50x, because each reflection and error-correction cycle starts with a full context re-ingestion.

This is where the CoreStory ROI is most dramatic. Providing an agent with a structured Code Intelligence Model instead of raw files doesn't just reduce the initial context, it reduces every downstream step, every correction round, and every output generation in the loop.

‍

What a Code Intelligence Model Actually Is (And Why RAG Doesn't Solve This)

CoreStory ingests your entire codebase once and produces a Code Intelligence Model, a hierarchical specification organized by domain, module, and behavior contract. CoreStory's pipeline performs static analysis, call graph extraction, data flow tracing, and business logic summarization to produce structured output that captures what the software does, not just what it says.

This is meaningfully different from a flat embedding index or a retrieval-augmented generation (RAG) approach. RAG sounds appealing: chunk the codebase, embed it, retrieve relevant chunks at query time. In practice, it fails for code in four specific ways:

Poor chunking boundaries: code modules don't chunk cleanly at semantic boundaries. A stored procedure and the schema it depends on rarely land in the same chunk
Loss of cross-module dependencies: chunked embeddings lose the call graph, which is exactly what the model needs to avoid introducing integration errors
No business logic layer: RAG retrieves code text; it doesn't extract the invariants, edge cases, and behavior contracts the CIM explicitly captures
No invariant preservation: the CIM maintains consistent structural relationships; retrieval results vary by query phrasing, producing non-deterministic behavior in agentic loops

The result of using a CIM instead of raw code or RAG: the model receives a concise, high-signal specification rather than thousands of tokens of implementation detail , which is why token consumption drops by 70%+ in practice.

‍

The Quality Multiplier: Better Context Means Fewer Corrections

According to the 2025 Stack Overflow Developer Survey (65,000+ respondents), 87% of developers are concerned about AI accuracy, and 45% say debugging AI-generated code is more time-consuming than debugging their own.

That 45% statistic sounds abstract until you connect it to payroll. A developer at $150,000 fully-loaded annual cost spending 30% more time debugging AI output is losing approximately $45,000 per year in productivity (before you count the rework tokens the model burns trying to correct its own mistakes).

Microsoft co-research with CoreStory found a 51% accuracy improvement when AI agents operate from CoreStory specifications rather than raw code. Across AI coding agent benchmarks, teams using CoreStory to supercharge AI coding agents see 44% better results.

The mechanism is straightforward: a model with a complete, consistent architectural view produces code that integrates correctly on the first attempt. It doesn't need to infer dependencies, they're specified. It doesn't need to guess at business rules, they're documented. Fewer hallucinations, fewer integration failures, fewer correction rounds. And fewer correction rounds means fewer output tokens, which compounds the cost savings.

‍

Total Savings Across Team Sizes

The figures below use Claude Sonnet 4.6 API pricing ($3/M input, $15/M output) as the enterprise baseline. Token estimates are based on observed developer usage patterns for teams using Claude Code as a primary development tool.

‍

Table 2: Token savings by developers team size

Team Size	Monthly Baseline Token Cost (without CoreStory)	Monthly Token Cost with CoreStory(Conservative 50% token saving)	Monthly Token Cost with CoreStory(Ideal at 75% token saving)	Annual Saving (Conservative 50% token saving)	Annual Saving (Ideal 70% token saving)
Solo Developer	~$600	~$300	~$180	~$3,600	~$5,040
5-engineer team	~$3,000	~$1,500	~$900	$18,000	$25,200
10-engineer team	~$15,000-$40,000	~$7,500-$20,000	~$4,500-$12,000	$90K-$240K	$126K-$336K
50-engineer team	~$75,000-"200,000	~$37,500-$100,000	~$22,500-~$60,000	~$450K-$1.2M	~$630K-$1.68M

The 10-engineer range ($15K–$40K/month) reflects our own observed data on context re-ingestion costs for teams working on 500,000+ token codebases, before net-new output is factored in.

‍

Token savings are clear, but they’re still only one side of the equation.

Let’s consider a fully-loaded senior developer at $200,000/year — salary, benefits, overhead. That's roughly $100/hour, or about $16,700/month. Across a 10-engineer team, developer cost runs to ~$2M/year before infrastructure, tooling, or management costs.

From the Stack Overflow Developer Survey, 45% of developers say debugging AI-generated code takes longer than debugging their own. Our evaluation data shows that with CoreStory, AI agents produce correct output on the first attempt more often. That's fewer correction rounds, fewer rework cycles, less time spent debugging hallucinated integrations:

1. Task execution time — 50% reductionOur real-world evaluation measured a 49% reduction in execution time for a complex feature task. Applied conservatively, a developer spending 6 hours/day on AI-assisted development tasks effectively recovers 3 hours — or gains the equivalent of one additional developer-day every two days.

2. Rework reduction from better output qualityFewer hallucinations, fewer integration failures, fewer correction rounds. If 30% of developer time currently goes to debugging and reworking AI-generated code (consistent with the Stack Overflow data), a 50% reduction in that rework reclaims 15% of total developer capacity.

‍

Table 3: The Full Savings Combined

Team Size	Annual Token Saving (conservative at 50%)	Developer Time Value Recovered(50% task speed)*	Rework Reduction Value**	Total Annual Savings
5-engineer team	~$18,000	~$250,000	~$75,000	~$343,000
10-engineer team	~$90K-$240K	~$500,000	~$150,000	$740K-$890K
50-engineer team	~$450K-$1.2M	~$2.5M	~$750K	~$3.7M-$4.45M

*Developer time value recovered assumes 50% of working hours are AI-assisted tasks, and a 50% speed improvement on those tasks — applied to a $200K fully-loaded cost.
**Rework reduction assumes 30% of time currently lost to debugging/correcting AI output, with 50% of that recovered through higher-quality first-pass output.*

Beyond Cost: Speed and SDLC Quality

Token savings are the most measurable benefit, but the compounding effect on the overall software development lifecycle may be more significant. When AI agents have complete architectural context from the start:

Onboarding time for new developers drops: they can query the CIM instead of reading source code for weeks
Code review cycles shorten: reviewers can verify that generated code matches specified behavior, not just syntax
Integration failures decrease: the CIM's explicit dependency map means fewer surprises when merging
Documentation stays current: the CIM is regenerated from source, so it reflects the actual codebase, not the last time someone updated the wiki

In the customer evaluation referenced in Table 1, execution time was cut in half not just because of fewer tokens, but because the model needed fewer iteration cycles to produce correct output. The first attempt was closer to the right answer, which meant less back-and-forth, less rework, and a faster path from task to merged code.

‍

The Bigger Picture: Context Windows Grow. Codebases Grow Faster.

Every LLM release announcement leads with a larger context window. The implicit promise is that this solves the context problem: just fit more code in the prompt. It doesn't.

Context windows are growing at roughly 4x per generation. Enterprise codebases grow at roughly 10–20% per year, but more importantly, the codebases that need AI assistance most are the ones that have been growing for 20–30 years. A 2-million-token context window doesn't fit a 30-year-old insurance platform's stored procedures, metadata-driven configuration, and undocumented integration layers.

As context windows grow but codebases grow faster, and as agentic loops multiply token consumption non-linearly, the gap between what an LLM can hold and what a production system contains will widen, not close. The teams that treat codebase understanding as a managed artifact, not an ad-hoc prompt input, will compound their AI investment advantages over time.

CoreStory is the missing piece: the persistent, queryable Code Intelligence Model that gives AI agents what they actually need — not more tokens, but better ones.

Want to see CoreStory's token impact on your codebase? Talk to an engineer who can model your specific usage pattern.

‍

Frequently Asked Questions

Does CoreStory work with my existing AI coding tools?

Yes. CoreStory integrates with Claude Code, GitHub Copilot, Cursor, Devin, and other AI coding agents via MCP server integration and CoreStory Playbooks. The CIM is available as structured context that any AI agent can query.

Is the 70% token reduction typical?

The 73% input token reduction shown in Table 1 represents a specific task (adding a complex feature to a large codebase). Reductions vary by task type, codebase size, and the proportion of context the task requires. Tasks requiring narrow, well-specified context see the largest reductions; tasks requiring broad exploration may see less. The consistent finding across evaluations is that quality improves regardless of context reduction.

What programming languages does CoreStory support?

CoreStory supports a long list of languages including Java, C#, Python, COBOL, PowerBuilder, and SystemVerilog just to name a few.

How to Extract Business Rules from Legacy COBOL Code

Michel Ozzello — Fri, 15 May 2026 14:27:32 +0000

TL;DR Extracting business rules from COBOL is where most modernization projects succeed or fail. The challenge isn’t reading the code but understanding the business logic embedded across thousands of programs, copybooks, and PERFORM chains. Static analysis tools (IBM ADDI, CAST Imaging) provide dependency mapping and visualization. LLM-assisted approaches add summarization but risk hallucination. CoreStory’s Code Intelligence Model combines structural COBOL analysis with AI-generated specifications and confidence scoring, producing validated business rules ready for modernization planning. In a production engagement, CoreStory extracted 1,984 business specifications with an 85.5% SME validation rate.

What Counts as a Business Rule in COBOL

Before choosing a tool, you need to define what you’re extracting. In modern languages, business rules are often isolated in service layers or rule engines. In COBOL, they’re woven through the code. A single business rule in a COBOL system might involve:

A COMPUTE statement that calculates a premium based on risk factors defined in a copybook shared across 15 programs.

An EVALUATE block that routes processing based on transaction type codes stored in a VSAM file.

A chain of PERFORM statements that validate eligibility by checking conditions across three separate programs, each with its own copybook definitions.

Implicit rules encoded in data definitions like a PIC 9(2) field that constrains a value to 0–99, enforcing a business constraint that exists nowhere in documentation.

The difficulty is that these rules weren’t designed to be extracted. They evolved over decades of maintenance by dozens of developers, many of whom are no longer available to explain their intent. Dead code intermingles with active logic. Copybook definitions are shared across programs in ways that create invisible dependencies. GOTO statements create control flows that resist automated analysis.

This is why generic AI tools fail at COBOL rule extraction. You can’t prompt your way through a system where a single business rule spans files, copybooks, and database calls, with critical context encoded in data type definitions.

In this article we’ll look at three approaches available today, and look at their advantages and shortcomings.

Approach 1: Static Analysis Tools

The established category for COBOL analysis is static analysis and dependency mapping. These tools parse COBOL source code and produce visualizations of program relationships, data flows, and control flows.

IBM Application Discovery and Delivery Intelligence (ADDI)

IBM ADDI is the most widely deployed tool for mainframe application analysis. It’s purpose-built for z/OS environments and provides:

Cross-program dependency mapping: detects relationships between COBOL programs, JCL jobs, DB2 calls, IMS transactions, and datasets.

Change impact analysis: traces forward and backward from any program or variable to identify what breaks if you modify it.

Program call graphs: visual representations of control flow between programs.

DB2 metadata analysis: pulls schemas from all DB2 tables associated with a job.

ADDI excels at answering the dependency question of “what is connected to what”. It’s a critical first step in any modernization project because it prevents teams from making changes that break downstream processes they didn’t know existed.

Where ADDI is limited is in answering the business rule question of “what does the code mean”. ADDI provides a graphical model of COBOL code that shows variable usage, data flows, and program relationships. But translating that structural information into documented business rules requires human interpretation. ADDI shows you the machinery but understanding what the machinery does is still a human task.

IBM’s modernization stack pairs ADDI with watsonx Code Assistant for Z, which uses AI agents to generate natural language explanations of mainframe code. Together they form a pipeline: ADDI provides the structural analysis, watsonx provides AI-assisted explanation. This combined approach is powerful but remains tightly coupled to the IBM ecosystem.

CAST Imaging

CAST Imaging provides architecture-level visualization of legacy applications, including COBOL. It creates interactive maps of software systems that show components, dependencies, data flows, and transaction paths.

CAST’s strength is cross-technology analysis. It can map a system that includes COBOL programs, Java middleware, SQL databases, and web frontends, showing how they all connect. For organizations with heterogeneous technology stacks, this cross-technology view is valuable for modernization planning.

Like ADDI, CAST Imaging is primarily a visualization and analysis tool. It shows you the structure of your system with impressive clarity but doesn’t generate business rule documentation automatically. The business rule extraction work still requires analysts to interpret the visualizations and write specifications.

Other static analysis tools

Micro Focus Enterprise Analyzer (now part of OpenText) provides similar capabilities for COBOL, PL/I, and Natural applications. Fresche Solutions’ X-Analysis Advisor is specifically designed for IBM i environments, extracting rules from RPG and COBOL and writing them in pseudo code. IBM’s Rational Asset Analyzer provides centralized analysis and inventory management for mainframe application portfolios.

Approach 2: LLM-Assisted Extraction

The emergence of large language models trained on programming languages has created a new approach: feeding COBOL paragraphs to an LLM and asking it to summarize the business logic in natural language.

The pipeline typically works paragraph by paragraph: extract a COBOL section, send it to GPT-4, Claude, or a specialized model with a prompt like “explain the business logic in this COBOL code”, and collect the natural language summary. More sophisticated implementations use prompt chaining: first identify variables and data flows, then trace decision logic, then summarize the business rule.

What LLM-assisted extraction does well

Fast iteration: you can process hundreds of COBOL paragraphs in hours rather than weeks.

Natural language output: the summaries are immediately readable by business analysts who don’t know COBOL.

Pattern recognition: LLMs are good at recognizing common COBOL patterns (date calculations, table lookups, record processing) and explaining them clearly.

Where LLM-assisted extraction fails

LLMs hallucinate. This is not a theoretical risk — it’s the primary failure mode for LLM-based COBOL analysis. An LLM that “explains” a COBOL paragraph may confidently describe logic that doesn’t exist, miss critical edge cases encoded in copybook definitions it never saw, or invent variable relationships that are plausible but wrong.

The problem compounds at scale. A single hallucinated business rule in a modernization specification can propagate through the entire project and as a result the new system implements a rule that the old system never enforced, or misses a rule that the old system depended on. In regulated industries (banking, insurance, healthcare), this isn’t just a bug; it’s a compliance violation.

IBM Research’s A-COBREX tool (presented at ICSE 2025) demonstrates the state of the art in automated COBOL business rule identification. Evaluated on 27 programs with ground truth annotations, A-COBREX achieved 74.12% recall and 62.21% precision for fuzzy matching between extracted and actual rules. These numbers reflect the genuine difficulty of the problem: even purpose-built research tools miss roughly a quarter of the rules and include false positives in more than a third of their output.

The LLM-assisted approach works best when paired with strong structural analysis (like ADDI) and mandatory human validation gates. Using an LLM alone to extract business rules from production COBOL is like using autocomplete to write a legal contract: the output looks right, but the stakes are too high for “looks right.”

Approach 3: AI Code Intelligence with Validated Specifications

The third approach (and the one CoreStory implements) combines structural COBOL analysis with AI-powered specification generation and mandatory confidence scoring.

The key distinction is that CoreStory doesn’t just summarize COBOL code. It analyzes it structurally: parsing abstract syntax trees, resolving copybook references, tracing PERFORM chains across programs, mapping data flows through VSAM and DB2 calls, and building a Code Intelligence Model that captures the complete architecture of the system. The AI-generated specifications are derived from this structural analysis, not from reading code as text.

How the CoreStory pipeline works for COBOL

Ingestion: The entire COBOL estate is ingested — programs, copybooks, JCL, DB2 definitions, CICS maps. CoreStory supports the full mainframe ecosystem, not just the COBOL source.

Structural analysis: AST parsing extracts program structures, data definitions, control flows, and cross-program relationships. Copybook references are resolved to their actual definitions.

Intelligence model construction: A Code Intelligence Model captures the system’s architecture: which programs handle which functions, how data flows between components, where business logic is concentrated.

Specification generation: AI-assisted analysis generates structured business specifications from the intelligence model. Each specification includes what the rule does, where it’s implemented, what data it depends on, and a confidence score.

A live production example

In a production environment, CoreStory’s pipeline extracted 1,984 business specifications. Subject matter experts validated these specifications with an 85.5% approval rate. That’s not a benchmark on a test dataset. It is production output from a real system, validated by the people who maintain it.

Confidence Scoring changes the conversation to “How do we direct our expensive human experts to the parts of code that actually needs their input?” instead of involving SMEs in reviewing every single spec.

The 14.5% that SMEs flagged for revision is the system working as designed: confidence scoring identified the ambiguous cases that need review; human SMEs caught the edge cases that automated analysis couldn’t resolve; the specifications got corrected.

The final output is a validated set of business rules that modernization teams can trust (acting as “safety net” that prevents production incidents, particularly important in regulated industries), without a project-crippling human overhead.
‍

Choosing the Right Approach

Capability	IBM ADDI	CAST Imaging	LLM-Assisted	CoreStory CIM
Dependency mapping	Yes	Yes	No	Yes
Change impact analysis	Yes	Yes	No	Yes
Business rule documentation	Manual (human interprets visualizations)	Manual	Automated (with hallucination risk)	Automated with confidence scoring
Copybook resolution	Yes	Partial	Requires manual context	Yes (full estate)
Cross-program tracing	Yes	Yes	Limited to context window	Yes (entire system)
Validation methodology	Human review of visualizations	Human review	None built in	SME validation with confidence scores
Output format	Dependency graphs, call maps	Architecture visualizations	Natural language summaries	Structured specifications
Production validation data	No public benchmarks	No public benchmarks	A-COBREX: 74% recall, 62% precision	LifeSys: 1,984 specs, 85.5% validation

These approaches are not mutually exclusive. IBM’s own modernization stack demonstrates this: ADDI provides structural analysis, watsonx Code Assistant provides AI-assisted explanation. A practical enterprise approach might use ADDI for dependency mapping, an LLM for initial summaries, and CoreStory for validated specification generation. The question is where you need certainty.

‍

The Validation Problem Nobody Talks About

The hardest part of COBOL business rule extraction isn’t the extraction, it’s knowing whether the extraction is correct.

Static analysis tools produce accurate structural views but don’t generate business rule documentation. LLMs generate plausible summaries but can’t guarantee accuracy. The gap between “the tool says this is the business rule” and “we know this is the business rule” is where modernization projects fail.

CoreStory’s confidence scoring addresses this directly. Every generated specification includes a confidence score that reflects the complexity of the underlying code, the ambiguity of the logic, and the completeness of the available context. High-confidence specs can be reviewed quickly. Low-confidence specs get deeper human analysis.

This isn’t a cosmetic feature. In the production example above, confidence scoring correctly flagged the most problematic specifications, the ones that SMEs ultimately revised. The validation process becomes efficient because human expertise is directed where it’s most needed, not spread evenly across thousands of specifications.

From COBOL Code to Validated Business Rules

Extracting business rules from COBOL isn’t optional — it’s the prerequisite for any modernization project that needs to preserve the logic that runs your business. The question is whether you do it manually (expensive, slow, error-prone), with an LLM (fast, cheap, unvalidated), or with a purpose-built code intelligence platform that combines structural analysis with validated specification generation.

CoreStory’s Code Intelligence Model is the third option. Real results from a real mainframe system. 1,984 business specifications. 85.5% SME validation rate. Ready for modernization planning.

See how CoreStory can help you extract valid business specifications from your COBOL codebase. Talk to an expert →

Agent Boosting: The Missing Workflow for Getting Real Results from AI Coding Agents

Michel Ozzello — Fri, 15 May 2026 13:42:26 +0000

Originally published on CoreStory by John Bender —
read the original here

Your Agents Are Capable. They're Just Flying Blind.

There's a growing gap between what AI coding agents can do in theory and what they actually deliver in practice. Claude Code, Cursor, Copilot, Devin, Codex, Droid — every major agent has gotten dramatically more capable over the past year. They can plan multi-step tasks, edit across files, run tests, and iterate on their own output.

And yet, engineering teams keep reporting the same experience: the agent works on small tasks, stumbles on anything that crosses system boundaries, and burns tokens exploring dead ends it could have avoided with five minutes of architectural context.

The problem isn't the agent. It's the context.

Context engineering has emerged as one of the most important disciplines in AI-assisted development. Thoughtworks, Anthropic, and individual practitioners have all converged on the same insight: curating what the model sees is the single highest-leverage thing you can do to improve output quality. As Anthropic's own engineering team put it, effective context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome.

But there's a meaningful difference between configuring an agent (writing a CLAUDE.md file, setting up rules, defining skills) and actually giving it deep, structured knowledge about the system it's working in. Configuration tells the agent how to behave. Knowledge gives it something to reason about.

Agent Boosting is the practice of closing that gap: equipping your coding agents with persistent, structured code intelligence so they perform at their actual capability ceiling rather than stumbling through unfamiliar code.

‍

Two Sessions, Same Agent, Different Outcomes

To understand what Agent Boosting changes, consider two versions of the same task.

Without Agent Boosting: A developer asks their coding agent to fix a bug where inherited attributes are missing their docstrings in a Sphinx documentation build. The agent reads the relevant files, identifies the docstring retrieval logic, and patches it. The fix is locally coherent — it looks correct based on the code the agent can see. Tests fail. The agent iterates, adjusting the retrieval logic, adding edge case handling, exploring adjacent files. After 20 minutes and thousands of tokens, the developer intervenes and discovers the actual root cause: attributes were never collected during member enumeration, an upstream problem in a completely different function. The agent was fixing the right symptom in the wrong place.

With Agent Boosting: The same developer, same agent, same task. But before the agent starts exploring code, it queries CoreStory's intelligence model via MCP. CoreStory serves two roles in this interaction. First, it acts as an Oracle — answering questions about how the Sphinx documentation pipeline is intended to work, what the data flow looks like, and what invariants govern member enumeration. Then it acts as a Navigator — pointing the agent to the specific function where attributes are collected, the method signatures involved, and the extension points that downstream retrieval depends on.

The agent sees immediately that the collection stage is the problem, not retrieval. It targets the upstream function, writes the fix, and passes tests on the first implementation.

This isn't a hypothetical. It's sphinx-8548 from CoreStory's SWE-bench evaluation, where three independent agents — Claude Code, Droid, and Codex — all converged on the same wrong fix at baseline, and all three solved the task correctly when given architectural context. When agents with different architectures and different underlying models all make the same mistake and all correct course from the same context, the failure isn't model-specific. It's a structural gap that better context closes.

‍

Why Agents Fail on Complex Tasks

Every AI coding agent, regardless of architecture or underlying model, shares the same fundamental constraint: it reasons from what's in its context window. When that context is raw source code, the agent has to infer architecture from implementation details, guess at dependencies it can't see, and reconstruct system boundaries that were never documented.

This works fine for small, self-contained tasks. It breaks down predictably on anything that requires understanding how components relate to each other.

In a controlled evaluation CoreStory ran across six leading agents on the 45 hardest tasks in SWE-bench Verified, the failure pattern was consistent. Agents didn't fail because they couldn't write correct code. They failed because they pursued the wrong solution path — fixing symptoms instead of causes, missing hidden dependencies, or patching one location in a multi-file bug and leaving the others untouched.

The dominant failure mode, accounting for 72% of all task flips from fail to pass, was wrong solution prevention: agents pursuing locally rational but architecturally incorrect approaches because they couldn't see pipeline boundaries. The second most common, at 46%, was hidden dependency discovery — implicit coupling between components that's invisible from local code inspection. In one Django task, two independent agents discovered through CoreStory that a transform class internally constructs a completely different lookup class, a dependency with no visible trace in the source file (the full taxonomy of five failure modes is covered in our benchmark deep dive.)

These aren't edge cases. Over half the tasks in the evaluation — 24 of 45 — contained at least one problem that an agent could only solve with better context.

‍

What Agent Boosting Actually Looks Like

Agent Boosting isn't a feature. It's a workflow discipline built on three principles.

1. Oracle before Navigator. Understanding before location.

The typical agent workflow is: receive task, explore code, form a plan, implement. Agent Boosting restructures this into two distinct phases before the agent writes any code.

First, the agent queries CoreStory as an Oracle: How is this system intended to work? What are the invariants? What are the business rules? What's the data flow through this pipeline? This is context synthesized from the entire codebase — not just file contents, but the meaning behind them. The Oracle captures architecture, behavior contracts, design history, and edge cases that aren't visible in any single source file.

Then the agent queries CoreStory as a Navigator: Which files do I need to change? What methods are involved? Where are the extension points? What are the call sites? Instead of grep-wandering through hundreds of files, the agent gets directed to exactly the code it needs.

This Oracle-before-Navigator pattern is the single most important practice in Agent Boosting. It prevents the agent from diving into code changes before understanding the system's constraints. In CoreStory's benchmark evaluation, this pattern improved success rates by an average of 25% across all six agents tested. The highest uplift was 44% (Claude Code), and even the strongest baseline agents (Droid and Devin, already at 80%+ success) improved by 14%. Research published jointly with Microsoft found a 51% accuracy improvement when AI agents operate from CoreStory's structured specifications rather than raw code.

‍

Agent	Baseline	With CoreStory	Relative Uplift
Claude Code	56%	80%	+44%
Cursor	38%	51%	+35%
GitHub Copilot	62%	78%	+25%
Codex	64%	76%	+17%
Droid	80%	91%	+14%
Devin	82%	93%	+14%

‍

2. Make context persistent and queryable, not session-scoped.

Most context engineering today is session-scoped. You write a CLAUDE.md or a .cursorrules file, maybe set up some MCP servers, and the agent gets that context at the start of each session. This is a meaningful improvement over nothing, but it doesn't scale. Recent research from ETH Zurich found that LLM-generated context files actually degraded agent performance by 3% compared to no context file at all, while human-written files provided only a marginal 4% improvement. The researchers found that agents given more context often ran more steps and incurred higher costs without producing better patches, because the context wasn't structured for how agents actually consume information.

Agent Boosting requires a persistent intelligence layer that goes deeper than markdown files. CoreStory's Code Intelligence Model performs static analysis, call graph extraction, data flow tracing, and business logic summarization to produce structured output that captures what the software does, not just what it says. That intelligence persists across sessions, across developers, and across agents — and it's derived directly from the codebase, so it stays current as code evolves rather than drifting like manually written documentation. Conversations with the intelligence model persist too, accumulating institutional knowledge that future queries in the same thread benefit from.

3. Eliminate cross-session re-ingestion.

Every time an agent starts a new session against the same codebase, it re-reads the same files, re-infers the same architecture, and re-discovers the same dependencies. That's wasted tokens and wasted time on every single session.

Agent Boosting replaces this pattern with targeted Oracle and Navigator queries against persistent intelligence. Instead of the agent reading 300 files to orient itself, it asks: What are the dependencies of this module? What's the data flow through this pipeline? Where are all the call sites for this function? The answer comes back in hundreds of tokens instead of hundreds of thousands. CoreStory's cost evaluation measured this directly: Claude Code augmented with CoreStory used 73% fewer input tokens per task. Across the benchmark evaluation, agents avoided reading an estimated 300-500 files in aggregate across all flipped tasks, replacing exploratory code archaeology with targeted architectural queries.

‍

The Economics: Why Agentic Loops Change the Math

The cost case for Agent Boosting starts with an insight most teams haven't internalized yet: agentic loops don't scale linearly. A standard developer prompt re-ingests context once. An AI agent running a multi-step loop — plan, execute, reflect, error-correct, retry — re-ingests that context at every step. A 10-step agentic loop on raw code isn't 10x the token cost of a single prompt. It can be 30-50x, because each reflection and error-correction cycle starts with a full context re-ingestion. And when the model lacks proper context, it produces longer, more hedged responses and requires more correction rounds — each of which generates output tokens that most providers charge 3-5x more for than input tokens.

This is where Agent Boosting delivers its most dramatic ROI. Reducing context at the input doesn't just save on the first step. It compounds savings across every downstream step, every correction round, and every output generation in the loop.

CoreStory's real-world cost evaluation measured the impact on a complex feature task against a large enterprise codebase:

Metric	Baseline (Claude Code)	With CoreStory	Reduction
Processing time	~92 min	~47 min	50%
Input tokens	~1,320,000	~357,500	73%
Output tokens	~87,000	~43,000	50%
Cost per task	~$5.29	~$1.74	67%

‍

At team scale, the numbers compound. A 10-engineer team running agents against a 500,000-token codebase can spend $15,000 to $40,000 per month on context re-ingestion alone. CoreStory's conservative modeling — applying a 50% token reduction to AI-assisted work hours and factoring in recovered developer time from higher first-pass accuracy — yields $740K to $890K in annual savings for a 10-engineer team. At the 50-engineer scale, the number approaches $3.7M to $4.5M annually.

The developer time recovery isn't speculative. The 2025 Stack Overflow Developer Survey (65,000+ respondents) found that 45% of developers say debugging AI-generated code takes longer than debugging their own. Enterprises using CoreStory report up to a 50% reduction in human development time by replacing manual discovery, documentation, and validation with automated specifications. Better first-pass accuracy reduces debugging overhead directly.

‍

Agent Boosting Across the Development Lifecycle

Agent Boosting isn't limited to bug fixes. The Oracle-before-Navigator pattern applies across the full development workflow, because every task benefits from the agent understanding the system before modifying it.

Bug resolution. The agent queries CoreStory to understand how the system should work, generates root cause hypotheses grounded in actual architecture, writes a failing test, and implements a minimal fix. This is the workflow behind the SWE-bench results above (Playbook).

Feature implementation. The agent uses CoreStory to understand existing patterns, data structures, and integration points before writing new code. Instead of inventing a new approach, it extends the system in a way that's consistent with established conventions (Playbook)

Spec-driven development. CoreStory provides the architectural truth that standalone specification tools can't — ensuring specs describe changes constrained by what the system actually does today, not what someone remembers it doing. The agent writes architecture-grounded specifications before implementation, then implements against them (Playbook).

Test generation. The agent derives comprehensive test suites from CoreStory specifications: positive cases, negative cases, edge cases, error contracts, and idempotency tests. Coverage is driven by business rules, not just code paths (Playbook).

Technical due diligence. In M&A scenarios, CoreStory enables rapid architectural analysis of acquisition targets: understanding architecture, identifying risks, assessing technical debt, and evaluating integration complexity — without needing the target's engineering team to walk you through it (Playbook).

Each of these workflows follows the same core pattern. The agent first consults CoreStory for understanding, then for location, then acts on what it learned. The specifics change. The discipline doesn't.

‍

Where Agent Boosting Fits in the Context Engineering Stack

Context engineering is becoming a layered discipline. As Thoughtworks observed, all forms of AI coding context engineering ultimately involve markdown files with prompts, but those files serve fundamentally different purposes depending on what layer they operate at. Here's how Agent Boosting relates to the practices most teams already have in place.

Configuration files (CLAUDE.md, .cursorrules, agent skills) tell the agent how to behave in your codebase: coding standards, testing conventions, preferred libraries. These are table stakes. But as ETH Zurich's research showed, even well-written config files provide only marginal accuracy gains while often increasing agent step count and cost.

MCP servers and tool access give the agent the ability to query external systems, run commands, and interact with services. These expand what the agent can do.

Agent Boosting via persistent code intelligence gives the agent structured knowledge about the system itself: architecture, data flow, dependencies, business rules, semantic intent. This determines whether the agent makes the right decisions with its expanded capabilities. CoreStory's Code Intelligence Model is meaningfully different from a flat embedding index or RAG approach — it captures cross-module dependencies, behavior contracts, and business logic that chunked embeddings lose.

The three layers are complementary. Configuration without knowledge produces agents that follow your style guide but still misunderstand your architecture. Knowledge without configuration produces agents that understand the system but don't follow your conventions. You need both.

‍

Getting Started with Agent Boosting

If you're already using AI coding agents, the fastest path to Agent Boosting is connecting your codebase to CoreStory's intelligence layer. CoreStory integrates with Claude Code, Cursor, GitHub Copilot, Devin, Codex, and Droid via MCP — no changes to the agents themselves. Setup takes minutes: generate an MCP token in the CoreStory dashboard, add the server URL to your agent's configuration, and verify by asking the agent to list your projects.

If you're evaluating agents, consider testing with and without structured architectural context. CoreStory's benchmark data shows that the agent you choose matters less than the context you give it. A mid-tier agent with good context routinely outperforms a top-tier agent flying blind. In the SWE-bench evaluation, Cursor augmented with CoreStory (51% success) outperformed baseline Codex (64% baseline, but without CoreStory's architectural guidance on the hardest failure modes).

If you're managing costs, start by measuring your team's token re-ingestion rate: how many tokens per session are spent re-sending context the model already processed in a prior session? That number is your addressable waste. CoreStory customers have reduced it by 50-73%.

Whichever path you start from, adopt the Oracle-before-Navigator discipline immediately. Before your agent touches code, ask it to query for understanding first: How does this pipeline work? What are the invariants? What's the intended behavior? Then ask for location: Which files implement this? Where are the extension points?

The quality of what the agent builds depends on the specificity of what you ask. "Tell me about the order system" produces vague context. "What is the validation logic for order placement, what fields are required, and how is stock validation handled?" produces the kind of context that prevents wrong solutions.

The agents are good enough. The question is whether you're giving them what they need to show it.

Ready to boost your coding agents? Join the CoreStory waitlist or talk to an expert to model the impact on your codebase.

‍

The Context Window Paradox: Why Throwing More Tokens at Legacy Code Doesn't Work

Michel Ozzello — Fri, 15 May 2026 13:17:31 +0000

TL;DR Every engineering team working with LLMs on large codebases hits the same wall: the context window. The instinct is to think bigger windows will make things better. But research and practice show that bigger contexts actually degrade output quality through information overload, attention dilution, and the well-documented "lost in the middle" problem. The real solution isn't a bigger window — it's smarter context. By progressively decomposing a codebase along its natural architectural boundaries and recomposing structured intelligence, you give LLMs exactly the context they need to reason accurately about complex systems.

The Working Memory Problem

If you've tried to use an LLM for anything beyond generating a utility function (understanding a module's business logic, tracing a data flow across files, figuring out why a particular function exists…) you've felt the constraint.

A context window is the working memory of a large language model. It's the lens through which the model sees everything: your prompt, the conversation history, any code or documents you've fed it with. The model doesn't have persistent memory. It has a sliding window of tokens, and everything it knows about your problem has to fit inside that window.

Three things determine what happens inside that window:

The focal point — the model is always attending to specific tokens and surrounding text, deciding what matters.
The contextual relationships — the model interprets connections between tokens to build an internal representation of meaning, not just pattern-matching strings.
The window size — the hard ceiling on how much data the model can hold in its working set at any given moment.

For a developer pasting in a few files to ask about business logic, these constraints become real fast. You hit token limits. Or worse, the model seems like it has room, but the output is wrong because critical context got pushed out of the window or diluted by everything else in there.

‍

Why Context Windows Matter for Engineering Work

The quality of an LLM's output on engineering tasks is directly tied to the context it can access. This plays out in three ways that matter for anyone working with real codebases.

Code understanding requires surrounding context. When an LLM is parsing legacy code, it needs more than the function signature. It needs the imports, the calling code, the data structures being passed around, the copybooks being referenced. Without that surrounding context, the model is guessing. And on a mainframe modernization, guessing is how you introduce regressions that surface during month-end processing, the kind where your general ledger is suddenly off by six figures.

Pattern conformance depends on visible patterns. LLMs adapt their outputs based on patterns observed in the context window. Feed the model well-structured context (naming conventions, architectural patterns, error handling standards, business rules) and it learns to conform. But only if that context fits in the window. Lose it, and the model generates code that looks right syntactically but violates every convention your team has established.

Coherent generation requires architectural visibility. When an LLM generates code that integrates with an existing codebase, coherence isn't optional. The output must match the style, error handling patterns, architectural decisions, and even commenting conventions of what's already there. That requires the model to see those patterns, which means context.

The context window isn't just a technical spec on a model card. It's the bottleneck that determines whether AI-assisted engineering produces usable code or generates plausible-looking output that passes a review but fails in production.

‍

The Obvious (Wrong) Answer

The first thing every engineer asks: why not just make the context window bigger?

If the problem is fitting enough context, expand the window. A million tokens. Ten million. Problem solved. Not quite.

Anyone who's worked with the larger context models has probably noticed that throwing everything in doesn't magically improve output. Sometimes it actually makes things demonstrably worse. More hallucinations, not fewer. Confident-sounding but incorrect answers. The model blending code from different modules as if they were the same thing.

‍

How the Context Window works independently of the maximum number of tokens available

‍

There are specific, well-documented reasons why.

‍

The Paradox: Four Reasons Bigger Breaks Down

Information Overload

This one's intuitive and it happens to people too. Dump hundreds of thousands of tokens of COBOL into a model and ask it to find the business rule for calculating late fees. The model has to sift through JCL, copybooks, dead code, and commented-out sections from decades ago to find the relevant logic. More noise means more opportunities to latch onto the wrong thing.

From a practical standpoint, larger contexts mean quadratically more compute in the attention mechanism, slower responses, and higher cost. On a large project processing thousands of programs, that cost compounds fast.

Lost in the Middle

This is well-documented in the research literature. LLMs exhibit what's called the "lost in the middle" problem, where they disproportionately attend to information at the beginning and end of the context window and pay significantly less attention to what's in the middle. It's an artifact of how attention mechanisms are trained.

If your critical business logic lands in the middle third of a large context dump (and statistically, a third of it will) the model might effectively ignore it. The tokens are present. The information is there. But the attention weights are too diluted for the model to actually use it.

Poor Signal-to-Noise Ratio

When the window is packed full, the model struggles to differentiate what's important from what's noise. You get redundancy — the model restating the same concept in different ways. Contradictions — code that conflicts with patterns established elsewhere in the context. And bias amplification — if there's more boilerplate than business logic in the context, the model generates boilerplate-flavored answers even when you're asking about specific business rules.

Long-Range Dependency Decay

This is the killer for legacy modernization specifically. Going back to the large COBOL application example, a business rule might span multiple paragraphs, reference a copybook defined in a completely different member, depend on a working storage variable set three PERFORM THRU calls earlier, and behave differently based on a condition flag initialized in the JCL.

These long-range dependencies (cause and effect separated by thousands of lines of code) are exactly what LLMs struggle with in large contexts. The attention mechanism degrades over distance. Concepts far apart in the token stream become weakly connected in the model's internal representation.

The paradox is real: you need more context to understand complex systems, but more context degrades the model's ability to reason about what's in the window. You cannot brute-force your way to understanding a million-line codebase by dumping it all into a prompt.

‍

Putting Numbers to the Problem

Let's make this concrete with real numbers instead of abstractions.

The current landscape of context window sizes tells the story. The largest commercially available context windows today top out around one million tokens. Most production models sit between 128K and 200K tokens. Open-source models commonly offer 8K to 16K.

Now consider a real enterprise codebase. A million lines of code — and many mainframe shops that estimate half a million actually have two million once you count copybooks, JCL, utility programs, and batch processing logic. A conservative million lines at roughly 50 characters per line gives 50 million characters. At approximately 4 characters per token, that's around 12.5 million tokens.

The largest context window on the market fits less than eight percent of a modest legacy codebase. Not even close.

And remember, even if it all fit, the paradox means you wouldn't want to send it all. Quality degrades well before you hit the ceiling.

Layer on the business reality. Research from Sonar across more than 200 projects found that technical debt costs approximately $306,000 per year per million lines of code. That's the maintenance burden: bugs from code nobody fully understands, fragility in systems nobody wants to touch, developer hours spent reverse-engineering undocumented logic.

‍

The Solution: Intelligent Decomposition, Not Bigger Windows

What if, instead of trying to cram a whole codebase into a context window, you intelligently decomposed it first? Not randomly, not chunking by line count or by file, but following the natural taxonomy of the code itself. Respecting the boundaries the original developers built into the system.

This is the approach CoreStory takes with its code intelligence platform, and it works in two phases.

Phase one: Progressive Decomposition. The full codebase breaks down along its natural architectural boundaries. The full system decomposes into modules. Modules decompose into classes or programs. Programs decompose into functions, paragraphs, and procedures. This isn't arbitrary chunking — it follows the structure the original developers created, because that structure encodes how business logic is organized.

The craft is in getting the boundaries right. You don't chunk in the middle of a function. The decomposition has to be semantically aware. It has to understand what constitutes a meaningful unit of code. Get the chunking wrong and you get garbage out. Get it right and you get specifications that reflect reality.

At each level of decomposition, enterprise context gets applied — naming conventions, architectural patterns, coding standards, the things senior engineers know intuitively but that aren't written down anywhere. The output conforms to your world, not to generic training data.

Phase two: Progressive Recomposition. Once each piece is analyzed with properly scoped context (context that fits in the window and gives the model everything it needs) the understanding recomposes back up the chain. Function-level analysis composes into class-level specs. Class specs compose into module-level documentation. Module specs compose into full-system requirements.

The CoreStory approach at solving the Context Window Paradox through progressive Decomposition and Recomposition

‍

What emerges is structured code intelligence: not raw code, but persistent, queryable specifications that an LLM can reason about effectively. When you send context to a model, you're sending well-structured, properly scoped specs that fit within the window and give the model exactly what it needs.

Real life application of this approach showed that Claude Code paired with CoreStory used 73% fewer input tokens.

‍

What This Unlocks in Practice

The technology only matters if it delivers real value. Here's what becomes possible when you solve the context problem.

Actual business requirements from code, not restated syntax. Not auto-generated comments that parrot the code in English, but real business requirements extracted from code behavior. Product requirement documents that describe what the system does in business terms. For many organizations, this alone justifies the effort: you finally get a source of truth for what the system actually does, rather than what someone wrote in a design document years ago.

Feature-to-code mapping for modernization and maintenance planning. Once requirements are mapped to code modules, you can plan with data instead of intuition. Which modules carry the most business risk? Which have the most technical debt? Which are the best candidates for modernization first because they're self-contained? You have a traceable map from business capability to code implementation.

Persistent context for all future AI-assisted development. The structured intelligence becomes seed data for every subsequent AI interaction. Every prompt, code generation task, and code review starts with accurate context about your enterprise's patterns, conventions, and architecture. You stop starting from zero every time you open a new chat session. This is what context engineering looks like at enterprise scale: persistent understanding that compounds over time rather than evaporating with each session.

Compressed engineer ramp time. Consider how long it takes a new developer to become productive on an existing system today. With structured, searchable specs tied directly to the running code, that ramp compresses dramatically. And existing engineers spend less time spelunking through code before they can change it.

This comes up consistently in customer conversations: the problem isn't writing new code, it's understanding what the old code actually does before you can safely touch anything.

‍

Ready to Solve the Context Problem?

If you're sitting on a legacy system that needs modernization or a critical application that needs to be maintained, but you haven't found an approach that handles the scale and complexity of your codebase, the context window paradox is likely the root cause. CoreStory's code intelligence platform was built specifically to solve it.

Talk to an expert about running a focused assessment on your codebase, or try CoreStory free to see the platform in action.

‍

FAQ

What exactly is the "lost in the middle" problem?

It's a well-documented behavior in LLMs where the model pays significantly more attention to information at the beginning and end of its context window than to information in the middle. Even when tokens are present in the window, the model may not effectively use them if they fall in the middle portion. This means critical business logic can be functionally invisible to the model even when it's technically within the context.

Can't I just use RAG (retrieval-augmented generation) to solve this?

RAG helps surface relevant chunks, but it doesn't solve the fundamental problem. RAG retrieves text fragments based on semantic similarity, which works well for documentation lookup but poorly for understanding code structure, cross-file dependencies, and business logic that spans multiple modules. You still need those retrieved chunks to fit meaningfully in the context window, and you still need the model to reason about their relationships correctly. Progressive decomposition and structured code intelligence give the model properly scoped, architecturally coherent context — not disconnected fragments.

How is CoreStory's approach different from just splitting code into smaller files?

Splitting code into arbitrary chunks (by file, by line count, by function) ignores the semantic structure of the codebase. CoreStory's progressive decomposition follows the natural architectural boundaries of the code, preserving the relationships and dependencies that make each unit of analysis meaningful. The recomposition phase then rebuilds understanding across those boundaries so nothing falls through the cracks.

What programming languages does this work with?

CoreStory supports a large variety of programming languages, from legacy systems like COBOL and Natural/ADABAS to modern stacks in Java, C#, Python, and more. The platform is designed for enterprise environments where multiple languages and frameworks coexist in the same system.

What size codebases can CoreStory handle?

The platform is built for enterprise scale. The progressive decomposition approach means codebase size isn't a limiting factor the way it is with raw context window approaches. Whether your system is hundreds of thousands or millions of lines, the analysis follows the same architectural decomposition methodology.

How to Give AI Coding Agents Better Codebase Context

Michel Ozzello — Thu, 30 Apr 2026 20:03:55 +0000

TL;DR AI coding agents fail on large codebases because they lack structured context about how the system actually works. The industry has converged on three tiers of solutions: static context files (AGENTS.md, .cursorrules), retrieval-augmented generation (Sourcegraph Cody, Continue.dev), and persistent code intelligence platforms (CoreStory). Each tier solves a different scale of the problem. This article explains what each approach does, where it breaks down, and how to evaluate which tier your team needs.

Why AI Coding Agents Fail on Large Codebases

Every AI coding agent (Claude Code, Cursor, GitHub Copilot, Codex, Windsurf…) faces the same fundamental constraint: they can only act on what they can see. For a 500-line side project, that’s rarely a problem. The entire codebase fits in a single context window. The agent reads the code, understands the structure, and produces reasonable output.

For a 500,000-line enterprise system spread across dozens of services, the math breaks. Even with million-token context windows now available in production models, you can’t fit an entire enterprise codebase into a prompt. And even if you could, raw source code doesn’t tell the agent why the system was built that way (the architectural decisions, the business rules embedded in legacy logic, the undocumented constraints that only exist in the heads of engineers who left three years ago…)

The result is predictable: hallucinated imports, functions that don’t exist, patterns that contradict the codebase’s established conventions, and “fixes” that break other parts of the system the agent never saw.

This isn’t a model intelligence problem. It’s an infrastructure problem. The agent is smart enough; it just doesn’t have the information it needs.

There are 3 ways to approach this problem. Each takes a different approach and delivers different results.

‍

Tier 1: Static Context Files (AGENTS.md, .cursorrules, and CLAUDE.md)

The first generation of codebase context delivery is the static context file. You write a markdown file, drop it in your repository root, and the agent reads it before doing any work.

The format landscape has consolidated rapidly. In 2025, every tool had its own approach: Claude Code read CLAUDE.md, Cursor read .cursorrules, GitHub Copilot read .github/copilot-instructions.md. By early 2026, the industry converged on AGENTS.md — now an open standard backed by the Linux Foundation, supported by every major AI coding agent, and adopted by tens of thousands of repositories. OpenAI’s Codex reads AGENTS.md files at every level of the directory tree. Apache Airflow and Temporal have adopted the format. At time of writing, the OpenAI repository alone contains 88 AGENTS.md files.

What AGENTS.md does well

Gives agents project-specific instructions: build commands, coding conventions, test runners, and constraints the agent can’t infer from the code alone.

Portable across tools. One file, one format, understood by Claude Code, Codex, Cursor, Copilot, Windsurf, and more.

Low cost to create. A useful AGENTS.md takes 30 minutes to write and immediately improves agent output quality on small-to-medium repositories.

Where it breaks down

A recent ETH Zurich study (AGENTbench, 2026 - source) tested context files rigorously across 138 real-world Python tasks. The findings were nuanced: LLM-generated AGENTS.md files actually reduced task success rates by approximately 3% and increased inference costs by over 20%. Human-written files performed better, but only when limited to non-inferable details — custom tooling, counterintuitive patterns, and project-specific constraints.

The core limitation is structural. AGENTS.md is a flat file. It doesn’t understand your code; it’s a set of instructions that you manually maintain. For a 100-file project, that works. For a 10,000-file enterprise system, you face three problems:

Staleness: The file drifts as the codebase evolves, and there’s no automated way to detect when it becomes inaccurate.

Scale: You can’t describe an entire enterprise architecture in a markdown file without blowing the agent’s context window budget.

Depth: AGENTS.md tells the agent what commands to run and what patterns to follow. It doesn’t tell the agent how the system actually works — call graphs, data flows, component relationships, business rules.

As one practitioner noted: the real value of writing an AGENTS.md is that it forces you to articulate things about your codebase that were previously just in your head. That’s valuable, but it’s documentation, not intelligence.

‍

Tier 2: RAG-Based Context Retrieval (Sourcegraph Cody, Continue.dev, Windsurf)

The second tier moves from static files to dynamic retrieval. Rather than telling the agent everything upfront, RAG systems index your codebase and retrieve relevant code snippets at query time.

How RAG works for code

The pipeline is conceptually straightforward: split your codebase into chunks, embed those chunks into a vector space, store them in a vector database, and at query time, find the chunks most semantically similar to the agent’s current task. The retrieved chunks get inserted into the agent’s context window alongside the prompt.

Sourcegraph Cody is the most mature implementation. It combines Sourcegraph’s code search engine (keyword search, SCIP-based code graph, and semantic search) with RAG to provide multi-repository context retrieval. Cody supports context windows up to 1 million tokens and can pull context from up to 10 remote repositories. The architecture gives it strong advantages for teams already using Sourcegraph for code search.

Other notable implementations include Windsurf’s Flow context engine, which uses hybrid semantic + BM25 search with a proprietary M-Query retrieval method; Continue.dev, which provides an open-source framework for building custom code RAG pipelines with MCP integration; and Qodo’s Context Engine, which combines RAG with agentic reasoning for multi-repository intelligence.

Why RAG is better than static files

Dynamic: The agent retrieves what’s relevant to the current task, not a fixed set of instructions.

Scalable: Can index hundreds of thousands of files across multiple repositories.

Current: Re-indexing keeps the retrieval layer in sync with code changes (though update frequency varies - some systems re-index daily, others weekly).

Where RAG falls short

RAG retrieves code. It doesn’t understand code. The distinction matters.

When you ask a RAG system "how does authentication work in this system?", it finds files that are semantically similar to your query, files with words like "auth", "login", "token" in them. That’s useful, but it doesn’t give you the architectural picture of which services are involved, what the call chain looks like, where the business rules live, how the authentication flow interacts with the session management system, or why the team chose this approach over alternatives.

Several teams working on code intelligence at scale have found that AST-based retrieval (following import graphs, type hierarchies, and call chains) outperforms vector similarity for structural code queries. RAG is reactive and unstructured. It responds to what you ask, returning text fragments ranked by similarity but it doesn’t proactively tell the agent things it needs to know but hasn’t thought to ask about.

For many teams, RAG is the right solution. If your codebase is under 500,000 lines and your agents are primarily doing file-level edits and feature additions, RAG-based tools like Cody, Windsurf, or Continue.dev provide a significant improvement over static context files.

‍

Tier 3: Persistent Code Intelligence. From Retrieval to Understanding

The third tier addresses what RAG cannot: structural understanding of how a codebase actually works.

A Code Intelligence Model (CIM) doesn’t just index your code. It analyzes it, parsing abstract syntax trees, extracting call graphs, mapping component relationships, identifying business rules, and building a persistent, queryable model of the entire system. The output isn’t retrieved text fragments, it’s structured specifications: "this service handles payment processing, it depends on these three other services, it implements these business rules, and it was last modified on this date."

The key difference is persistence. Where RAG retrieves on demand and forgets between sessions, a Code Intelligence Model builds understanding that survives turnover, compounds over time, and is accessible to any tool that needs it.

CoreStory MCP server delivering structured code intelligence to an AI agent, showing component specifications and architecture maps

What makes a Code Intelligence Model (CIM) different from RAG

Dimension	AGENTS.md	RAG	Code Intelligence Model
Context source	Manual markdown	Embedded code chunks	Analyzed specifications
Update mechanism	Human edits file	Re-index periodically	Git-diff driven, incremental
Output type	Instructions & rules	Raw code snippets	Structured specs & relationships
Understands architecture	No	Partially (via search)	Yes (call graphs, component maps)
Persists across sessions	Yes (static file)	Index persists; context doesn’t	Yes (queryable model)
Business rule extraction	No	No	Yes
Scales to 10M+ lines	No	With infrastructure	Yes
Agent delivery	In-context file	IDE plugin / API	MCP server / API

‍

How a Code Intelligence Model Delivers Agent Context

CoreStory is a persistent code intelligence platform purpose-built for enterprise codebases. It ingests your entire repository, regardless of language, framework, or size, and builds a Code Intelligence Model: a knowledge graph of your system that captures architecture, component relationships, business rules, and data flows.

The CIM is delivered to AI coding agents via MCP (Model Context Protocol), the open standard for connecting AI tools to external context sources. When an agent running in Claude Code, Cursor, or any MCP-compatible environment needs to understand part of your system, it queries CoreStory’s MCP server and receives structured specifications. Not raw code, but an analyzed understanding of what the code does and why.

What agents receive from CoreStory

Component specifications: what each module does, its responsibilities, dependencies, and public interfaces.

Architecture maps: how services connect, what the call chains look like, where data flows between components.

Business rule documentation: the logic embedded in code, extracted and structured for human and machine consumption.

Change context: what was recently modified, by whom, and what specifications were affected.

This is the difference between handing a contractor a stack of code printouts and giving them a technical architecture document written by a senior engineer who knows the system inside out.

For one particular production mainframe system, CoreStory extracted 1,984 business specifications from a live COBOL codebase with an 85.5% SME validation rate. That’s not documentation generated from prompts but structured intelligence derived directly from source code analysis and validated by the people who know the system.

‍

How to Evaluate Your Current Approach

Different teams may need different approaches. The right tier depends on your codebase complexity, team size, and what you’re asking agents to do.

Start with AGENTS.md if:

Your codebase is under 100,000 lines and well-structured.
Your agents primarily handle file-level tasks: writing functions, fixing bugs, generating tests.
You have a small team that can manually maintain the context file as the codebase evolves.
You’re using multiple AI coding tools and need a single, portable context format.

Move to RAG-based tools if:

Your codebase spans multiple repositories or exceeds what fits in a context window.
Your agents need to reference code outside the currently open files.
You’re already using Sourcegraph or a similar code search platform.
You need dynamic retrieval — different context for different tasks — rather than a fixed instruction set. ‍ ### Invest in a Code Intelligence Model if:
Your codebase exceeds 500,000 lines, spans multiple languages, or includes legacy systems.
Your agents need to understand architecture and business logic, not just find relevant files.
You’re planning a modernization, migration, or major refactoring initiative.
Knowledge loss from developer turnover is a real business risk.
You need intelligence that persists across sessions, tools, and team changes.

The tiers above are not mutually exclusive. Many enterprise teams use AGENTS.md for project-specific instructions alongside a CIM for structural intelligence. The AGENTS.md handles "run this test command" and "use this naming convention." The CIM handles "here’s how the payment processing pipeline actually works."

‍

Stop Giving Your Agents Workarounds

AGENTS.md was a necessary first step. RAG-based retrieval was a meaningful upgrade. But if your AI coding agents are still guessing about how your system works the problem isn’t the model. It’s the infrastructure.

CoreStory is the persistent code intelligence layer that gives agents what they actually need: a structured, always-current understanding of your entire codebase. Not another prompt engineering trick. Not another configuration file. A production-grade intelligence layer that sits between your code and any agent.

See how CoreStory delivers codebase context to your AI agents.