DEV Community

Daniel Westgaard
Daniel Westgaard

Posted on • Originally published at riftmap.dev

Declared, inferred, registered: the three ways a tool knows a cross-repo dependency exists

Three lines were open in three tabs on my screen last week, and all three declared a dependency that crosses a repository boundary.

The first was a Helm chart. In argoproj/argo-helm, the argo-cd chart's Chart.yaml carries a dependencies: block:

dependencies:
  - name: redis-ha
    version: 4.38.0
    repository: https://dandydeveloper.github.io/charts/
Enter fullscreen mode Exit fullscreen mode

The second was Terraform. In cloudposse/terraform-aws-vpc, the root main.tf has a module block whose source points at another repo entirely:

module "label" {
  source  = "cloudposse/label/null"
  version = "0.25.0"
}
Enter fullscreen mode Exit fullscreen mode

The third was a Dockerfile. In cilium/cilium, images/cilium/Dockerfile builds its release stage from a base image passed in as a build argument:

FROM ${CILIUM_RUNTIME_IMAGE} AS release
Enter fullscreen mode Exit fullscreen mode

Run grep across the org for any of these and you get a partial answer. It finds the string redis-ha, but not that the chart resolves against a Chart.lock you would have to read separately. It finds cloudposse/label/null, but has no idea that registry short-address maps to the cloudposse/terraform-null-label repo. It finds ${CILIUM_RUNTIME_IMAGE} and stops, because the real image name is bound somewhere else. Point a symbol graph at the same three files and it finds nothing at all. None of these is a programming-language symbol. No compiler and no SCIP indexer parses a Chart.yaml, an HCL module block, or a Dockerfile instruction as source code.

Here is the claim I want to plant before we go further. Before you merge and run a change, a cross-repo dependency can be known to a tool in three ways: declared in a manifest the machine already executes, inferred from statistical signal, or registered in a catalog a human maintains. (A fourth mode, observing the edge at runtime, needs the change already running, which is exactly what you do not have before merge. More on that below.) Those three regimes are not three qualities of the same thing. They are three different answers to the question how did the tool come to know this edge exists at all, and each one buys a different, structural failure mode.

"Parsed, not inferred" stages a two-horse race and quietly drops a third runner

The slogan I and half the industry reach for is "parsed, not inferred." It is a good slogan and it is doing less work than it sounds like. It stages a two-horse race: on one side the tool that reads what a manifest says, on the other the tool that guesses from embeddings and model output. That framing is real, but it hides the regime that quietly runs a large share of platform teams, which is neither parsed nor inferred. It is registered: an edge some human typed into a catalog, that no machine executes and no model produced.

So the honest split is three-way. Declared, inferred, registered. This is a different axis from the one I drew in an earlier post in this series, where the taxonomy was symbol / live-state / artifact. That split is about which layer of the stack an edge lives on. This one is about how a tool knows the edge is there. They compose. An artifact-layer edge can be declared, inferred, or registered, and the same Terraform module source can show up in all three tools by three different routes. The rest of this post is about that second axis, because it is where the word "parsed" is quietly carrying an argument it never actually made.

Declared: the edge the machine already executes

A declared dependency is one written into a manifest that the machine already reads and executes to do its job. Nobody adds a FROM line to document a dependency. They add it because the build will not produce an image without it. The dependency edge is a side effect of a file that has to be correct for the system to run at all, which is what makes it deterministic. terraform init resolves the module source or the plan fails. helm dependency update pulls the chart named in dependencies: or the release is incomplete. The edge is not a description of the system. It is part of the system.

What counts as declared

The declared regime is wide, and it is precise. Each ecosystem has its own construct, and the point is to name the construct rather than wave at "config files":

  • Helm Chart.yaml dependencies: entries (name, version, repository), which Riftmap reads as helm_dependency edges.
  • Terraform module { source = ... } blocks, read as terraform_module edges. Registry short-addresses and git URLs count as cross-repo; a bare ./ local path does not, nor do full registry URLs like registry.terraform.io/ or app.terraform.io/.
  • Dockerfile FROM lines, read as docker_base_image edges.
  • go.mod require directives, and the replace directives that quietly redirect them.
  • GitHub Actions uses: values, whether they point at owner/repo@ref or a reusable .github/workflows/x.yml@ref.
  • GitLab CI include: in its several forms (project:, remote:, component:).
  • Kustomize remote resources: and bases:, read as kustomize_resource edges.
  • npm package.json dependencies, including the npm: alias and git+ forms where the imported name and the actual package differ.
  • Ansible, where the precision matters. A role's meta/main.yml dependencies: list emits a role-to-role edge (ansible_role) regardless of how the string is dotted. A task in a playbook that calls a three-segment FQCN like polaris.infrastructure.deploy emits a collection edge (ansible_collection). Two different files, two different edge types. An FQCN in meta/main.yml still resolves as a role dependency and not a collection reference, because that file is what Ansible reads when it loads a role's dependencies.

That precision is the whole personality of the declared regime. The edge is not "there is a dependency somewhere in this YAML." It is a named construct with a known grammar and a known resolution step.

Why grep only half-sees it

Grep finds the string and misses the meaning, because in every one of these constructs the literal text is not the resolvable target. A Helm version: is usually a semver range, not a pinned version. A Terraform registry short-address like cloudposse/label/null has to be resolved through the registry's naming convention before you know which repo backs it. A Dockerfile FROM ${VAR} names a variable, not an image. A GitLab CI include: has five distinct shapes and an unqualified shorthand that silently resolves to local-or-remote depending on the string. An npm dependency can be declared under an alias, so the name in the code and the package actually installed are different strings. Grep sees text. The declared edge is text plus a resolution rule, and grep does not run the rule.

Why a symbol graph misses it

For most of these constructs, a symbol graph does not miss the edge so much as never look at it, because a symbol graph indexes programming-language symbols and none of these are symbols. Helm, Terraform, Docker, GitLab CI, GitHub Actions, Ansible, Kustomize. A compiler-accurate indexer like Sourcegraph's SCIP has nothing to say about any of them, because they are not code it compiles. This is not a knock on Sourcegraph. Symbol graphs and artifact graphs are different categories, and Sourcegraph is genuinely excellent at the category it is in.

I want to be fair about the two exceptions. For go.mod and package.json, the import path is itself a language-level symbol. Sourcegraph's own writeup on cross-repository code navigation describes how SCIP's external symbols carry cross-repository dependency information across the languages it indexes, without calling out any ecosystem by name. A Go import path and an npm package name are exactly that kind of symbol, so I read those as edges a symbol graph can resolve cross-repo when cross-repo indexing is configured. That is my inference from how Go and npm name their imports, not a claim on Sourcegraph's page about those two ecosystems. It is a real capability, and a heavier lift most installs skip. The manifest parser reads the literal require or dependencies value regardless of indexing, and it still catches the cases a symbol resolver handles less cleanly: the npm: alias, the renamed module path, the non-registry git source. Different mechanisms, overlapping coverage, and I would rather concede the overlap than pretend it away.

The scale is the part that does not fit in a code review. A full Riftmap scan of the cloudposse GitHub org (242 repos, completed 2026-07-02) found 147 repos declaring a dependency on cloudposse/terraform-null-label via a Terraform module { source = "cloudposse/label/null" } block. 138 were on the current 0.25.0. 9 were pinned behind. Nearly every one of those references sits at the same place, context.tf line 24, the line cloudposse's own module template generates. That is one declared edge, in one construct, in one org, repeated across 147 repos on a single templated line. It is exactly the kind of signal that is trivial to parse and impossible to hold in your head across 242 repositories.

The honest failure mode of the declared regime is coverage. A parser is software. It only sees the ecosystems someone wrote a parser for, and it has the blind spots any parser has. If a team declares a dependency in a format nobody has written a parser for, the edge is real and the tool does not see it. That is a genuine limit, and it is a different kind of limit from the two that follow.

Inferred: the edge guessed from statistical signal

An inferred dependency is one a tool produces from statistical signal rather than reading it from a declaration. Embedding proximity. Name similarity. Model output. Co-change history. This is the regime that reaches for a coupling nobody wrote down in any artifact at all: two files that always move together in the commit history, a service whose vocabulary sits close to another's in embedding space, a natural-language question about the codebase answered from summaries rather than a parse. When there is no manifest entry and no catalog record to read, inference is the only thing left that can even suggest the edge exists. That is a real place on the map, and declared parsing does not stand on it.

The failure mode is that inference has no ground truth. It produces a probability that an edge exists, and probabilities are wrong at a rate. This is measured, not folklore. When Richardeau et al. asked a range of LLMs to reproduce Zachary's Karate Club graph, every model got it wrong. The benchmark has 34 nodes and 78 known edges. The best model still added two edges that are not in the graph. Edge-count outputs across models ranged from 8 to 153 against a ground truth of 78 (arXiv:2409.00159). In a code-specific setting it is sharper. On a 15-question architecture-discovery suite against the Shopizer repo, an AST-derived dependency graph scored 15 out of 15. An LLM-extracted knowledge graph scored 13. A vector-only baseline scored 6 (arXiv:2601.08773). The same study documents a coverage failure distinct from being wrong: the LLM extraction pass skipped 377 files outright, so the graph it built was missing large parts of the dependency surface, not just occasionally mistaken about the parts it covered.

The confidence score is the tell. An inferred edge comes with a number that means "how likely we think this edge is real," and that number is doing load-bearing work, because without it you cannot separate the edges the tool is sure about from the ones it guessed. Turn the threshold up and you drop real edges. Turn it down and you admit false ones. There is no setting that gives you both, because the underlying quantity is a belief, not a fact. Inference is the right tool when nothing is written down anywhere. It is strictly worse than reading the file when the edge is already declared, because guessing at an edge that is sitting in plain text can only add error to something you could have simply read.

Registered: the edge a human wrote in a catalog

A registered dependency is one a human wrote into a catalog that no machine executes. Backstage represents it as spec.dependsOn in a catalog-info.yaml, which the catalog processor turns into a directional relation at ingestion. Port represents it as a relation between blueprints, single or many, which its docs frame as the software catalog as a dynamic graph database. And I want to concede the real thing first, because it is real: catalogs model relationships that parsing simply cannot see. Ownership. On-call. Which team you page. The tier of a service. There is no manifest the build executes that declares who owns a repo, and a good catalog is the right home for that.

The failure mode is drift, because a registered edge is true only as of the last human edit, and nothing executes it to force a correction. This is not a competitor's insinuation. It is Backstage's own documented behaviour: issue #20030 describes how unregistering an entity leaves related entities carrying stale relationships until a later processing pass, and a Group page will show a live warning about relationships to entities that no longer exist. Port's CTO makes the maintenance case directly, though as an interested party. As he puts it on Port's blog, "YAMLs require maintenance when code changes occur. This results in outdated information that can affect operations and decision-making…". And from the adoption side, Roadie, a Backstage-ecosystem vendor and not a rival, reports two customers reaching 88% and 90% catalog completeness over roughly four months of active effort. That last number is the one I keep coming back to. Even funded, deliberate catalog work plateaus below 100%, because the catalog is a second job that competes with shipping, and the parts nobody remembered to update are silently indistinguishable from the parts that are current. I have written more on why teams quietly abandon the catalog elsewhere.

A fourth mode: discovered at runtime, and why it is unavailable before merge

There is a fourth way to know a cross-repo edge exists, and it is neither declared, inferred, nor registered: you can observe it at runtime. Because it lives on a different axis from the other three, I want to name it and set it aside cleanly rather than fold it into inference. A service mesh, DNS, live traffic, a database connection resolved in production. That edge is discovered by watching the system run, and it is genuinely powerful, because it is the only thing that sees the undeclared HTTP calls a service makes to three others through environment variables injected at runtime, calls no manifest declares. This is the live-state layer, the subject of the first post in this series. SixDegree calls it "discovered", and their tie-break rule, prefer discovered over declared when the two conflict, is correct for the question it answers. Runtime observation is righter than any manifest about what is talking to what right now. What it cannot tell you is anything about a change that has not been merged yet, because you cannot observe the traffic of a base-image bump that does not exist in production. The thing you want the blast radius of has not run. That is why this series is about blast radius before merge, and before merge the edges you can actually know are the declared, inferred, and registered ones.

Does a parsed dependency edge need a confidence score?

Riftmap parses deterministically and still puts a confidence score on every edge, and those two facts only sound contradictory until you see what the score measures. It is not inference confidence. There is no model, so the number can never mean "we think this edge is real." It is resolution confidence: how cleanly the declared reference matched a known target in your org. The resolver's own dataclass documents it in one line. """Resolution confidence. 1.0 = exact match; lower = heuristic.""" Every value below it comes from a string or path comparison, an if, not a probability.

The external precedent for this distinction is, again, Sourcegraph. Their precise vs search-based code navigation split does the same thing one layer up. Precise navigation is compiler-accurate when a SCIP index exists. Search-based navigation is what Sourcegraph falls back to, in their own words, "when precise navigation is not available." Neither mode is doubt about whether a symbol is real. The distinction is match quality on how the reference was resolved, precise index versus heuristic search. A declared-edge resolution score is the same shape of thing, one layer down at the artifact level.

This is where I need to reconcile something honestly, because an earlier post in this series set a merge gate at min_confidence=0.8, and it would be easy to read that as "declared edges are always at least 0.8." They are not. The 0.8 floor excludes a separate regex-heuristic layer that scans files no formal parser owns, whose findings sit at 0.4 to 0.7 by design, plus a few declared edges that resolved fuzzily. The score moves for two deterministic reasons: ambiguous declaration syntax, or an imperfect string match to a known target. Neither reason is doubt about existence. The live cloudposse graph proves it: the terraform-null-label edge I pulled from the production API resolves at 0.9, not because anyone is unsure the edge exists, but because turning cloudposse/label/null into the cloudposse/terraform-null-label repo took a documented naming-convention rule rather than an exact string match. A ${var}-templated Terraform source lands at 0.5. The four ${VAR}-templated FROM lines in that cilium Dockerfile land at 0.7, each of them a real, declared base-image edge whose confidence is lower only because the image name is bound through a build argument. The number answers "how cleanly did this resolve." It never answers "do we think this is real," because nothing in the pipeline is guessing.

The three regimes differ in what keeps each edge honest

The three regimes differ in the thing that decides whether they stay true: what keeps each edge honest.

A declared edge is kept honest by the machine that executes it. Get a FROM line wrong and the build breaks. Get a module source wrong and terraform init fails. The manifest is not honest because humans are diligent about it. It is honest because it is load-bearing, and the same machine that consumes the edge re-reads it on every run. An inferred edge is kept honest by nothing. There is no build that fails when the model guesses wrong; you re-roll the dice and get a different graph. A registered edge is kept honest by human diligence alone. Nothing executes a catalog, so it rots at exactly the rate that attention wanders, which is quickly.

I have to concede a point the research made me sharpen, because a reader who knows the build-dependency-error literature will catch it otherwise. Declared is not infallible. It is true to the manifest, not true to the world. A Helm dependencies: entry nobody pruned, a Terraform module block whose source still points at code no longer wired into any resource. These are declared-but-dead edges, the same failure shape a catalog has. Declared and registered are both things somebody wrote down, and both can be stale while still parsing cleanly. The difference is not that declared never goes stale. It is that a declared edge lives in the file the machine runs, so it is cheaper to keep current than a catalog is: nobody has to remember to edit it, because the machine re-reads it every time it runs, and a wrong one tends to announce itself by breaking something. A catalog entry that goes wrong just sits there, wrong and quiet.

That is why the declared regime is the substrate I would want under an agent making a small cross-repo infra change. You cannot ask an agent to maintain a catalog, and you cannot trust it to guess. What you can do is hand it the edges the org already declared, kept current by the same machines that already depend on them being correct, including the base-image or shared-module dependency that used to live only in the head of the engineer who just left.

Where Riftmap sits

Riftmap lives in the declared regime, on purpose. It reads the edges your manifests already declare. Terraform module source, Dockerfile FROM, Helm dependencies:, GitLab CI include:, and the rest. It reads them deterministically, with no model anywhere in the parsing path, across an entire GitHub or GitLab organisation from one read-only token. No catalog YAML to maintain, because the edges are parsed straight from the files that already exist and re-read on every scan. It is not trying to be the inference tool for undeclared runtime calls, and it is not a catalog. It is the substrate: the cross-repo artifact graph the org already declared but never had assembled in one place. If you want to see what your own org declares that no single repo's clone can show you, run a scan against a read-only token and look at the graph before you bump the next base image.

About Riftmap

Riftmap maps cross-repo dependencies across your entire GitLab or GitHub
organisation — Terraform, Docker, CI templates, Helm, and more. One read-only
token. No YAML to maintain.

Common questions

What's the difference between declared, inferred, and registered dependencies?

A declared dependency is written into a manifest the machine already executes to do its job, like a Dockerfile FROM line or a Terraform module source, so it is deterministic and re-read on every run. An inferred dependency is guessed from statistical signal such as embeddings or LLM output, so it comes with a probability and no ground truth. A registered dependency is one a human typed into a catalog like Backstage or Port, which no machine executes, so it is accurate only as of the last edit. The three differ in what keeps the edge honest: the machine that runs it, nothing, or human diligence alone.

How do dependency-mapping tools actually detect dependencies, and are the edges parsed or inferred?

It depends on the tool, and "parsed vs inferred" hides a third option. Some tools parse the edge from a manifest declaration deterministically, some infer it from statistical signal like embeddings or model output, and some read it from a human-maintained catalog. Parsing gives you an edge that is true to the manifest and self-correcting because the machine re-reads it; inference gives you probabilistic coverage of couplings nothing declares; a catalog gives you relationships like ownership that neither can see, at the cost of drift. Observing an edge at runtime is a fourth mode, but it needs the change already running, so it cannot tell you the blast radius of something not yet merged.

Why do grep and symbol graphs miss infrastructure dependencies?

Grep finds the literal string but not the resolution rule behind it: a Terraform registry short-address, a Helm semver range, or a Dockerfile FROM ${VAR} is text plus a rule that maps it to an actual repo, and grep does not run the rule. Symbol graphs index programming-language symbols, and Helm, Terraform, Dockerfile, GitLab CI, GitHub Actions, Ansible, and Kustomize constructs are not symbols any compiler parses. For go.mod and npm the import path is a language symbol, so a symbol graph can resolve those cross-repo when cross-repo indexing is configured, which most installs skip.

Does a parsed dependency edge need a confidence score?

Not to say whether the edge exists. A parsed edge is read from a declaration, not guessed, so there is no probability that it is real. A confidence score on a parsed edge measures resolution quality instead: how cleanly the declared reference matched a known target, where 1.0 is an exact match and lower means a documented heuristic like a naming convention was needed. That is the same distinction Sourcegraph draws between precise and search-based navigation, and it is a different quantity from the existence-probability an inference tool attaches to a guessed edge.

Top comments (0)