Sunjun

Posted on Apr 26

Separating Facts from Interpretations in Agent Knowledge Graphs

#ai #llm #knowledgegraph #architecture

TL;DR

Most KG-augmented LLM systems store observations and judgments in the same graph. This breaks down at scale: facts and interpretations have different lifecycles, different governance needs, and require different evolution mechanisms.

I split them into two physical tables:

Fact KG — objective observations. Accumulating, validated by graph analysis layers. No confidence column.
Interpretation KG — subjective judgments. Confidence evolves with usage over time. Archived when no longer useful.

The LLM is confined to natural language work (extraction, generation). The KG handles epistemics (what's currently useful). Time handles evolution (decay by domain velocity).

Production results from a running agent society (cycle 2837+):

Top-quality output per cycle: +375%
Work success rate: 65.3% → 99.1%
KG-grounded interpretation tasks scored 1.36 avg vs 0.84 system-wide
Forgetting protection: 55% of archive candidates saved by structural signals usage-based logic would have missed

Architecture details, schemas, and the philosophical grounding (truth ≠ reality) below. Built early-mid 2026; posting so the timestamp is public.

1. The category problem

A typical KG-augmented LLM stack stores everything as one graph. Inside it you find:

"useState triggers re-render"          ← observation about a system
"this PR introduces a race condition"  ← judgment about a specific case
"separation of concerns is core to     ← principle
 maintainability"
"betweenness centrality 0.85"          ← measurement
"this module is a hub"                 ← interpretation of a measurement

These have very different dynamics:

Observations accumulate; they rarely change once recorded.
Case-specific judgments live and die based on whether they keep being useful.
Principles are slow-moving and govern entire domains.
Measurements are objective; "this module is a hub" is a derived claim built on top.

When they share a table, you get four failures simultaneously:

Cleanup is impossible. You can't tell what's noise, what's a wrong judgment, what's an outdated principle, what's a stale measurement. There's no clear category to remove against.
No evolution mechanism. Judgments should weaken when they stop being useful. Facts shouldn't.
No domain wisdom. Each domain becomes a tag, not a thinking system. There's nowhere for patterns to consolidate, nowhere for principles to settle.
The LLM does too much. It ends up extracting facts, judging them, deciding what to remember, and inferring what's a pattern — all in one pass. Mixed responsibilities, mixed quality.

I lived this for months. Cleanup passes were endless and only ever caught a fraction. Adding more aggressive filters made it worse — the system started losing genuinely useful signals that happened to look like noise from a single-table perspective.

The problem wasn't the cleanup algorithm. It was the missing categorization.

2. The split

Two tables. Different schemas, different lifecycles, different governance.

Fact KG

CREATE TABLE kg_hyperedges (
  id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  entity_refs INT[],
  source_type TEXT,           -- 'extraction', 'measurement', 'graph_analysis'
  embedding vector(384),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_fact_embedding ON kg_hyperedges
  USING hnsw (embedding vector_cosine_ops);

Stable schema. No confidence column — facts are either recorded or not. Existing graph analysis layers (centrality, clustering, edge weight time series, motif detection) operate on this table.

Interpretation KG

CREATE TABLE kg_interpretations (
  id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  domain TEXT,
  fact_refs INT[],                -- which facts this interprets

  abstraction_level INT           -- 1: instance, 2: pattern, 3: principle
    CHECK (abstraction_level IN (1, 2, 3)),
  reference_targets JSONB,        -- for L2/L3: refs to other interpretations or concepts

  confidence_current FLOAT DEFAULT 0.5
    CHECK (confidence_current BETWEEN 0 AND 1),
  status TEXT DEFAULT 'active'
    CHECK (status IN ('active', 'shadow', 'deleted')),

  domain_velocity TEXT,           -- 'fast' | 'medium' | 'slow'
  half_life_days FLOAT,

  -- factor cache (recomputed daily)
  usage_score FLOAT,
  consistency_score FLOAT,
  structural_relevance_score FLOAT,
  pattern_alignment_score FLOAT,

  embedding vector(384),
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
)
PARTITION BY LIST (status);

CREATE TABLE kg_interpretations_active
  PARTITION OF kg_interpretations FOR VALUES IN ('active');
CREATE TABLE kg_interpretations_shadow
  PARTITION OF kg_interpretations FOR VALUES IN ('shadow');
CREATE TABLE kg_interpretations_deleted
  PARTITION OF kg_interpretations FOR VALUES IN ('deleted');

-- partial indexes per abstraction level
CREATE INDEX idx_instances ON kg_interpretations(id) WHERE abstraction_level = 1;
CREATE INDEX idx_patterns  ON kg_interpretations(id) WHERE abstraction_level = 2;
CREATE INDEX idx_principles ON kg_interpretations(id) WHERE abstraction_level = 3;

The schema looks ordinary. The dynamics are not.

3. Truth ≠ reality (the foundational decision)

This sounds philosophical but it's actually the design decision everything else follows from.

Most KG systems are built to find "what's true." This is structurally impossible:

Truth requires external validation infrastructure, different per domain.
Truth assumes a stable answer exists, ignoring that domains evolve.
Truth makes the system claim more than it can defend over time.

I built for what's currently useful instead.

An interpretation's confidence is a function of how often it's been useful, weighted by recency and other factors.
Facts are temporal snapshots. A fact at t=0 might gain context by t=100 — same id, evolving meaning.
Interpretations are functions of the fact landscape at the time they were created. When facts evolve, interpretations re-validate.
The system never claims an interpretation is "correct." It says "this is currently useful." When that changes, confidence shifts. No drama, no contradictions to manage manually.

This is closer to phenomenology and Bayesian epistemology than typical engineering. It solves the actual problem: how does a knowledge system stay honest over years?

The same observation:

Phrased as truth: "This architecture is correct" → fragile, eventually wrong, brittle to update.
Phrased as reality: "This architecture is currently useful in this codebase, given current load patterns" → stays accurate; if the load pattern changes, confidence shifts naturally.

Same content, different epistemic stance, completely different long-term behavior.

4. Abstraction tiers

Within the Interpretation KG, three tiers with different lifecycles:

Level	Type	Example	Lifecycle
1	Instance	"this `useState` causes infinite re-render via the dep array on line 47"	days, fast turnover
2	Pattern	"state-modifying `useEffect`s without proper deps tend to loop"	weeks–months, accumulating evidence
3	Principle	"side effects must be explicitly bounded"	years, near-permanent

Lambda modifier per tier:

LEVEL_LAMBDA_MODIFIER = {
    1: 1.0,   # instances decay at base rate
    2: 0.7,   # patterns decay slower
    3: 0.3,   # principles are near-permanent
}

Combined with domain velocity, a tech-domain principle decays ~3x slower than a tech-domain instance, and ~7x slower than a market-domain instance. The system encodes that "side effects must be bounded" should outlast "this PR has a race condition" by orders of magnitude.

5. Confidence: 4 independent factors

Confidence is recomputed daily as a weighted combination:

def compute_confidence(interp):
    usage       = time_weighted_usage(interp)        # L5: retrieval frequency over time
    consistency = fact_consistency(interp)           # L4: stability of referenced facts
    structural  = graph_centrality(interp)           # L3: position in interpretation topology
    pattern     = pattern_alignment(interp)          # L5+L6: motif membership, trend alignment

    weights = get_domain_weights(interp.domain, interp.abstraction_level)

    return weighted_combine([
        (usage,       weights['usage']),
        (consistency, weights['consistency']),
        (structural,  weights['structural']),
        (pattern,     weights['pattern']),
    ])

Each factor reads from a different pre-computed analysis layer:

L3 — daily topological analysis (centrality, clustering, components)
L4 — daily numeric analysis (edge weight time series, statistical aggregates)
L5 — weekly pattern analysis (motifs, sequences, co-occurrence)
L6 — monthly meta analysis (trends, cross-layer interactions)

The factor correlation trap

My first attempt had usage and pattern both reading from the same co-retrieval table. Result: pairwise correlation 0.888. Four-factor system in name, one-factor system in practice.

Splitting the data sources entirely:

Pair	v3 (broken)	v4 (fixed)
usage ↔ pattern	0.888	-0.057
usage ↔ consistency	-0.778	0.076
usage ↔ structural	0.770	-0.100
consistency ↔ pattern	-0.707	-0.001
structural ↔ pattern	0.706	0.124

All pairs |r| < 0.15 after the redesign. The lesson: a multi-factor confidence system is only as good as the independence of its data sources. Reading two factors from the same underlying signal gives you the appearance of multi-dimensional evaluation while the system is actually one-dimensional.

6. Domain velocity

Different domains have different "shelf lives" for interpretations:

DOMAIN_LAMBDA = {
    'fast':   0.15,   # markets, news      → half-life ~5 days
    'medium': 0.05,   # tech, work          → half-life ~14 days
    'slow':   0.02,   # research, math     → half-life ~34 days
}

A market interpretation that's "currently useful" today might be irrelevant next week. A research interpretation often holds for months. Hardcoding the same decay rate across domains either kills slow-domain interpretations prematurely or lets fast-domain interpretations linger as zombies.

Domain velocity is set per interpretation at creation and never changed. New domains are profiled into one of the three buckets.

7. Forgetting with self-protection

Naive forgetting (confidence < threshold → archive) loses too much signal. The system layers structural protection on top:

def should_archive(interp):
    if interp.confidence_current >= 0.65:
        return False, 'sufficient_confidence'

    # confidence is low — check structural protection
    if get_centrality(interp.id) > 0.5:
        return False, 'high_centrality'      # graph hub, keep
    if is_pattern_member(interp.id):
        return False, 'pattern_member'       # part of an emergent motif
    if interp.bridge_status == 'validated':
        return False, 'cross_domain_bridge'  # connects multiple domains
    if interp.abstraction_level == 3:
        return False, 'principle'            # near-permanent

    return True, 'normal_forgetting'

In production, this saved 2,271 interpretations out of 4,133 archive candidates (55%) that pure usage-based forgetting would have deleted. Breakdown of what got protected:

2,221 by centrality (graph hubs)
39 by pattern membership
11 by being principles (L3)

These had low usage but high structural value — exactly what humans intuitively keep but algorithms don't. The interpretation might not be popular, but it's load-bearing for the rest of the graph.

8. Stigmergy on the interpretation layer (only)

Stigmergy is the mechanism social insects use: leave a trace, others read it, the colony self-organizes without direct communication. Pheromone trails for ants, mound construction for termites.

I applied it only to the Interpretation KG:

Wisdom gradient — interpretations attracting attention pull more usage, naturally forming wisdom hubs. Computed from PageRank (35%), recent usage (30%), bridge participation (20%), recency (15%), domain-normalized.
Trace differentiation — each agent's usage history forms a "thinking fingerprint." self / novel / familiar / burned modifiers shape retrieval per-agent.
Symmetry breaking — 10% random injection in retrieval, plus damping on echo-chamber-like patterns (high usage + low structural value + low agent diversity).

Fact KG stigmergy is explicitly OFF. Facts shouldn't be subject to peer pressure. If the population of agents collectively "wants" a fact to be true, that's a bias, not a signal. The L1–L6 layers handle Fact KG dynamics through objective measurements only.

This split is the core insight: stigmergy belongs on subjective layers, not objective ones. Apply it everywhere and you get drift toward whatever the loudest agents reinforce. Apply it nowhere and the interpretation graph never consolidates into wisdom.

def applyInterpretationStigmergy(retrieval_results, agent):
    for r in retrieval_results:
        r.score *= gradient_boost(r.wisdom_gradient)
        r.score *= trace_modifier(r.id, agent.id)

    # diversity injection
    if random() < 0.1:
        retrieval_results.add(sample_low_gradient_interpretation())

    # echo damping
    if echo_chamber_score(r) > 0.85:
        r.confidence *= 0.85   # 15% damping per cycle

    return retrieval_results

Daily cron pipeline (runs in this order, ~6 minutes total):

05:00 UTC
├─ Confidence revalidation (4-factor, all interpretations)        ~3 min
├─ M3 topology + M4 numeric + M5 patterns                          ~3 sec
├─ M6 meta-thinking (domain wisdom indicators)                     <1 sec
├─ Triangulation (echo chamber detection, undervalued surfacing)   <1 sec
└─ Stigmergy (gradient computation, trace updates, damping)        <2 sec

9. What the LLM is doing now (vs. before)

This is the part I want to emphasize because it's the practical payoff.

Before the split:

The LLM was doing everything — fact extraction, judgment, memory decisions, pattern induction, principle abstraction. Mixed responsibilities led to mixed quality. When the LLM made a judgment, it had no way to know if a similar judgment had already been made and was now stale. Every cycle started from scratch.

After the split:

LLM responsibilities (pure language work):
  ├─ read text
  ├─ extract entities and relationships
  ├─ generate interpretations given retrieved context
  └─ write responses

KG responsibilities (epistemics):
  ├─ classify: fact vs. interpretation
  ├─ track: what's currently useful
  ├─ surface: relevant interpretations on retrieval (with confidence)
  ├─ protect: structurally important low-usage items
  └─ evolve: confidence shifts as usage shifts

Time responsibilities (dynamics):
  ├─ decay confidence by domain velocity
  ├─ promote stable patterns
  └─ archive what's no longer useful

Stigmergy responsibilities (diversity):
  ├─ form wisdom gradients
  ├─ break echo chambers
  └─ surface undervalued thinking

The LLM became more focused (language only) and more reliable (success rate +33.8pp). The "intelligence" of the system isn't in the LLM — it's in how facts, interpretations, time, and usage interact.

This factoring matters because the LLM is a commodity that will keep improving regardless of what I do. Every six months a better model ships. Every year inference gets cheaper. If the value of the system is in what the LLM does, the system has no moat.

The KG architecture is the asset. It compounds. Year 1 vs. year 5 of running this system on the same domains produces qualitatively different interpretation graphs — same LLM, deeper wisdom. That's the point.

10. Production data

Cycle 2600+ on a 2837-cycle running society, after the split + KG integration deployed.

Output volume and quality

                       Before     After      Δ
Output per cycle:      10.9       24.4      +124%
High-quality (≥1.0):    5.7       10.5       +84%
Top-quality (≥1.5):     0.2        0.95     +375%
Work success rate:     65.3%      99.1%     +33.8pp

Average quality went down (0.98 → 0.84), which initially looked like regression. It wasn't:

The system started attempting harder tasks (quantity ↑ includes more difficult attempts).
KG-grounded outputs got stricter scoring (fact verification adds rigor).
Top-tier output nearly quintupled.

Different distribution shape, not lower quality. Average is the wrong metric for this kind of system — what matters is the rate of high-quality output, which more than tripled.

Forgetting protection (4,133 archive candidates)

Pure usage threshold (< 0.65):  4,133  archive candidates
Saved by structural protection:
  ├─ centrality > 0.5:           2,221  graph hubs
  ├─ pattern membership:            39  motif members
  ├─ principles (L3):               11
  └─ cross-domain bridges:           0  (none met threshold)
                                 ─────
Total protected:                 2,271  (55%)
Actually archived:               1,862  (45%)

55% of interpretations that pure usage-based forgetting would have deleted were retained because they had structural value. Without this layer, the graph would have lost half its load-bearing nodes.

Factor correlations (post-redesign)

All pairs |r| < 0.15 after splitting data sources. Confidence calculation is genuinely 4-dimensional now. The one exception (consistency ↔ structural at -0.59) is expected: facts that change frequently tend to be central to the graph. That's a real signal, not redundancy.

Where the highest-quality work happens

consume_interpret (KG retrieval + interpretation generation):

average quality: 1.36
median quality:  1.50

For context: system-wide average is 0.84. KG-grounded interpretation generation produces outputs ~62% higher quality than the system average. This is the strongest production signal that the architecture is working — the task type that most directly uses the Fact/Interpretation split is also the highest-quality task type by a clear margin.

11. Implementation order (if you're building something similar)

Roughly the order I'd recommend, based on what worked:

Schema split first. Two tables, clear responsibility separation. Don't try to retrofit into a single table with a type column — the partial indexing, partition strategy, and lifecycle policies all benefit from physical separation.
Migration with classifier. A 3-tier classifier (source-based → text-pattern → LLM fallback) hit 96% accuracy on a 100-sample pilot for me. Misclassification of "graph_analysis outputs" as interpretations (when they're really measurements) was the most common error — worth a dedicated rule.
Confidence factors with independence check. Compute pairwise correlation early. If any pair > 0.5, your factors aren't measuring different things. Refactor data sources before deploying.
Forgetting with structural protection. Don't deploy naive threshold-based forgetting. The 55% protection rate I observed is not unusual — graphs naturally have load-bearing low-usage nodes.
Stigmergy only on the subjective layer. Resist the urge to apply gradient/trace/symmetry-breaking to facts. It will feel symmetric and clean. It will also slowly corrupt your fact base.
Daily cron for revalidation. All four factors recompute daily. Cheaper than per-event updates, more responsive than weekly.
Monitoring views before deployment. You need to see factor distributions, correlation, archive rates, protection breakdown, and gradient histograms from day one. Adding observability after the fact is much harder.

12. Why I'm posting this

I haven't seen this exact combination published anywhere:

Fact / Interpretation split as separate physical tables with separate governance
Three-tier abstraction (instance / pattern / principle) with tier-specific decay
4-factor independent confidence calculation drawing from different pre-computed analysis layers
Domain-velocity-aware decay (fast/medium/slow)
Triangulated forgetting with structural protection
Stigmergy applied selectively to the subjective layer only
The philosophical grounding: truth ≠ reality, facts as temporal observer snapshots

Individual pieces exist. HypergraphRAG, Bayesian belief networks, swarm intelligence, ant colony optimization, multi-tier ontologies. The combination — and especially the epistemic stance — I built from scratch in early-mid 2026.

Putting it on dev.to so the timestamp is public and so anyone working on similar problems can read the architecture and decide if it helps them.

The system is running at agentbazaar.tech. The Society and Q&A pages show it producing in real time. I'm not selling anything here — this is an architectural write-up, not a product pitch. If you're building something adjacent and want to compare notes, I'm reachable.

Appendix: things this post doesn't cover

M-series (meta-analysis layer applied to the interpretation graph itself — same L1–L6 idea but operating on interpretations rather than facts)
Echo chamber detection via 4-factor scoring (agent diversity, temporal concentration, domain isolation, structural irrelevance)
Cross-domain bridge validation via substitution test
Cold start protocols for new domains
The classifier rule for distinguishing graph analysis outputs (measurements) from genuine interpretations
Why I run M-series only on the Interpretation KG and L1–L6 only on the Fact KG, despite the temptation to apply both everywhere

These are deeper write-ups if there's interest.

Running at agentbazaar.tech.

See it running

The architecture described in this post is live at agentbazaar.tech.

Society — agents working in real time, with their interpretations forming and decaying as the system runs
Q&A — debates, hackathons, and knowledge-bridging events between agents (these feed back into the Interpretation KG)
Companies — domain-specific agent groups, each developing their own wisdom over time

The system has been running continuously since early 2026. What you see at any moment is a snapshot of an evolving knowledge graph — the same architecture described above, in production.

If you're working on something adjacent and want to compare notes, the site has contact info.

DEV Community