DEV Community

Cover image for Why Your AI Agent needs better Temporal Reasoning—and How We Fixed It
Vektor Memory
Vektor Memory

Posted on

Why Your AI Agent needs better Temporal Reasoning—and How We Fixed It

Most agent memory systems treat stored facts linearly. There’s no sense of when a fact was true, whether it’s been superseded, or how to reason about time at all.

This is the story of how we diagnosed the problem, found the research, and built a production fix in Node.js and SQLit, no Python, no subprocess, no academic overhead.

Received a great question from a reader along these lines:

“Temporal reasoning is the one I’d love to see more research on — most retrieval systems treat memory as a flat bag of facts and the agent has no way to know that a fact from yesterday supersedes one from last month”

Which interesting question sent us down a rabbit hole through Arxiv late in the evening before dinner.

We found a handful of papers attacking the problem from different angles: temporal knowledge graphs, bi-temporal storage models, neuro-symbolic reasoning pipelines. Most of them were Python. All of them were academic. None of them were something you could drop into a production Node.js agent without a rewrite.

Which brought us back to a decision we made early in VEKTOR’s life and have never regretted: Node.js over Python. Not because Python isn’t excellent, it’s the flavour of the month for good reason, and the ML ecosystem built on top of it is genuinely world-class. We chose Node.js for concurrent execution, for file I/O speed, for the event loop model that makes agent tooling feel snappy rather than sluggish. That choice closes certain doors. It also opens others.

The most interesting paper we found was TReMu — Temporal Reasoning for LLM-Agents in Multi-Session Dialogues, out of UIUC and AWS. Their framework takes GPT-4o from 29% to 77% accuracy on temporal questions. The mechanism: resolve relative time expressions at ingest, then use Python to execute date arithmetic at query time.

The Python part we had to throw away. What we built instead is arguably cleaner, we think so anyway.

The Problem Nobody Talks About

Ask your AI agent “what database are we using?” and it will confidently answer with whatever it last store, even if that fact is six months stale and three decisions out of date.

This isn’t a hallucination problem. The fact is real. It was true. It’s just not true anymore.

Most retrieval-augmented memory systems treat stored memories as a flat collection ranked by semantic similarity and recency. There’s no explicit model of when a fact was true in the world versus when the agent learned it. There’s no mechanism to mark a fact as superseded by a newer one. And there’s certainly no way to answer questions like “how long between when we decided on Redis and when we migrated away from it?”

The agent doesn’t know what it doesn’t know about time. It lives in an eternal present, every fact equally valid, forever.

The Research: TReMu

Revised 24 Sep 2025, a team from the University of Illinois and AWS published TReMu — Temporal Reasoning for LLM-Agents in Multi-Session Dialogues.

The paper is worth reading in full, but the headline number is striking: standard GPT-4o prompting on temporal reasoning questions scores 29.83%. Their framework scores 77.67%. That’s a 48-point jump on questions humans find trivially easy.

What were those questions? Three types:

Temporal Anchoring — “When exactly did this happen?” A user says “I went to the seminar last Monday.” When was that? Most systems store the ingestion timestamp, not the event date. They’re different.

Temporal Precedence — “Which of these two things happened first?” Requires knowing the actual order of events across multiple sessions, not just which was stored most recently.

Temporal Interval — “How long between these two events?” Needs real date arithmetic. Not vibes. Not “a while ago.” Actual days.

The paper’s solution has two parts:

Time-aware memorization — at ingest time, resolve relative time expressions (“last Friday,” “two weeks ago”) into concrete calendar dates. Store the event date separately from the ingestion date.
Neuro-symbolic temporal reasoning — at query time, generate Python code to perform the date arithmetic, execute it, and use the output to answer the question.
Part 1 is clearly right. Part 2 is a good idea wrapped in academic scaffolding that we had to refine some more to make it work in our production architecture.

What’s Wrong With Python Subprocess

The paper uses Python because it’s running in a research notebook. dateutil.relativedelta is genuinely excellent for date math. But "generate Python, exec it, parse stdout" as a production pattern has real problems:

Latency — cold subprocess startup on every temporal query
Security — LLM-generated code execution is a meaningful attack surface
Fragility — syntax errors require retry loops, model-specific prompting, and graceful degradation
Dependency — Python must be available in the same environment as your Node.js agent
There’s a better way. SQLite has had julianday() since version 3.0. JavaScript has new Date(). Neither needs a subprocess.

The Production Architecture

We built this into VEKTOR Slipstream — a local-first AI agent memory platform built on Node.js and SQLite. Here’s what the production implementation looks like.

Stage 1: Time-Aware Memorization

The TReMu paper’s key insight is that event time and mention time are different dimensions. When a user says “I met the new CTO last Tuesday,” the event happened last Tuesday. The agent learned about it today. Standard memory systems collapse these into one timestamp.

VEKTOR already had a vektor_timeline table and an event_date column waiting to be wired up. The gap was at ingestion: the extraction prompt didn't ask for relative time expressions, and nothing resolved them to calendar dates.

The fix: chrono-node, a production-grade NLP date parser for JavaScript.

import * as chrono from 'chrono-node';
function resolveRelativeDate(expression, sessionTimestamp) {
// Pre-process idioms chrono doesn't handle natively
let expr = expression.trim();
expr = expr.replace(/\ba\s+fortnight(\s+ago)?\b/i, '2 weeks ago');
expr = expr.replace(/\bhalf\s+a\s+year(\s+ago)?\b/i, '6 months ago');
const anchor = new Date(sessionTimestamp);
const parsed = chrono.parseDate(expr, anchor, { forwardDate: false });
return parsed ? parsed.toISOString().slice(0, 10) : null;
}
Two lines. Handles “last Thanksgiving,” “a fortnight ago,” “Q3 of last year,” “the day before yesterday.” The anchor parameter is the session timestamp — the point in time from which relative expressions are calculated.

This gets called at the end of the session ingest loop, after facts are stored:

// After storing facts, resolve event dates into vektor_timeline
await patchSessionIngest(storedFacts, sessionTimestamp, db);
Stage 2: SQL Temporal Reasoning
Instead of generating Python code and executing it, we use the tools that were already there.

**
Temporal Anchoring — “When did X happen?”**

SELECT m.id, m.rowid, m.content,
COALESCE(t.iso_date, m.event_date) AS resolved_date
FROM memories m
LEFT JOIN vektor_timeline t ON t.memory_id = m.rowid
WHERE m.rowid IN (/* recall result rowids */)
AND (m.superseded_by IS NULL)
AND NOT EXISTS (
SELECT 1 FROM memory_edges me
WHERE me.source_id = m.rowid AND me.edge_type = 'SUPERSEDES'
)
AND (m.event_date IS NOT NULL OR t.iso_date IS NOT NULL)
ORDER BY resolved_date DESC
Temporal Interval — “How long between A and B?”

SELECT
CAST(ABS(
julianday(COALESCE(tb.iso_date, b.event_date)) -
julianday(COALESCE(ta.iso_date, a.event_date))
) AS INTEGER) AS days_between
FROM memories a
LEFT JOIN vektor_timeline ta ON ta.memory_id = a.rowid
JOIN memories b ON b.rowid = ?
LEFT JOIN vektor_timeline tb ON tb.memory_id = b.rowid
WHERE a.rowid = ?
julianday() is native SQLite. No dependencies. Transactional. Runs inside your existing database connection. The result feeds directly into a formatInterval() function that outputs "6 months" or "179 days" as needed.

Temporal Precedence — “Which came first?”

SELECT m.id, m.rowid, m.content,
COALESCE(t.iso_date, m.event_date) AS resolved_date
FROM memories m
LEFT JOIN vektor_timeline t ON t.memory_id = m.rowid
WHERE m.rowid IN (/* recall result rowids /)
/
SUPERSEDES filter */
ORDER BY resolved_date ASC NULLS LAST
Map the results with an index: rows.map((r, i) => ({ ...r, temporal_order: i + 1 })). The LLM reads temporal_order: 1 and knows which came first. No arithmetic required on the model side.

Stage 3: The SUPERSEDES Anti-Join
This is the piece that makes temporal reasoning actually useful — hiding dead facts so the LLM never sees them.

The Zep/Graphiti approach (bi-temporal storage with four timestamps per fact) is architecturally correct but expensive to retrofit. Their valid timestamps, invalid timestamps, event times and ingestion times require schema discipline at write time across every writer.

The lazy alternative: a SUPERSEDES edge in the graph.

When a new memory replaces an old one, write a SUPERSEDES edge in memory_edges. At recall time, the NOT EXISTS anti-join makes superseded nodes mathematically invisible:

AND NOT EXISTS (
SELECT 1 FROM memory_edges me
WHERE me.source_id = m.rowid AND me.edge_type = 'SUPERSEDES'
)
One index on (source_id, edge_type) and this runs fast even at scale.

The critical addition: a Conflict Resolution Agent gate before writing the edge. Blind automated supersession is dangerous — “I also bought Ethereum” shouldn’t mark “I own Bitcoin” as superseded just because they’re semantically similar.

async function checkConflict(contentOld, contentNew, llmCall) {
const prompt = `
Fact A (older): "${contentOld}"
Fact B (newer): "${contentNew}"
Does Fact B logically SUPERSEDE (replace/update) Fact A?

  • TRUE: "We use AWS" → "We migrated to GCP"
  • FALSE: "I own Bitcoin" → "I also bought Ethereum" Reply ONLY with JSON: {"supersedes": true|false, "reasoning": "..."}`; const result = JSON.parse(await llmCall(prompt)); return result.supersedes === true; } The edge is only written if the LLM explicitly confirms logical replacement. Everything else stays additive.

Stage 4: MCP Tools for On-Demand Temporal Reasoning
Rather than hardcoding SQL routing (anchor vs precedence vs interval), we expose two MCP tools and let the agent drive:

query_timeline — returns chronologically ordered events for a keyword, superseded facts excluded. The agent uses this to establish what happened and when.

calculate_date_math — takes two ISO dates, returns days/weeks/months and a formatted string. Pure JavaScript Date arithmetic, no subprocess.

// Agent calls this after query_timeline returns two dates
const result = await calculate_date_math({
date_a: '2025-03-15', // Redis decision
date_b: '2025-09-10' // Valkey migration
});
// → { days: 179, formatted: "6 months", ... }
This is the TReMu “neuro-symbolic” pattern translated correctly: the agent writes the reasoning steps, the execution layer handles the precision. The difference is that “execution” is now a pure JS function with no subprocess overhead.

A Schema Gotcha That Cost Us Three Debugging Sessions

Why are you mentioning your bugs?

Because they are real-life testing phases in our lab, every developer gets them, and these were ironed out in 5 mins with multiple revisions and passes.

One thing no academic paper mentions: memories.id in VEKTOR is a TEXT column (format: vektor-slipstream-memory-8054). The vektor_timeline.memory_id column is INTEGER (SQLite rowid). Every JOIN we wrote naively used t.memory_id = m.id — silently returning zero rows because SQLite was comparing integer 5726 to text vektor-slipstream-memory-8054.

The fix is to join on m.rowid not m.id, and wrap production recall (which returns text IDs) in a lookup function:

function _resolveToRowids(db, memoryIds) {
if (typeof memoryIds[0] === 'number') return memoryIds; // test seeds
const ph = memoryIds.map(() => '?').join(',');
const rows = db.prepare(
SELECT rowid FROM memories WHERE id IN (${ph})
).all(...memoryIds);
return rows.map(r => r.rowid);
}
Type mismatches in SQLite are silent. They don’t throw. They just return empty result sets and let you spend three sessions wondering why the logic is wrong when it’s actually the schema.

The Results

After building and iterating through the fixes, 32/32 tests pass:

Results

32 passed 0 failed 0 skipped (100% pass rate)
chrono-node installed ✓
The accuracy benchmark on the three TReMu question types (Anchoring, Precedence, Interval) scores 2/3 (67%) without semantic pre-filtering, and 3/3 (100%) when the recall step correctly narrows to topic-specific memories before applying temporal reasoning.

That gap — from 67% to 100% — is meaningful. It shows that temporal reasoning and semantic retrieval are separate problems that compose. Getting the temporal math right (which we’ve now done) is necessary but not sufficient. The retrieval step still needs to scope the candidate set correctly. That’s a solved problem in VEKTOR — it just needs to be wired to the temporal layer, which is the next step.

For context: GPT-4o with standard prompting scores 29.83% on TReMu’s benchmark. With the full framework: 77.67%. The gap we’re closing with pure SQL and a JS date library.

What We Didn’t Build (And Why)

Full bi-temporal storage (Zep/Graphiti style) — four timestamps per fact, explicit tvalid/tinvalid ranges. Correct for compliance use cases where you need "what did the agent believe at time T?" reconstruction. Overkill for most applications, expensive to retrofit, requires write-time discipline across every ingestion path. The SUPERSEDES edge approach gives you 80% of the benefit at 10% of the cost.

Time-weighted PageRank — decaying graph traversal weights toward recent nodes. Directionally right, but the hot-path cost is real. Our implementation uses temporalDecayWeight() to pre-bake decay into edge weights at write time, avoiding runtime recalculation.

Generative temporal QA — TReMu’s benchmark uses multiple-choice questions. Real agent interactions are open-ended. The MCP tool pattern handles this: the agent constructs its own reasoning using query_timeline + calculate_date_math, generating a freeform answer rather than selecting from options.

What’s Next

The temporal layer is now tested and live. The remaining work:

Wire patchSessionIngest to the live ingest flow — session timestamp needs to be passed through from the MCP caller context
Semantic pre-filtering before temporal recall — narrow candidate set by topic before calling temporalAnchor/temporalPrecedence
Foresights integration — vektor_foresights table is populated with valid_from/valid_until dates via chrono-node; briefing scheduler queries upcoming foresights

Benchmark against real sessions — run the accuracy harness against actual VEKTOR conversation history to get a real-world before/after number
The core insight from TReMu holds: treating temporal reasoning as a code execution problem is the right abstraction. The execution layer just doesn’t need to be Python. For agents built on SQLite and Node.js, it never did.

VEKTOR Slipstream is a local-first AI agent memory and governance platform. vektormemory.com

Generative Ai Tools
AI Agent
Arxiv
LLM
Vector Database

Top comments (0)