Oleksander

Posted on Jul 2

Series: "Can You Build an Alternative to LLMs? 8 Months of Experiments, 200 Failures, and One Wall" 1

#ai #llm #machinelearning #evalution

I tested a simple hypothesis: can long LLM sessions be made cheaper by replacing the full transcript with a compact memory state, without losing answer correctness?

On my own synthetic eval, the system passed 30/30. On an external benchmark for long-term conversational memory, the score collapsed to 0.13. The failure was not caused by one bad prompt. It came from the type of memory being preserved: the local eval tested exact facts, while LoCoMo tested episodic memory.

After several failed approaches, one narrow architecture survived: append-only memory plus deterministic guards for exact fields, especially dates. It is not general-purpose memory, but it produced an honest result: 94% retention with 60% per-query saving on 200 QA.

The main lesson: synthetic evals are useful as regression tests, but dangerous as evidence. If an eval is written around the mechanism being tested, the system can look strong exactly where it proves the least.

1. The Problem

Long LLM sessions are expensive. If every new request receives the entire previous transcript, cost grows with conversation length. The obvious engineering response is to compress the history into a compact state and give the model that state plus the most recent turns.

The hypothesis was:

Compact session state can reduce prompt/context usage without losing answer correctness.

This is attractive. It promises smaller prompts, longer sessions, cheaper agents, and less irrelevant context. But it hides a trap: if the state compresses the wrong type of information, it does not optimize memory. It deletes data.

2. The Local Eval Looked Good

I started with my own 30-case corpus. It covered:

exact facts;
support notes;
CRM-style notes;
coding sessions;
RAG-like context;
mixed-language facts;
preferences and decisions;
negative short-context cases.

Example:

{
  "id": "exact_authorization_code",
  "question": "What is the authorization code? Answer only the code.",
  "expected_fragments": ["RX-4471"]
}

On this corpus, the system looked good. But the first pretty numbers were dry-runs. A dry-run does not call the model: the answer equals the expected fragments by construction. That mode is useful for checking the pipeline, but it is not evidence of quality.

The first full real-model run, after fixing checker artifacts and a mixed-language gap, gave:

corpus:       30 cases
noise:        +40 turns
accuracy:     1.000  (30/30)
effectiveness_rate: 1.000
context_window_saved_pct: 82.86%
false savings: 0
GATE: PASS

This was stronger than a dry-run. But the central problem remained: the corpus was mine. It was written around what the mechanism was already good at preserving.

3. Why Synthetic 30/30 Was Not Enough

The local eval mostly tested exact facts and durable decisions:

codes;
dates in strict formats;
IDs;
file paths;
explicit preferences;
short rules;
"do not forget" facts.

That is important for regression testing. But it is not the same as long-term conversational memory.

Real conversational memory often asks:

when something happened;
what "yesterday" meant relative to a specific session;
how events across several dialogues are connected;
who said what;
what two people have in common;
which fact is needed for an answer even though it is not named in the question.

My local eval tested the type of memory I already knew how to compress. The external benchmark tested what I had not shaped around my own mechanism.

4. LoCoMo Broke the Claim

For an external check, I used LoCoMo: a benchmark for very long-term conversational memory. LoCoMo dialogues average around 300 turns, 9K tokens, and up to 35 sessions. It tests long-term memory through QA, event summarization, and multi-session dialogue understanding.

The first result was harsh:

raw              : 8/15 = 0.53
projected_facts  : 2/15 = 0.13
projected_hybrid : 2/15 = 0.13

The optimized state retained only about a quarter of the correct answers preserved by raw context. The context saving looked excellent: 94-99%.

That was not optimization. It was deletion of needed information.

After adding session timestamps to dialogue lines, the raw baseline improved:

raw              : 0.80
projected_hybrid : 0.20
retained_vs_raw  : 0.25

So the issue did not disappear with a better baseline. It became clearer that the projected state was not preserving the required memory type.

There is an important methodological caveat: the early 0.13 used a strict substring/token checker. Such a checker can miss semantically correct date answers. For example, the gold answer may be the sunday before 25 May 2023, while the model answers 20 May 2023.

But the gap was too large to dismiss as a checker artifact: raw had 8 correct answers, projected had 2.

5. What Failed

After LoCoMo, three failures became visible.

5.1 Static Facts Are Not Episodic Memory

The initial state preserved exact facts well:

codes;
paths;
IDs;
strict dates;
explicit user decisions;
preferences;
constraints.

But it poorly preserved:

event sequence;
relative dates;
"yesterday" relative to a session date;
shared interests;
multi-hop links;
evidence-neighbor context.

LoCoMo asked for episodic memory: who did what, when, where, and how it connects across sessions.

5.2 Lexical Retrieval Was Not Enough

I also tested a retrieval-style approach: instead of compressing everything into one state, select relevant chunks for each question.

Result:

append_full      : 32/60 = 0.533, query saving 45.89%
append_retrieved : 25/60 = 0.417, query saving 90.73%

append_retrieved looked better economically, but quality dropped. The reason is simple: lexical overlap fails when the question and the evidence do not share words.

Typical failures:

"What did Caroline research?" did not retrieve adoption agencies;
a shared-destress question did not retrieve dance;
a martial-arts question did not retrieve Kickboxing, Taekwondo;
temporal and multi-hop questions broke more often.

5.3 Dates Were a Separate Failure Class

At 200 QA, the main gap localized to the temporal category:

raw         : 126/200 = 0.63
append-only : 111/200 = 0.56
retention   : 88%

by category:
multi-hop   : 46 / 45   (~98%)
temporal    : 58 / 46   (~79%)
open-domain : 18 / 16
single-hop  : 4 / 4

This distinction mattered. The memory was not uniformly bad. Temporal anchors were bad.

When the diagnosis is precise, the fix can be precise.

6. Ladder of Attempts

There was no direct jump from the first failure to the final result. Several approaches died for specific reasons.

Approach	Result	Why it failed or narrowed
text compression / projected state	`2/15 = 0.13`	kept exact facts, lost episodic memory
MinHash-style lexical retrieval	`3/15 = 0.20`	lexical overlap missed paraphrase; evidence hit about `42%`
evidence oracle	`0.30` under strict checker	even exact evidence lines did not guarantee date-equivalent substring match
recode-to-notation smoke	`3/3 = 1.00`	small smoke was too optimistic
recode-to-notation larger slice	unstable: `0.35` in one slice, `0.70` vs raw `0.75` under LLM judge in another	interesting signal, not stable enough
append-only without date-guard	`111/200 = 0.56`, retention `88%`	most loss concentrated in temporal questions
append-only + date-guard	`94% retention`, `60% per-query saving`	first narrow result that survived scale better

This table is the real research story. The useful mechanism was not guessed. It survived because prettier mechanisms died first.

7. What Survived

The surviving design had two constraints.

First: do not re-summarize the whole state.

Second: protect exact fields deterministically.

The append-only rule is simple:

Compress only the new exchange.
Freeze the compressed chunk.
Append it to memory.
Never re-compress old chunks.

Why this matters: if old facts repeatedly pass through a compressor, small losses accumulate. If each exchange is compressed once and then frozen, loss cannot compound in the same way.

Early append-only results looked even better than raw: 103% retention on 6 conversations. That was a small-sample artifact. At 200 QA, retention fell to 88%. This was useful: it showed that the architecture helped, but the temporal gap was still real.

8. Date-Guard

The fix was deterministic date protection.

The idea:

extract absolute time expressions;
extract relative time expressions;
attach session date to relative expressions;
append these time anchors to the compressed state.

This does not ask the LLM to be careful with dates. It removes the choice. The compressor can shorten prose, but date anchors survive as explicit fields.

Result on the 200 QA setup:

without guard:
retention        88%
temporal cat     79%
per-query saving 45%

with date-guard:
retention        94%
temporal cat     96%
per-query saving 60%

Saving improved because the prose compressor could become more aggressive once dates were protected separately.

Final honest number:

append-only + date-guard: 94% retention, 60% per-query saving

This is not solved memory. It is not a universal alternative to LLM context. It is a narrow result: append-only compression plus deterministic protection for exact fields.

9. Cost and Scope

This mechanism is not useful for short chats.

On short sessions, fixed overhead can exceed savings:

0 added noise exchanges: projected is worse than raw
1 added noise exchange : projected is still usually worse

Another important distinction: context-window saving and API-cost saving are not the same metric.

One real-provider smoke test showed:

context_window_saved_pct_vs_raw_estimate: 14.61
provider_total_saved_pct_vs_raw: -2.83

The final prompt was smaller, but total provider cost was worse because preparation added calls.

For append-only full on 60 QA:

query_saved_pct_vs_raw   : 45.89%
product_saved_pct_vs_raw : 23.92%
break_even_queries       : 28.72
net_saved_pct_at_200     : 39.30%

So the mechanism is only interesting for long sessions where setup cost can be amortized.

10. Lessons

Lesson 1: Synthetic eval is not evidence by itself

Synthetic eval is useful for regression. It is weak evidence for generalization.

If the author writes the eval around their own mechanism, the system can pass by matching the author's blind spots.

Lesson 2: Compression ratio is a vanity metric without retention

99% context saving is meaningless if answer retention collapses.

The key metric is not "how much did we delete?" but "how much correct behavior survived?"

Lesson 3: Memory is not one thing

Exact facts, preferences, episodic events, temporal anchors, multi-hop relations and source evidence are different memory types.

A compressor can preserve one and destroy another.

Lesson 4: Deterministic guards matter

Some fields should not be entrusted to a generative summary:

dates;
amounts;
IDs;
codes;
names;
statuses;
paths;
constraints.

If losing a field breaks correctness, extract it deterministically and preserve it explicitly.

Lesson 5: Small samples lie

The 103% retention result on 6 conversations looked exciting. At 200 QA it became 88%. The useful signal was not the optimistic number, but the category breakdown showing where the loss happened.

11. Limitations

This is not an academic benchmark paper.

Limitations:

one main external conversational-memory benchmark;
small and medium QA slices before the 200-QA run;
no statistical significance analysis;
some early measurements used strict substring checking;
LLM-judge checks reduce one problem but introduce another;
final 94%/60% should be published with a compact appendix table before being treated as a stable claim.

The result is best read as an engineering research note: a failed broad claim, a localized diagnosis, and a narrower mechanism that survived better tests.

12. Conclusion

The local eval said:

30/30

The external benchmark said:

0.13

The final surviving mechanism said:

94% retention / 60% per-query saving

The important result is not that this "solves memory". It does not.

The important result is that an external benchmark forced the system to stop lying through its own eval. The useful architecture appeared only after the original success story failed.

References

LoCoMo: "Evaluating Very Long-Term Conversational Memory of LLM Agents" — https://arxiv.org/abs/2402.17753
Lost in the Middle: "How Language Models Use Long Contexts" — https://arxiv.org/abs/2307.03172
"Investigating Data Contamination in Modern Benchmarks for Large Language Models" — https://arxiv.org/abs/2311.09783
"Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models" — https://arxiv.org/abs/2311.06233

Top comments (2)

Dipankar Sarkar • Jul 2

The 30/30 to 0.13 gap is the whole lesson, and you named the mechanism exactly: the eval and the compressor share a shape. Both key on exact fields, so the compressor literally cannot lose what the eval checks. LoCoMo breaks that because episodic recall lives in the ordering and the turns you compressed away, not in any single field.

The append-only plus deterministic date guards fix is the honest move. You stopped pretending the state was general memory and scoped it to the fields you can actually guarantee. 94% retention at 60% saving reads as real precisely because it is narrow.

One thing I would add from running context compression under an LLM proxy: the bias is not just 'wrong type of memory,' it is that compression is lossy in a direction you pick at design time, and every eval you write inherits that same blind spot. The only evals that caught our regressions were the ones authored by someone who did not know what we were dropping.

Oleksander • Jul 3

This is a sharper framing than mine, and it corrects my conclusion in the right direction. I said "wrong type of memory" — but you're right that the root isn't the type, it's that the direction of loss is chosen at design time, and after that every eval I write is blind in the same direction. The author of the compressor can't write an eval that catches what he unconsciously decided to drop — because if he knew it mattered, he wouldn't have dropped it.

That turns "external benchmarks are useful" into something more precise: the independence of an oracle isn't a property of the data, it's a property of its author. LoCoMo caught me not because it's harder, but because it was written by people who didn't know about my date-guard. Your line — "the only evals that caught our regressions were authored by someone who did not know what we were dropping" — is the shortest statement of this I've seen.
Question back: how did you operationalize that independence in practice, inside a team where everyone already knows the architecture?