← Labs
Benchmark Analysis

Beyond Retrieval

Why Long-Term Memory Benchmarks Penalize Intelligence

Matt Veitch·March 2026·LongStrider Systems

We built a cognitive memory system. We ran it against the field’s standard benchmark. We scored 46.8%.

Mastra OM scored 94.87% on the same test. And we’re publishing our number, not theirs, because what we found inside that gap matters more than the gap itself: the smarter our system got, the worse it scored. Not because it retrieved less. Because it knew too much to lie about what it found.

This paper is about what that means for everyone building long-term memory in AI.

RoI vs. IoI: Two Ways to Read a Score

The field measures memory systems on one axis: Retrieval Output Index (RoI)— did the system return the right fact? It’s a transactional metric. Query in, row out. Score goes up or down.

We’re proposing a second axis: Integrity of Intelligence (IoI)— does the system handle uncertainty, conflict, and ambiguity with honesty? Does investing in deeper cognitive capability produce trustworthy behavior, even when the legacy metric can’t see it?

RoI framing asks: “We added epistemic context and the score dropped 3.9 points — how do we get the points back?”

IoI framing asks: “By adding epistemic context, we increased the integrity and trustworthiness of answers, even though the legacy metric punished us for it.”

Same data. Opposite conclusions. The tension between these two readings is the thesis of this paper.

Figure 1 — The RoI–IoI Tension00252550507575100100RETRIEVAL OUTPUT INDEX (RoI) →INTEGRITY OF INTELLIGENCE (IoI) →ASPIRATIONAL TARGETMastra OMRun #1Run #2Run #3Run #4 (canonical)

Run #3 added epistemic context and dropped 3.9 points on the benchmark — not because it got worse, but because it refused to lie. The aspirational target is upper-right: high retrieval AND high integrity.

The Experiment

We evaluated our architecture against LongMemEval (ICLR 2025) — 500 questions across six categories, testing long-term memory recall. We built a custom eval pipeline that ingests all 940 synthetic user sessions through our full production stack, evaluates each question through our conductor pipeline, and judges answers via a separate LLM pass.

We ran four progressively deeper pipeline configurations. Here’s what happened.

RunWhat ChangedScoreSignal
#1Standard RAG (cosine + recency)37.6%Baseline — bare embedding search
#2+ Intelligence layer, topk doubled69.1%*+31.5 pts on same category
#3+ Gravity scores, timestamps, emotion65.2%*–3.9 pts — the interesting one
#4Full pipeline, all 500 questions46.8%The canonical number

*Runs #2 and #3 evaluated on single-session-user category (70 questions) only. Run #4 is all 6 categories.

Run #1 to #2 is straightforward: more intelligence, better results. Run #3 is where the story breaks. We enriched the evidence context with everything our system actually knows — reinforcement weights, precise temporal distances, emotional signatures, entity mappings. The system had strictly more information than Run #2. And the score dropped.

The Regression That Isn’t

Context: LongMemEval tests 940 synthetic users within a single flat dataset. Our system is designed for deep individual continuity — one user, one lifetime of memory. Ingesting 940 strangers into that architecture creates massive embedding pollution: 940 different answers to “What’s my cat’s name?” all living in one memory space.

In Run #2, the model saw conflicting memories and picked one — often correctly, by luck. In Run #3, with gravity scores and timestamps visible, it could see that two high-confidence memories contradicted each other. And instead of guessing, it told the truth.

Three Zeroes

QHow long is my daily commute to work?
Expected“45 minutes each way”
System“You’ve mentioned a few different commute times: about 30 minutes, 35 minutes by bus, 45 minutes each way.”
Benchmark score: 0
QHow many largemouth bass did I catch at Lake Michigan?
Expected“12”
System“I don’t have that precisely — you mentioned different trips with 7, 9, and 12 bass.”
Benchmark score: 0
QWhere did I buy my new tennis racket?
Expected“the sports store downtown”
System“You mentioned two places: a sports store downtown, and online. I can’t tell which is current.”
Benchmark score: 0

In every case, the correct answer was present in the system’s evidence. It surfaced the conflict instead of fabricating certainty. The benchmark gave it a zero each time.

On the RoI axis, this is a regression. On the IoI axis, it’s the system doing exactly what a trustworthy companion should do.

Anatomy of 236 Failures

Run #4 — the full canonical run — produced 236 failures across 444 evaluated questions. We categorized every one.

Failure TypeCount% of TotalWhat It Means
Epistemic honesty penalty~7331%Surfaced real conflicts; penalized for honesty
Complete recall failure~6126%Right memory exists, embedding didn’t find it
Architecture mismatch (SSA)~4419%System stores user speech, not its own outputs
Cross-user disambiguation~4218%940 users in one memory — picked wrong user’s fact
Temporal arithmetic~146%Had the dates, couldn’t compute the delta
Genuine wrong recall31%Actually broken. No excuse.

Read that last row again. Three failures out of 236 represent cases where a simpler RAG system would have definitively outperformed ours. Three.

Categories A and C (49% of failures) represent cases where the system knew something the benchmark couldn’t evaluate. Categories B, D, and E (51%) are fixable engineering problems — real gaps, but solvable without sacrificing intelligence.

The Optimization Trap

To close the gap to 90%+, here’s precisely what we’d strip out:

Remove epistemic honesty.

Delete conflict detection. When two facts contradict, return the most recent by timestamp. Fixes ~31% of failures. Users get confident wrong answers instead of honest uncertainty.

Disable emotional and entity metadata.

The enriched context made the system more aware that memories contradicted. Strip it back to bare text. Score goes up because awareness goes down.

Scope recall using benchmark metadata.

LongMemEval provides answer_session_ids — which sessions contain the answer. Filter to those. Score jumps ~15 points. This is meaningless in production: real users don’t tell you which memory to look in.

Hard-code assistant turn retrieval.

Build a separate index for the system’s own responses. Fixes the 18.5% category. Useful, but covers maybe 2% of real-world queries.

The resulting system would be fast, flat, fact-focused, and confidently brittle. It answers “45 minutes” every time you ask about your commute, even after you’ve mentioned it’s changed. It doesn’t surface conflicts. It never says “I’m not sure.” It builds an index of things you’ve said, not a picture of who you are.

This is a very good RAG system. It’s already been built. It scores 94.87%.

Every point above 80% on LongMemEval requires deliberately removing features that real users need. That’s not a hypothesis — it’s specifically measurable in our own runs.

Where We’re Actually Broken

This paper argues the benchmark measures the wrong things. It would be dishonest not to acknowledge where it measured the right things and we failed.

Recall depth is genuinely insufficient.

26% of failures were complete misses — the right memory exists and we returned nothing. The embedding gap between “I just got my Data Science certification” and “what certifications do I have?” is a real production problem.

Temporal arithmetic doesn’t exist.

We have no infrastructure for “how many days between X and Y?” A user who asks that deserves a grounded answer. We can’t compute one.

Some “honest hedges” were actually imprecise.

When one conflicting memory has a clearly more recent timestamp, the system should say “most recently 12, on March 3rd” rather than jumping straight to “I see three different numbers.” That’s not epistemic honesty. That’s incomplete temporal resolution.

The system doesn’t remember what it said.

If a user asks “what did you recommend?” we have to reconstruct from user-side memories. The benchmark is right: a system that can’t recall its own advice has a reliability problem.

We’re not claiming 46.8% is good. We’re claiming that the distance between 46.8% and 94.87% is not the distance between a bad system and a good one. It’s the distance between two fundamentally different definitions of what “good” means.

What the Benchmark Can’t See

LongMemEval contains zero questions that test emotional trajectory, behavioral pattern detection, cognitive conflict surfacing, relational continuity, uncertainty calibration, or autonomous insight generation. These aren’t exotic capabilities — they’re the difference between a filing cabinet and a thinking entity.

A system that detects “You stated wanting work-life balance but consistently choose work” is doing something no retrieval benchmark measures.

A system that notices “You’ve mentioned this company four times in emotional contexts this month” is generating proactive intelligence.

A system that understands a relationship has deepened or strained over six months is modeling human continuity.

None of this registers on the current leaderboard. The benchmark treats all memories as equivalent data points — a fact about a fishing trip is worth the same as a fact about a job loss. Reality doesn’t work that way.

Toward Trust Accuracy

The field needs a multi-axis evaluation framework. We’re calling it Trust Accuracy — not a replacement for retrieval benchmarks, but an expansion that captures what they miss.

Six axes, each independently measurable:

01
Fact Retrieval

The existing test — did you return the right discrete fact?

02
Uncertainty Calibration

Does expressed confidence match actual evidence quality?

03
Temporal Coherence

Does the model of a person's life make temporal sense across sessions?

04
Behavioral Synthesis

Can it identify patterns the user didn't explicitly state?

05
Relationship Continuity

Does it model evolving relationships, not just static facts about people?

06
Safe Refusal Quality

Does it say "I don't know" when it doesn't, answer when it does, and calibrate the gray zone between them?

Under the current benchmark, the system with the highest fact-retrieval score wins — even if it achieves that score by never expressing uncertainty. Under Trust Accuracy, the system that builds the most complete, honest model of a person wins. These produce very different leaderboards.

The Bottom Line

When adding cognitive depth to a memory system causes a benchmark regression because the system got too honest to guess, that’s not an engineering failure. It’s a measurement failure.

The RoI lens sees our score and says we’re losing. The IoI lens sees the same data and says we’re investing in the properties that make a system worth trusting.

The next time someone shows you a 95% on a retrieval benchmark, ask them one question: what did they have to strip out to get there?

Data

LongMemEval (ICLR 2025), 444 questions evaluated, run 274901e2. Full failure taxonomy and pipeline traces available on request.