Why Long-Term Memory Benchmarks Penalize Intelligence
We built a cognitive memory system. We ran it against the field’s standard benchmark. We scored 46.8%.
Mastra OM scored 94.87% on the same test. And we’re publishing our number, not theirs, because what we found inside that gap matters more than the gap itself: the smarter our system got, the worse it scored. Not because it retrieved less. Because it knew too much to lie about what it found.
This paper is about what that means for everyone building long-term memory in AI.
The field measures memory systems on one axis: Retrieval Output Index (RoI)— did the system return the right fact? It’s a transactional metric. Query in, row out. Score goes up or down.
We’re proposing a second axis: Integrity of Intelligence (IoI)— does the system handle uncertainty, conflict, and ambiguity with honesty? Does investing in deeper cognitive capability produce trustworthy behavior, even when the legacy metric can’t see it?
RoI framing asks: “We added epistemic context and the score dropped 3.9 points — how do we get the points back?”
IoI framing asks: “By adding epistemic context, we increased the integrity and trustworthiness of answers, even though the legacy metric punished us for it.”
Same data. Opposite conclusions. The tension between these two readings is the thesis of this paper.
Run #3 added epistemic context and dropped 3.9 points on the benchmark — not because it got worse, but because it refused to lie. The aspirational target is upper-right: high retrieval AND high integrity.
We evaluated our architecture against LongMemEval (ICLR 2025) — 500 questions across six categories, testing long-term memory recall. We built a custom eval pipeline that ingests all 940 synthetic user sessions through our full production stack, evaluates each question through our conductor pipeline, and judges answers via a separate LLM pass.
We ran four progressively deeper pipeline configurations. Here’s what happened.
| Run | What Changed | Score | Signal |
|---|---|---|---|
| #1 | Standard RAG (cosine + recency) | 37.6% | Baseline — bare embedding search |
| #2 | + Intelligence layer, topk doubled | 69.1%* | +31.5 pts on same category |
| #3 | + Gravity scores, timestamps, emotion | 65.2%* | –3.9 pts — the interesting one |
| #4 | Full pipeline, all 500 questions | 46.8% | The canonical number |
*Runs #2 and #3 evaluated on single-session-user category (70 questions) only. Run #4 is all 6 categories.
Run #1 to #2 is straightforward: more intelligence, better results. Run #3 is where the story breaks. We enriched the evidence context with everything our system actually knows — reinforcement weights, precise temporal distances, emotional signatures, entity mappings. The system had strictly more information than Run #2. And the score dropped.
Context: LongMemEval tests 940 synthetic users within a single flat dataset. Our system is designed for deep individual continuity — one user, one lifetime of memory. Ingesting 940 strangers into that architecture creates massive embedding pollution: 940 different answers to “What’s my cat’s name?” all living in one memory space.
In Run #2, the model saw conflicting memories and picked one — often correctly, by luck. In Run #3, with gravity scores and timestamps visible, it could see that two high-confidence memories contradicted each other. And instead of guessing, it told the truth.
In every case, the correct answer was present in the system’s evidence. It surfaced the conflict instead of fabricating certainty. The benchmark gave it a zero each time.
On the RoI axis, this is a regression. On the IoI axis, it’s the system doing exactly what a trustworthy companion should do.
Run #4 — the full canonical run — produced 236 failures across 444 evaluated questions. We categorized every one.
| Failure Type | Count | % of Total | What It Means |
|---|---|---|---|
| Epistemic honesty penalty | ~73 | 31% | Surfaced real conflicts; penalized for honesty |
| Complete recall failure | ~61 | 26% | Right memory exists, embedding didn’t find it |
| Architecture mismatch (SSA) | ~44 | 19% | System stores user speech, not its own outputs |
| Cross-user disambiguation | ~42 | 18% | 940 users in one memory — picked wrong user’s fact |
| Temporal arithmetic | ~14 | 6% | Had the dates, couldn’t compute the delta |
| Genuine wrong recall | 3 | 1% | Actually broken. No excuse. |
Read that last row again. Three failures out of 236 represent cases where a simpler RAG system would have definitively outperformed ours. Three.
Categories A and C (49% of failures) represent cases where the system knew something the benchmark couldn’t evaluate. Categories B, D, and E (51%) are fixable engineering problems — real gaps, but solvable without sacrificing intelligence.
To close the gap to 90%+, here’s precisely what we’d strip out:
Delete conflict detection. When two facts contradict, return the most recent by timestamp. Fixes ~31% of failures. Users get confident wrong answers instead of honest uncertainty.
The enriched context made the system more aware that memories contradicted. Strip it back to bare text. Score goes up because awareness goes down.
LongMemEval provides answer_session_ids — which sessions contain the answer. Filter to those. Score jumps ~15 points. This is meaningless in production: real users don’t tell you which memory to look in.
Build a separate index for the system’s own responses. Fixes the 18.5% category. Useful, but covers maybe 2% of real-world queries.
The resulting system would be fast, flat, fact-focused, and confidently brittle. It answers “45 minutes” every time you ask about your commute, even after you’ve mentioned it’s changed. It doesn’t surface conflicts. It never says “I’m not sure.” It builds an index of things you’ve said, not a picture of who you are.
This is a very good RAG system. It’s already been built. It scores 94.87%.
Every point above 80% on LongMemEval requires deliberately removing features that real users need. That’s not a hypothesis — it’s specifically measurable in our own runs.
This paper argues the benchmark measures the wrong things. It would be dishonest not to acknowledge where it measured the right things and we failed.
26% of failures were complete misses — the right memory exists and we returned nothing. The embedding gap between “I just got my Data Science certification” and “what certifications do I have?” is a real production problem.
We have no infrastructure for “how many days between X and Y?” A user who asks that deserves a grounded answer. We can’t compute one.
When one conflicting memory has a clearly more recent timestamp, the system should say “most recently 12, on March 3rd” rather than jumping straight to “I see three different numbers.” That’s not epistemic honesty. That’s incomplete temporal resolution.
If a user asks “what did you recommend?” we have to reconstruct from user-side memories. The benchmark is right: a system that can’t recall its own advice has a reliability problem.
We’re not claiming 46.8% is good. We’re claiming that the distance between 46.8% and 94.87% is not the distance between a bad system and a good one. It’s the distance between two fundamentally different definitions of what “good” means.
LongMemEval contains zero questions that test emotional trajectory, behavioral pattern detection, cognitive conflict surfacing, relational continuity, uncertainty calibration, or autonomous insight generation. These aren’t exotic capabilities — they’re the difference between a filing cabinet and a thinking entity.
A system that detects “You stated wanting work-life balance but consistently choose work” is doing something no retrieval benchmark measures.
A system that notices “You’ve mentioned this company four times in emotional contexts this month” is generating proactive intelligence.
A system that understands a relationship has deepened or strained over six months is modeling human continuity.
None of this registers on the current leaderboard. The benchmark treats all memories as equivalent data points — a fact about a fishing trip is worth the same as a fact about a job loss. Reality doesn’t work that way.
The field needs a multi-axis evaluation framework. We’re calling it Trust Accuracy — not a replacement for retrieval benchmarks, but an expansion that captures what they miss.
Six axes, each independently measurable:
The existing test — did you return the right discrete fact?
Does expressed confidence match actual evidence quality?
Does the model of a person's life make temporal sense across sessions?
Can it identify patterns the user didn't explicitly state?
Does it model evolving relationships, not just static facts about people?
Does it say "I don't know" when it doesn't, answer when it does, and calibrate the gray zone between them?
Under the current benchmark, the system with the highest fact-retrieval score wins — even if it achieves that score by never expressing uncertainty. Under Trust Accuracy, the system that builds the most complete, honest model of a person wins. These produce very different leaderboards.
When adding cognitive depth to a memory system causes a benchmark regression because the system got too honest to guess, that’s not an engineering failure. It’s a measurement failure.
The RoI lens sees our score and says we’re losing. The IoI lens sees the same data and says we’re investing in the properties that make a system worth trusting.
The next time someone shows you a 95% on a retrieval benchmark, ask them one question: what did they have to strip out to get there?
LongMemEval (ICLR 2025), 444 questions evaluated, run 274901e2. Full failure taxonomy and pipeline traces available on request.