DEV Community

Cover image for NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.
Dan Shalev for FalkorDB

Posted on

3 2 2 2 1

NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

Recent advancements in large language models (LLMs) have pushed context window limits to 128K–1M tokens, yet benchmarks like NoLiMA: Long-Context Evaluation Beyond Literal Matching reveal critical gaps in associative reasoning over extended sequences.

NoLiMA demonstrates that while models like GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens. The benchmark’s two-hop associative tasks (e.g., linking “Saxony” to “Semper Opera House” to “Yuki”) reveal that models fail to preserve transitive relationships across 16K+ token windows.

The NoLiMA benchmark highlights a fundamental truth: scaling context windows alone cannot overcome attention mechanisms' inability to model latent relationships. Property graphs provide the missing structural layer, offering explicit relationship encoding and metadata-aware retrieval.

For AI architects, integrating graph-native storage with LLMs isn’t optional—it’s imperative for building systems capable of robust, multi-hop reasoning at scale.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (1)

Collapse
 
danshalev7 profile image
Dan Shalev

Models like GPT-4o may still have decent base scores, but their effective context length remains limited when dealing with associative reasoning without literal cues.

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more

AWS GenAI LIVE!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️