DEV Community

Johnny
Johnny

Posted on

Building LLM Tools That Don't Drown in Context: Lazy Traversal for Unknown Structures

This article is for anyone who's hit the context window problem with structured data.

The Problem: Dumping entire DOMs into LLM context costs 50k+ tokens and causes models to hallucinate elements that don't exist.

The Solution: Lazy traversal with bounded queries reduced token costs from 50k+ to <3k while improving selector quality.

While building an open source MCP server for AI-native test authoring, I discovered that lazy traversal solved the context window problem for DOM exploration. Instead of dumping entire DOMs into context—which consumed excessive tokens and caused LLMs to miss information buried in the middle—I built tools that reveal structure incrementally through bounded queries.

A note on scope: This pattern works for trees—structures where each node has exactly one parent. DOMs, file systems, ASTs, and org charts are trees. For DAGs (where nodes can have multiple parents), you'd need traversal rules to impose tree-like structure. For dense graphs, this pattern doesn't apply—you'd need sampling, summarization, or indexing instead.

This article documents what worked for DOM, why I think it worked, and the implementation principles I learned. I've only built this for DOM exploration, but if you're working with hierarchical structures—file systems, ASTs, database schemas, organizational hierarchies—this case study might help you think through your own approach.

Article Roadmap

Part 1: The Problem (2 min read)

  • Why dumping full structures fails
  • The attention problem in long contexts

Part 2: The Solution Pattern (5 min read)

  • Four-level lazy traversal workflow
  • Complete DOM example with code

Part 3: Why It Works (5 min read)

  • Connection to established CS patterns
  • Why bounded queries preserve reasoning quality
  • Structural characteristics that enable this pattern
  • Design principles for building primitives

Part 4: When to Use This (3 min read)

  • Decision framework
  • Alternative approaches

Part 1: The Problem

The Context Window Paradox

You're building an LLM tool that needs to understand structured data. Maybe it's exploring a file system, navigating a codebase's AST, or understanding a database schema. The obvious approach: serialize everything and hand it to the model.

Token cost is the obvious problem. Complex structures blow through your budget fast—a production web page's DOM serializes to 50k+ tokens. But that's not the real issue.

Recent research shows LLM performance degrades significantly when relevant information is positioned in the middle of long contexts. The "Lost in the Middle" paper demonstrated a U-shaped performance curve: highest performance at the beginning or end of input, with information in the middle getting overlooked even when explicitly present—and this holds even for models explicitly designed for long contexts.

When I was building Verdex, I saw this constantly with full DOM dumps. Strong models would hallucinate elements that didn't exist. They'd generate selectors referencing completely unrelated page sections. The model would anchor on early DOM sections and miss better containers that appeared later.

I needed a different approach: don't dump everything at once. Let the LLM traverse the structure lazily, requesting only what it needs for the current decision through bounded queries.


Part 2: The Solution Pattern

The Core Idea: LLM-Guided Lazy Traversal

Don't give the model everything at once. Give it what it needs for the current decision, then let it request more through bounded queries. The LLM decides which branches to expand based on task context.

For DOM exploration, this meant: instead of dumping the full HTML, start with a compact semantic overview, then let the LLM request structural details through targeted queries. "What contains this button?" returns an ancestor chain with IDs and classes. "Does this pattern repeat?" returns sibling elements at this depth. "What's unique about this instance?" returns text content and attributes within the container.

This worked surprisingly well for DOM exploration. Token costs dropped by 90%+ compared to full dumps. Selector quality improved. Models stopped hallucinating elements. The pattern was reliable across diverse page structures.

The Workflow: Container → Pattern → Anchor

The workflow has four levels. I discovered this through iteration with DOM exploration:

Level 1 is your semantic overview. This shows meaningful entities without implementation details. For DOM, it's the accessibility tree with roles and names.

Level 2 is container resolution. When the LLM identifies something interesting from the overview, it asks: what contains this element? This query returns the containment hierarchy and stable anchors—the attributes that make containers identifiable.

Level 3 is pattern recognition. Once you know the container, you ask: does this repeat? Are there siblings at this depth with similar structure? This reveals lists, grids, repeated components.

Level 4 is anchor extraction. Now you need to distinguish one instance from another. This is a deeper scan within the chosen container for unique identifiers.

The workflow works because each step builds on previous understanding. Token cost is bounded per step. The LLM decides what to explore next based on findings. Exploration terminates when sufficient context is gathered. And iteration is practical because each query is cheap.

A Concrete Example: DOM Exploration with Code

Here's how this actually worked in practice.

Scenario: Generate a Playwright selector for the "Add to Cart" button specifically for iPhone 15 Pro in a product grid. Challenge is there are multiple "Add to Cart" buttons on the page.

Full code example: [Link to Verdex examples on GitHub]

Before: Full DOM Dump Approach

// ❌ Full DOM dump - 127k tokens
const fullDOM = await page.content();

// Common results with full dumps:
// - Fragile positional selector: .product-card:nth-child(3) button
// - Hallucinated attributes that don't exist
// - Anchoring on wrong section of page
// Token cost: ~127,000
// Quality: Breaks when cards reorder, misses data-testid attributes
Enter fullscreen mode Exit fullscreen mode

After: Lazy Traversal Approach

// ✅ Lazy traversal - 2.9k tokens

// Step 1: Semantic overview (800 tokens)
await browser_snapshot();
Enter fullscreen mode Exit fullscreen mode

Returns:

navigation "Main menu"
heading "Featured Products" [level=1]
generic
  heading "MacBook Air" [level=3]
  button "Add to Cart" [ref=e1]
generic
  heading "iPhone 15 Pro" [level=3]
  button "Add to Cart" [ref=e3]
Enter fullscreen mode Exit fullscreen mode

LLM's reasoning: "Multiple 'Add to Cart' buttons exist. Need to understand what differentiates them. Let me check what contains the target button [ref=e3]."

// Step 2: Container resolution (600 tokens)
await resolve_container({ ref: 'e3' });
Enter fullscreen mode Exit fullscreen mode

Returns:

Ancestry analysis for element e3:

🎯 Target Element:
   Tag: button
   Text: "Add to Cart"

📋 Ancestors (3 levels up):

Level 1 (div):
   Children: 3
   Attributes: {"data-testid":"product-card"}
   Contains refs: e3, e4, e5

Level 2 (div):
   Children: 12
   Attributes: {"data-testid":"product-grid"}
   Contains refs: e1, e2, e3, e4, e6, e7...

Level 3 (section):
   Children: 2
   Attributes: {"class":"products"}
   Contains refs: none
Enter fullscreen mode Exit fullscreen mode

LLM's reasoning: "The button lives inside nested divs with identifiable test IDs. But I don't know if this is one card or many. Are there other product cards? Need to check siblings at the grid level."

// Step 3: Pattern recognition (800 tokens)
await inspect_pattern({ ref: 'e3', ancestorLevel: 2 });
Enter fullscreen mode Exit fullscreen mode

Returns:

Sibling analysis for element e3 at ancestor level 2:

👥 Found 12 siblings at ancestor level 2:

Sibling 1 (div):
   Attributes: {"data-testid":"product-card"}
   Contains refs: e1, e2
   Contains text: MacBook Air, $1,299, Add to Cart

Sibling 2 (div):
   Attributes: {"data-testid":"product-card"}
   Contains refs: e3, e4
   Contains text: iPhone 15 Pro, $999, Add to Cart

Sibling 3 (div):
   Attributes: {"data-testid":"product-card"}
   Contains refs: e6, e7
   Contains text: iPad Pro, $799, Add to Cart

... (9 more siblings)
Enter fullscreen mode Exit fullscreen mode

LLM's reasoning: "Pattern confirmed: 12 product cards with identical structure. Each has a heading plus button. Now I need a unique identifier for the iPhone card."

// Step 4: Anchor extraction (700 tokens)
await extract_anchors({ ref: 'e3', ancestorLevel: 1 });
Enter fullscreen mode Exit fullscreen mode

Returns:

Descendant analysis for element e3 within ancestor level 1:

🏗️ Analyzing within ancestor: div {"data-testid":"product-card"}

🔍 Found 3 direct children within ancestor:

Child 1 (depth 1):
   Attributes: {}
   h3 "iPhone 15 Pro" (2 children)
      span "iPhone 15 Pro"

Child 2 (depth 1):
   Attributes: {"class":"price"}
   span "$999"

Child 3 (depth 1):
   Ref: e3 (button) "Add to Cart"
   Direct Text: "Add to Cart"
Enter fullscreen mode Exit fullscreen mode

LLM's reasoning: "Perfect. I have everything needed: a stable container (data-testid='product-card'), a unique identifier (heading text 'iPhone 15 Pro'), and the target action (button with role 'Add to Cart')."

Total tokens used: 2,900 across four queries (98% reduction from 127k)

LLM generates:

const selector = page
  .getByTestId("product-card")
  .filter({ hasText: "iPhone 15 Pro" })
  .getByRole("button", { name: "Add to Cart" });
Enter fullscreen mode Exit fullscreen mode

Note: The LLM generates Playwright code based on the textual analysis. While the MCP tools return formatted text, the LLM interprets this structure to produce the appropriate Playwright selector syntax. The exploration happens through MCP tool calls in the coding agent, but the output is standard Playwright code.

Why this selector is better:

  • ✅ Container-scoped (uses data-testid for stability)
  • ✅ Uniquely identified (filters by product name)
  • ✅ Role-first for accessibility
  • ✅ Resilient to structural changes (survives card reordering)
  • ✅ No positional dependencies (no :nth-child)
  • ✅ No hallucinated attributes

Key Differences Summary

Approach Tokens Quality Issues Selector Type
Full DOM dump ~127,000 Hallucinations, missed attributes, anchoring bias Positional: .product-card:nth-child(3) button
Lazy traversal ~2,900 None Semantic: getByTestId("product-card").filter({hasText: "iPhone 15 Pro"}).getByRole("button")

What made this work:

  • The semantic overview existed (accessibility tree)
  • Containers had stable IDs (test IDs on divs)
  • Patterns were identifiable (sibling inspection showed repeated structure)
  • Anchors were unique (heading text differentiated cards)
  • Each query was bounded

Try it yourself: https://github.com/verdexhq/verdex-mcp


Part 3: Why It Works

What This Pattern Actually Is

This is lazy tree traversal where the LLM decides which branches to expand.

The traversal pattern itself isn't new. Lazy evaluation defers computation until values are needed. Iterators yield items on demand rather than loading entire collections. The visitor pattern traverses structure and applies operations at each node. These are well-established patterns for handling large structures efficiently.

What's different here is the heuristic for branch selection. Traditional lazy traversal uses predetermined rules: depth-first, breadth-first, or application-specific logic. Here, the LLM acts as the heuristic. It looks at the task ("find a stable selector for this button"), examines the current node's context, and decides which branch to expand next. The traversal order emerges from task understanding rather than fixed rules.

This matters because the "right" traversal depends on what you're trying to accomplish. For selector authoring, you need ancestors (for containment) and siblings (for uniqueness). For accessibility auditing, you might need descendants (to check nested elements). For scraping, you might need entirely different branches. The LLM adapts the traversal to the task without requiring separate traversal implementations for each use case.

Why Bounded Queries Preserve Reasoning Quality

The token savings are obvious—sending less data costs fewer tokens. The interesting question is why the LLM reasons better with 2.9k tokens than with 127k.

The Attention Trough Problem

The U-shaped attention curve isn't just a curiosity—it fundamentally changes how LLMs process long contexts. The landmark "Lost in the Middle" paper demonstrated this rigorously: when researchers tested retrieval tasks across different context positions, they found performance dropped significantly for information placed in the middle third of the context window. The model doesn't "forget" the middle content—it's still there in the context—but attention weights concentrate on the beginning and end. Performance is often highest when relevant information occurs at the very start or end, and degrades substantially when models must access information in the middle of long contexts—even for models explicitly designed for long contexts.

Research on context engineering from Anthropic builds on this finding, demonstrating that progressive disclosure prevents cognitive overload—letting agents navigate and retrieve data autonomously keeps them focused on what matters. Work on progressive context enrichment goes further, showing that giving LLMs more context often makes them perform worse.

This creates a perverse dynamic with full structure dumps. The element you care about might appear at position 40,000 in a 127,000 token context. The model sees it, technically, but attention has already anchored on earlier content. When I watched Claude generate selectors from full DOM dumps, it would consistently reference elements from the first few thousand tokens even when better options appeared later.

Most Structural Reasoning Is Local

Here's the key insight: to generate a stable selector for a button, you don't need to know about the footer, the navigation menu, or the sidebar. You need to know:

  1. What contains this button (ancestors)
  2. Whether similar buttons exist (siblings)
  3. What makes this instance unique (local attributes and text)

All of this information exists within a small neighborhood of the target element. The other 95% of the DOM is noise for this specific task.

This is why bounded queries don't lose decision-relevant information. They're not sampling randomly—they're following the containment and sibling relationships that actually matter for the task. You're trading global context (which the LLM can't effectively use anyway due to attention limitations) for local context (which is what the reasoning actually requires).

Information Density Per Token

A full DOM dump has terrible signal-to-noise ratio for any specific task. Of those 127,000 tokens, maybe 3,000 are relevant to finding a stable selector for one button. The LLM has to locate the needle in the haystack while fighting attention dynamics that work against it.

Bounded queries invert this ratio. Each response contains only information relevant to the current decision. The 2,900 tokens across four queries are almost entirely signal. The LLM isn't reasoning with less information—it's reasoning with higher-density information.

Bounded Responses Stay in High-Attention Regions

Each query response is small enough (600-800 tokens) that there's no "middle" to overlook. The entire response falls within the high-attention regions at the beginning and end of the context. When the LLM processes resolve_container's output, every ancestor in that response gets appropriate attention weight.

This converts the attention problem from "find the relevant content in 127k tokens" to "process these 700 tokens of relevant content." The latter is a task LLMs handle well.

The Net Effect

Lazy traversal doesn't just reduce costs—it restructures the problem to play to LLM strengths. Instead of global search through a long context (hard), you get local reasoning over focused responses (easy). Instead of fighting attention dynamics (lossy), you work with them (each response is fully attended). Instead of low signal-to-noise (wasteful), you get high density (efficient).

The LLM reasons better with 2.9k tokens because those tokens are the right tokens, presented in a way that attention mechanisms can actually process.

Why This Worked for DOM

Lazy traversal worked for DOM exploration because of how DOMs are structured. Understanding these characteristics might help you evaluate whether similar patterns could work for your data structure.

Clear Containment

Every element has exactly one parent. When you ask "what contains this button?", there's an unambiguous answer: this div, which is inside that section, which is inside body, which is inside html. You can walk up the tree and always know where you are.

This seems fundamental. In structures where containment is ambiguous—like dense graphs where node A connects to nodes B, C, and D with equal weight—the question "what contains X?" doesn't have a clear answer. You'd need to define traversal rules that create deterministic paths.

Bounded Exploration

Traversing from any node to the root terminates predictably. A tree of depth 20 requires max 20 steps. Even deeply nested pages have predictable bounds.

Without guaranteed termination, lazy traversal might return unbounded results, losing the token efficiency that makes the pattern valuable. Cyclic structures would need explicit cycle detection and depth limits.

Self-Describing Nodes

The identifying properties of DOM elements—data-testid, class, id, role—are attached directly to nodes. You can inspect an element and immediately see what makes it unique without traversing relationships.

If critical metadata lives in edge properties or requires graph traversal to gather, each query becomes more expensive. You might need to fetch relationships just to understand what you're looking at.

Semantic Overview Exists

The accessibility tree provides a clean, high-level view showing roles and names without implementation details. This is essential for Level 1—where does exploration start?

Without a natural overview layer, you'd need to create one through sampling, aggregation, or summarization. If creating an overview requires complex analysis, lazy traversal might not provide enough value.

Independent Subtrees

Inspecting one component doesn't require knowing about unrelated components. I can explore a product card without understanding the navigation menu. Subtrees are relatively self-contained.

In densely connected structures where understanding any node requires understanding most other nodes, you can't isolate exploration. You end up needing global context anyway.

Meaningful Depth

"Ancestor at level 3" or "siblings at this depth" are coherent concepts in DOM. Depth is well-defined and stable. Different paths from leaf to root don't exist.

In structures where depth from one node to another isn't unique—multiple paths of different lengths—operations like "give me siblings at this level" become ambiguous.

Summary: If your structure shares these characteristics—file systems, organizational hierarchies, taxonomy trees—the pattern might transfer well. If it looks more like a dense graph where nodes are heavily interconnected, you'll probably need a different approach like sampling, summarization, or indexing.

I don't know where the exact boundary is. I've only built this for DOM. But these structural characteristics seemed essential to making it work.

Design Principles for Building Query Primitives

When implementing lazy traversal primitives, there are key principles I learned through building real tools.

Return Facts, Not Interpretations

This is maybe the most important lesson. The separation isn't just good practice—it's about preserving optionality for different tasks.

Early on, I tried to make primitives "helpful" by adding interpretation. Like a tool that returned { type: "product-card", confidence: 0.85, role: "list-item" }. The tool was guessing semantic meaning and adding confidence scores.

This failed completely. What if it's not actually a product card? The tool imposed an interpretation that's hard for the LLM to override. Different tasks need different interpretations of the same structure—authoring a selector versus debugging why one broke versus refactoring patterns. And when the heuristics broke, there was no way for the LLM to work around them because the decision had already been made.

The key insight: structure traversal is task-independent, but interpretation is task-dependent. A div with data-testid="product-card" could be a stable container (selector authoring), a UI component to screenshot (visual testing), product data to extract (scraping), or an element missing ARIA labels (accessibility audit). Same structural facts, four different interpretations. If the tool chooses one, the other three tasks suffer.

Research on LLM modularity demonstrates this principle at the architectural level—the "separability" of capabilities enables effective modularization. When different parts of a system are separately responsible for different tasks, you can evolve them independently.

The solution: return raw structural facts. Just the tag, the attributes, the depth. Let the LLM interpret based on the user's query, the task context, and the surrounding structure.

Good primitive design looks like:

{ "tag": "div", "attrs": {"data-testid": "product-card"}, "depth": 2 }
Enter fullscreen mode Exit fullscreen mode

That's it. No guessing, no confidence scores, no strategic recommendations.

Why this works better:

  • The LLM has full context from the user query to interpret meaning
  • Raw structural facts compose across different tasks
  • There are no hidden heuristics that break on edge cases
  • The LLM's reasoning is auditable because it's based on explicit facts

The key principle: tools provide capability (access to structure). LLMs provide agency (interpretation and decision-making).

Guaranteed Bounded Output

Every primitive must promise it will terminate in finite time and return bounded tokens.

For DOM:

  • resolve_container is bounded by tree height (typically max 20 levels)
  • inspect_pattern is bounded by explicit limits (return first 20 siblings only)
  • extract_anchors is bounded by depth constraints (traverse max 3 levels deep)

Bad primitives would be:

  • find_all_matching(selector) could return the entire DOM
  • explore_connected(node) could loop infinitely in a graph with cycles

Why bounds matter:

  • They prevent infinite loops in graph-like structures
  • They guarantee predictable token costs so you can budget accurately
  • They enable LLMs to compose multiple calls without explosion
  • They make lazy traversal actually lazy rather than "eventually everything"

Enable Composition

Primitives should compose naturally where the output of one call contains the metadata needed for the next.

The DOM workflow shows this:

  • browser_snapshot() returns refs for elements
  • Those refs are inputs to resolve_container(ref) which returns level information
  • That level becomes input to inspect_pattern(ref, level) and extract_anchors(ref, level)
  • Each step builds on the previous

Requirements for composability:

  • Outputs include metadata for next calls (refs, levels, paths)
  • No hidden state between calls that the LLM needs to track
  • Order independence where possible—pattern inspection and anchor extraction can run in parallel

The DOM workflow shows linear composition: each output deterministically points to the next input. But some domains require branching composition where outputs need interpretation before choosing the next query. Linear composition works with tool descriptions alone; branching composition requires Skills to teach decision logic. If your primitives compose linearly 80%+ of the time, lazy traversal will feel natural to LLMs.

Research on in-context learning explains why this matters: in-context learning works as implicit Bayesian inference—models "locate" latent concepts they acquired during pretraining. Few-shot examples in tool descriptions show individual tool use, but the latent concept for tool composition may not exist in the model's prior at all. That's why knowledge layers matter.

When primitives compose well, the LLM can discover multi-step workflows through experimentation rather than requiring hardcoded sequences.

Part 4: When to Use This

Decision Framework

Question 0: Discovery or Access?

Are you exploring structure you don't control, or accessing content you organized yourself?

If you control the structure (like Anthropic's Skills directories), use good organization + generic file operations. You don't need custom query primitives—you can just read files at known paths.

If you don't control the structure, continue to the remaining questions.

Question 1: Is it a tree?

This pattern works for trees—structures where each node has exactly one parent. "What contains this?" must have an unambiguous answer.

  • Trees (pattern applies): DOM, file systems, ASTs, org charts, taxonomy hierarchies
  • DAGs (pattern needs adaptation): dependency graphs, citation networks—you'd need traversal rules to pick one parent
  • Dense graphs (pattern doesn't apply): social networks, knowledge graphs—use sampling or indexing instead

Question 2: Token economics?

Does getting everything upfront cost more than 10x what you actually need? If full dumps fit comfortably in context with good results, skip the complexity.

Question 3: Attention problems?

Evidence the LLM misses information buried in your current approach? If it handles full context fine, you don't need this.

Question 4: Clear entry point?

Do you know what specific thing you're investigating? "This button" works. "Any anomalies in this dataset" doesn't—you need a starting node for lazy traversal.

Question 5: Repeated exploration?

Lazy traversal has upfront cost. Worth it for 100+ investigations per month. Not worth it for one-off analysis.

When Not to Use This

Structure is too simple. Everything fits under 5k tokens.

Tasks require global analysis. "Count all instances" needs complete context anyway.

Round-trip latency dominates. High-latency environments favor full dumps despite token cost.

Structure changes rapidly. Real-time updates invalidate refs between queries.

Alternative Approaches

Sampling for statistical analysis—random samples of structure.

Summarization for high-level understanding—LLM-generated structure summaries.

Indexing for repeated access patterns—pre-computed views for common queries.

Hybrid for complex tasks—explore to find relevant subset, then dump just that subset.


Closing Thoughts

Lazy traversal with bounded queries solved a real problem for DOM exploration: long contexts degrade LLM performance, but minimal context leaves models without necessary understanding. Revealing structure incrementally through bounded queries worked better than any alternative I tried.

The pattern is essentially lazy tree traversal where the LLM decides which branches to expand. The traversal mechanics aren't new—but using task context to guide branch selection lets the same primitives serve different use cases without modification.

Research on modular agent architectures validates this approach: separation of concerns enables clear input/output boundaries, improves controllability, and limits the blast radius of failures. The same principle applies here—separating traversal capability from task-specific interpretation makes both layers more robust.

The pattern worked because DOM is a tree with clear containment, bounded exploration, and self-describing nodes. If your structure shares these properties—file systems, ASTs, org charts—the pattern should transfer well.

The design principles matter most:

  • Return facts, not interpretations — let the LLM decide what things mean
  • Guarantee bounded output — every query terminates predictably
  • Enable composition — outputs contain metadata for next calls
  • Separate capability from strategy — tools provide access, skills provide knowledge

If you try this for file systems, ASTs, or other tree-structured data, I'd love to hear what you learn.

Top comments (0)