DEV Community: Haystack

Context Engineering for Agentic Systems: What Goes Into Your Agent's Mind

Kacper Łukawski — Mon, 20 Apr 2026 10:56:42 +0000

Every new generation of Large Language Models arrives with a bigger context window - and the temptation to use it fully. If the model can read a million tokens, why not feed it everything? In practice, more context doesn't reliably mean better answers: it often means higher costs, slower responses, and a model that loses track of what actually matters. Context engineering is the discipline of deciding not just what to put in the context window, but how much, in what form, and when to leave things out - and it's quickly becoming one of the most important skills in building reliable agentic systems.

Why context is so important for agentic systems

An LLM has exactly two sources of information when generating a response:

Internal state ("knowledge") - what was baked in during training. It is static, potentially stale, and opaque to the developer.
Context ("prompt") - everything provided at inference time. That's the only thing we can actively control.

Training knowledge is fixed. We can't update it without retraining, and we can't know exactly what the model does or doesn't know - though most providers publish a knowledge cutoff date in their model cards or documentation, which tells you the point beyond which the model has no awareness of world events. Context is the lever we actually have. Everything a model knows about the current task, the current user, the tools available to it, and the world right now has to come through the context window.

Today's leading models offer context windows that would have seemed impossibly large just a few years ago - millions of tokens, enough to fit entire codebases, legal contracts, or a stack of research papers in a single prompt. Yet in practice, agentic systems burn through these limits surprisingly fast. A system prompt, a set of tool definitions, all tool calls and results, a few retrieved documents, and a handful of conversation turns can easily consume tens of thousands of tokens before the agent has done anything meaningful. And even when the hard limit isn't reached, performance often degrades long before it is - the model starts losing track of earlier instructions, repeating itself, or missing relevant details buried under layers of accumulated context.

At step 1, the context holds little more than the user's task. By step N, it has grown to include every tool call, every result, every model response, and any retrieved documents - all concatenated and re-sent from scratch on every iteration.

The difference from one-shot prompting is stark. A single prompt is small, hand-crafted, and fully under control. An agentic system operates in a loop - reasoning, calling tools, receiving results, and repeating, potentially dozens of times. Because LLMs are stateless, every iteration re-sends the entire accumulated history from scratch. The context isn't a fixed input, but more of a growing log, and context engineering is about managing that growth.

When less is more

Transformers architecture behind the LLMs work by letting every token attend to every other token in the context. This is what makes them so powerful at integrating information - but it also means the model's capacity is spread across all tokens simultaneously. Think of it as an attention budget: every new token you introduce depletes it by some amount, regardless of whether that token is useful or not.

The practical consequence is that irrelevant or redundant content doesn't just waste space - it actively competes with the information that actually matters. A critical instruction buried under pages of tool outputs may receive less attention than if it had been sent alone. Research from Anthropic confirms this: models remain capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to shorter ones. A million-token context window is not a free pass to include everything - it's a budget, and every token you add is a trade-off.

The cost dimension

Most hosted LLMs charge per input token, which means every byte of context has a price tag. A single call with a 50,000-token context costs roughly 50× more than one with 1,000 tokens - and in an agentic loop that runs dozens of iterations, that multiplier compounds with every step. Context management is therefore not just a quality concern but a cost concern: a bloated context window can turn a cheap pipeline into an expensive one without producing any better answers.

What fills the context window in an agentic system

We've already touched on some of the components that fill an agent's context window - system prompts, tool definitions, retrieved documents. Let's map out the full picture, because the list is longer than many developers expect.

System prompt - standing instructions, persona, constraints, output format. Usually fixed but can be large.
Conversation history - the full back-and-forth between user and agent across the current session.
Memory - retrieved facts from past sessions or external knowledge stores. See also: Using Mem0 Memory Store with Haystack Agents.
Retrieval output - documents or chunks fetched proactively by a RAG pipeline, before the model acts. This data arrives in context as part of the input to the model, not as a consequence of something the model decided to do.
Tool definitions - every tool the model could call must be described in the context (name, description, parameters schema). With MCP toolsets, this can easily balloon into hundreds of tool descriptions.
Tool call results - the output of tools the model itself chose to invoke. Unlike retrieval output, these arrive mid-session as a consequence of the model's actions. They can be surprisingly large: a read file operation returning a 500-line source file, a web search returning multiple scraped pages, or a database query returning hundreds of rows - and each result stays in context for the remainder of the session.
Few-shot examples - demonstration input/output pairs used to guide model behaviour.

The iceberg effect. A user sees a single answer. Behind the scenes, the model may have received 50,000 tokens or more on that one turn - a system prompt (perhaps 10k tokens), tool definitions (5k), retrieved documents (20k), and accumulated conversation history (15k). The answer is the tip, while the context is everything below the surface.

What the context actually looks like

The screenshot above shows Claude Code's /context command, which breaks down exactly where tokens are being spent: system prompt, tool definitions, conversation history, open files, etc. Knowing this makes it possible to identify which component is responsible for a bloated context and whether that cost is justified. With this visibility, optimisation is a bit easier.

Building a Haystack agent

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"It's sunny and 22°C in {city}."

agent = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a helpful assistant.",
    tools=[get_weather],
)

result = agent.run(messages=[ChatMessage.from_user("What's the weather in Paris?")])
print(result["last_message"].text)

When you create an agent in Haystack, much of the context is assembled automatically. Tool descriptions are serialised and injected into the prompt under the hood - you define a tool once, and the framework ensures the model receives everything it needs to call it: the name, description, and parameter schema. The same applies to conversation history, which is maintained across turns without any manual concatenation. The context you see in your code is just the surface, but the model receives considerably more on every call.

Strategies for managing context growth

Context explosion is not inevitable. Once you understand what's filling the window, you can start making choices about what actually needs to be there. There are several proven techniques for keeping context short without sacrificing quality.

Delegation to subagents

Another way to keep context small is to never let it grow large in the first place. Instead of one agent accumulating the full history of a complex task, you can split the work across specialised subagents - each one receiving only the slice of context relevant to its job. The orchestrator maintains a thin, high-level context, while the worker agents get focused, task-specific contexts. The total token count across the system may be similar, but no single model call is burdened with everything at once. For a practical example of this pattern in Haystack, see Building a Swarm of Agents or the Creating a Multi-Agent System with Haystack tutorial.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for up-to-date information on a topic."""
    return f"Search results for '{query}': ..."

# Worker agent: only receives context relevant to its task
researcher = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a research assistant. Answer questions concisely.",
    tools=[search_web],
)

from haystack.tools import ComponentTool

delegate_research = ComponentTool(
    component=researcher,
    name="delegate_research",
    description="Delegate a research question to a specialised agent.",
    outputs_to_string={"source": "last_message"},
)

# Orchestrator: only sees compact summaries from worker agents
orchestrator = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="Break down tasks and delegate them to specialised agents.",
    tools=[delegate_research],
)

result = orchestrator.run(messages=[ChatMessage.from_user("Compare quantum and classical computing.")])
print(result["last_message"].text)

Improving retrieval quality

In RAG pipelines, retrieval quality directly determines how many tokens land in the context. Poor retrieval returns irrelevant chunks that add noise without adding value - each one consuming part of the attention budget. Better precision means fewer chunks are needed, which means a smaller, cleaner context.

A related problem is redundancy: when retrieved passages are near-duplicates, the model sees the same information repeated multiple times without gaining anything new. This is why diversity matters as much as relevance - a set of chunks that each cover a different facet of the question is far more efficient than a set of very similar top matches. Techniques like hybrid retrieval, HyDE, query decomposition, and auto-merging retrieval all help surface results that are both more relevant and more varied.

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

rag = Pipeline()
rag.add_component("embedder", SentenceTransformersTextEmbedder())
# Retrieve 10 candidates, then rerank to the 3 most relevant
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=10))
rag.add_component("ranker", TransformersSimilarityRanker(top_k=3))

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever.documents", "ranker.documents")

result = rag.run({
    "embedder": {"text": "climate change"},
    "ranker": {"query": "climate change"},
})
# result["ranker"]["documents"] now contains at most 3 highly relevant chunks

Coming up in the series: Retrieval quality deserves a post of its own. The next article will go deep on techniques for surfacing more relevant, more diverse results - so your RAG pipelines put more important tokens in front of the model.

Summarisation and compaction

As a conversation grows, the raw message history becomes the biggest consumer of context. Compaction addresses this by periodically replacing the accumulated history with a condensed summary - retaining the essential facts and decisions while discarding the verbatim back-and-forth. The agent continues with a much shorter context, and the summary is updated with each new turn.

This pattern is well-established in practice. Popular coding agents' context compaction feature works exactly this way: when the context approaches its limit, it summarises the conversation so far and continues from the summary rather than truncating or failing.

from haystack import Pipeline
from haystack.core.component import component
from haystack.components.agents import Agent
from haystack.components.builders import ChatPromptBuilder
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.components.writers import ChatMessageWriter

@tool
def get_current_date() -> str:
    """Return today's date."""
    from datetime import date
    return date.today().isoformat()

@component
class HistoryCompactor:
    def __init__(self, threshold: int = 3):
        self.threshold = threshold
        self.compactor = ChatPromptBuilder(
            template=[
                ChatMessage.from_user(
                    "Summarise the key facts from the conversation below in "
                    "3-5 bullet points.\n\n"
                    "{{ history }}"
                )
            ],
            required_variables=["history"],
        )
        self.summariser = AnthropicChatGenerator(model="claude-haiku-4-5-20251001")

    @component.output_types(messages=list[ChatMessage])
    def run(self, messages: list[ChatMessage]) -> dict:
        if len(messages) <= self.threshold:
            return {"messages": messages}
        history_text = "\n".join(f"{m.role}: {m.text}" for m in messages if m.text)
        prompt = self.compactor.run(template_variables={"history": history_text})["prompt"]
        summary = self.summariser.run(messages=prompt)["replies"][0].text
        # The output message has to be a user message, as our chat 
        # generator cannot work with just system/assistant messages
        return {
            "messages": [
                ChatMessage.from_user(f"Conversation so far (summary):\n{summary}")
            ]
        }

# skip_system_messages=False so the compacted summary (a system message) is persisted
message_store = InMemoryChatMessageStore(skip_system_messages=False)

pipeline = Pipeline()
pipeline.add_component("message_retriever", ChatMessageRetriever(message_store))
pipeline.add_component("compactor", HistoryCompactor(threshold=3))
pipeline.add_component(
    "agent", 
    Agent(
        chat_generator=AnthropicChatGenerator(model="claude-haiku-4-5-20251001"),
        system_prompt="You are a helpful assistant.",
        tools=[get_current_date],
    )
)
pipeline.add_component("message_writer", ChatMessageWriter(message_store))

pipeline.connect("message_retriever.messages", "compactor.messages")
pipeline.connect("compactor.messages", "agent.messages")
pipeline.connect("agent.messages", "message_writer.messages")

chat_history_id = "session_1"

# First turn
pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What day is it today?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})

# Second turn - history is retrieved, compacted if needed, and stored back automatically
result = pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What month are we in?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})
print(result["agent"]["last_message"].text)

Adding only relevant tools to the context

Tool definitions can be a surprisingly large slice of the context window, especially when connecting to MCP servers that expose dozens or hundreds of tools. Listing every tool upfront means the model receives all those descriptions on every single call, regardless of which tool is actually needed.

SearchableToolset, introduced in Haystack 2.25, inverts this approach. Instead of exposing the full catalog, the agent starts with a single search_tools function and uses it to dynamically discover relevant tools via BM25 keyword search. Only the tools it actually needs are loaded into the context for that turn.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import Tool, SearchableToolset

# Create a catalog of tools
catalog = [
    Tool(name="get_weather", description="Get weather for a city", ...),
    Tool(name="search_web", description="Search the web", ...),
    # ... 100s more tools
]
toolset = SearchableToolset(catalog=catalog)

agent = Agent(chat_generator=AnthropicChatGenerator(), tools=toolset)

# The agent is initially provided only with the search_tools tool
# and will use it to find relevant tools on demand.
result = agent.run(messages=[ChatMessage.from_user("What's the weather in Milan?")])

Offloading notes (scratchpad / working memory)

An agent's intermediate reasoning - the chain of thoughts it builds up while working through a multi-step task - does not have to live inside the context window. A simple alternative is to give the agent two dedicated tools: one to write a note to an external store, and one to read notes back. Instead of accumulating its internal monologue in the prompt, the agent can offload conclusions, partial results, and reminders to storage and retrieve only what it needs at each step.

This keeps the context lean: rather than carrying the full trace of every intermediate thought, the agent holds a minimal working state and queries its own notes on demand. The pattern is especially useful for long-horizon tasks where the reasoning chain would otherwise grow without bound, and it has the side effect of making the agent's thinking inspectable and debuggable from outside the model.

What's coming next in this series

This article is the foundation of a series on context engineering. Future posts will go deeper on specific topics - measuring whether your context actually helps the model, keeping context manageable in long-running agent loops, diversifying retrieval results, tracking token usage across pipelines, and more. If there is a particular area you would like us to cover first, let us know.

To stay up to date with the series and everything else happening in Haystack, star the Haystack GitHub repository and join the conversation on Discord.

Multimodal Search with Gemini Embedding 2 in Haystack

Bilge Yücel — Thu, 12 Mar 2026 12:23:42 +0000

This post is co-authored with Stefano Fiorucci

Embeddings are the backbone of modern AI applications, from semantic search and recommendation systems to Retrieval-Augmented Generation (RAG). However, most embedding models operate in a single modality, typically focusing only on textual data.

Google has introduced Gemini Embedding 2, a fully multimodal embedding model that maps text, images, video, audio, and PDFs into a shared vector space. This means you can search across different types of data using a single embedding model: gemini-embedding-2-preview.

Even better, Haystack supports Gemini Embedding 2 from Day 0. Through the Google GenAI x Haystack integration, you can immediately start using the model in your Haystack applications for both text and multimodal embeddings.

Let’s take a closer look.

Meet Gemini Embedding 2

Gemini Embedding 2 is Google’s first fully multimodal embedding model, built on the Gemini architecture. It can map text, images, video, audio, and PDFs into a single unified vector space, enabling cross-modal comparison and retrieval using a shared semantic representation.

For example, a text query can retrieve relevant images, an audio clip can match a document, or a video segment can be retrieved using text search. This unified representation makes it easier to build multimodal applications like image search, recommendation systems, and RAG.

The model supports 100+ languages and allows developers to choose flexible embedding sizes using Matryoshka Representation Learning (MRL). Depending on the trade-off between storage and accuracy, you can select embedding dimensions up to 3072, with commonly recommended sizes being 768, 1536, or 3072 (default).

Gemini Embedding 2 also supports large inputs up to 8192 tokens, making it suitable for embedding longer documents and complex multimodal inputs.

Early benchmarks indicate strong performance across modalities, including a top-5 ranking on the MTEB Multilingual leaderboard for text and state-of-the-art results among proprietary models, with document retrieval performance comparable to Voyage.

Check out the official Google documentation for more details.

Using Gemini Embeddings in Haystack

Haystack provides built-in components for generating Gemini embeddings through the Gemini API and Vertex AI.

For text data, you can use:

The GoogleGenAIDocumentEmbedder is typically used during the indexing to embed documents before storing them in a vector database.

# pip install haystack-ai google-genai-haystack
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.google_genai import (
    GoogleGenAIDocumentEmbedder, GoogleGenAITextEmbedder
)

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

docs = [
    Document(content="The capybara is the largest rodent in the world and is native to South America, where it lives near rivers, lakes, and wetlands. It is highly social and often seen relaxing in groups, spending much of its time swimming or soaking in water. Capybaras communicate through whistles, barks, and purr-like sounds."),
    Document(content="Dogs are domesticated mammals known for their loyalty, intelligence, and strong bond with humans. They have been bred for thousands of years for roles such as companionship, hunting, guarding, and assisting people with various tasks. Different breeds vary widely in size, temperament, and abilities."),
    Document(content="The tiger is the largest species of big cat and is recognized by its distinctive orange coat with black stripes. It is a powerful solitary predator that inhabits forests, grasslands, and wetlands across parts of Asia. Tigers are excellent swimmers and rely on stealth and strength to hunt prey."),
    Document(content="The giraffe is the tallest land animal on Earth, easily identified by its long neck and distinctive spotted coat. It uses its height to reach leaves high in acacia trees and roams the savannas and open woodlands of Africa. Despite its long neck, a giraffe has the same number of neck vertebrae as most mammals."),
    Document(content="Elephants are the largest land animals and are known for their intelligence, strong family bonds, and remarkable memory. They use their trunks for breathing, grasping objects, and communication. Elephants live in complex social groups led by a matriarch."),
    Document(content="Penguins are flightless birds that live primarily in the Southern Hemisphere, especially in Antarctica. They are excellent swimmers, using their flipper-like wings to move through the water while hunting fish, squid, and krill."),
    Document(content="Dolphins are highly intelligent marine mammals known for their playful behavior and complex communication. They live in social groups called pods and use echolocation to navigate and locate prey in the ocean."),
    Document(content="Owls are nocturnal birds of prey with excellent night vision and silent flight. They hunt small mammals, insects, and other birds, relying on their sharp talons and keen hearing to detect prey in darkness."),
    Document(content="Red pandas are small mammals native to the eastern Himalayas and southwestern China. They have reddish-brown fur, bushy tails, and spend most of their time in trees. Their diet mainly consists of bamboo, though they may also eat fruits and insects."),
    Document(content="Kangaroos are large marsupials native to Australia and are famous for their powerful hind legs, large feet, and strong tails that help them balance while hopping. Female kangaroos carry and nurture their young, called joeys, in a pouch. They typically live in open grasslands and forests and often move in groups called mobs."),
]

doc_embedder = GoogleGenAIDocumentEmbedder(
    model="gemini-embedding-2-preview", 
    batch_size=5, 
    config={
            "task_type": "RETRIEVAL_DOCUMENT",
            "output_dimensionality": 768 # flexible embedding sizes using MRL
        }
)
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Once documents are embedded and stored, you can embed queries using GoogleGenAITextEmbedder and retrieve relevant documents.

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder

embedding_retriever = InMemoryEmbeddingRetriever(document_store=document_store)

query = "animal that communicates with whistles and barks"
text_embedder = GoogleGenAITextEmbedder(
    model="gemini-embedding-2-preview", 
    config={
            "task_type": "RETRIEVAL_QUERY",
            "output_dimensionality": 768 # flexible embedding sizes using MRL
          }
)
query_embedding = text_embedder.run(query)["embedding"]

result = embedding_retriever.run(query_embedding=query_embedding, top_k=2)

for doc in result["documents"]:
    print(doc.meta)
    print(doc.content)
    print(doc.score)
    print("-" * 10)

In production applications, embeddings can be stored in search engines such as Qdrant, Elasticsearch or OpenSearch.

Multimodal Embeddings

Haystack also provides Day-0 support for multimodal embeddings with the new GoogleGenAIMultimodalDocumentEmbedder component.

This component enables embedding images, audio, video, and PDFs directly inside Haystack pipelines, making it straightforward to build applications that search across multiple media types.

Example:

from haystack_integrations.components.embedders.google_genai import GoogleGenAIMultimodalDocumentEmbedder
from haystack import Document

docs = [
    Document(meta={"file_path": "kangaroo.mp4"}),
    Document(meta={"file_path": "tiger.jpg"}),
    Document(meta={"file_path": "sample.pdf"}),
    Document(meta={"file_path": "kangaroo.jpg"}),
    Document(meta={"file_path": "cat.jpg"})
]

doc_multimodal_embedder = GoogleGenAIMultimodalDocumentEmbedder(
    model="gemini-embedding-2-preview", 
    config={
        "task_type": "RETRIEVAL_DOCUMENT", 
        "output_dimensionality": 768 # flexible embedding sizes using MRL
        }
)
docs_with_embeddings = doc_multimodal_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Cross-modal Retrieval

One powerful capability enabled by multimodal embeddings is cross-modal retrieval, retrieving one type of content using another modality as the query.

For example, you can perform image-to-text search, where an image is used as a query to retrieve relevant text documents. If a user uploads an image of a dog, the system could retrieve documents describing dog breeds, training tips, or animal behavior.

To perform image-to-text search, you first embed your text documents during indexing (as shown earlier using GoogleGenAIDocumentEmbedder). Then you embed the image as the query using the multimodal embedder.

Because all modalities share the same vector space, you can use this approach to support cross-modal retrieval in any direction, for example text-to-image, image-to-text, audio-to-video, or video-to-document search.

Example:

image_doc = Document(meta={"file_path": "another_kangaroo.jpg"})
image_embedder = GoogleGenAIMultimodalDocumentEmbedder(
    model="gemini-embedding-2-preview", 
    config={
        "task_type": "RETRIEVAL_QUERY",
        "output_dimensionality": 768 # flexible embedding sizes using MRL
        }
)

# Create the embedding for the image
image_embedding = image_embedder.run([image_doc])["documents"][0].embedding

# Find the most semantically similar texts, images, audio, video and PDFs in the vector database
results = embedding_retriever.run(query_embedding=image_embedding, top_k=3)["documents"]

for doc in results:
    print(doc.content)
    print(doc.score)
    print("-" * 100)

With multimodal embeddings, you can build applications such as:

Multimodal search

Example: search a product catalog with a text query like “red running shoes with white soles” and retrieve both product images and descriptions.
Cross-modal retrieval

Search one type of media using another modality. For example, upload a photo of a monument to retrieve relevant articles describing it.
Multimodal RAG systems

Combine documents, diagrams, screenshots, and videos as sources of knowledge.
Media recommendation systems

Recommend videos or images based on textual descriptions.

To learn more about working with multimodal data in Haystack, check out the tutorial on Creating Vision + Text RAG Pipelines.

What’s Next

Gemini Embedding 2 makes it easier to build multimodal-aware retrieval systems without stitching together multiple embedding models.

With Haystack, you can already use both text and multimodal embeddings with Gemini Embedding 2. We’re excited to see what multimodal applications you build next.