DEV Community: Harshvardhan Singh

Building Production AI Agents on the JVM: A Practical Guide to Java + Spring AI in 2026

Harshvardhan Singh — Mon, 20 Jul 2026 08:24:51 +0000

Let's skip the philosophical debate. This isn't about whether Java is better than Python for AI, Python owns model research and that's not changing. This is about what happens after you have a model endpoint and need to build a reliable system around it: retrieval, tool calling, multi-agent orchestration, observability, and enough throughput to not fall over on launch day.

This guide walks through building a production-grade RAG + agent service in Java 21, using Spring Boot 3.x, Spring AI, virtual threads, and structured concurrency — with real code, real tradeoffs, and a couple of diagrams so you're not just taking my word for it.

Why Backend Engineers Are Ending Up on AI Teams

AI products in 2026 aren't model calls. They're distributed systems: retrieval pipelines, tool-calling agents, multi-model routing for cost and reliability, semantic caching, and observability stacks that need to trace a request across five different services. That's backend engineering with an LLM bolted onto one node of the graph.

flowchart LR
    U[User Request] --> GW[AI Gateway]
    GW --> RT[Model Router]
    RT --> M1[GPT / Claude / Gemini]
    RT --> M2[Local vLLM Cluster]
    GW --> RAG[RAG Pipeline]
    RAG --> VDB[(Vector DB: pgvector / Qdrant)]
    GW --> TOOLS[MCP Tool Layer]
    TOOLS --> EXT[External APIs]
    GW --> CACHE[(Redis Semantic Cache)]
    GW --> OTEL[OpenTelemetry Collector]

Every box in that diagram except "GPT / Claude / Gemini" is a job for backend infrastructure — and it's where Java's ecosystem (Spring Boot, resilience libraries, mature observability tooling) genuinely earns its keep.

Setting Up the Project

curl https://start.spring.io/starter.zip \
  -d dependencies=web,spring-ai-openai,spring-ai-vectorstore-pgvector,actuator \
  -d javaVersion=21 \
  -d bootVersion=3.4.0 \
  -o ai-service.zip

build.gradle.kts dependencies:

dependencies {
    implementation("org.springframework.boot:spring-boot-starter-web")
    implementation("org.springframework.ai:spring-ai-openai-spring-boot-starter")
    implementation("org.springframework.ai:spring-ai-pgvector-store-spring-boot-starter")
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("io.micrometer:micrometer-registry-prometheus")
    runtimeOnly("org.postgresql:postgresql")
}

Enable virtual threads in application.yml — this one property changes your entire concurrency model:

spring:
  threads:
    virtual:
      enabled: true
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4.1

Why Virtual Threads Matter Here Specifically

AI request handling is almost entirely I/O wait: call the model, call the vector store, call a tool, call the model again. Platform threads made this expensive — each blocked thread held an OS thread hostage, so you'd tune thread pools and hope. Virtual threads decouple the two. You write normal, blocking, readable code, and the JVM schedules thousands of them onto a small number of carrier threads.

@RestController
@RequestMapping("/api/chat")
public class ChatController {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public ChatController(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder.build();
        this.vectorStore = vectorStore;
    }

    @PostMapping(produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> chat(@RequestBody ChatRequest request) {
        List<Document> relevant = vectorStore.similaritySearch(
            SearchRequest.query(request.message()).withTopK(5)
        );

        String context = relevant.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n---\n"));

        return chatClient.prompt()
            .system("Answer using only this context:\n" + context)
            .user(request.message())
            .stream()
            .content();
    }
}

No manual thread pool tuning. No reactive boilerplate required to get high concurrency — though WebFlux remains a valid choice when you need true backpressure control over streaming pipelines, not just high thread counts. Virtual threads and reactive streams solve overlapping but distinct problems: virtual threads give you cheap concurrency with imperative code; WebFlux gives you backpressure and non-blocking composition. Pick reactive when you're streaming large volumes with flow control requirements; pick virtual threads when you want blocking-style simplicity at scale. Many production systems use both.

Structured Concurrency for Multi-Tool Agent Calls

Agents frequently need to fan out to multiple tools or retrieval sources in parallel and combine results. Java's structured concurrency (finalized as of JDK 21+ preview, stabilized in later releases) gives you a clean way to do this without leaking threads on partial failure:

public AgentResult gatherContext(String query) throws Exception {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        var vectorResults = scope.fork(() -> vectorStore.similaritySearch(query));
        var toolResults = scope.fork(() -> mcpToolClient.callTool("search_invoices", query));
        var cacheCheck = scope.fork(() -> semanticCache.lookup(query));

        scope.join();
        scope.throwIfFailed();

        return new AgentResult(
            vectorResults.get(),
            toolResults.get(),
            cacheCheck.get()
        );
    }
}

If any subtask fails, the scope cancels the others automatically. No orphaned threads, no manual CompletableFuture cleanup, no silent leaks — a real problem in long-running agent services where a forgotten future eventually shows up as a memory profile nobody wants to explain.

RAG Architecture: pgvector vs. Qdrant vs. Milvus

A quick, honest comparison for teams choosing a vector store for a Java-based RAG pipeline:

Store	Best fit	Tradeoff
PostgreSQL + pgvector	You already run Postgres; want transactional consistency between metadata and vectors	Slower at very high dimensional scale (10M+ vectors) without careful indexing (HNSW tuning)
Qdrant	Dedicated vector workloads, filtering-heavy queries, strong Rust-based performance	One more system to operate and monitor
Milvus	Very large scale (100M+ vectors), distributed deployments	Higher operational complexity, heavier resource footprint

For most mid-scale production systems, pgvector wins on simplicity: one less database to run, one less failure domain, and Spring AI's PgVectorStore integrates directly with Spring's transaction management, which matters when you need vector writes to stay consistent with the rest of your data model.

@Bean
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel) {
    return PgVectorStore.builder(jdbcTemplate, embeddingModel)
        .withDimensions(1536)
        .withDistanceType(PgDistanceType.COSINE_DISTANCE)
        .withIndexType(PgIndexType.HNSW)
        .build();
}

Multi-Model Routing and Cost Control

Production systems rarely call a single model for every request. Cheap, fast models handle simple queries; larger models handle complex reasoning; local models via Ollama or vLLM handle anything sensitive enough to keep off third-party infrastructure. LiteLLM or a custom Spring AI ChatModel router can sit in front of all of them:

@Component
public class ModelRouter {

    private final Map<String, ChatModel> models;

    public ChatResponse route(String query, ComplexityScore score) {
        ChatModel selected = switch (score) {
            case SIMPLE -> models.get("gpt-4o-mini");
            case MODERATE -> models.get("claude-sonnet");
            case COMPLEX -> models.get("gpt-4.1");
            case SENSITIVE -> models.get("local-vllm-llama3");
        };
        return selected.call(new Prompt(query));
    }
}

This is where the "backend engineering" framing pays off directly: routing logic, cost attribution per customer, and failover between providers are ordinary service-layer concerns, not AI-specific magic. You're applying the same patterns you'd use for any multi-vendor integration — circuit breakers, retries with backoff, health checks — just pointed at model APIs instead of payment processors.

Semantic Caching with Redis

Re-answering identical or near-identical questions is pure waste. A semantic cache checks embedding similarity before hitting the model:

@Service
public class SemanticCacheService {

    private final RedisTemplate<String, String> redis;
    private final EmbeddingModel embeddingModel;
    private static final double SIMILARITY_THRESHOLD = 0.95;

    public Optional<String> lookup(String query) {
        float[] queryEmbedding = embeddingModel.embed(query);
        // Redis with a vector-capable index (RediSearch) performs the ANN lookup
        return redis.opsForValue()
            .get(hashKeyIfCloseEnough(queryEmbedding, SIMILARITY_THRESHOLD));
    }
}

In practice this cuts model spend meaningfully on high-traffic, low-variance query patterns — support bots, FAQ-style assistants, internal tooling — where the same question shows up in a dozen phrasings.

Agent Orchestration and MCP

Model Context Protocol standardizes how agents discover and invoke tools, instead of every team hand-rolling a bespoke function-calling schema. Spring AI's MCP support lets you register a Java service as an MCP server that any compliant agent — regardless of what language or framework built it — can call:

@McpServer(name = "invoice-tools")
public class InvoiceToolServer {

    @McpTool(description = "Fetch the last N invoices for a customer")
    public List<Invoice> getRecentInvoices(String customerId, int count) {
        return invoiceRepository.findRecentByCustomer(customerId, count);
    }
}

For multi-step agent workflows with branching logic, teams often pair Spring AI's tool-calling primitives with graph-based orchestration patterns similar to LangGraph — modeling the agent as an explicit state machine rather than an implicit chain, which makes debugging failed runs dramatically easier:

stateDiagram-v2
    [*] --> Retrieve
    Retrieve --> Decide
    Decide --> ToolCall: needs external data
    Decide --> Synthesize: has enough context
    ToolCall --> Decide
    Synthesize --> [*]

Structured Outputs

Downstream systems need structured data, not prose. Spring AI's structured output converters map model responses directly to Java records, with schema validation baked in:

public record InvoiceSummary(
    String customerId,
    double totalAmount,
    List<String> flaggedItems
) {}

InvoiceSummary summary = chatClient.prompt()
    .user("Summarize these invoices: " + invoiceData)
    .call()
    .entity(InvoiceSummary.class);

This eliminates an entire category of bugs — hand-parsed JSON from free-text model output — that plagues early-stage AI integrations.

Observability: You Can't Fix What You Can't See

AI systems fail in unusual ways: silent hallucination, slow degradation under load, cost spikes from a prompt template change nobody reviewed. Standard observability tooling still applies, and Java's ecosystem here is mature:

@Bean
public ObservationRegistry observationRegistry(MeterRegistry meterRegistry) {
    return ObservationRegistry.create()
        .observationConfig()
        .observationHandler(new DefaultMeterObservationHandler(meterRegistry))
        .observationRegistry();
}

Spring AI auto-instruments chat calls, embedding calls, and vector store operations through Micrometer, which flows into Prometheus and OpenTelemetry with no extra glue code. In production, track at minimum: token usage per request, latency percentiles per model, cache hit rate, and tool-call failure rate. That last one catches problems weeks before users notice.

Deployment: Containers and Startup Time

A common objection to Java in latency-sensitive or serverless-adjacent AI deployments is JVM startup time. Two mitigations worth knowing:

CDS (Class Data Sharing) / AppCDS — reduces cold start meaningfully for containerized deployments by caching class metadata.
GraalVM native image for services where cold start truly matters (scale-to-zero Kubernetes deployments), compiling to a native binary can bring startup from seconds to milliseconds, at the cost of longer build times and some reflection-heavy library incompatibilities (check Spring AI's native support status for your specific starters before committing).

FROM eclipse-temurin:21-jre-alpine
COPY target/ai-service.jar app.jar
ENTRYPOINT ["java", "-XX:+UseZGC", "-jar", "app.jar"]

ZGC (Z Garbage Collector) is worth defaulting to for AI services specifically, sub-millisecond pause times matter when you're streaming tokens over SSE and a GC pause shows up as a visible stutter in the response.

What This Adds Up To

None of this replaces Python for model development, evaluation, or research, that's not the goal. What it gives you is a production-grade layer for everything a model needs around it: retrieval, tool orchestration, multi-model routing, caching, structured outputs, and observability, built on a runtime with decades of tuning for exactly this class of problem high-concurrency, long-running, I/O-bound services that need to not fall over.

If you're currently running your AI orchestration layer as a Python script that grew into a service by accident, the migration path here doesn't require throwing anything away. Keep your model layer in Python. Put a Java service in front of it for routing, caching, and orchestration. Measure the difference in your p99 latency and your on-call load after a month.

Context Engineering Is Replacing Prompt Engineering: Building Production Context Pipelines for LLM Apps

Harshvardhan Singh — Sat, 11 Jul 2026 11:02:56 +0000

If you've spent time tweaking prompts to fix a flaky RAG app or a wandering agent, you've probably noticed the same thing most of us do eventually: rewording the prompt fixes the demo case and breaks two others. The actual bug usually isn't the prompt. It's what the model was handed before the prompt even ran — retrieved documents that contradict each other, conversation history crowding out the facts that matter, or the one correct chunk buried in the middle of a context blob where positional bias guarantees the model will underweight it.

This is a full implementation walkthrough of a context pipeline — the retrieval, ranking, assembly, memory, and reuse logic that decides what a model actually sees. Every stage below includes working code you can adapt directly, plus the specific failure modes each stage exists to prevent.

Architecture Overview

raw query
   │
   ▼
[1] Query Rewriting        — fix ambiguous/short queries before retrieval
   │
   ▼
[2] Hybrid Retrieval        — dense + sparse search, merged
   │
   ▼
[3] Reranking                — precision pass over retrieved candidates
   │
   ▼
[4] Deduplication            — collapse near-identical/contradictory chunks
   │
   ▼
[5] Context Assembly         — token-budgeted, position-aware final input
   │
   ▼
[6] Reuse Check               — skip stages if a near-identical request exists
   │
   ▼
[7] Generation                — the actual model call
   │
   ▼
[8] Evaluation Logging       — per-stage metrics for regression testing

Prompt engineering lives entirely in box 7. Everything else is context engineering, and it's usually where production bugs actually hide. Let's build each stage.

1. Query Rewriting

Users type short, underspecified queries. Retrievers do better with explicit, expanded ones.

def rewrite_query(user_query: str, conversation_history: list[str]) -> str:
    # Cheap, deterministic expansion first — cheaper than an LLM call,
    # and it resolves a surprising fraction of ambiguous short queries.
    is_short = len(user_query.split()) < 4
    has_pronoun = any(p in user_query.lower() for p in ["it", "that", "this", "those"])

    if (is_short or has_pronoun) and conversation_history:
        return f"{conversation_history[-1]} {user_query}"
    return user_query

In production, this step often escalates to a small, fast LLM call for genuinely ambiguous queries — but gate that escalation behind the heuristic above, since most queries don't need it and every extra model call adds latency you don't want to pay by default.

def rewrite_query_with_llm_fallback(user_query, history, needs_escalation_fn, llm_rewrite_fn):
    rewritten = rewrite_query(user_query, history)
    if needs_escalation_fn(rewritten):
        return llm_rewrite_fn(rewritten, history)
    return rewritten

2. Hybrid Retrieval

Pure vector search misses exact matches — IDs, product codes, proper nouns — that embeddings tend to smear together in semantic space. Combine dense and sparse retrieval, and merge with reciprocal rank fusion rather than naive score averaging (scores from different retrieval methods aren't on comparable scales):

def reciprocal_rank_fusion(result_lists, k=60):
    scores = {}
    for results in result_lists:
        for rank, doc in enumerate(results):
            scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    ranked_ids = sorted(scores, key=scores.get, reverse=True)
    doc_lookup = {doc.id: doc for results in result_lists for doc in results}
    return [doc_lookup[i] for i in ranked_ids]


def hybrid_retrieve(query: str, vector_store, keyword_index, k: int = 20):
    dense_results = vector_store.similarity_search(query, k=k)
    sparse_results = keyword_index.search(query, k=k)
    return reciprocal_rank_fusion([dense_results, sparse_results])[:k]

Retrieve generously here — k=20 or more — over-fetching is cheap, and the next two stages exist specifically to cut the candidate set down intelligently rather than relying on the retriever to be precise on the first pass.

3. Reranking

This is the step most RAG tutorials skip, and it's usually the highest-leverage fix available for citation and relevance bugs.

def rerank(query: str, candidates: list, reranker_model, top_n: int = 6):
    scored = reranker_model.score(query, [c.text for c in candidates])
    ranked = sorted(zip(candidates, scored), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_n]]

A cross-encoder reranker — or an LLM-based relevance judge for lower-volume use cases — trades a small amount of latency for a substantial jump in precision. Retrieval optimizes for recall across the whole corpus; reranking optimizes for precision over the candidates you already pulled. Skipping this step and taking top-k directly from a vector store is one of the most common causes of noisy, contradictory context reaching generation.

4. Deduplication

Retrieved candidates frequently include near-duplicates — different versions of the same document, or restated facts across multiple sources — that consume token budget without adding information, and can actively confuse a model when they disagree with each other.

def deduplicate(candidates: list, similarity_fn, threshold: float = 0.92,
                 recency_key=None):
    kept = []
    for doc in candidates:
        duplicate_of = None
        for existing in kept:
            if similarity_fn(doc.text, existing.text) > threshold:
                duplicate_of = existing
                break

        if duplicate_of is None:
            kept.append(doc)
        elif recency_key and recency_key(doc) > recency_key(duplicate_of):
            # Keep the more recent version when duplicates represent
            # different versions of the same underlying document.
            kept.remove(duplicate_of)
            kept.append(doc)

    return kept

If your source data has any notion of versioning or supersession — updated policies, revised documentation — this stage needs to know about it explicitly. Embedding similarity alone can't tell a model which of two near-identical documents is current; that's a metadata problem, not a generation problem, and it needs to be solved before the model ever sees the candidates.

5. Context Assembly — With an Explicit Token Budget

This is the part most implementations get wrong by default: concatenate everything relevant and let the model sort it out. Do it deliberately instead.

def assemble_context(system_prompt, retrieved_docs, history, user_query,
                      max_tokens=8000, token_counter=len):
    budget = {
        "system": int(max_tokens * 0.10),
        "history": int(max_tokens * 0.25),
        "retrieved": int(max_tokens * 0.55),
        "query": int(max_tokens * 0.10),
    }

    def trim_to_budget(text_blocks, limit):
        kept, used = [], 0
        for block in text_blocks:
            t = token_counter(block)
            if used + t > limit:
                break
            kept.append(block)
            used += t
        return kept

    # Keep the most recent history turns, trimmed from the oldest end
    trimmed_history = trim_to_budget(history[::-1], budget["history"])[::-1]

    # Highest-ranked retrieved docs first — and positioned first in the
    # final context, since models under-attend to mid-context information.
    trimmed_docs = trim_to_budget(
        [d.text for d in sorted(retrieved_docs, key=lambda d: -d.score)],
        budget["retrieved"],
    )

    return {
        "system": system_prompt,
        "context": trimmed_docs,        # placed early in the final prompt
        "history": trimmed_history,      # placed after context
        "query": user_query,             # placed last
    }

Two things matter here more than anything else in this pipeline: placement and budget. Put the most decision-critical retrieved facts near the start or end of the assembled context, never buried in the middle — long-context evaluations consistently show models under-attend to mid-context information, an effect commonly referred to as "lost in the middle." And give every section a hard token budget instead of letting whichever section happens to be largest (usually conversation history, in long-running sessions) silently crowd out the rest.

6. Reuse Layer — Don't Repeat Work You've Already Done

Once retrieval, reranking, and assembly are dialed in, the next optimization most teams miss entirely is reuse. If two queries are semantically close, there's often no need to run the full pipeline twice.

def get_or_run_pipeline(query, cache, run_full_pipeline_fn, generate_fn,
                          full_reuse_threshold=0.98, partial_reuse_threshold=0.92):
    match = cache.find_similar(query, threshold=partial_reuse_threshold)

    if match is None:
        result = run_full_pipeline_fn(query)
        cache.store(query, result)
        return result

    if match.similarity > full_reuse_threshold:
        return match.cached_response                       # full reuse
    else:
        # Reuse retrieval + ranking work, regenerate only the final answer
        return generate_fn(query, context=match.cached_context)

This is a simplified version of the pattern used by semantic caching layers and by newer, purpose-built projects like Remem, an open-source work-reuse engine designed specifically for RAG and agent pipelines. Instead of a binary cache hit/miss, it evaluates similarity and decides per-request whether to fully reuse a prior response, reuse retrieved context while regenerating the answer, or run the pipeline fresh. For high-traffic apps with repetitive query patterns — support bots, internal knowledge assistants, coding copilots — this kind of tiered reuse is frequently a bigger cost and latency win than swapping to a cheaper model, and unlike a model swap, it doesn't trade off against output quality on the requests that do need a fresh run.

7. Memory for Multi-Turn Sessions

Full conversation replay doesn't scale — it resends the same tokens on every turn, growing latency and cost linearly with conversation length, and it's exactly the mechanism that let history quietly crowd out retrieved context in the assembly stage above. Summarize older turns once a session passes a length threshold:

def manage_conversation_memory(history, summarizer_fn, max_raw_turns=6):
    if len(history) <= max_raw_turns:
        return history

    older_turns = history[:-max_raw_turns]
    recent_turns = history[-max_raw_turns:]
    summary = summarizer_fn(older_turns)

    return [f"[Earlier conversation summary]: {summary}"] + recent_turns

For assistants that need precise recall of specific earlier facts rather than just the gist — a coding assistant remembering an exact variable name from twenty turns ago, for instance — summarization alone will lose detail. In that case, extract discrete facts into a separate structured store instead of relying on summarized prose, and retrieve relevant facts back into context the same way you'd retrieve external documents.

8. Evaluation Harness

A minimal harness for catching regressions before they hit users — run this against a fixed test set on every change to retrieval, ranking, or assembly logic, the same way you'd run unit tests against any other service:

def evaluate_pipeline(test_cases, pipeline_fn, groundedness_checker):
    results = []
    for case in test_cases:
        output = pipeline_fn(case["query"])
        retrieved_ids = {d.id for d in output["retrieved_docs"]}
        expected_ids = set(case["expected_ids"])

        results.append({
            "query": case["query"],
            "retrieval_recall": len(retrieved_ids & expected_ids) / max(len(expected_ids), 1),
            "groundedness": groundedness_checker(output["answer"], output["context"]),
            "latency_ms": output["latency_ms"],
        })
    return results

Prompt changes are cheap to iterate on in a chat window. Pipeline regressions are not, and they're much harder to spot by eyeballing a handful of outputs — a retrieval regression can look, three stages downstream, like a confusing but plausible-sounding wrong answer with no obvious connection to its actual cause.

Putting It Together

def handle_query(user_query, history, vector_store, keyword_index,
                  reranker, cache, system_prompt):
    rewritten = rewrite_query(user_query, history)

    cached = cache.find_similar(rewritten, threshold=0.92)
    if cached and cached.similarity > 0.98:
        return cached.cached_response

    candidates = hybrid_retrieve(rewritten, vector_store, keyword_index)
    reranked = rerank(rewritten, candidates, reranker)
    deduped = deduplicate(reranked, similarity_fn=cosine_sim, recency_key=lambda d: d.updated_at)

    managed_history = manage_conversation_memory(history, summarizer_fn=summarize)
    context = assemble_context(system_prompt, deduped, managed_history, rewritten)

    result = generate(context)
    cache.store(rewritten, result)
    return result

Every box in this pipeline is a place where a bug can hide — and, importantly, every box is testable independently of the others. That's the real advantage of building a context pipeline instead of a single prompt string: you get a system you can debug stage by stage, instead of a black box you can only tune by trial and error and hope the regression doesn't come back somewhere else.

Common Pitfalls Checklist

Run through this list before assuming a bug is a prompt problem:

☐ Is retrieval returning near-duplicate or superseded documents with no recency signal distinguishing them?
☐ Is there a reranking step, or is the system taking raw top-k from vector search?
☐ Does every context section have an explicit token budget, or can one section (usually history) silently crowd out the rest?
☐ Is the most decision-critical retrieved content positioned near the start/end of context, or is it buried in the middle?
☐ Is conversation history growing unbounded, or is it summarized/extracted past a length threshold?
☐ Is the pipeline re-running in full for near-duplicate queries that could partially or fully reuse prior work?
☐ Are retrieval and groundedness evaluated independently, or only judged by eyeballing final output quality?

Takeaways

Query rewriting, hybrid retrieval, and reranking fix more RAG problems than prompt tuning does — start there when debugging.
Deduplicate semantically, not just exactly, and resolve document recency explicitly rather than relying on embedding similarity to imply it.
Give context sections explicit token budgets, and place the most important information near the start or end of the assembled context, never the middle.
Add a reuse layer once your pipeline is stable — repeated or near-duplicate queries are common in most production apps, and re-running the full pipeline every time is wasted latency and cost.
Evaluate retrieval, ranking, and groundedness as separate, independently-testable metrics, not just final output quality.

Prompt engineering is still part of the job. It's just the last five percent of it — and treating it as the whole job is exactly how six weeks of debugging effort ends up pointed at the wrong layer of the system.

I Kept Rebuilding the Same AI Context From Scratch. Here's What I Learned About "Why AI Forgets".

Harshvardhan Singh — Wed, 08 Jul 2026 10:55:28 +0000

Okay real talk — how many of you have pasted the same here's my project context paragraph into a fresh ChatGPT/Claude chat for the third time this week?

Yeah. Me too. Constantly.

I finally sat down and actually dug into why this happens, instead of just being annoyed by it, and it led me down a genuinely interesting rabbit hole about how LLMs handle (or don't handle) memory. Sharing what I found, plus some code, because I think a lot of us building with these APIs are hitting the same wall without fully naming it.

The core fact that explains everything

LLMs are stateless. Full stop. There is no session on the model's side. Every single API call is self-contained.

That "conversation" you're having? It's an illusion built by the client. Here's basically what your chat app is doing behind the scenes every time you hit send:

const messages = [
  { role: "user", content: "I'm building a Next.js app with Postgres" },
  { role: "assistant", content: "Got it, what's the question?" },
  { role: "user", content: "What ORM should I use?" } // <- your new message
];

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1000,
  messages,
});

Every. Single. Time. The whole history gets resent. Nothing is remembered server-side (with some caching exceptions I'll mention later). Close the tab, and that array is gone unless you saved it somewhere.

Once this clicked for me, a bunch of AI product behavior suddenly made sense.

"But it has a huge context window now, doesn't it just remember more?"

This is the trap I fell into too. Bigger context window ≠ better memory. Two separate problems:

It resets between sessions anyway. A 1M token context window is still empty the second you open a new chat.
Stuff in the middle gets "lost." There's actual research on this — models are measurably worse at pulling out info buried in the middle of a long context vs. info near the start or end. It's called the "lost in the middle" effect. So just cramming your whole chat history in isn't a free win even within one session.

Plus, more tokens = more $$$ and slower responses. Every extra token you send gets billed and adds latency. So nobody's shipping "just include everything," even if they wanted to.

So how do apps like ChatGPT "remember" me?

They're layering stuff on top. None of this is the model itself remembering — it's engineering:

Sliding window — just keep the last N messages, drop the rest. Simple, cheap, dumb.

def get_recent_context(history, max_turns=10):
    return history[-max_turns:]

Summarization — compress old messages into a summary blob so you keep the gist without the token cost.

def summarize_old_turns(old_messages, model):
    return model.generate(
        f"Summarize key facts/decisions from: {old_messages}"
    )

RAG (Retrieval-Augmented Generation) — store stuff externally (vector DB), pull back only what's relevant to the current message, inject it fresh each time.

def rag_context(query, vector_db, k=5):
    query_vec = embed(query)
    docs = vector_db.search(query_vec, top_k=k)
    return "\n".join(d.text for d in docs)

Structured memory — extract actual discrete facts ("prefers TypeScript", "works in IST") into something like a little user profile, and inject those instead of raw transcript.

{
  "user_id": "abc123",
  "facts": {
    "preferred_stack": "Next.js + Postgres",
    "experience_level": "senior"
  }
}

This last one is basically what ChatGPT's memory feature and Claude's project context are doing - a curated fact layer bolted onto a fundamentally stateless model. Not magic, just good plumbing.

The problem nobody talks about: redoing the same work over and over

Here's the thing that actually surprised me the most while digging into this. Memory isn't just "does the AI know facts about me" — it's also "does the system realize it already did this work."

Say you've got a RAG-powered support bot. User A asks:

"How do I reset my password?"

User B asks, twenty minutes later:

"I forgot my password, what do I do?"

Same question. Totally different string. In a naive setup, both trigger the entire pipeline from scratch — embed query, hit the vector DB, build the prompt, call the LLM, generate response. Twice. For the same thing. That's wasted latency and wasted money at scale, and it happens constantly in production systems because most caching is exact-match only (same hash, same key) and totally misses this.

This is actually a distinct problem from conversational memory, and while poking around I found an open-source project called Remem that's built specifically around this — a "work reuse engine" for RAG pipelines and AI agents. Instead of exact-match caching, it checks semantic similarity between requests. Close enough match → reuse the whole response. Partial match → reuse the retrieval step but let the model regenerate the actual answer. No match → run the full pipeline like normal.

I think it's a genuinely useful mental model even if you never use the tool itself: caching solves "have I seen this exact string," semantic reuse solves "have I basically already answered this." Most systems only handle the first one.

Quick comparison table because I like tables

Approach	Solves	Doesn't solve
Sliding window	Recent context	Anything older / cross-session
Summarization	Long conversation compression	Precision (it's lossy)
RAG	Grounding in external/past knowledge	Redundant compute on similar queries
Structured facts	Durable user preferences	Free-form conversational nuance
Semantic work-reuse (e.g. Remem)	Redundant pipeline execution	Long-term factual memory itself

None of these is "the" answer. They're complementary layers, and most real production systems end up using two or three of them together.

Things I'm now doing differently

Stopped assuming a bigger context window fixes memory problems — it doesn't, it just delays when they show up.
Started separating "what should persist forever" (user prefs) from "what's just this session's scratch context" instead of treating my whole history as one blob.
Added basic staleness handling — facts I store now have a rough "last confirmed" timestamp, because stale memory can genuinely make an app feel less trustworthy than no memory at all.
Started thinking about work-reuse as its own optimization, separate from RAG itself. Before this I was treating retrieval + generation as one atomic thing to cache or not cache.

Discussion

Curious how others here are handling this in production:

Are you doing anything smarter than sliding-window + RAG for memory?
Has anyone measured actual cost savings from semantic caching/reuse vs. plain RAG?
Any horror stories from stale memory making an assistant worse, not better?

Drop your setup in the comments, always down to compare notes on this stuff. 👇

The Hidden Cost of Every LLM API Call

Harshvardhan Singh — Sun, 05 Jul 2026 16:00:51 +0000

What actually happens after your app sends a prompt to an LLM?

~6 min read

You call client.messages.create(...). A few hundred ms later, tokens start streaming back.

Feels simple. Isn't. Here's the full path, broken into fast, skimmable sections.

1. Your SDK does work before anything leaves your laptop

Serializes your messages to JSON
Attaches headers (API key, content-type)
Decides HTTP/1.1 vs HTTP/2
Sets up retry/backoff logic

💡 Common Mistake: Making a new client instance per request. You lose connection pooling and pay full TCP + TLS setup cost every time. Reuse the client.

2. DNS: finding the server

api.anthropic.com → Resolver → 203.0.113.42

Cold lookup: 20–120ms. Cached: basically free. This is why connection reuse (skip re-resolving DNS on every call) is a real win at scale.

3. TLS: locking the channel

Client → TCP handshake → TLS handshake → Encrypted request →

TLS 1.3 trimmed this to ~1 round trip. Still not free — especially on mobile networks with higher latency.

4. Load balancer: you're not hitting one server

Request → [Load Balancer] → Server A / B / C

Health checks, geographic routing, traffic spike absorption. This is why one dead server never becomes your problem.

5. API Gateway: airport security for your request

Auth — is this API key valid, whose account is it?
Rate limiting — protects shared infra from noisy neighbors
Validation — malformed JSON or bad params get rejected here, before wasting GPU time downstream

💡 Engineering Insight: Rate limits aren't there to annoy you — they keep one client from degrading service for everyone sharing that hardware.

6. Logging (async, non-blocking)

Request IDs, token counts, per-stage latency — feeds debugging, abuse detection, and your invoice. Doesn't block your request.

7. Tokenization: words become numbers

"Explain quantum entanglement" → [16350, 14294, 4776, 385, 1997]

Two things this affects directly:

💰 Cost — billed per token, not per character
📏 Context limit — "200K context" = token budget, not word count

💡 Real Example: Code and non-English text often burn more tokens than plain English for the same "amount" of meaning — the tokenizer saw those patterns less during training.

💡 Performance Tip: Trim repeated boilerplate/system prompts. Every token costs money and context space.

8. Model routing

A routing layer picks which model + cluster serves your request based on capacity and region. Provider-specific, mostly undocumented in detail — but this general shape is common everywhere.

9. GPU scheduling: the real bottleneck

[User A][User B][User C][User D] → batched onto one GPU

GPUs can't spin up instantly like a web server. Batching multiple requests keeps them efficient. Continuous batching (slotting new requests into an in-flight batch) is why modern serving is so much faster than naive one-at-a-time processing.

💡 This is also why your latency varies call to call — you're sharing hardware.

10. KV Cache: the trick behind fast generation

Token 1 → compute + cache
Token 2 → reuse cache + compute new token only

Without this, every new token would mean reprocessing the whole conversation. With it, generation stays fast — but the cache grows with context length, eating GPU memory the whole time your request is active.

This is also the mechanism behind prompt caching — reusing cached state for a shared prefix (like a system prompt) across calls, cutting cost + latency.

11. Transformer inference (the part everyone pictures)

Per token: embed → run through N transformer layers (self-attention + feed-forward) → probability distribution over vocabulary → sample next token.

💡 Common Mistake: Higher temperature ≠ smarter model. It just changes sampling randomness.

12. Streaming: why it feels like typing

Prompt → [t1] → [t1,t2] → [t1,t2,t3] → ...

Tokens generate one at a time (autoregressive) and get streamed to you as each one is ready — usually via Server-Sent Events.

💡 Performance Tip: Always stream user-facing responses longer than a sentence. Total time is the same, but perceived latency drops massively — first token in ms instead of a blank screen.

13. Billing, running in parallel

Input tokens + output tokens metered (cached tokens often cheaper). Feeds your invoice and sometimes real-time quota checks back into the rate limiter from step 5.

💡 A long repeated system prompt quietly becomes a big line item unless the provider discounts the repeated prefix via caching.

The whole pipeline, one diagram

Your Code → SDK → DNS → TLS → Load Balancer → Gateway (auth/limit/validate)
   → Logging → Tokenization → Routing → GPU Scheduling → KV Cache
   → Inference → Generation → Streaming → (Billing, parallel) → Your Code

~15 systems, different teams, different hardware — cooperating in well under a second.

TL;DR

This is a distributed systems problem first, ML problem second
Reuse connections — DNS + TLS cost adds up
Tokens = cost + context budget, treat them as a resource
Latency variance = GPU batching, not "harder thinking"
KV cache = why long chats cost more server-side
Streaming = better perceived speed, not better actual speed

Discussion: As agent chains stack tool calls on tool calls, how much of this overhead gets duplicated at every hop — and what should get collapsed into one shared layer instead?

Provider-specific details (routing, scheduling, caching) vary — the patterns above are common across large-scale LLM serving systems, not any one provider's exact internals.

References: Anthropic Docs · OpenAI Docs · Vaswani et al., "Attention Is All You Need" (2017) · Cloudflare: How TLS Works