DEV Community: Henry Li

Stop Letting Your LLM Bill Spiral: Building a Multi-Tenant Gateway in Spring Boot

Henry Li — Mon, 04 May 2026 18:08:28 +0000

A team I worked with shipped their first LLM feature in two weeks. Six weeks later, they got a $47,000 OpenAI bill — for a free tier product.

The post-mortem found three things: one tenant ran a script that retried failed requests indefinitely, another had a buggy prompt that asked the model to "respond in ten thousand tokens," and a third was just abusive — they had discovered the API key was effectively unlimited and were running batch jobs through it.

There was no rate limit. No per-tenant budget. No cost ceiling. No audit trail. Just direct SDK calls from the application code straight to OpenAI.

If your team is shipping LLM features the same way, this post is for you. We will walk through a runnable Spring Boot LLM Gateway that sits between your clients and the provider, enforcing API keys, rate limits, token budgets, caching, and audit logging — the controls you need before going to production, not after.

Full source code, Docker Compose stack, and 9 execution screenshots are at exesolution.com. This post covers the architecture and the key design decisions.

What Direct SDK Usage Doesn't Give You

When your application code calls OpenAI directly, every request looks the same to the provider. They see one API key, one source, one bill. You can't:

Scope keys per tenant. A single shared key means one bad tenant takes down the whole product. Rotation is impossible without a coordinated multi-deploy.

Cap spend per tenant. Without a gateway, you find out you have blown the monthly budget when the invoice arrives. You can't throttle in real time.

Block runaway responses. A buggy prompt asking for 10,000 tokens executes happily. The provider does not know it is wrong; you only know after the fact.

Cache deterministic calls. Identical requests with temperature=0 are paid for every time. There is no shared cache layer because there is no shared layer at all.

Audit anything. When a customer complains "your AI gave me wrong information," you cannot reconstruct what was sent, what came back, or what model was used. The data is in OpenAI's logs, which you cannot query.

A gateway is the standard fix. The question is what controls it actually enforces.

The Gateway Architecture

The request pipeline has eight stages, each enforcing one specific concern:

Client
  POST /v1/chat/completions
  Authorization: Bearer <tenant_api_key>

Stage 1: Authentication       -> hashed key lookup, tenant resolution
Stage 2: Input normalization  -> canonicalize model/params, count bytes
Stage 3: Policy decision      -> ALLOW / DEGRADE / BLOCK
Stage 4: Quota enforcement    -> rate limit + budget check (Redis)
Stage 5: Cache lookup         -> only if temperature=0 and policy allows
Stage 6: Provider call        -> bounded timeout, circuit breaker
Stage 7: Response filtering   -> strip provider metadata, redact PII
Stage 8: Audit + rollup       -> write to PostgreSQL, increment counters

Client receives response

The architecture has three storage components:

PostgreSQL holds the durable state: tenants, hashed API keys, policies, audit logs, daily usage rollups. Everything that survives a restart.

Redis holds the hot path: per-tenant rate limit counters, in-flight request semaphores, optional response cache. Optional but strongly recommended for any meaningful QPS.

Stateless gateway instances sit behind a load balancer. All state lives in PostgreSQL and Redis, so you can scale horizontally without coordination.

The Three Enforcement Modes

This is the design decision that makes or breaks the gateway. Most teams default to either "block everything that exceeds limits" or "log everything but never block." Both are wrong in different ways.

The gateway supports three modes, configured per tenant per policy:

HARD — Reject the request when the limit is hit. Returns 429 (rate limit) or 402 (budget exhausted) with a reason code. Use for tenants on metered plans where overage isn't allowed.

SOFT — Degrade the request instead of rejecting it. The gateway rewrites the request: switches to a cheaper model, lowers max_tokens, tightens parameters. The user gets a response — just not the premium-quality one. Use during traffic spikes where degraded service is better than a 4xx.

OBSERVE — Allow the request but flag it in the audit log. Critical for rolling out a new policy: you see exactly which tenants would have been blocked or degraded, without actually impacting them. Validate the policy with real traffic before flipping to HARD.

The OBSERVE mode is the practical one. You are never going to get policy thresholds right on the first try. Setting them, running in OBSERVE for two weeks, reviewing the would-have-blocked traffic, then switching to HARD or SOFT is the only safe rollout path.

Data Model

Five tables cover the durable state.

tenants

id, name, status (ACTIVE/SUSPENDED), created_at

api_keys — keys are never stored in plaintext

id, tenant_id, key_hash, scopes, status,
created_at, last_used_at, rotated_at

policies — one row per tenant

tenant_id,
allowed_models (json),
max_prompt_bytes, max_input_tokens, max_output_tokens,
rate_limit_rps, max_inflight,
daily_budget_usd, monthly_budget_usd,
daily_token_cap, monthly_token_cap,
enforcement_mode (HARD/SOFT/OBSERVE),
redact_mode (NONE/BASIC/STRICT)

usage_rollup_daily — append-only counters, fast to aggregate

tenant_id, date,
requests, tokens_in, tokens_out, cost_usd_est, blocked_requests

audit_log — one row per request

request_id, tenant_id, key_id, model,
request_ts, latency_ms, tokens_in, tokens_out, cost_usd_est,
decision (ALLOW/BLOCK/DEGRADE), reason_code,
trace_id,
prompt_redacted, response_redacted    -- nullable, policy-driven

The split between usage_rollup_daily and audit_log matters. The rollup is queried in the hot path on every request to check budget; it is small and indexed by (tenant_id, date). The audit log is much larger but only queried during incident investigation. Don't merge them.

API Key Handling

Three rules, no exceptions.

Keys are hashed at rest. SHA-256 with a per-instance salt. Constant-time comparison on lookup. The raw key is shown to the user once, at creation time, and then never again. If they lose it, they rotate it.

The Authorization header is never logged. Every audit entry references key_id (the database primary key), not the actual key value. Logs that capture HTTP requests have an explicit filter for the Authorization header.

Rotation is graceful. When a tenant rotates a key, the new key becomes active immediately. The old key continues working for a configurable grace period (default 24 hours) so deployments can roll out without downtime, then is automatically revoked.

This is straightforward Spring Security with a custom AuthenticationProvider. Nothing fancy — just disciplined.

Rate Limiting and Budget Enforcement

Both run in Redis, both use the same pattern: a sliding window counter incremented on each request, checked against the policy threshold.

Rate limiting is per-tenant requests-per-second using a token bucket algorithm. The bucket size and refill rate come from the tenant's policy. A semaphore counter enforces max_inflight to prevent a tenant from queueing thousands of concurrent requests.

Budget enforcement is more interesting because the cost is not known until the response comes back. The flow:

Before the call: estimate the cost using the request's max_tokens parameter and the configured price-per-token table. Check the estimate against the remaining budget.
If the estimate would exceed the budget: apply HARD/SOFT/OBSERVE per the enforcement mode.
After the call: parse the actual usage object from the provider response, compute the actual cost, and update usage_rollup_daily with the real number.

The pre-call estimate prevents a single 10,000-token request from blowing the monthly budget. The post-call true-up keeps the rollup accurate. The two-step approach is the only way to get both safety and accuracy.

Caching

Caching LLM responses is dangerous if you are not careful. Two requests that look identical can have different intended outputs because of upstream context the gateway cannot see. So the cache only activates when:

The policy explicitly allows caching for this route, AND
temperature=0 (deterministic output), AND
The cache key includes tenant_id + model + canonicalized prompt + relevant params

The tenant_id in the cache key prevents cross-tenant leakage even if two tenants happen to send identical prompts. TTL is configured per route — short for personalized routes, long for generic prompts.

Every cache hit is recorded in the audit log with cache_hit=true. This matters for billing: cached responses do not incur provider cost, so the rollup correctly shows zero cost for those requests.

Failure Modes

This is the section most gateway tutorials skip, and it is the section that determines whether the gateway is actually production-ready.

Provider outage (5xx, timeout) — Bounded retry (1-2 attempts) on transient errors only. Circuit breaker (Resilience4j) sheds load when the provider is consistently failing. Optional fallback: degrade to a cheaper alternative model.

Redis unavailable — Configurable behavior:

HARD-FAIL: block all requests until Redis recovers (strict, but predictable)
SOFT-FAIL: allow requests but log quota_unavailable (risky — tenants can exceed budgets undetected)

The default is HARD-FAIL. SOFT-FAIL is only appropriate when paired with strict per-instance rate limiting as a fallback.

Budget calculation drift — The pre-call estimate uses an approximate token count. The post-call true-up uses the provider's actual usage field. Daily rollups reconcile based on actuals. The price table is versioned, so historical audit records remain accurate even after pricing changes.

Key leakage — Hashed keys at rest, fast rotation, per-key rate limits as a circuit breaker if anomalous traffic is detected on a single key.

Running It

docker compose up -d

This brings up: Spring Boot gateway, PostgreSQL, Redis, and a mock provider for testing without burning real OpenAI tokens.

Bootstrap a tenant:

curl -s -X POST http://localhost:8080/admin/tenants \
  -H "Content-Type: application/json" \
  -d '{"name":"team-a"}'

Issue an API key:

curl -s -X POST http://localhost:8080/admin/tenants/<tenant-id>/keys

The response shows the raw key once. Save it. You won't see it again.

Set a policy with a low budget for testing:

curl -s -X PUT http://localhost:8080/admin/tenants/<tenant-id>/policy \
  -H "Content-Type: application/json" \
  -d '{
    "allowedModels": ["gpt-4o-mini"],
    "maxOutputTokens": 200,
    "rateLimitRps": 5,
    "dailyBudgetUsd": 0.10,
    "enforcementMode": "HARD"
  }'

Call the gateway:

curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-tenant-XXXXX" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Trigger a budget block by running the call in a loop until the daily limit is hit:

for i in {1..50}; do
  curl -s -X POST http://localhost:8080/v1/chat/completions \
    -H "Authorization: Bearer sk-tenant-XXXXX" \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}]}'
done

Eventually you will see BUDGET_EXCEEDED in the response. Then inspect the audit log:

curl -s -u admin:admin-secret \
  "http://localhost:8080/admin/audit?tenantId=<tenant-id>&limit=10"

Each entry shows tokens, cost, decision (ALLOW/BLOCK), and a reason code.

What's in the Full Solution

The verified solution at exesolution.com contains everything to run this from scratch:

Complete Spring Boot project: gateway controller, policy engine, rate limiter, audit writer, admin endpoints
PostgreSQL schema with Flyway migrations for all 5 tables, including indexes for the hot-path queries
Redis-backed token bucket implementation and in-flight semaphore
Spring Security configuration: API key authentication for tenant routes, HTTP Basic for admin routes
Docker Compose stack: gateway + PostgreSQL + Redis + mock provider
Configurable price table for cost estimation across multiple models
9 evidence screenshots: build, startup, health, create tenant, issue API key, update policy, tenant call, admin visibility, usage dashboard

Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.

Where This Pays Off

The gateway pattern adds development time upfront, no question. The case for it gets clearer as you scale:

The first time a tenant burns through their monthly budget in a day and you can throttle them in real time without redeploying — instead of finding out from the invoice.
The first time a customer reports "your AI gave me wrong information" and you can reconstruct the exact request from the audit log in 30 seconds.
The first time you rotate a leaked key without coordinating a multi-service deploy.
The first time OBSERVE mode tells you a new policy would have blocked 12% of legitimate traffic, before you flip it to HARD.

If you are shipping LLM features in Spring Boot, the gateway is not a nice-to-have. It is the layer that lets you sleep at night.

Have questions about a specific part of the pipeline — rate limiting algorithm, audit log schema, key rotation flow? Drop a comment below.

You're Flying Blind: Adding LLM Observability to Spring AI with OpenTelemetry and Self-Hosted Langfuse

Henry Li — Sat, 25 Apr 2026 14:04:08 +0000

Your Spring Boot service returns 200 OK. Latency looks fine in Datadog. Users are complaining the answers are wrong and slow.

You open the logs. Nothing useful. You check your APM traces. HTTP span: 1.2 seconds. Business logic: 40ms. That leaves 1.16 seconds completely unaccounted for — inside the LLM call, where your standard tooling sees nothing.

This is the observability gap in every LLM-enabled Java service. Standard APM tools were not built to capture what actually matters: which prompt triggered which model, how many tokens it consumed, what it cost, whether the tool call chain stalled on the third retry, or which span in a multi-step RAG pipeline blew the latency budget.

This post walks through a runnable setup that closes that gap: Spring AI plus OpenTelemetry plus self-hosted Langfuse, fully containerized, with no data leaving your infrastructure.

The full solution with source code, Docker Compose, and 11 execution screenshots is at exesolution.com. This post covers the core problem, the trace architecture, and the key configuration decisions.

What You Can't See Without LLM-Specific Tracing

Before getting into the setup, it is worth being specific about what is missing. Most teams discover these gaps the hard way:

Latency attribution. A request takes 3 seconds. Your APM shows the HTTP span. It does not show whether the latency came from the embedding call, the LLM completion, a tool invocation, or a retry on a transient 429. You cannot fix what you cannot locate.

Token and cost accumulation. In a chain with retrieval, reranking, a summarization step, and a final completion, tokens accumulate across multiple model calls. Without per-span token metadata, your cost reports are aggregates that tell you you are spending money but not where.

Prompt correlation. When a user reports a bad answer, you need to know the exact prompt that produced it, the model version, and the full context window. Without trace-level prompt capture, incident investigation is manual and slow.

Cross-service correlation. An upstream HTTP request triggers an async enrichment job that calls an LLM. Without W3C traceparent propagation through the LLM span, these two halves of the trace appear in separate, unrelated records.

Sensitive data control. You need observability, but you cannot send prompt content to a third-party SaaS. Self-hosted tracing is the only viable path in regulated environments.

The Trace Architecture

The setup has four components in a single Docker Compose stack:

Spring Boot Application
    -> OpenTelemetry Java SDK (in-process)
        -> OTLP Exporter (HTTP/protobuf)
            -> Langfuse ingestion endpoint (port 4318)
                -> PostgreSQL (trace storage)
                -> Langfuse UI (trace inspection)

Spring AI generates the spans. When you call ChatClient, Spring AI wraps the model invocation in an OpenTelemetry span automatically. Tool calls, embedding calls, and retries each get child spans. You do not write instrumentation code.

The OTel SDK handles propagation and export. W3C trace context flows from inbound HTTP requests through business logic spans into LLM spans — all linked in one trace. The SDK batches spans and exports them via OTLP without blocking the application thread.

Langfuse receives and stores everything. It is the same Langfuse you may know from the Python world, but self-hosted: PostgreSQL for persistence, its own ingestion API on port 4318, and a UI for trace inspection, filtering by model, cost, latency, and prompt review.

The key architectural decision: Langfuse runs on your infrastructure. Prompts, responses, and token metadata never leave your network. This matters for compliance and is non-negotiable in many enterprise contexts.

What Each Span Carries

Once running, every ChatClient call produces a span with these attributes visible in the Langfuse UI:

llm.model              -> "gpt-4o-mini"
llm.prompt_tokens      -> 312
llm.completion_tokens  -> 87
llm.total_tokens       -> 399
llm.latency_ms         -> 1143
error.type             -> (present only on failure)

Nested under the LLM span: tool call spans (if your ChatClient uses tools), each with their own latency and result status. Nested under those: any downstream spans from calls the tool makes.

The Langfuse UI groups these into a flame graph per trace. You can filter by model, sort by token count, drill into a specific prompt, or search for traces where error.type is set.

Configuration

Three environment variable blocks wire the stack together.

Spring Boot application:

SPRING_PROFILES_ACTIVE=otel
SPRING_AI_OPENAI_API_KEY=sk-...

OpenTelemetry Java SDK:

OTEL_SERVICE_NAME=spring-ai-llm-service
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://langfuse:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.2

OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local

The sampling configuration deserves a note. parentbased_traceidratio at 0.2 means 20 percent of traces are sampled — enough for operational visibility without the storage overhead of 100 percent capture. Error spans bypass sampling and are always recorded. For debugging sessions, bump the ratio to 1.0 and redeploy with a config change; no code change needed.

Langfuse (self-hosted):

LANGFUSE_PUBLIC_KEY=lf_pk_xxxx
LANGFUSE_SECRET_KEY=lf_sk_xxxx
DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse

Running It

docker compose pull
docker compose up -d

Startup order is managed by Compose health checks: PostgreSQL first, then Langfuse services, then the Spring Boot application. No manual sequencing needed.

Verify the Spring Boot application is up:

curl -s http://localhost:8080/actuator/health

The expected response shows status: UP.

Verify the Langfuse UI is up by opening http://localhost:3000 in a browser. Log in with the credentials from your .env file.

Trigger a trace:

curl -s -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Summarize the quarterly report"}'

What to look for in the Langfuse UI: open the Traces view. You should see an entry for spring-ai-llm-service. Expand it — you will see an HTTP span at the root, a business logic span below it, and an LLM invocation span as a child of that. Click the LLM span: model name, token counts, and latency are in the attributes panel on the right.

If you called any tools, each tool call appears as a child span of the LLM span, with its own duration and result status.

Prompt and Response Redaction

By default, prompt and response content is captured in span attributes. For environments where this is not acceptable, two options:

Metadata-only mode. Disable payload capture entirely. Token counts and latency are retained; prompt and response content are not recorded. One configuration flag, no code change.

Partial redaction. Apply regex-based masking in the OTEL instrumentation layer before spans are exported. PII patterns (emails, phone numbers, account numbers) are replaced with [REDACTED] in the span attributes. The LLM still receives the full content; only the observability record is masked.

Both modes are configured in application.yml with the otel Spring profile. The full configuration is in the solution at exesolution.com.

Operational Notes

If Langfuse goes down. The OTEL batch processor drops spans after the queue saturates. Application traffic is completely unaffected — tracing degrades gracefully. No circuit breaker needed.

Disabling tracing without a redeploy. Set OTEL_TRACES_EXPORTER=none and restart the application container with this variable set. Tracing stops; everything else continues normally.

Non-OpenAI providers. The instrumentation is provider-agnostic. It works with any Spring AI ChatModel implementation — Anthropic, Azure OpenAI, Ollama, Mistral. The span attributes are populated by Spring AI's abstraction layer, not by provider-specific code.

Kubernetes. The same OTEL and Langfuse configuration applies. Docker Compose is provided for local and CI reproducibility; the Kubernetes equivalent is straightforward — deploy Langfuse as a Helm chart and point OTEL_EXPORTER_OTLP_ENDPOINT at the service.

What's in the Full Solution

The verified solution at exesolution.com includes everything to run this from scratch:

Complete Spring Boot project with otel profile, OTel dependencies, and ChatClient wiring
Full Docker Compose stack: Spring Boot app + Langfuse (web + worker) + PostgreSQL
application.yml with sampling, batching, and redaction configuration
11 evidence screenshots: Docker Compose build, running containers, chat API test, Langfuse dashboard, and five trace detail views showing nested spans with token and latency data
Verification checklist: services running, traces visible, sampling confirmed, redaction verified

Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.

The Practical Case for Self-Hosted

The cloud-hosted Langfuse option is fine for many projects. But if you are in financial services, healthcare, or any context with data residency requirements, sending prompt content to a third-party SaaS is a non-starter. Self-hosted Langfuse on Docker Compose or Kubernetes gives you the same UI and the same trace schema — the only difference is the data never leaves your network.

The setup in this solution takes about 15 minutes from git clone to first trace in the UI. That is a reasonable investment for closing the observability gap that every LLM service eventually hits.

Questions about the OTel configuration or the Langfuse setup? Leave a comment below.

MCP Server & Client in Spring AI: Stop Coupling Tools to Your AI Host

Henry Li — Sun, 19 Apr 2026 19:27:43 +0000

If you've built an LLM feature in Spring Boot, you've probably done something like this: created a @Bean with @Tool-annotated methods, wired it into your ChatClient, and shipped it. That works fine — until your tool set grows, multiple AI applications want to reuse the same tools, or you need to update a tool without redeploying the entire AI service.

That's the problem MCP (Model Context Protocol) solves. This post walks through a two-service setup I built and verified: a standalone MCP Tool Server and an AI Chat Service that discovers tools dynamically over Streamable HTTP — no restart required when tools change.

The full solution with runnable code, Docker Compose, and execution evidence is at exesolution.com. This post covers the core problem and how to get it running locally.

The Problem with In-Process Tool Registration

When you register tools inside the same Spring Boot app that handles LLM interactions, you get:

Deployment coupling — every new tool means a new deployment of the AI service, even though the AI logic didn't change.
No sharing — if three different AI applications need the same "get order status" tool, you copy-paste the implementation into each.
No trust boundary — a bug in a tool method can crash the process that's serving your users.
Static inventory — tools are fixed at startup. Adding one at runtime? Not without a restart.
Zero visibility — tool invocations vanish inside the ChatClient execution loop with no structured logs or traces.

The naive fix is "just put everything in one service." But once you have 20 tools across 5 domains, that service becomes the new monolith.

The Solution: Two Services, One Protocol

The setup has two independently deployable Spring Boot apps:

User
  └─→ AI Chat Service (:8081)
          └─→ ChatClient (Spring AI)
                  └─→ LLM (gpt-4o-mini)
                  └─→ MCP Client
                          └─→ MCP Tool Server (:8080)  ← POST /mcp
                                  └─→ @McpTool-annotated service methods

MCP Tool Server — owns tool implementations. Exposes them via @McpTool annotations over Streamable HTTP. Deployed and versioned independently.

AI Chat Service — user-facing REST API. Knows nothing about specific tools. Uses SyncMcpToolCallbackProvider to auto-discover whatever tools the server exposes, on every request.

The key insight: ToolCallbackProvider re-fetches the tool list from the server on each getToolCallbacks() call. Add a new @McpTool bean, hit the refresh endpoint, and the next conversation picks it up — no restart of either service.

Defining a Tool: One Annotation

On the server side, any Spring bean method can become an MCP tool with @Tool (Spring AI's annotation):

@Service
public class OrderTool {

    @Tool(description = "Get the current status and details of an order by its ID")
    public Map<String, Object> getOrderStatus(
            @ToolParam(description = "The unique order identifier, e.g. ORD-12345")
            String orderId) {

        return orderRepository.findById(orderId)
                .map(order -> Map.of(
                        "orderId",           order.getId(),
                        "status",            order.getStatus(),
                        "estimatedDelivery", order.getEstimatedDelivery().toString(),
                        "items",             order.getItems().size()
                ))
                .orElseThrow(() ->
                        new IllegalArgumentException("Order not found: " + orderId));
    }
}

Spring AI reads the annotation at startup and generates a JSON Schema for the parameters automatically. The LLM receives this schema and knows exactly how to call the tool.

Wiring the Client: One Line

On the AI Host side, wiring all server tools into ChatClient takes one method call:

@Configuration
public class ChatConfig {

    @Bean
    ChatClient chatClient(ChatModel chatModel,
                          SyncMcpToolCallbackProvider toolCallbackProvider) {
        return ChatClient.builder(chatModel)
                .defaultTools(toolCallbackProvider) // ← entire server tool registry
                .build();
    }
}

From here, when a user asks "What's the status of order ORD-12345?", the LLM decides to call getOrderStatus, Spring AI dispatches it over MCP, the tool runs on the server, the result comes back, and the LLM incorporates it into the reply — entirely transparent to the controller layer.

Configuration

MCP Tool Server (application.properties):

spring.ai.mcp.server.name=tool-server
spring.ai.mcp.server.version=1.0.0
spring.ai.mcp.server.protocol=STREAMABLE
server.port=8080

AI Chat Service (application.properties):

spring.ai.mcp.client.toolcallback.enabled=true
spring.ai.mcp.client.connections.tool-server.url=${MCP_SERVER_URL}/mcp
spring.ai.mcp.client.connections.tool-server.transport=STREAMABLE_HTTP
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o-mini
server.port=8081

Dependencies — MCP Server (pom.xml):

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-mcp-server-webmvc</artifactId>
</dependency>

Dependencies — AI Host (pom.xml):

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-mcp-client</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>

Running It Locally

Prerequisites: Docker Desktop, JDK 17, an OpenAI-compatible API key.

# 1. Clone and configure
cp .env.template .env
# add OPENAI_API_KEY=sk-...

# 2. Start both services
docker compose up -d --build

Verify both services are up:

curl -s http://localhost:8080/actuator/health | jq .status
# → "UP"

curl -s http://localhost:8081/actuator/health | jq .status
# → "UP"

Confirm the tool registry (admin endpoint):

curl -s http://localhost:8080/admin/tools | jq .
# → list of @McpTool-annotated methods with name, description, inputSchema

Trigger a tool call through the chat API:

curl -s -X POST http://localhost:8081/api/chat \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"sessionId":"sess-001","message":"What is the status of order ORD-12345?"}' \
  | jq .
# → {"reply":"Order ORD-12345 is currently SHIPPED...","toolsUsed":["getOrderStatus"]}

Verify the tool call hit the server:

docker compose logs mcp-tool-server | grep "tools/call"
# → log lines showing getOrderStatus invoked with orderId=ORD-12345

Dynamic tool discovery — no restart needed:

# Add a new tool bean to the server, then:
curl -s -X POST http://localhost:8080/admin/tools/refresh \
  -H "Authorization: Bearer <ADMIN_TOKEN>"
# → {"registered":["getOrderStatus","searchProducts",...]}

# Next chat request immediately picks up the new tool
curl -s -X POST http://localhost:8081/api/chat \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"sessionId":"sess-001","message":"Search for electronics products"}' \
  | jq .reply
# → uses the newly registered searchProducts tool

What the Stateless Transport Mode Gives You

By default the server runs in stateful STREAMABLE mode (sessions via Mcp-Session-Id headers). For horizontally scaled deployments behind a load balancer, switch to stateless:

# on mcp-tool-server
spring.ai.mcp.server.protocol=STATELESS

In stateless mode the server returns application/json per request. No session affinity required. The same chat requests work identically — the difference is purely at the transport layer.

What's in the Full Solution

This post covers the core problem and the minimal working setup. The complete verified solution at exesolution.com includes:

Full source code for both Spring Boot modules (pom.xml, all Java classes, Docker Compose)
Three @McpTool implementations: OrderTool, ProductTool, and WeatherTool (the last one calls open-meteo.com in real time — verifiable live data)
Security configuration: /mcp endpoint internal-only, /api/chat JWT-protected, /admin/** role-gated
Architecture diagram and request flow diagram
Evidence Pack: 10 verification screenshots from actual execution — health checks, tool registry, chat responses, server-side logs, dynamic refresh

👉 Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.

Key Takeaways

The pattern here — separate MCP server, auto-discovering client — pays off when:

Multiple AI applications need the same tools (deploy once, use everywhere)
Tool implementations need independent scaling or deployment cadence
You want a trust boundary between the LLM execution context and the actual side-effecting code
You're connecting to Claude Desktop, VS Code Copilot, or any other MCP-compatible client — the same server JAR works for all of them without code changes

If you're already using Spring AI for chat and RAG, adding an MCP server is one dependency and a few annotations. The split into two services pays for itself the first time you update a tool without touching the AI host.

Have questions about the setup or ran into something unexpected? Drop a comment below.