<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Henry Li</title>
    <description>The latest articles on DEV Community by Henry Li (@henry9527).</description>
    <link>https://dev.to/henry9527</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3887743%2F90852950-095f-47a8-948e-2e3169072723.ico</url>
      <title>DEV Community: Henry Li</title>
      <link>https://dev.to/henry9527</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/henry9527"/>
    <language>en</language>
    <item>
      <title>Stop Letting Your LLM Bill Spiral: Building a Multi-Tenant Gateway in Spring Boot</title>
      <dc:creator>Henry Li</dc:creator>
      <pubDate>Mon, 04 May 2026 18:08:28 +0000</pubDate>
      <link>https://dev.to/henry9527/stop-letting-your-llm-bill-spiral-building-a-multi-tenant-gateway-in-spring-boot-1599</link>
      <guid>https://dev.to/henry9527/stop-letting-your-llm-bill-spiral-building-a-multi-tenant-gateway-in-spring-boot-1599</guid>
      <description>&lt;p&gt;A team I worked with shipped their first LLM feature in two weeks. Six weeks later, they got a $47,000 OpenAI bill — for a free tier product.&lt;/p&gt;

&lt;p&gt;The post-mortem found three things: one tenant ran a script that retried failed requests indefinitely, another had a buggy prompt that asked the model to "respond in ten thousand tokens," and a third was just abusive — they had discovered the API key was effectively unlimited and were running batch jobs through it.&lt;/p&gt;

&lt;p&gt;There was no rate limit. No per-tenant budget. No cost ceiling. No audit trail. Just direct SDK calls from the application code straight to OpenAI.&lt;/p&gt;

&lt;p&gt;If your team is shipping LLM features the same way, this post is for you. We will walk through a runnable Spring Boot LLM Gateway that sits between your clients and the provider, enforcing API keys, rate limits, token budgets, caching, and audit logging — the controls you need before going to production, not after.&lt;/p&gt;

&lt;p&gt;Full source code, Docker Compose stack, and 9 execution screenshots are at &lt;a href="https://exesolution.com/solutions/spring-boot-llm-gateway-multitenant-quotas" rel="noopener noreferrer"&gt;exesolution.com&lt;/a&gt;. This post covers the architecture and the key design decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Direct SDK Usage Doesn't Give You
&lt;/h2&gt;

&lt;p&gt;When your application code calls OpenAI directly, every request looks the same to the provider. They see one API key, one source, one bill. You can't:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope keys per tenant.&lt;/strong&gt; A single shared key means one bad tenant takes down the whole product. Rotation is impossible without a coordinated multi-deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cap spend per tenant.&lt;/strong&gt; Without a gateway, you find out you have blown the monthly budget when the invoice arrives. You can't throttle in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block runaway responses.&lt;/strong&gt; A buggy prompt asking for 10,000 tokens executes happily. The provider does not know it is wrong; you only know after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache deterministic calls.&lt;/strong&gt; Identical requests with temperature=0 are paid for every time. There is no shared cache layer because there is no shared layer at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit anything.&lt;/strong&gt; When a customer complains "your AI gave me wrong information," you cannot reconstruct what was sent, what came back, or what model was used. The data is in OpenAI's logs, which you cannot query.&lt;/p&gt;

&lt;p&gt;A gateway is the standard fix. The question is what controls it actually enforces.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway Architecture
&lt;/h2&gt;

&lt;p&gt;The request pipeline has eight stages, each enforcing one specific concern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client
  POST /v1/chat/completions
  Authorization: Bearer &amp;lt;tenant_api_key&amp;gt;

Stage 1: Authentication       -&amp;gt; hashed key lookup, tenant resolution
Stage 2: Input normalization  -&amp;gt; canonicalize model/params, count bytes
Stage 3: Policy decision      -&amp;gt; ALLOW / DEGRADE / BLOCK
Stage 4: Quota enforcement    -&amp;gt; rate limit + budget check (Redis)
Stage 5: Cache lookup         -&amp;gt; only if temperature=0 and policy allows
Stage 6: Provider call        -&amp;gt; bounded timeout, circuit breaker
Stage 7: Response filtering   -&amp;gt; strip provider metadata, redact PII
Stage 8: Audit + rollup       -&amp;gt; write to PostgreSQL, increment counters

Client receives response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The architecture has three storage components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; holds the durable state: tenants, hashed API keys, policies, audit logs, daily usage rollups. Everything that survives a restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis&lt;/strong&gt; holds the hot path: per-tenant rate limit counters, in-flight request semaphores, optional response cache. Optional but strongly recommended for any meaningful QPS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stateless gateway instances&lt;/strong&gt; sit behind a load balancer. All state lives in PostgreSQL and Redis, so you can scale horizontally without coordination.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Enforcement Modes
&lt;/h2&gt;

&lt;p&gt;This is the design decision that makes or breaks the gateway. Most teams default to either "block everything that exceeds limits" or "log everything but never block." Both are wrong in different ways.&lt;/p&gt;

&lt;p&gt;The gateway supports three modes, configured per tenant per policy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HARD&lt;/strong&gt; — Reject the request when the limit is hit. Returns 429 (rate limit) or 402 (budget exhausted) with a reason code. Use for tenants on metered plans where overage isn't allowed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SOFT&lt;/strong&gt; — Degrade the request instead of rejecting it. The gateway rewrites the request: switches to a cheaper model, lowers max_tokens, tightens parameters. The user gets a response — just not the premium-quality one. Use during traffic spikes where degraded service is better than a 4xx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OBSERVE&lt;/strong&gt; — Allow the request but flag it in the audit log. Critical for rolling out a new policy: you see exactly which tenants would have been blocked or degraded, without actually impacting them. Validate the policy with real traffic before flipping to HARD.&lt;/p&gt;

&lt;p&gt;The OBSERVE mode is the practical one. You are never going to get policy thresholds right on the first try. Setting them, running in OBSERVE for two weeks, reviewing the would-have-blocked traffic, then switching to HARD or SOFT is the only safe rollout path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Model
&lt;/h2&gt;

&lt;p&gt;Five tables cover the durable state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tenants&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;status&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ACTIVE&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;SUSPENDED&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;created&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;api_keys&lt;/strong&gt; — keys are never stored in plaintext&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tenant&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;scopes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;created&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;last&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;used&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;rotated&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;policies&lt;/strong&gt; — one row per tenant&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;tenant&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;allowed&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;models&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;json&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;prompt&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;input&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;output&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;rate&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;limit&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;rps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;inflight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;daily&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;budget&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;monthly&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;budget&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;usd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;daily&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;token&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;monthly&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;token&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;enforcement&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;mode&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;HARD&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;SOFT&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;OBSERVE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="k"&gt;redact&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;mode&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NONE&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;BASIC&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;STRICT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;usage_rollup_daily&lt;/strong&gt; — append-only counters, fast to aggregate&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;tenant&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cost&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;usd&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;blocked&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;requests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;audit_log&lt;/strong&gt; — one row per request&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;request&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tenant&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;request&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;latency&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cost&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;usd&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;decision&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ALLOW&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;BLOCK&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;DEGRADE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;reason&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;trace&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;prompt&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;redacted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;response&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;redacted&lt;/span&gt;    &lt;span class="err"&gt;--&lt;/span&gt; &lt;span class="k"&gt;nullable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;policy&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;driven&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The split between &lt;code&gt;usage_rollup_daily&lt;/code&gt; and &lt;code&gt;audit_log&lt;/code&gt; matters. The rollup is queried in the hot path on every request to check budget; it is small and indexed by &lt;code&gt;(tenant_id, date)&lt;/code&gt;. The audit log is much larger but only queried during incident investigation. Don't merge them.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Key Handling
&lt;/h2&gt;

&lt;p&gt;Three rules, no exceptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keys are hashed at rest.&lt;/strong&gt; SHA-256 with a per-instance salt. Constant-time comparison on lookup. The raw key is shown to the user once, at creation time, and then never again. If they lose it, they rotate it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Authorization header is never logged.&lt;/strong&gt; Every audit entry references &lt;code&gt;key_id&lt;/code&gt; (the database primary key), not the actual key value. Logs that capture HTTP requests have an explicit filter for the Authorization header.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rotation is graceful.&lt;/strong&gt; When a tenant rotates a key, the new key becomes active immediately. The old key continues working for a configurable grace period (default 24 hours) so deployments can roll out without downtime, then is automatically revoked.&lt;/p&gt;

&lt;p&gt;This is straightforward Spring Security with a custom &lt;code&gt;AuthenticationProvider&lt;/code&gt;. Nothing fancy — just disciplined.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting and Budget Enforcement
&lt;/h2&gt;

&lt;p&gt;Both run in Redis, both use the same pattern: a sliding window counter incremented on each request, checked against the policy threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt; is per-tenant requests-per-second using a token bucket algorithm. The bucket size and refill rate come from the tenant's policy. A semaphore counter enforces &lt;code&gt;max_inflight&lt;/code&gt; to prevent a tenant from queueing thousands of concurrent requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget enforcement&lt;/strong&gt; is more interesting because the cost is not known until the response comes back. The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Before the call: estimate the cost using the request's &lt;code&gt;max_tokens&lt;/code&gt; parameter and the configured price-per-token table. Check the estimate against the remaining budget.&lt;/li&gt;
&lt;li&gt;If the estimate would exceed the budget: apply HARD/SOFT/OBSERVE per the enforcement mode.&lt;/li&gt;
&lt;li&gt;After the call: parse the actual &lt;code&gt;usage&lt;/code&gt; object from the provider response, compute the actual cost, and update &lt;code&gt;usage_rollup_daily&lt;/code&gt; with the real number.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The pre-call estimate prevents a single 10,000-token request from blowing the monthly budget. The post-call true-up keeps the rollup accurate. The two-step approach is the only way to get both safety and accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching
&lt;/h2&gt;

&lt;p&gt;Caching LLM responses is dangerous if you are not careful. Two requests that look identical can have different intended outputs because of upstream context the gateway cannot see. So the cache only activates when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The policy explicitly allows caching for this route, AND&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;temperature=0&lt;/code&gt; (deterministic output), AND&lt;/li&gt;
&lt;li&gt;The cache key includes &lt;code&gt;tenant_id + model + canonicalized prompt + relevant params&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;tenant_id&lt;/code&gt; in the cache key prevents cross-tenant leakage even if two tenants happen to send identical prompts. TTL is configured per route — short for personalized routes, long for generic prompts.&lt;/p&gt;

&lt;p&gt;Every cache hit is recorded in the audit log with &lt;code&gt;cache_hit=true&lt;/code&gt;. This matters for billing: cached responses do not incur provider cost, so the rollup correctly shows zero cost for those requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Modes
&lt;/h2&gt;

&lt;p&gt;This is the section most gateway tutorials skip, and it is the section that determines whether the gateway is actually production-ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider outage (5xx, timeout)&lt;/strong&gt; — Bounded retry (1-2 attempts) on transient errors only. Circuit breaker (Resilience4j) sheds load when the provider is consistently failing. Optional fallback: degrade to a cheaper alternative model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis unavailable&lt;/strong&gt; — Configurable behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HARD-FAIL: block all requests until Redis recovers (strict, but predictable)&lt;/li&gt;
&lt;li&gt;SOFT-FAIL: allow requests but log &lt;code&gt;quota_unavailable&lt;/code&gt; (risky — tenants can exceed budgets undetected)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default is HARD-FAIL. SOFT-FAIL is only appropriate when paired with strict per-instance rate limiting as a fallback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget calculation drift&lt;/strong&gt; — The pre-call estimate uses an approximate token count. The post-call true-up uses the provider's actual &lt;code&gt;usage&lt;/code&gt; field. Daily rollups reconcile based on actuals. The price table is versioned, so historical audit records remain accurate even after pricing changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key leakage&lt;/strong&gt; — Hashed keys at rest, fast rotation, per-key rate limits as a circuit breaker if anomalous traffic is detected on a single key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This brings up: Spring Boot gateway, PostgreSQL, Redis, and a mock provider for testing without burning real OpenAI tokens.&lt;/p&gt;

&lt;p&gt;Bootstrap a tenant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/admin/tenants &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"team-a"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Issue an API key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/admin/tenants/&amp;lt;tenant-id&amp;gt;/keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response shows the raw key once. Save it. You won't see it again.&lt;/p&gt;

&lt;p&gt;Set a policy with a low budget for testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; PUT http://localhost:8080/admin/tenants/&amp;lt;tenant-id&amp;gt;/policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "allowedModels": ["gpt-4o-mini"],
    "maxOutputTokens": 200,
    "rateLimitRps": 5,
    "dailyBudgetUsd": 0.10,
    "enforcementMode": "HARD"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer sk-tenant-XXXXX"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gpt-4o-mini",
    "messages": [{"role":"user","content":"Hello"}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger a budget block by running the call in a loop until the daily limit is hit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..50&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer sk-tenant-XXXXX"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}]}'&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eventually you will see &lt;code&gt;BUDGET_EXCEEDED&lt;/code&gt; in the response. Then inspect the audit log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; admin:admin-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://localhost:8080/admin/audit?tenantId=&amp;lt;tenant-id&amp;gt;&amp;amp;limit=10"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each entry shows tokens, cost, decision (ALLOW/BLOCK), and a reason code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in the Full Solution
&lt;/h2&gt;

&lt;p&gt;The verified solution at exesolution.com contains everything to run this from scratch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete Spring Boot project: gateway controller, policy engine, rate limiter, audit writer, admin endpoints&lt;/li&gt;
&lt;li&gt;PostgreSQL schema with Flyway migrations for all 5 tables, including indexes for the hot-path queries&lt;/li&gt;
&lt;li&gt;Redis-backed token bucket implementation and in-flight semaphore&lt;/li&gt;
&lt;li&gt;Spring Security configuration: API key authentication for tenant routes, HTTP Basic for admin routes&lt;/li&gt;
&lt;li&gt;Docker Compose stack: gateway + PostgreSQL + Redis + mock provider&lt;/li&gt;
&lt;li&gt;Configurable price table for cost estimation across multiple models&lt;/li&gt;
&lt;li&gt;9 evidence screenshots: build, startup, health, create tenant, issue API key, update policy, tenant call, admin visibility, usage dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://exesolution.com/solutions/spring-boot-llm-gateway-multitenant-quotas" rel="noopener noreferrer"&gt;Full solution + runnable code + evidence at exesolution.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Free registration required to access the code bundle and evidence images.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Pays Off
&lt;/h2&gt;

&lt;p&gt;The gateway pattern adds development time upfront, no question. The case for it gets clearer as you scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first time a tenant burns through their monthly budget in a day and you can throttle them in real time without redeploying — instead of finding out from the invoice.&lt;/li&gt;
&lt;li&gt;The first time a customer reports "your AI gave me wrong information" and you can reconstruct the exact request from the audit log in 30 seconds.&lt;/li&gt;
&lt;li&gt;The first time you rotate a leaked key without coordinating a multi-service deploy.&lt;/li&gt;
&lt;li&gt;The first time OBSERVE mode tells you a new policy would have blocked 12% of legitimate traffic, before you flip it to HARD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are shipping LLM features in Spring Boot, the gateway is not a nice-to-have. It is the layer that lets you sleep at night.&lt;/p&gt;




&lt;p&gt;Have questions about a specific part of the pipeline — rate limiting algorithm, audit log schema, key rotation flow? Drop a comment below.&lt;/p&gt;

</description>
      <category>springboot</category>
      <category>java</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>You're Flying Blind: Adding LLM Observability to Spring AI with OpenTelemetry and Self-Hosted Langfuse</title>
      <dc:creator>Henry Li</dc:creator>
      <pubDate>Sat, 25 Apr 2026 14:04:08 +0000</pubDate>
      <link>https://dev.to/henry9527/youre-flying-blind-adding-llm-observability-to-spring-ai-with-opentelemetry-and-self-hosted-5gj4</link>
      <guid>https://dev.to/henry9527/youre-flying-blind-adding-llm-observability-to-spring-ai-with-opentelemetry-and-self-hosted-5gj4</guid>
      <description>&lt;p&gt;Your Spring Boot service returns 200 OK. Latency looks fine in Datadog. Users are complaining the answers are wrong and slow.&lt;/p&gt;

&lt;p&gt;You open the logs. Nothing useful. You check your APM traces. HTTP span: 1.2 seconds. Business logic: 40ms. That leaves 1.16 seconds completely unaccounted for — inside the LLM call, where your standard tooling sees nothing.&lt;/p&gt;

&lt;p&gt;This is the observability gap in every LLM-enabled Java service. Standard APM tools were not built to capture what actually matters: which prompt triggered which model, how many tokens it consumed, what it cost, whether the tool call chain stalled on the third retry, or which span in a multi-step RAG pipeline blew the latency budget.&lt;/p&gt;

&lt;p&gt;This post walks through a runnable setup that closes that gap: Spring AI plus OpenTelemetry plus self-hosted Langfuse, fully containerized, with no data leaving your infrastructure.&lt;/p&gt;

&lt;p&gt;The full solution with source code, Docker Compose, and 11 execution screenshots is at &lt;a href="https://exesolution.com/solutions/spring-ai-opentelemetry-langfuse-observability" rel="noopener noreferrer"&gt;exesolution.com&lt;/a&gt;. This post covers the core problem, the trace architecture, and the key configuration decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can't See Without LLM-Specific Tracing
&lt;/h2&gt;

&lt;p&gt;Before getting into the setup, it is worth being specific about what is missing. Most teams discover these gaps the hard way:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency attribution.&lt;/strong&gt; A request takes 3 seconds. Your APM shows the HTTP span. It does not show whether the latency came from the embedding call, the LLM completion, a tool invocation, or a retry on a transient 429. You cannot fix what you cannot locate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token and cost accumulation.&lt;/strong&gt; In a chain with retrieval, reranking, a summarization step, and a final completion, tokens accumulate across multiple model calls. Without per-span token metadata, your cost reports are aggregates that tell you you are spending money but not where.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt correlation.&lt;/strong&gt; When a user reports a bad answer, you need to know the exact prompt that produced it, the model version, and the full context window. Without trace-level prompt capture, incident investigation is manual and slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-service correlation.&lt;/strong&gt; An upstream HTTP request triggers an async enrichment job that calls an LLM. Without W3C &lt;code&gt;traceparent&lt;/code&gt; propagation through the LLM span, these two halves of the trace appear in separate, unrelated records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensitive data control.&lt;/strong&gt; You need observability, but you cannot send prompt content to a third-party SaaS. Self-hosted tracing is the only viable path in regulated environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trace Architecture
&lt;/h2&gt;

&lt;p&gt;The setup has four components in a single Docker Compose stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Spring Boot Application
    -&amp;gt; OpenTelemetry Java SDK (in-process)
        -&amp;gt; OTLP Exporter (HTTP/protobuf)
            -&amp;gt; Langfuse ingestion endpoint (port 4318)
                -&amp;gt; PostgreSQL (trace storage)
                -&amp;gt; Langfuse UI (trace inspection)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Spring AI generates the spans.&lt;/strong&gt; When you call &lt;code&gt;ChatClient&lt;/code&gt;, Spring AI wraps the model invocation in an OpenTelemetry span automatically. Tool calls, embedding calls, and retries each get child spans. You do not write instrumentation code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The OTel SDK handles propagation and export.&lt;/strong&gt; W3C trace context flows from inbound HTTP requests through business logic spans into LLM spans — all linked in one trace. The SDK batches spans and exports them via OTLP without blocking the application thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Langfuse receives and stores everything.&lt;/strong&gt; It is the same Langfuse you may know from the Python world, but self-hosted: PostgreSQL for persistence, its own ingestion API on port 4318, and a UI for trace inspection, filtering by model, cost, latency, and prompt review.&lt;/p&gt;

&lt;p&gt;The key architectural decision: Langfuse runs on your infrastructure. Prompts, responses, and token metadata never leave your network. This matters for compliance and is non-negotiable in many enterprise contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Span Carries
&lt;/h2&gt;

&lt;p&gt;Once running, every &lt;code&gt;ChatClient&lt;/code&gt; call produces a span with these attributes visible in the Langfuse UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;llm.model&lt;/span&gt;              &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;"gpt-4o-mini"&lt;/span&gt;
&lt;span class="err"&gt;llm.prompt_tokens&lt;/span&gt;      &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;312&lt;/span&gt;
&lt;span class="err"&gt;llm.completion_tokens&lt;/span&gt;  &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;87&lt;/span&gt;
&lt;span class="err"&gt;llm.total_tokens&lt;/span&gt;       &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;399&lt;/span&gt;
&lt;span class="err"&gt;llm.latency_ms&lt;/span&gt;         &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;1143&lt;/span&gt;
&lt;span class="err"&gt;error.type&lt;/span&gt;             &lt;span class="err"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;(present&lt;/span&gt; &lt;span class="err"&gt;only&lt;/span&gt; &lt;span class="err"&gt;on&lt;/span&gt; &lt;span class="err"&gt;failure)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nested under the LLM span: tool call spans (if your &lt;code&gt;ChatClient&lt;/code&gt; uses tools), each with their own latency and result status. Nested under those: any downstream spans from calls the tool makes.&lt;/p&gt;

&lt;p&gt;The Langfuse UI groups these into a flame graph per trace. You can filter by model, sort by token count, drill into a specific prompt, or search for traces where &lt;code&gt;error.type&lt;/code&gt; is set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Three environment variable blocks wire the stack together.&lt;/p&gt;

&lt;p&gt;Spring Boot application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;SPRING_PROFILES_ACTIVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otel
&lt;span class="nv"&gt;SPRING_AI_OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenTelemetry Java SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;spring-ai-llm-service
&lt;span class="nv"&gt;OTEL_TRACES_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otlp
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://langfuse:4318
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf

&lt;span class="nv"&gt;OTEL_TRACES_SAMPLER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;parentbased_traceidratio
&lt;span class="nv"&gt;OTEL_TRACES_SAMPLER_ARG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.2

&lt;span class="nv"&gt;OTEL_RESOURCE_ATTRIBUTES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;deployment.environment&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;local&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sampling configuration deserves a note. &lt;code&gt;parentbased_traceidratio&lt;/code&gt; at 0.2 means 20 percent of traces are sampled — enough for operational visibility without the storage overhead of 100 percent capture. Error spans bypass sampling and are always recorded. For debugging sessions, bump the ratio to 1.0 and redeploy with a config change; no code change needed.&lt;/p&gt;

&lt;p&gt;Langfuse (self-hosted):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;LANGFUSE_PUBLIC_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lf_pk_xxxx
&lt;span class="nv"&gt;LANGFUSE_SECRET_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lf_sk_xxxx
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://langfuse:langfuse@postgres:5432/langfuse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose pull
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Startup order is managed by Compose health checks: PostgreSQL first, then Langfuse services, then the Spring Boot application. No manual sequencing needed.&lt;/p&gt;

&lt;p&gt;Verify the Spring Boot application is up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/actuator/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expected response shows &lt;code&gt;status: UP&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Verify the Langfuse UI is up by opening &lt;code&gt;http://localhost:3000&lt;/code&gt; in a browser. Log in with the credentials from your &lt;code&gt;.env&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Trigger a trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/api/chat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"message": "Summarize the quarterly report"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What to look for in the Langfuse UI: open the Traces view. You should see an entry for &lt;code&gt;spring-ai-llm-service&lt;/code&gt;. Expand it — you will see an HTTP span at the root, a business logic span below it, and an LLM invocation span as a child of that. Click the LLM span: model name, token counts, and latency are in the attributes panel on the right.&lt;/p&gt;

&lt;p&gt;If you called any tools, each tool call appears as a child span of the LLM span, with its own duration and result status.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt and Response Redaction
&lt;/h2&gt;

&lt;p&gt;By default, prompt and response content is captured in span attributes. For environments where this is not acceptable, two options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metadata-only mode.&lt;/strong&gt; Disable payload capture entirely. Token counts and latency are retained; prompt and response content are not recorded. One configuration flag, no code change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partial redaction.&lt;/strong&gt; Apply regex-based masking in the OTEL instrumentation layer before spans are exported. PII patterns (emails, phone numbers, account numbers) are replaced with &lt;code&gt;[REDACTED]&lt;/code&gt; in the span attributes. The LLM still receives the full content; only the observability record is masked.&lt;/p&gt;

&lt;p&gt;Both modes are configured in &lt;code&gt;application.yml&lt;/code&gt; with the &lt;code&gt;otel&lt;/code&gt; Spring profile. The full configuration is in the solution at exesolution.com.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operational Notes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If Langfuse goes down.&lt;/strong&gt; The OTEL batch processor drops spans after the queue saturates. Application traffic is completely unaffected — tracing degrades gracefully. No circuit breaker needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disabling tracing without a redeploy.&lt;/strong&gt; Set &lt;code&gt;OTEL_TRACES_EXPORTER=none&lt;/code&gt; and restart the application container with this variable set. Tracing stops; everything else continues normally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-OpenAI providers.&lt;/strong&gt; The instrumentation is provider-agnostic. It works with any Spring AI &lt;code&gt;ChatModel&lt;/code&gt; implementation — Anthropic, Azure OpenAI, Ollama, Mistral. The span attributes are populated by Spring AI's abstraction layer, not by provider-specific code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes.&lt;/strong&gt; The same OTEL and Langfuse configuration applies. Docker Compose is provided for local and CI reproducibility; the Kubernetes equivalent is straightforward — deploy Langfuse as a Helm chart and point &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; at the service.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in the Full Solution
&lt;/h2&gt;

&lt;p&gt;The verified solution at exesolution.com includes everything to run this from scratch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete Spring Boot project with &lt;code&gt;otel&lt;/code&gt; profile, OTel dependencies, and &lt;code&gt;ChatClient&lt;/code&gt; wiring&lt;/li&gt;
&lt;li&gt;Full Docker Compose stack: Spring Boot app + Langfuse (web + worker) + PostgreSQL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;application.yml&lt;/code&gt; with sampling, batching, and redaction configuration&lt;/li&gt;
&lt;li&gt;11 evidence screenshots: Docker Compose build, running containers, chat API test, Langfuse dashboard, and five trace detail views showing nested spans with token and latency data&lt;/li&gt;
&lt;li&gt;Verification checklist: services running, traces visible, sampling confirmed, redaction verified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://exesolution.com/solutions/spring-ai-opentelemetry-langfuse-observability" rel="noopener noreferrer"&gt;Full solution + runnable code + evidence at exesolution.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Free registration required to access the code bundle and evidence images.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Practical Case for Self-Hosted
&lt;/h2&gt;

&lt;p&gt;The cloud-hosted Langfuse option is fine for many projects. But if you are in financial services, healthcare, or any context with data residency requirements, sending prompt content to a third-party SaaS is a non-starter. Self-hosted Langfuse on Docker Compose or Kubernetes gives you the same UI and the same trace schema — the only difference is the data never leaves your network.&lt;/p&gt;

&lt;p&gt;The setup in this solution takes about 15 minutes from &lt;code&gt;git clone&lt;/code&gt; to first trace in the UI. That is a reasonable investment for closing the observability gap that every LLM service eventually hits.&lt;/p&gt;




&lt;p&gt;Questions about the OTel configuration or the Langfuse setup? Leave a comment below.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>springboot</category>
      <category>ai</category>
      <category>java</category>
    </item>
    <item>
      <title>MCP Server &amp; Client in Spring AI: Stop Coupling Tools to Your AI Host</title>
      <dc:creator>Henry Li</dc:creator>
      <pubDate>Sun, 19 Apr 2026 19:27:43 +0000</pubDate>
      <link>https://dev.to/henry9527/mcp-server-client-in-spring-ai-stop-coupling-tools-to-your-ai-host-2l21</link>
      <guid>https://dev.to/henry9527/mcp-server-client-in-spring-ai-stop-coupling-tools-to-your-ai-host-2l21</guid>
      <description>&lt;p&gt;If you've built an LLM feature in Spring Boot, you've probably done something like this: created a &lt;code&gt;@Bean&lt;/code&gt; with &lt;code&gt;@Tool&lt;/code&gt;-annotated methods, wired it into your &lt;code&gt;ChatClient&lt;/code&gt;, and shipped it. That works fine — until your tool set grows, multiple AI applications want to reuse the same tools, or you need to update a tool without redeploying the entire AI service.&lt;/p&gt;

&lt;p&gt;That's the problem MCP (Model Context Protocol) solves. This post walks through a two-service setup I built and verified: a standalone MCP Tool Server and an AI Chat Service that discovers tools dynamically over Streamable HTTP — &lt;strong&gt;no restart required when tools change&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The full solution with runnable code, Docker Compose, and execution evidence is at &lt;a href="https://exesolution.com/solutions/MCP-Server-Client-in-Spring-AI-Dynamic-Tool-Discovery" rel="noopener noreferrer"&gt;exesolution.com&lt;/a&gt;. This post covers the core problem and how to get it running locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with In-Process Tool Registration
&lt;/h2&gt;

&lt;p&gt;When you register tools inside the same Spring Boot app that handles LLM interactions, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deployment coupling&lt;/strong&gt; — every new tool means a new deployment of the AI service, even though the AI logic didn't change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No sharing&lt;/strong&gt; — if three different AI applications need the same "get order status" tool, you copy-paste the implementation into each.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No trust boundary&lt;/strong&gt; — a bug in a tool method can crash the process that's serving your users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static inventory&lt;/strong&gt; — tools are fixed at startup. Adding one at runtime? Not without a restart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero visibility&lt;/strong&gt; — tool invocations vanish inside the &lt;code&gt;ChatClient&lt;/code&gt; execution loop with no structured logs or traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The naive fix is "just put everything in one service." But once you have 20 tools across 5 domains, that service becomes the new monolith.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: Two Services, One Protocol
&lt;/h2&gt;

&lt;p&gt;The setup has two independently deployable Spring Boot apps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
  └─→ AI Chat Service (:8081)
          └─→ ChatClient (Spring AI)
                  └─→ LLM (gpt-4o-mini)
                  └─→ MCP Client
                          └─→ MCP Tool Server (:8080)  ← POST /mcp
                                  └─→ @McpTool-annotated service methods
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;MCP Tool Server&lt;/strong&gt; — owns tool implementations. Exposes them via &lt;code&gt;@McpTool&lt;/code&gt; annotations over Streamable HTTP. Deployed and versioned independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Chat Service&lt;/strong&gt; — user-facing REST API. Knows nothing about specific tools. Uses &lt;code&gt;SyncMcpToolCallbackProvider&lt;/code&gt; to auto-discover whatever tools the server exposes, on every request.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;code&gt;ToolCallbackProvider&lt;/code&gt; re-fetches the tool list from the server on each &lt;code&gt;getToolCallbacks()&lt;/code&gt; call. Add a new &lt;code&gt;@McpTool&lt;/code&gt; bean, hit the refresh endpoint, and the next conversation picks it up — no restart of either service.&lt;/p&gt;




&lt;h2&gt;
  
  
  Defining a Tool: One Annotation
&lt;/h2&gt;

&lt;p&gt;On the server side, any Spring bean method can become an MCP tool with &lt;code&gt;@Tool&lt;/code&gt; (Spring AI's annotation):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Service&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderTool&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Tool&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Get the current status and details of an order by its ID"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getOrderStatus&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nd"&gt;@ToolParam&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"The unique order identifier, e.g. ORD-12345"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;orderRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findById&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                        &lt;span class="s"&gt;"orderId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;           &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
                        &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;            &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getStatus&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
                        &lt;span class="s"&gt;"estimatedDelivery"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEstimatedDelivery&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
                        &lt;span class="s"&gt;"items"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;             &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getItems&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;))&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orElseThrow&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
                        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;IllegalArgumentException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Order not found: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spring AI reads the annotation at startup and generates a JSON Schema for the parameters automatically. The LLM receives this schema and knows exactly how to call the tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wiring the Client: One Line
&lt;/h2&gt;

&lt;p&gt;On the AI Host side, wiring all server tools into &lt;code&gt;ChatClient&lt;/code&gt; takes one method call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChatConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Bean&lt;/span&gt;
    &lt;span class="nc"&gt;ChatClient&lt;/span&gt; &lt;span class="nf"&gt;chatClient&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChatModel&lt;/span&gt; &lt;span class="n"&gt;chatModel&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                          &lt;span class="nc"&gt;SyncMcpToolCallbackProvider&lt;/span&gt; &lt;span class="n"&gt;toolCallbackProvider&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ChatClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chatModel&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;defaultTools&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toolCallbackProvider&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// ← entire server tool registry&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From here, when a user asks "What's the status of order ORD-12345?", the LLM decides to call &lt;code&gt;getOrderStatus&lt;/code&gt;, Spring AI dispatches it over MCP, the tool runs on the server, the result comes back, and the LLM incorporates it into the reply — entirely transparent to the controller layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MCP Tool Server&lt;/strong&gt; (&lt;code&gt;application.properties&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;spring.ai.mcp.server.name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;tool-server&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.mcp.server.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.mcp.server.protocol&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;STREAMABLE&lt;/span&gt;
&lt;span class="py"&gt;server.port&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AI Chat Service&lt;/strong&gt; (&lt;code&gt;application.properties&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;spring.ai.mcp.client.toolcallback.enabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.mcp.client.connections.tool-server.url&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;${MCP_SERVER_URL}/mcp&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.mcp.client.connections.tool-server.transport&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;STREAMABLE_HTTP&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.openai.api-key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}&lt;/span&gt;
&lt;span class="py"&gt;spring.ai.openai.chat.options.model&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;
&lt;span class="py"&gt;server.port&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;8081&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dependencies&lt;/strong&gt; — MCP Server (&lt;code&gt;pom.xml&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.springframework.ai&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;spring-ai-starter-mcp-server-webmvc&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dependencies&lt;/strong&gt; — AI Host (&lt;code&gt;pom.xml&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.springframework.ai&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;spring-ai-starter-mcp-client&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.springframework.ai&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;spring-ai-starter-model-openai&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running It Locally
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; Docker Desktop, JDK 17, an OpenAI-compatible API key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and configure&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.template .env
&lt;span class="c"&gt;# add OPENAI_API_KEY=sk-...&lt;/span&gt;

&lt;span class="c"&gt;# 2. Start both services&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify both services are up:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/actuator/health | jq .status
&lt;span class="c"&gt;# → "UP"&lt;/span&gt;

curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8081/actuator/health | jq .status
&lt;span class="c"&gt;# → "UP"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Confirm the tool registry (admin endpoint):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/admin/tools | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="c"&gt;# → list of @McpTool-annotated methods with name, description, inputSchema&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trigger a tool call through the chat API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8081/api/chat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &amp;lt;TOKEN&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"sessionId":"sess-001","message":"What is the status of order ORD-12345?"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="c"&gt;# → {"reply":"Order ORD-12345 is currently SHIPPED...","toolsUsed":["getOrderStatus"]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify the tool call hit the server:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose logs mcp-tool-server | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"tools/call"&lt;/span&gt;
&lt;span class="c"&gt;# → log lines showing getOrderStatus invoked with orderId=ORD-12345&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dynamic tool discovery — no restart needed:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add a new tool bean to the server, then:&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/admin/tools/refresh &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &amp;lt;ADMIN_TOKEN&amp;gt;"&lt;/span&gt;
&lt;span class="c"&gt;# → {"registered":["getOrderStatus","searchProducts",...]}&lt;/span&gt;

&lt;span class="c"&gt;# Next chat request immediately picks up the new tool&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8081/api/chat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &amp;lt;TOKEN&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"sessionId":"sess-001","message":"Search for electronics products"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq .reply
&lt;span class="c"&gt;# → uses the newly registered searchProducts tool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What the Stateless Transport Mode Gives You
&lt;/h2&gt;

&lt;p&gt;By default the server runs in stateful &lt;code&gt;STREAMABLE&lt;/code&gt; mode (sessions via &lt;code&gt;Mcp-Session-Id&lt;/code&gt; headers). For horizontally scaled deployments behind a load balancer, switch to stateless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# on mcp-tool-server
&lt;/span&gt;&lt;span class="py"&gt;spring.ai.mcp.server.protocol&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;STATELESS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In stateless mode the server returns &lt;code&gt;application/json&lt;/code&gt; per request. No session affinity required. The same chat requests work identically — the difference is purely at the transport layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's in the Full Solution
&lt;/h2&gt;

&lt;p&gt;This post covers the core problem and the minimal working setup. The complete verified solution at exesolution.com includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full source code for both Spring Boot modules (pom.xml, all Java classes, Docker Compose)&lt;/li&gt;
&lt;li&gt;Three &lt;code&gt;@McpTool&lt;/code&gt; implementations: &lt;code&gt;OrderTool&lt;/code&gt;, &lt;code&gt;ProductTool&lt;/code&gt;, and &lt;code&gt;WeatherTool&lt;/code&gt; (the last one calls &lt;code&gt;open-meteo.com&lt;/code&gt; in real time — verifiable live data)&lt;/li&gt;
&lt;li&gt;Security configuration: &lt;code&gt;/mcp&lt;/code&gt; endpoint internal-only, &lt;code&gt;/api/chat&lt;/code&gt; JWT-protected, &lt;code&gt;/admin/**&lt;/code&gt; role-gated&lt;/li&gt;
&lt;li&gt;Architecture diagram and request flow diagram&lt;/li&gt;
&lt;li&gt;Evidence Pack: 10 verification screenshots from actual execution — health checks, tool registry, chat responses, server-side logs, dynamic refresh&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://exesolution.com/solutions/MCP-Server-Client-in-Spring-AI-Dynamic-Tool-Discovery" rel="noopener noreferrer"&gt;Full solution + runnable code + evidence at exesolution.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Free registration required to access the code bundle and evidence images.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;The pattern here — separate MCP server, auto-discovering client — pays off when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple AI applications need the same tools (deploy once, use everywhere)&lt;/li&gt;
&lt;li&gt;Tool implementations need independent scaling or deployment cadence&lt;/li&gt;
&lt;li&gt;You want a trust boundary between the LLM execution context and the actual side-effecting code&lt;/li&gt;
&lt;li&gt;You're connecting to Claude Desktop, VS Code Copilot, or any other MCP-compatible client — the same server JAR works for all of them without code changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're already using Spring AI for chat and RAG, adding an MCP server is one dependency and a few annotations. The split into two services pays for itself the first time you update a tool without touching the AI host.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about the setup or ran into something unexpected? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>springboot</category>
      <category>java</category>
      <category>ai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
