DEV Community: srinivas reddy gouru

I Built Scrivio: An AI Pipeline That Researches, Fact-Checks, and Polishes Technical Articles

srinivas reddy gouru — Sun, 14 Jun 2026 04:50:03 +0000

Why I stopped asking a chatbot to “write me an article” and instead built a multi-stage pipeline.

github.com/srinivas-reddy-gouru/scrivio

I like writing about the things I build and learn. The hard part was never the writing itself; it was everything that comes before the writing: digging up sources, checking that what I remembered about a framework was still true, drawing the diagram, and then rewriting the whole thing twice because the first draft sounded like a product brochure.

At some point, I did the obvious thing and asked an AI to write a draft for me. And it did, in about four seconds. But the draft had a familiar problem. It was confident, fluent, and quietly wrong in a few places. It cited nothing. It pulled “facts” from training data that may or may not have matched the current version of the library I was writing about. The diagrams didn’t exist. And it had that unmistakable AI texture where every paragraph is the same temperature.

So I kept the idea but threw away the approach. Instead of one model doing one pass, I wanted a pipeline: research first, draft against real evidence, verify every claim, then edit and polish as an actual editorial team would. That turned into Scrivio.

This is the story of what it does, how it works, and the honest answer to the question everyone asks first: why not just use ChatGPT or Claude directly?

What Scrivio actually does

You type a topic. That’s the whole input. Something like How does Kafka handle backpressure? or Spring Security 6 filter chain. From there, Scrivio runs the full chain on its own:

Researches the live web for real, recent sources instead of leaning on training data
Plans a section-by-section structure tuned to your audience and depth level
Draft each section with inline citations tied to the evidence it actually fetched
Generates Mermaid diagrams for architecture, flows, and sequence interactions
Verifies every factual claim against the source it came from
Edits for thesis alignment, voice, and structure (it’ll add a comparison table when the content calls for one)
Polishes the voice so it reads like a colleague explaining something, not a generated document
Compiles the final article with numbered references resolved at the end

The output is a clean Markdown file, ready to paste into a blog, Notion, or docs site that renders Markdown and Mermaid.

One thing to get straight up front: Scrivio isn’t locked to a single provider. It auto-selects based on the API key you give it. Drop in an Anthropic key, and Claude runs the pipeline. Drop in an OpenAI key and the whole pipeline runs on GPT-4o and GPT-4o-mini instead. Set both, and it defaults to Anthropic unless you tell it otherwise with a LLM_PROVIDER setting. There's no code change involved — a single adapter presents the OpenAI client to every worker as if it were the Anthropic one, so the workers don't even know which provider they're talking to.

Which provider runs the pipeline depends on the keys you set:

Anthropic only — Pipeline runs on Claude (Sonnet + Haiku) · Verifier: GPT-4o-mini (needs the OpenAI key)
OpenAI only — Pipeline runs on GPT-4o + GPT-4o-mini · Verifier: GPT-4o-mini
Both — Anthropic by default, or OpenAI via LLM_PROVIDER=openai · Verifier: GPT-4o-mini

The three quality presets let you trade speed for polish, depending on whether you’re prototyping or shipping. The model names below are the Anthropic defaults; in OpenAI mode, each one maps to its nearest GPT equivalent (Sonnet → GPT-4o, Haiku → GPT-4o-mini), so the preset slider still does its job either way:

Quality presets — model names are the Anthropic defaults; in OpenAI mode, Sonnet maps to GPT-4o and Haiku to GPT-4o-mini.

Fast — Routing: Haiku / GPT-4o-mini · Writing: Sonnet / GPT-4o · Best for: quick drafts and prototyping
Balanced (default) — Routing: Haiku / GPT-4o-mini · Writing: Sonnet / GPT-4o · Best for: the best cost/quality ratio
Best — Routing: Sonnet / GPT-4o · Writing: Sonnet / GPT-4o · Best for: production-quality articles

The slider in the UI doesn’t just change one model. Every stage in the pipeline asks for a central config which model it should use for the current preset, so the whole writing chain moves together when you flip it.

Under the hood: it’s a pipeline, not a prompt

The core mental model is simple. Scrivio is a FastAPI server that takes a request, runs the job in the background, and streams progress to the browser stage by stage so you can watch it think. The UI is a single static page.

The interesting part is what sits behind that server. Rather than one big “write an article” call, the work is split across a chain of specialised workers, each with one job and its own prompt:

topic
  → brief (thesis, angle, hook)
  → relevance check (does the brief match what you asked?)
  → web search + extraction (fetch real pages, chunk them, score relevance)
  → planning (section-by-section outline mapped to evidence)
  → gap-fill search (find more sources for thin sections)
  → drafting (every section in parallel, each claim cited)
  → diagrams (Mermaid spec → rendered SVG)
  → verification (claim-by-claim fact check)
  → editor review + targeted revision
  → critic (final quality gate)
  → humanizer (voice polish, AI-tell scrub)
  → compile (resolve citations into numbered references)

A few of these stages are the ones I’m quietly proud of, because they’re the parts a single chatbot call doesn’t do.

The relevance check runs a cheap Haiku pass right after the brief is written, purely to catch drift. If you ask, “What is X?” and the brief quietly decides to write a dramatic war story instead, it gets caught before the expensive stages spend tokens on the wrong article.

The editor runs two passes. First, it reviews the assembled draft and flags up to a few sections that need rewriting, plus structural hints (for example, “these three things are being compared across two dimensions, turn this into a table”). Then it re-drafts only the flagged sections, so a single weak paragraph doesn’t trigger a full regeneration.

The critic is the final gate. It checks title patterns, opening framing, citation completeness, diagram accuracy, internal consistency, and code-block fidelity, and grades issues as blocking, moderate, or minor. The article only ships if nothing is blocking the remains.

“But Claude or ChatGPT can already write an article…”

This is a fair question, and I want to answer it honestly rather than pretend Scrivio invented writing.

Yes. If you open Claude or ChatGPT and ask for a technical article, you’ll get something readable in seconds. For a casual post, that’s often enough, and Scrivio is overkill. I’m not going to tell you to spin up a pipeline to write a 300-word “5 tips” listicle.

The gap shows up the moment you care about a technical article being correct, sourced, and not obviously machine-written. A single chat call gives you one pass from a single model working off training data. Scrivio is built around the things that one pass skips:

Here’s that comparison table reflowed for Medium. Since it has three columns, the clearest way is one bolded item per line with the two sides underneath:

A single Claude or ChatGPT call gives you a fast draft. Here’s where Scrivio differs, point by point:

Up-to-date facts — Single call: from training data, no live sources. Scrivio: live web search, fetched per run.
Citations — Single call: rarely, and often invented. Scrivio: every claim tied to a fetched source, numbered at the end.
Fact-checking — Single call: the model checks itself, if at all. Scrivio: a separate verifier model checks every claim (a different family when you run in Claude mode).
Diagrams — Single call: described in text, not drawn. Scrivio: real rendered Mermaid SVGs.
Editing — Single call: one pass, one voice. Scrivio: separate editor, critic, and humaniser passes.
Reproducibility — Single call: new chat means starting over. Scrivio: same topic, same pipeline, repeatable settings.

The verification step is the one I’d point to first. Whoever does the writing, a separate verifier model goes through the draft claim by claim and downgrades anything the cited source doesn’t actually support. A model grading its own homework is a known weak spot, and this catches a class of confident-but-wrong statements that self-review tends to wave through. It’s strongest in the default Claude setup, where the writer is Claude and the verifier is GPT-4o-mini — a genuinely different model family doing the checking. If you run the whole thing on OpenAI, the verifier is still an independent, smaller-model pass, just from the same family, so treat it as a second opinion rather than a fully independent one.

So the honest framing is this: a chatbot is a brilliant writer. Scrivio is closer to a small newsroom — researcher, drafter, fact-checker, editor, and copy-editor — that happens to be made of models. Where it shines is exactly where a fast single draft falls: sourced, verifiable, diagram-supported technical writing you’d actually put your name on.

Using it

There are three ways in, depending on how much setup you want.

Web UI (the nice one). Start the server, open localhost:8899, Type a topic, pick depth and audience, choose a preset, and hit generate. The progress panel streams each stage live, which honestly became my favourite part to watch. You can toggle live web search on or off, and flip diagrams on or off, right there in the form.

Command line. If you’d rather script it:

# Fastest: no web search, no diagrams
python main.py --topic "pytest best practices" --level basic --no-web --no-diagrams

# Full run with search and diagrams
python main.py --topic "How does Kafka handle backpressure?" --level intermediate

Output gets written to a timestamped folder with the finished Markdown and a meta.json containing the request params and verification reports, in case you want to see why a claim was kept or dropped.

Claude Code skill (no setup at all). There’s a standalone generate-article.skill in the repo. Drag it into a Claude Code chat (or install it from Settings → Skills), ask it to "write an article about how Kafka handles backpressure," and it runs the workflow with no server, no clone, and no API keys of your own. It's a single self-contained SKILL.md that walks Claude Code through the same eight stages — brief, research, plan, draft, editorial review, critique, and file output — using Claude Code's built-in web search and file tools. This is the lowest-friction way to try the idea before touching the full project.

It is worth being precise about one difference: because the skill runs entirely inside Claude Code on a single model, it skips the separate cross-model verification step that the full server pipeline does. Its review and critique passes are Claude checking its own draft, not an independent verifier from another family. So the skill is the fastest way to feel out the workflow, and the full pipeline is the one to reach for when you want that extra layer of fact-checking.

What’s still rough (and what I’m working on)

I’d rather be upfront about the edges than oversell this. It works well for what it does, but it’s early, and there’s a list I’m actively chipping away at.

Job storage is in-memory right now. Jobs live in the server process, so a restart loses in-flight work. Moving to a persistent store (and a proper queue) is near the top of the list, especially before anyone tries to host this rather than run it locally.

Key setup could still be simpler. You can already run the whole pipeline on a single provider key — just Anthropic, or just OpenAI — and web search is optional on top of that. What I still want to smooth out is the verifier: right now, it always reaches for OpenAI, so the cleanest “one key only” experience is OpenAI mode. Making the verifier fully provider-agnostic (so an Anthropic-only setup needs nothing else) is on the list.

There’s no auth on the server. That’s fine for localhost, but a hosted deployment needs it, so it's a prerequisite for the "run it as a small service" direction I'd like to take.

Output is Markdown only. The pipeline already reasons about basic/intermediate/advanced depth levels internally, but the surfaced output is a single file.

Cost visibility. A full “Best” run touches on many models. I want the UI to show a token and cost estimate up front so a run never surprises you.

None of these is blockers for the core use case, but they’re the difference between “a tool I built for myself” and “a tool other people can comfortably depend on,” and that’s the direction I’m steering it.

Try it

If sourced, fact-checked technical writing is a problem you also have, the repo is here, and it’s MIT-licensed:

github.com/srinivas-reddy-gouru/scrivio

There are five real example articles in the examples/ folder if you want to judge the output before installing anything. And if you try it, I'd genuinely like to hear where it breaks for you — what topics it nails, where the verification is too strict or too loose, and which export format you'd reach for first. That feedback determines what I build next.

Thanks for reading.

A few quick FAQs

Does this replace writing the article yourself?

No, and it’s not trying to. It removes the blank page, the source-hunting, and the first three drafts. The judgment, the opinions, and the final read-through are still yours.

Do I need both Anthropic and OpenAI?

No. Scrivio runs on whichever you give it — an Anthropic key runs the pipeline on Claude, an OpenAI key runs it on GPT-4o. The one nuance is verification, which currently uses GPT-4o-mini, so an OpenAI key gives you the most self-contained single-key setup today. In the default Claude setup, the writer and the verifier come from different model families on purpose, since a model checking its own work is exactly what I wanted to avoid.

Is it free to run?

The code is MIT and free. The API calls are not — you bring your own keys, and a full “Best” run with web search costs real tokens. The “Fast” and “Balanced” presets exist precisely to keep that reasonable.

What’s the minimum to get started?

One LLM key. Just Anthropic or just OpenAI is enough to generate articles; web search is optional and only needs its own key if you want live sources instead of model knowledge. So the smallest setup is a single key, no search.

Do I need to clone the repo to try it?

No. The generate-article.skill file installs straight into Claude Code (drag it in, or Settings → Skills), and from there you just ask it to write an article. It uses Claude Code's own tools, so there's nothing else to set up. The full repo is for when you want the server UI, the cross-model verification, and full control over the pipeline.

How Kafka Handles Backpressure: Producer Buffers, Broker Quotas, and Consumer Flow Control

srinivas reddy gouru — Mon, 08 Jun 2026 02:33:05 +0000

Backpressure is the pressure that builds in a pipeline when the slow end cannot absorb what the fast end produces. In a reactive stream, the solution is explicit: the subscriber signals its demand upstream, and the publisher sends only as many items as it requested. Kafka takes a different approach. There is no protocol-level demand signal flowing from consumers back to producers. Instead, Kafka distributes the problem across three separate layers, each with its own configuration surface, and leaves the integration of those layers to you.

Understanding which layer to reach for, and when, is the practical skill this article develops.

The Producer Side: A Bounded Buffer That Blocks

When your application calls producer.send(), the record does not go directly to the broker. It lands in the RecordAccumulator, an in-memory buffer the producer client maintains internally. The accumulator organizes records into per-partition deques of ProducerBatch objects. A background Sender thread drains ready batches to the appropriate brokers over the network.

This separation is where producer-side backpressure lives. The accumulator has a finite size, controlled by buffer.memory (default: 32 MB). As long as the Sender drains batches faster than your application produces records, everything stays well under that limit. When it does not, because the broker is slow, the network is saturated, or you are simply writing more than the brokers can accept, the buffer fills up.

At that point, send() blocks. The calling thread waits inside BufferPool.allocate() until the Sender frees enough memory or max.block.ms (default: 60 seconds) expires. If the timeout fires first, the client throws a BufferExhaustedException.

The blocking behavior is the implicit backpressure signal on the producer side. Your application slows down because send() will not return until there is room. The 60-second default means a slow broker can cause your producers to stall quietly for up to a minute before surfacing an error. This is worth knowing when diagnosing latency spikes that seem to appear without warning.

In practice, lowering max.block.ms to 5-10 seconds and handling the resulting exception explicitly gives you faster feedback and a clear place to shed load, rather than waiting for Kafka to silently unblock.

How Batching Softens Burst Pressure

The RecordAccumulator does not forward records to the Sender one at a time. It accumulates them into batches, flushing a batch when either it reaches batch.size bytes or linger.ms milliseconds have passed since the first record arrived in that batch, whichever comes first.

batch.size defaults to 16 KB and linger.ms defaults to 5 ms as of Kafka 4.0 (it was 0 in earlier versions, which caused the producer to send records immediately). A linger.ms of 0 is right for low-latency use cases but leaves throughput on the table for anything bulk-oriented. A linger.ms of 5-50 ms allows the accumulator to fill batches more completely, reducing the number of network round trips the Sender has to make and easing the per-request load on brokers during traffic spikes.

Compression works at the batch level, so the fuller the batch, the better the ratio. Enabling compression.type (lz4 or zstd being the practical choices) on top of sensible batching settings can meaningfully reduce the bytes-per-second your producers push to the network, leaving more headroom before the buffer fills.

Broker Quota Throttling

Producer-side buffering protects individual producers from overwhelming themselves. Broker quotas protect the cluster from being overwhelmed by any single client.

Kafka supports three quota types, each applied per user or client ID on a per-broker basis:

Quota type	What it limits	Available since
Produce quota	Bytes per second written	Kafka 0.9
Fetch quota	Bytes per second read	Kafka 0.9
Request quota	Percentage of broker request-handler CPU time	Kafka 0.11

When a client exceeds its quota, the broker does not reject the request. Instead, it computes a throttle delay based on how much the client overshot its limit across a sliding window of 30 one-second buckets. For a produce quota of 10 MB/s, a client that sends 15 MB in one second receives a throttle delay of roughly 500 ms.

For produce requests, the broker returns the response immediately but includes a non-zero ThrottleTimeMs field. The client SDK reads this value and pauses outgoing requests for that duration. For fetch requests, the response contains no data and the client backs off before its next poll. The broker also mutes the client's network channel during the delay window, so clients that ignore ThrottleTimeMs still cannot push more requests through.

You configure quotas at runtime with no broker restart required:

# Set a produce quota of 10 MB/s for a specific client ID
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter \
  --add-config 'producer_byte_rate=10485760' \
  --entity-type clients \
  --entity-name my-producer

One thing to account for in quota math: quotas are per broker, not per cluster. A quota of 50 MB/s on a 6-broker cluster allows up to 300 MB/s of aggregate throughput for that client if its writes are spread evenly across brokers.

Consumer Flow Control: Pause, Resume, and Poll Sizing

On the consumption side, Kafka's pull-based model is itself a form of flow control. Your consumer calls poll() at its own pace; the broker sends back available records up to fetch.max.bytes (default: 50 MB) and max.partition.fetch.bytes (default: 1 MB per partition). You control throughput entirely by controlling when and how often you call poll().

Two configs bound how much work each poll cycle returns:

max.poll.records (default: 500): the maximum number of records returned per poll() call
max.poll.interval.ms (default: 5 minutes): the maximum time the broker allows between consecutive polls before treating the consumer as dead and triggering a rebalance

If your downstream processing is slow, max.poll.records is the first lever to reach for. Dropping it from 500 to 50 means each poll cycle hands you less work, you return to calling poll() faster, and you stay well inside the max.poll.interval.ms window. This alone prevents a large class of rebalance storms that masquerade as "Kafka being slow."

For more dynamic control, KafkaConsumer exposes pause(Collection<TopicPartition>) and resume(Collection<TopicPartition>). When you pause a partition, subsequent poll() calls return no records from that partition. The consumer remains in the group and continues sending heartbeats; it just stops fetching. This is the right tool when your processing is async (thread pools, HTTP calls, database writes) and you need to prevent a downstream queue from growing without bound.

A practical pattern:

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        processingQueue.put(record); // blocks naturally when the queue is full
    }

    if (processingQueue.size() > HIGH_WATERMARK) {
        consumer.pause(consumer.assignment());
    } else if (processingQueue.size() < LOW_WATERMARK) {
        consumer.resume(consumer.assignment());
    }
}

The one thing to watch: paused partitions still count against your session. If the processing thread takes longer than max.poll.interval.ms to drain below LOW_WATERMARK, the broker removes the consumer from the group. Set HIGH_WATERMARK generously enough that draining happens well within that window.

Kafka Streams: Backpressure for Free

Kafka Streams applications are simultaneously consumers and producers: they read from input topics, transform records, and write results to output topics. Because they use the standard consumer client under the hood, backpressure emerges naturally from the pull-based poll rhythm.

If processing slows, the Streams thread takes longer per poll cycle. This means fewer records flow through per unit of time, without any explicit pause/resume logic on your part. The application never accumulates more records in memory than one poll cycle returns.

You still care about max.poll.records and commit.interval.ms. A slow processing stage increases the uncommitted offset window; if a Streams application restarts, it replays all uncommitted records. Shorter commit intervals reduce that replay window, at the cost of more offset commits to the __consumer_offsets topic.

Techniques at a Glance

Mechanism	Where it lives	What it protects	Key config
Producer buffer blocking	Producer client	Slows the caller when broker is slow	`buffer.memory`, `max.block.ms`
Batch accumulation	Producer client	Smooths throughput spikes	`batch.size`, `linger.ms`
Compression	Producer client	Reduces wire bytes, extends buffer headroom	`compression.type`
Broker quotas	Broker	Prevents noisy clients from monopolizing resources	`producer_byte_rate`, `consumer_byte_rate`
Poll sizing	Consumer client	Bounds records per processing cycle	`max.poll.records`
Pause/Resume	Consumer client	Dynamic per-partition flow control	`KafkaConsumer.pause()`
Kafka Streams pull model	Streams runtime	Implicit throttling via poll cadence	`max.poll.records`, `commit.interval.ms`

Where to Start

Begin on the consumer side. Measure your per-message processing time, then set max.poll.records so one poll cycle's work fits inside max.poll.interval.ms with room to spare. This one change resolves most consumer-lag issues before you need to touch anything else.

Add pause()/resume() if your processing is async and you need to protect a downstream queue from growing without bound. Keep your watermarks far enough apart that the processing thread can drain between poll cycles.

On the producer side, decide how you want failures to surface. The 60-second max.block.ms default means a stalled producer is invisible for up to a minute. Lowering it to 5-10 seconds and treating BufferExhaustedException as a load signal gives you faster, actionable feedback. Pair this with appropriate linger.ms and a compression codec to increase throughput before the buffer becomes the constraint.

Broker quotas earn their configuration cost in multi-tenant clusters or anywhere you have a mix of batch and interactive workloads sharing the same brokers. A runaway batch ingest job should not be able to saturate the cluster and introduce latency spikes for latency-sensitive consumers.

Consumer lag, visible through kafka-consumer-groups.sh --describe or the JMX metric records-lag-max, is the single health signal that ties all these mechanisms together. Sustained, growing lag tells you the consumers are falling behind before the system enters a failure mode. The root cause might be anywhere in this stack, but the lag metric is where you start the investigation.

Five Spring Boot Patterns That Prevent Production Failures

srinivas reddy gouru — Sat, 06 Jun 2026 00:33:18 +0000

The Gap Between Tutorial Code and Production Reality

Spring Boot is an opinionated framework for building Java services quickly. It auto-configures your web server, data source, and application context so you can focus on writing business logic rather than plumbing. This article is for engineers who have a Spring Boot service running and want to know which additional patterns separate a service that survives real traffic from one that quietly misbehaves under it.

Consider a concrete example. A downstream API starts responding slowly. Tomcat threads pile up waiting for a response that never comes, because Spring Boot sets no default read timeout. [1] With 60 threads hung on the slow dependency, only 140 of the default 200 remain to serve real traffic, and throughput drops accordingly. [1] In a busier environment, the pool exhausts entirely, and the service stops accepting new requests without throwing a single exception.

This failure mode does not appear in any beginner Spring Boot tutorial, and that's not an accident of curriculum design. Tutorials optimise for getting something running. Production optimises for keeping it running when the environment misbehaves: when a downstream service hangs, when a deployment interrupts in-flight requests, when configuration differs between staging and prod, when a dependency starts failing intermittently overnight.

Five patterns address five distinct categories of that production reality. Externalised configuration with profiles keeps environment-specific values out of your artefact, making changes auditable and safe. Graceful shutdown with lifecycle hooks ensures that a rolling deployment doesn't drop mid-flight requests when the process receives a termination signal. [2] Structured async processing with @Async and explicit thread pool configuration moves slow work off the request thread, so a sluggish job can't starve the rest of the application. Circuit breaking with Resilience4j gives the service a way to stop calling a failing dependency, rather than letting failures cascade. Health and readiness probes with Actuator tell your orchestration layer whether the service is actually ready to receive traffic, not just whether the process started.

Each of these patterns is already available in the standard Spring Boot dependency tree. None requires a major architectural change. What they require is knowing they exist, understanding what failure they prevent, and wiring them in before traffic is real enough to expose the gap. The rest of this article covers each pattern in turn: what it does, how to configure it, and the specific failure scenario it prevents.

Pattern 1, Externalised Configuration with Profiles: Eliminating Environment Mismatch

A quieter but common failure happens before the service ever receives real traffic: the application runs fine locally, ships to production, and immediately misbehaves because it's still pointed at the dev database or using in-memory H2 instead of Postgres. The fix isn't discipline alone; it's understanding how Spring Boot resolves configuration values and building your config structure around that resolution order.

Both .properties and .yml formats are supported for Spring Boot configuration files, and the two are functionally equivalent. The examples below use .properties throughout; if your project uses .yml, the same keys and values apply with YAML's indentation syntax instead.

Spring Boot 2 evaluates property sources in a well-defined order of priority. Command-line arguments sit at the top, overriding everything below them. OS environment variables come next. Below those sit the profile-specific property files, and at the bottom sits the base application.properties bundled inside your jar. This layering is the whole point: the jar is immutable, and every environment customises it by injecting values above the jar's baseline rather than by modifying what's baked in.

Profile-specific files named application-{profile}.properties placed outside the jar take precedence over their counterparts inside the jar. A Kubernetes deployment can mount an environment-specific application-prod.properties as a ConfigMap volume, and Spring Boot will pick it up without any code change or rebuild. This is what allows a single deployable artefact to run correctly across dev, staging, and production.

A typical split looks like this. The base application.properties holds values that genuinely don't change across environments:

# application.properties
spring.application.name=order-service
spring.jpa.open-in-view=false
management.endpoints.web.exposure.include=health,info,prometheus

The dev profile keeps things frictionless locally:

# application-dev.properties
spring.datasource.url=jdbc:h2:mem:orderdb
spring.datasource.driver-class-name=org.h2.Driver
spring.jpa.show-sql=true
logging.level.com.example=DEBUG

Production tightens everything down:

# application-prod.properties
spring.datasource.url=jdbc:postgresql://${DB_HOST}:5432/orderdb
spring.datasource.username=${DB_USER}
spring.datasource.password=${DB_PASSWORD}
spring.jpa.show-sql=false
logging.level.com.example=WARN

The prod file deliberately delegates secrets to environment variables using ${...} placeholders. If DB_PASSWORD isn't set at startup, Spring Boot fails fast with a clear error rather than starting with a null password that only fails on the first query.

Profile activation is straightforward. Locally, you add spring.profiles.active=dev to your IDE run configuration or to a .env file. In a container, you set the environment variable SPRING_PROFILES_ACTIVE=prod. A command-line override, --spring.profiles.active=staging, works for one-off deployments without touching any file.

One thing worth being explicit about: the profile mechanism is not a secret store. Credentials should come through environment variables or a secrets manager like Vault, with the profile file holding only the placeholder reference. Teams that skip this step often end up with database passwords in version control, which is a much worse problem than a misconfigured datasource URL.

With the environment mismatch handled, the next failure category is what happens when a pod is killed mid-request, and that requires a different approach entirely.

Pattern 2, Graceful Shutdown with Lifecycle Hooks: Preventing Dirty Mid-Request Kills

When Kubernetes rolls out a new version of your service, it sends SIGTERM to the old pod. Without any additional configuration, the JVM exits immediately on that signal. Any request that was mid-flight, a database write half-committed, a payment being authorised, a file being streamed, gets cut off at the socket level. The client receives a connection-reset error and has no way to know whether the operation succeeded.

Spring Boot 2.3 introduced a built-in graceful shutdown to address exactly this. [2] Enabling it takes two lines in application.properties:

server.shutdown=graceful
spring.lifecycle.timeout-per-shutdown-phase=30s

With server.shutdown=graceful, the embedded server (Tomcat, Jetty, or Undertow) stops accepting new connections the moment SIGTERM arrives, then waits for active requests to complete before allowing the JVM to exit. [2] The timeout-per-shutdown-phase property sets the ceiling: if in-flight requests haven't finished within 30 seconds, Spring proceeds with the shutdown anyway rather than hanging indefinitely. Tune that timeout to match your slowest plausible request, not your average one.

Graceful shutdown handles the HTTP layer, but your application likely has other resources that need orderly cleanup: database connection pools, scheduled jobs mid-execution, Kafka consumer loops, open file handles. This is where @PreDestroy and ContextClosedEvent earn their place. [3] Annotate a method with @PreDestroy, and Spring calls it during the bean's destruction phase, after the server has drained. For cross-cutting shutdown logic, flushing a metrics buffer, signalling worker threads to stop, and implementing an ApplicationListener<ContextClosedEvent> gives you a single, centralised place to coordinate cleanup.

@Component
public class MetricsFlushListener implements ApplicationListener<ContextClosedEvent> {

 private final MetricsBuffer buffer;

 public MetricsFlushListener(MetricsBuffer buffer) {
 this.buffer = buffer;
 }

 @Override
 public void onApplicationEvent(ContextClosedEvent event) {
 buffer.flush();
 }
}

There is one gap that Spring's built-in graceful shutdown cannot close on its own. Kubernetes removes a pod from the service endpoints list asynchronously after sending SIGTERM. There is a brief window, typically a few seconds, during which the pod is already refusing new connections, but the load balancer is still routing traffic to it. Requests that arrive in that window hit a closed socket.

The standard mitigation is a Kubernetes preStop hook that introduces a deliberate delay before the JVM begins shutting down. In distroless images where sleep is unavailable, an Actuator-backed endpoint solves the same problem: a GET /actuator/preStopHook/{delayInMillis} endpoint holds the preStop hook open for the specified duration, giving the control plane time to drain the pod from all endpoint slices before SIGTERM actually reaches the application. [4] [4] A delay of 10 to 15 seconds covers most Kubernetes environments; the right value depends on your cluster's endpoint propagation latency.

Once your service shuts down cleanly, the next failure category is what happens during normal operation when a slow downstream dependency starts consuming all your threads.

Pattern 3, Structured Async Processing with @async and Thread Pool Tuning: Defeating Thread Starvation

The thread-exhaustion scenario from the introduction had a specific cause: every slow downstream call was blocking a Tomcat worker thread, and the application had no way to offload that work elsewhere. Spring's @Async mechanism exists precisely for this situation, but enabling it naively introduces a different failure mode that is just as damaging.

When you add @EnableAsync to a configuration class and annotate a service method with @Async, Spring wraps the bean in a proxy. Calls to that method from outside the bean are intercepted and submitted to an executor, freeing the calling thread immediately. [5] The Tomcat request thread returns to the pool and can accept the next incoming request while the async work runs separately.

The danger is in which executor Spring uses by default. Without an explicit executor bean, @Async falls back to SimpleAsyncTaskExecutor, which spawns a brand-new OS thread for every invocation and imposes no upper bound on how many can exist simultaneously. [6] Under load, this trades Tomcat thread exhaustion for JVM thread exhaustion; you get OutOfMemoryError: unable to create new native thread instead of a timeout, which is harder to diagnose and just as fatal. [6]

The fix is a named ThreadPoolTaskExecutor bean with explicit bounds:

@Configuration
@EnableAsync
public class AsyncConfig {

 @Bean(name = "taskExecutor")
 public Executor taskExecutor() {
 ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
 executor.setCorePoolSize(10);
 executor.setMaxPoolSize(25);
 executor.setQueueCapacity(200);
 executor.setThreadNamePrefix("async-worker-");
 executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
 executor.initialize();
 return executor;
 }
}

The three sizing parameters interact in a way that surprises many engineers. [7] The pool starts with corePoolSize threads and keeps them alive even when idle. New tasks go to the queue once all core threads are busy. Only after the queue fills up does the pool grow toward maxPoolSize. Setting queueCapacity to a generous but finite value, 200 in the example above, means excess work is buffered rather than immediately spawning new threads. The CallerRunsPolicy rejection handler applies natural back-pressure by running rejected tasks on the submitting thread instead of throwing an exception, so when the pool and queue are both full, the caller slows down rather than failing outright. [8]

One subtlety worth understanding: @Async works via Spring's proxy mechanism, which means the proxy is only engaged when the call crosses a bean boundary. If a method inside NotificationService calls another @Async method on the same NotificationService instance via this.sendEmail(...), the proxy is bypassed entirely, and the method runs synchronously on the caller's thread. To get async dispatch from within the same class, inject a reference to self via @Autowired and call through that reference, or move the async method to a separate bean.

With a bounded executor in place, your async service methods look straightforward:

@Service
public class NotificationService {

 @Async("taskExecutor")
 public CompletableFuture<Void> sendEmail(String recipient, String body) {
 // I/O-bound work happens here, off the Tomcat thread
 emailClient.send(recipient, body);
 return CompletableFuture.completedFuture(null);
 }
}

Naming the executor explicitly in the annotation ("taskExecutor") avoids any ambiguity if you later define multiple executors for different workloads, such as a separate pool for CPU-intensive tasks versus I/O-bound ones.

Thread pool sizing is not a one-size-fits-all choice. For I/O-bound async work, a larger corePoolSize relative to your CPU count makes sense because threads spend most of their time waiting on network or disk. For CPU-bound work, setting corePoolSize to roughly the number of available processors prevents context-switching overhead from eating the gains. Exposing corePoolSize and maxPoolSize through application.properties lets you tune these values per environment without redeploying.

Bounded async processing resolves the thread starvation problem, but it does not help when the downstream service you're calling starts returning errors rather than slow responses. That failure mode, a dependency that is up but broken, is where circuit breaking becomes essential.

Pattern 4, Circuit Breaking with Resilience4j: Stopping Cascading Failures at the Boundary

The previous section showed how unbounded thread pools can cause thread starvation when a downstream call is slow. Circuit breaking is what stops that slow call from becoming everyone's problem. Instead of letting threads pile up waiting for a timeout that may never come, a circuit breaker tracks the error rate of outbound calls and, once failures exceed a threshold, stops making the call at all and returns a fast failure immediately.

Resilience4j models this with three states. In the CLOSED state, calls pass through normally and outcomes are recorded in a sliding window. Once the failure rate inside that window crosses the configured threshold, the breaker transitions to OPEN, where every call is rejected immediately with a CallNotPermittedException, no network round-trip, no waiting. After waitDurationInOpenState elapses, the breaker moves to HALF_OPEN and admits a small number of probe requests. If those succeed, the circuit closes again. If they fail, it opens immediately, and the timer resets. [9] [9]

To add this to a Spring Boot 3 service, you need three dependencies: the Resilience4j starter, AOP support for the annotations to work, and Actuator for observability. [10]

<dependency>
 <groupId>io.github.resilience4j</groupId>
 <artifactId>resilience4j-spring-boot3</artifactId>
 <version>2.2.0</version>
</dependency>
<dependency>
 <groupId>org.springframework.boot</groupId>
 <artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<dependency>
 <groupId>org.springframework.boot</groupId>
 <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Configure the breaker's behavior in application.yml. The two parameters that matter most in production are slidingWindowSize, how many recent calls are evaluated, and waitDurationInOpenState, how long the breaker stays open before probing for recovery. [10]

resilience4j:
 circuitbreaker:
 instances:
 paymentService:
 slidingWindowType: COUNT_BASED
 slidingWindowSize: 10
 minimumNumberOfCalls: 5
 failureRateThreshold: 50
 waitDurationInOpenState: 30s
 permittedNumberOfCallsInHalfOpenState: 3
 automaticTransitionFromOpenToHalfOpenEnabled: true

On the Java side, annotate the method that crosses the service boundary, the connector or repository layer, not the controller. [11] Pair it with a fallback method that returns a degraded-but-valid response rather than propagating an exception to the caller.

@Service
public class PaymentConnector {

 private final RestTemplate restTemplate;

 public PaymentConnector(RestTemplate restTemplate) {
 this.restTemplate = restTemplate;
 }

 @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
 public PaymentResponse charge(PaymentRequest request) {
 return restTemplate.postForObject(
 "https://fraud-api.example.com/check", request, PaymentResponse.class);
 }

 private PaymentResponse paymentFallback(PaymentRequest request, Exception ex) {
 return PaymentResponse.builder()
.orderId(request.getOrderId())
.status("PENDING_REVIEW")
.message("Payment queued for manual review")
.build();
 }
}

The fallback method's signature must match the protected method exactly, with an Exception or Throwable parameter appended. Get this wrong and Resilience4j will silently fail to wire the fallback, propagating the original exception instead. [11]

One configuration detail worth getting right early: minimumNumberOfCalls. Without it, a single failed request at startup can trip a breaker that has only seen one call. Setting it to at least 5 ensures the failure rate is calculated from a meaningful sample before the circuit opens. [12] Similarly, slowCallDurationThreshold lets you treat calls that hang for more than, say, two seconds as failures, catching a degraded dependency before it starts actively returning errors. [12]

Resilience4j publishes the circuit breaker state via the Actuator automatically when you enable it. In application.yml, expose the circuitbreakers and health endpoints and set management.health.circuitbreakers.enabled: true. A GET to /actuator/circuitbreakers will show the current state, failure rate, and buffered call count for each named instance, useful for confirming that a circuit has opened in production before you start chasing logs. [10]

Circuit breaking solves the cascading-failure problem, but it only helps if you can see when it fires. The next section covers how Actuator health and readiness probes make that degradation visible to both your orchestration layer and your on-call team.

Pattern 5, Health and Readiness Probes with Actuator: Making Degradation Visible

A Spring Boot service can be simultaneously "running" and completely broken. The JVM is up, the process responds to pings, Kubernetes sees a healthy pod, and every request returns a 500 because the database connection pool was exhausted twenty minutes ago. Without health probes, this failure is invisible to the platform, and all it can do is keep routing traffic into the problem.

Spring Boot Actuator addresses this by exposing /actuator/health, /actuator/health/liveness, and /actuator/health/readiness as structured endpoints that aggregate the status of every registered health indicator, database, message broker, disk space, Redis, and more, into a single machine-readable response. The overall status follows a logical AND: if any one component reports DOWN, the aggregate reports DOWN and returns HTTP 503.

The liveness and readiness distinction matters in practice. A liveness probe should be narrow, checking only that the JVM is responsive, not that downstream dependencies are healthy. If your liveness probe checks the database and the database has a thirty-second hiccup, Kubernetes will restart the pod, which does nothing to fix the database and adds unnecessary churn. [13] The readiness probe is where you add dependency checks, because a NOT READY pod is simply removed from the service's endpoint list rather than restarted. Traffic stops arriving; the pod stays alive and recovers when the dependency comes back. [13]

Enable both probe endpoints explicitly in application.yml:

management:
 endpoint:
 health:
 show-details: always
 probes:
 enabled: true
 group:
 liveness:
 include: livenessState
 readiness:
 include: readinessState,db,diskSpace
 health:
 livenessstate:
 enabled: true
 readinessstate:
 enabled: true

Then point your Kubernetes deployment at the right paths:

livenessProbe:
 httpGet:
 path: /actuator/health/liveness
 port: 8080
 periodSeconds: 10
 failureThreshold: 3
readinessProbe:
 httpGet:
 path: /actuator/health/readiness
 port: 8080
 periodSeconds: 5
 failureThreshold: 3
startupProbe:
 httpGet:
 path: /actuator/health/liveness
 port: 8080
 initialDelaySeconds: 10
 periodSeconds: 5
 failureThreshold: 30

A startup probe is worth treating as required for Spring Boot on Kubernetes. Java applications routinely need 30-60 seconds to initialise, and without one, Kubernetes may declare the pod broken before the application context has even finished loading. The startup probe above allows up to 150 seconds for initialisation; after it passes, the liveness probe takes over.

Actuator also lets you expose your own health logic. Implement HealthIndicator, annotate it with @Component, and Spring will automatically include it in the aggregated response. The bean name minus the "HealthIndicator" suffix becomes the key in the JSON output, so DatabaseHealthIndicator appears as "database" in the response.

Go-Live Checklist: Which Patterns to Add First

The order depends on what kind of service you are shipping.

Synchronous HTTP service (REST API, BFF, gateway)

Externalised configuration with profiles, environment parity prevents the entire class of "works locally, broken in prod" incidents.
Graceful shutdown prevents in-flight requests from being killed on every deploy.
Circuit breaking with Resilience4j protects your thread pool from slow downstream APIs.
Actuator health and readiness probes make degradation visible to load balancers and on-call engineers.

Async worker or event-driven service (Kafka consumer, job processor)

Externalised configuration with profiles.
@Async with a bounded ThreadPoolTaskExecutor, an unbounded executor will exhaust memory under sustained load.
Actuator health probes, at minimum, expose readiness so the pod can signal when it has fallen behind or lost its broker connection.
Graceful shutdown, give in-flight messages time to complete before the consumer stops.

Any service deploying to Kubernetes

All five patterns apply, but start with readiness probes and graceful shutdown together. A pod that starts receiving traffic before its connection pool is ready, or that gets killed mid-request on every rolling deploy, will produce incidents that are genuinely hard to debug, because the failure window is short and the logs look normal.

The fastest path forward: add spring-boot-starter-actuator to your dependencies today, enable the probe endpoints, and wire the Kubernetes manifest to /actuator/health/liveness and /actuator/health/readiness. That single change makes your service's internal state legible to the platform, and legibility is the prerequisite for everything else.

How Spring AI Works: LLMs, Embeddings, and RAG for Java Engineers

srinivas reddy gouru — Fri, 05 Jun 2026 16:39:37 +0000

The Java Engineer's AI Problem, and Why Spring AI Exists

Spring AI is a Spring Boot framework that connects Java applications to LLMs and vector stores using the same dependency-injection patterns Spring developers already know. Your team picks up a ticket to add an AI-powered feature to an existing Spring Boot service, maybe a document Q&A tool, or a chat assistant backed by company-internal data. You open a browser and start searching for tutorials. Nearly every result assumes you are working in Python, reaching for LangChain, and standing up a virtualenv. The code samples use FastAPI. The LLM client libraries are pip packages. The entire mental model is Python-first, and the message, whether intentional or not, is that AI development happens somewhere other than where you work.

This is the friction Spring AI was built to remove. Java is a powerful, deeply established language running enormous amounts of production workloads, and there is no good reason Java engineers should have to detour through a different language just to call a language model. The question "how do I build an AI solution in Java?" deserves a real answer, not "port the Python tutorial yourself."

Spring AI is that answer. It is an application framework for AI engineering whose explicit goal is to carry Spring ecosystem design principles, portability, modular design, and POJO-centric construction into the AI domain. At its core, it solves a concrete integration problem: connecting your enterprise data and APIs with AI models, using the same patterns you already reach for when connecting to a database or a message broker.

The project drew inspiration from Python's LangChain and LlamaIndex, but it is not a port of either. The underlying conviction is that generative AI applications will not stay confined to one language. Spring AI was founded on the belief that the next wave of AI-powered software will be ubiquitous across programming languages, and that Java engineers deserve first-class tooling rather than a translation layer bolted on afterwards.

What makes that practical is the mapping layer Spring AI provides between the AI world and the Spring world. Concepts that feel foreign in isolation, chat models, embeddings, vector stores, prompt templates, tool calls, each get a Spring-native abstraction: an auto-configured bean, a declarative client, a repository interface. You wire them with dependency injection the same way you wire a JdbcTemplate or a RestClient. The rest of this article walks through how each of those mappings works, how a request travels from your application code through Spring AI to an LLM and back, and how the framework connects to vector databases for retrieval-augmented generation. By the end, you should be able to read a Spring AI codebase, understand the event flow it describes, and know exactly which dependency to add for each piece of the puzzle.

Spring AI's Core Architecture: Familiar Abstractions, New Capabilities

Spring AI is built on three interlocking design principles, and understanding how they stack together explains why the framework feels so natural to a Spring developer. The principles are not independent: each one depends on the one before it, and together they produce something more useful than any of them would alone.

The foundation is portability, expressed through a set of provider-agnostic interfaces: ChatModel, EmbeddingModel, VectorStore, ImageModel, and so on. Every supported AI provider, OpenAI, Anthropic, Google, Amazon Bedrock, Azure, Ollama, implements the same interfaces. Your application code refers only to these abstractions, never to a provider-specific class. Switching from Claude to GPT-4 means updating a dependency and two lines in application.properties; the ChatModel you injected into your service doesn't change. That portability only becomes practical, though, if something wires those implementations for you automatically.

That something is auto-configuration. Spring AI ships a Spring Boot starter for each supported provider, and adding one to your build is enough for the framework to register a fully wired AI client as a bean. You set spring.ai.openai.api-key in application.properties, and Spring Boot's auto-configuration creates a ChatModel, an EmbeddingModel, and a ChatClient.Builder, all ready for injection, no manual SDK initialisation required. The same mechanism configures vector stores: add spring-ai-pgvector-store-spring-boot-starter and the auto-configured EmbeddingModel is wired into a VectorStore bean automatically. Without the portable interfaces beneath it, this auto-configuration would produce a tangle of provider-specific beans. Because the interfaces are stable contracts, auto-configuration can produce beans that the rest of your application consumes without caring which provider sits behind them.

Those two layers together create the conditions for the third principle: POJO-centric design. An LLM response is free-form text, but your service needs typed data. Spring AI's Structured Outputs feature handles the conversion: the ChatClient appends a system message instructing the model to respond in JSON, then deserialises the result directly into whatever Java type you specify. The call reads exactly as it would if you were deserialising an HTTP response with RestClient:

record City(String name, String zipcode) {}

City capital = chatClient.prompt()
.user("What is the capital of Germany?")
.call()
.entity(City.class);

Without portability, you couldn't reuse this pattern across providers. Without auto-configuration, you'd need to manually construct the converter pipeline. The three principles stack, and the stack pays off in code that treats an LLM as just another external data source, one that returns typed objects, gets injected like any other bean, and can be swapped out through configuration.

This layered architecture is what every subsequent Spring AI concept, ChatClient prompts, Advisors, RAG pipelines, and tool calling, is built on top of. The diagram below shows how these layers relate at runtime, from your Spring beans down through the abstraction interfaces to the provider implementations and external services.

Talking to LLMs: ChatClient, Models, and Prompt Templates

The first thing most Spring developers want to do when they pick up Spring AI is send a message to an LLM and get a response back. The entry point for that is ChatClient, a fluent API that feels instantly familiar if you've ever used WebClient or RestClient.

Getting there requires three things: the right starter on the classpath, a key in application.properties, and an injected ChatClient. For OpenAI, that looks like this:

<!-- pom.xml -->
<dependency>
 <groupId>org.springframework.ai</groupId>
 <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

# application.properties
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o

@Service
public class SupportService {

 private final ChatClient chatClient;

 public SupportService(ChatClient.Builder builder) {
 this.chatClient = builder.build();
 }

 public String answer(String question) {
 return chatClient.prompt()
.user(question)
.call()
.content();
 }
}

Auto-configuration registers the ChatClient.Builder bean based on whatever provider starter is on the classpath. You inject the builder, call .build(), and you're done. Swapping from OpenAI to Azure OpenAI or Mistral is a matter of changing the starter dependency and the corresponding properties; the ChatClient call site above stays unchanged.

That portability comes from the ChatModel interface sitting underneath ChatClient. Every provider adapter implements ChatModel, so the same call(Prompt) method signature works regardless of which backend you've configured. You can also inject ChatModel directly if you need a lower-level access, but for most application code, ChatClient is the better choice because it layers on convenience methods for streaming, system messages, and chat memory.

Building structured prompts with PromptTemplate

Simple string questions work fine for exploration, but real features need structure. A customer support bot might need to inject the user's account tier and their most recent order number into every prompt. Hardcoding that with string concatenation gets messy fast. PromptTemplate solves this the same way Thymeleaf solves HTML templating: you define a string with named placeholders using {} syntax, then fill them at runtime with a Map.

PromptTemplate template = new PromptTemplate(
 "You are helping a {tier} customer. Their last order was {orderId}. Answer: {question}"
);

Prompt prompt = template.create(Map.of(
 "tier", "premium",
 "orderId", "ORD-9182",
 "question", userInput
));

Generation result = chatModel.call(prompt).getResult();

Under the hood, Spring AI uses the StringTemplate engine by Terence Parr to perform the substitution. If curly braces clash with your template content, you can configure alternate delimiters via StTemplateRenderer. You can also store prompt text in classpath resources and inject it with @Value("classpath:/prompts/support.st"), which keeps long instructions out of your Java source files entirely.

A Prompt can carry multiple Message objects, each tagged with a role. The system role gives the model its persona and constraints before the conversation begins; the user role carries the actual question; the assistant role records prior model responses for multi-turn exchanges. Spring AI provides SystemPromptTemplate and UserMessage as typed helpers for each role, so you rarely work with the raw enum directly. For stateful conversations where the model needs to remember earlier turns, ChatClient supports chat memory out of the box; prior messages are automatically included in subsequent calls without manual prompt concatenation.

Once you can send structured prompts to an LLM, the natural next question is how to give it access to your own data. That's where embeddings and vector stores come in.

Embeddings and Vector Stores: Giving the LLM Long-Term Memory

An embedding is a fixed-length array of floating-point numbers that represents the meaning of a piece of text. When you ask an embedding model to process the sentence "What is the refund policy?", it returns something like a 1536-element vector of decimal values. That vector encodes the semantic content of the sentence, not its exact words. Two sentences that mean the same thing will produce vectors that sit close together in that high-dimensional space, even if they share no vocabulary. That proximity is what makes similarity search possible.

You do not need to understand the mathematics behind this to use it in a Spring application. What matters is the operational picture: you have text, you convert it to a vector, and you store or compare that vector.

EmbeddingModel: The Spring Abstraction

Spring AI wraps embedding providers behind the EmbeddingModel interface, which follows the same design philosophy as ChatModel. Auto-configuration wires a provider-specific implementation into your ApplicationContext, and your code depends only on the interface. Calling embeddingModel.embed("some text") returns a float[] at whatever dimensionality the underlying model uses. For batch workloads, embedForResponse(List<String> texts) returns all vectors in one round trip, which matters when you are processing thousands of documents at startup.

@Service
public class DocumentIndexer {

 private final EmbeddingModel embeddingModel;

 public DocumentIndexer(EmbeddingModel embeddingModel) {
 this.embeddingModel = embeddingModel;
 }

 public float[] vectorize(String text) {
 return embeddingModel.embed(text);
 }
}

Swapping from OpenAI's text-embedding-ada-002 to a locally hosted Ollama model is a one-line change in application.properties. The dimensionality of the output will differ, but the interface stays identical.

VectorStore: One API, Many Databases

Storing and searching vectors requires a database that understands cosine similarity or dot-product distance. Spring AI introduces VectorStore, a single portable interface over a wide fleet of supported backends: PostgreSQL via PGVector, Redis, MongoDB Atlas, Chroma, Milvus, Pinecone, Weaviate, Neo4j, Oracle, Qdrant, Apache Cassandra, and Azure Vector Search.

The API has two primary operations. vectorStore.add(List<Document> documents) takes Spring AI Document objects, embeds them automatically, and persists the resulting vectors alongside the original text and any metadata you attach. vectorStore.similaritySearch(SearchRequest.query("refund policy").withTopK(5)) retrieves the five documents whose vectors sit closest to the query vector, returning plain Document objects ready to inject into a prompt.

Beyond basic search, Spring AI adds a SQL-like metadata filter syntax that works portably across every supported provider. You can write SearchRequest.query("refund policy").withFilterExpression("category == 'billing' && year >= 2024"), and the library translates that expression into whatever native filter the underlying store supports, whether that is a Postgres WHERE clause or a Pinecone metadata filter object. That portability means you can develop against an in-process Chroma instance and deploy against PGVector in production without changing application code.

Why These Two Primitives Matter

EmbeddingModel converts meaning into numbers; VectorStore stores those numbers and finds the closest ones on demand. Together, they give an LLM access to knowledge that was never in its training data, your documentation, your customer records, and your internal policies. In the next section, those two primitives become the first two steps in a Retrieval-Augmented Generation pipeline that feeds retrieved context directly into the model's prompt.

RAG Pipelines: Wiring Documents into LLM Answers

Retrieval-Augmented Generation (RAG) is the pattern that closes the biggest gap in stock LLM behaviour: a model trained on public data knows nothing about your internal documents, product catalogue, or support tickets. RAG solves this by retrieving the relevant text at request time and handing it to the model as context, so the answer is grounded in your data rather than in whatever the model happened to learn during pre-training. Spring AI makes this pattern a first-class citizen, and it maps naturally onto the embedding and vector-store primitives covered in the previous section.

The pipeline splits into two distinct phases. In the ingestion phase, you load raw documents, split them into chunks, embed each chunk, and write the resulting vectors into a vector store. In the query phase, you embed the user's question, run a similarity search against those vectors to retrieve the top-K most relevant chunks, inject those chunks into the prompt as additional context, and send the augmented prompt to the LLM. Everything between "user typed a question" and "LLM produced a grounded answer" runs inside Spring-managed beans, which means you configure it exactly the way you'd configure a data source or a REST client.

Ingestion: Loading and Splitting Documents

Spring AI ships a DocumentReader abstraction with built-in implementations for PDF, plain text, Markdown, HTML, and JSON. You load documents, pipe them through a TokenTextSplitter (or a custom splitter), and call vectorStore.add(documents). The VectorStore implementation calls your configured EmbeddingModel automatically; you don't wire the embedding call by hand.

@Bean
ApplicationRunner ingest(DocumentReader reader,
 TokenTextSplitter splitter,
 VectorStore vectorStore) {
 return args -> {
 List<Document> docs = reader.get();
 List<Document> chunks = splitter.apply(docs);
 vectorStore.add(chunks);
 System.out.println("Ingested " + chunks.size() + " chunks.");
 };
}

Run this once at startup (or on a schedule), and the vector store is ready to serve queries.

Query: Retrieval and Prompt Augmentation

On the query side, the QuestionAnswerAdvisor is the highest-level abstraction Spring AI provides. You attach it to a ChatClient, pass in the VectorStore, and it handles the retrieve-then-augment flow automatically. The advisor embeds the incoming question, queries the vector store for similar chunks, and rewrites the prompt to include those chunks before the call reaches the model.

@Bean
ChatClient ragClient(ChatClient.Builder builder, VectorStore vectorStore) {
 return builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
.build();
}

// At request time:
String answer = ragClient.prompt()
.user(userQuestion)
.call()
.content();

The advisor's default behaviour retrieves the top four chunks and appends them to the system message. You can tune k, apply metadata filters (useful when you've segmented documents by tenant or product line), or replace the default prompt template entirely by supplying your own PromptTemplate to the advisor constructor.

What the Embeddings Are Actually Doing

When the advisor receives a question, it calls embeddingModel.embed(question) to produce a vector, then calls vectorStore.similaritySearch(SearchRequest.query(question).withTopK(4)). The vector store computes cosine similarity (or an equivalent metric, depending on the backend) between the query vector and all stored chunk vectors, and returns the closest matches. Because the embeddings encode semantic meaning rather than literal keyword overlap, a question about "cancellation policy" will match a chunk that uses the phrase "how to end your subscription", something a keyword search would miss entirely.

The result is a system where your LLM's answers are constrained by evidence drawn from your own documents, which reduces hallucination and keeps the model from confidently citing things it simply doesn't know. For most production Java applications, this is the pattern that makes the difference between a demo and something you'd trust at scale.

Tool Calling: Letting the LLM Invoke Your Spring Beans

RAG pipelines give the LLM access to stored documents, but documents are static snapshots. When a user asks "what's the current balance on account 4821?" or "book a meeting for tomorrow at 2 pm," the model needs to reach into live systems. Tool calling is how that happens.

The core idea is straightforward: instead of answering a question itself, the LLM can pause mid-response and say, "I need to call a function to answer this." Spring AI intercepts that signal, executes the corresponding Java method, and sends the result back to the model so it can complete its response. From the user's perspective, the answer just arrives. From the developer's perspective, the model is calling a Spring bean.

How the Round-Trip Works

The sequence goes like this: your application sends a prompt along with a list of available tools, each described by name, purpose, and parameter schema. The model evaluates whether any tool is relevant to the request. If it decides to use one, it returns a structured tool-call request rather than a natural-language answer. Spring AI catches that response, looks up the registered Java method by name, deserialises the arguments, invokes the method, and then sends the result back to the model as a follow-up message. The model reads the result and continues generating the final answer. The whole loop is invisible to the caller.

Registering a Tool

Spring AI lets you register tools in two ways: as annotated beans or as inline lambdas passed directly to the ChatClient. The annotation approach maps cleanly to existing service classes:

@Service
public class AccountTools {

 @Tool(description = "Returns the current balance for a given account ID")
 public BigDecimal getAccountBalance(String accountId) {
 return accountRepository.findById(accountId)
.map(Account::getBalance)
.orElseThrow(() -> new IllegalArgumentException("Unknown account: " + accountId));
 }
}

You then attach it when building the client call:

ChatResponse response = chatClient.prompt()
.user("What is the balance on account 4821?")
.tools(accountTools)
.call()
.chatResponse();

Spring AI reads the @Tool annotation to generate the JSON schema that the model receives, so the model knows exactly what parameters to pass and what the function does. You do not write any schema by hand.

Guardrails Around Tool Calls

Because the model is now driving real method invocations, safety controls matter. Spring AI supports input and output guardrails at the model-provider level, which means you can validate what the model is asking to call before the method executes, and you can filter or transform the result before it goes back to the model. In practice, this looks like middleware: you intercept the tool-call request, check it against your business rules, and either allow it, reject it with a safe fallback, or modify the parameters. That keeps a misbehaving model from passing malformed inputs straight into your database layer.

Tool calling is also where Spring AI's model portability pays off most noticeably. The same @Tool-annotated bean works against OpenAI, Anthropic, Google Gemini, or a local Ollama model. The framework translates the tool schema into whatever wire format each provider expects, so you are not locked into one vendor's function-calling API. Once your tools are registered, switching models is a configuration change, not a rewrite.

Agentic Patterns: Workflows vs. Autonomous Agents

Tool calling, covered in the previous section, is the mechanism that enables agents. The next question is how much control you hand to the LLM once those tools are available.

Spring AI draws a clear line between two types of agentic systems. In a workflow, LLMs and tools are orchestrated through predefined code paths. Your Java code decides when to call the model, which tools are eligible, and what happens with the result. In an agent, the LLM itself dynamically decides which tools to invoke and in what sequence, with your code acting more as an execution host than a director.

The distinction matters practically. A workflow is deterministic by construction: given the same inputs, you follow the same execution path, which makes it easy to test, audit, and explain to stakeholders. An autonomous agent trades that predictability for flexibility. You describe a goal and let the model reason about how to reach it, which is powerful but introduces non-determinism that can be hard to manage in a production system with SLA requirements.

Spring AI's approach here is deliberately aligned with Anthropic's research on building effective agents, which emphasises simplicity and composability over complex orchestration frameworks. The practical takeaway from that research is that a well-structured workflow better serves most production use cases than a fully autonomous agent, even when autonomous agents feel like the more exciting option. Spring AI's agentic-patterns reference project implements both styles, giving you concrete starting points for each.

For a backend Java engineer, this framing maps onto patterns you already know. A workflow resembles a service method with a clear call graph: call the LLM to classify intent, branch on the result, call a tool, and return a structured response. An agent resembles an event loop: the LLM produces a tool call, you execute it, you feed the result back to the model, and you repeat until the model signals it is done. Spring AI handles that loop through its ChatClient and tool registration infrastructure, the same APIs you use for single-shot requests, just called iteratively inside a loop that the model controls.

In practice, most teams start with workflow patterns because the control flow is visible in code and failures are easy to reproduce. The agentic loop becomes worthwhile when the task genuinely requires open-ended reasoning, a research assistant that decides which APIs to query based on partial results, for example, or a code-review agent that chooses which static-analysis tools to run based on the languages it detects in a PR. For everything else, the workflow keeps the model in a supporting role and your Java code in the driver's seat, which is usually where you want it in an enterprise context.

Multi-Model Configuration and Provider Portability

Real enterprise applications rarely settle on a single LLM. You might use a fast, inexpensive model for short classification tasks and a more capable model for complex summarisation, or you might want a fallback path so that when one provider's API is unavailable, the application degrades gracefully rather than returning an error. Spring AI handles both scenarios through the same auto-configuration and dependency injection patterns you already use everywhere else in your Spring Boot application.

Switching Providers With Configuration Alone

The most direct form of portability is the ability to change providers without touching your application code. Because ChatClient and ChatModel are interfaces, the concrete implementation is determined at startup by whichever starter is on the classpath and how application.properties is set. Swapping from OpenAI to Anthropic's Claude is primarily a two-part change: replace the starter dependency in your build file, and update your properties.

# OpenAI configuration (active)
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o

# Anthropic Claude configuration (alternative, comment out the OpenAI block above
# and uncomment these lines to switch providers; do not use both at once)
# spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
# spring.ai.anthropic.chat.options.model=claude-3-5-sonnet-20241022

Your @Service classes that inject ChatModel or ChatClient remain untouched. This is the same bet Spring has always made with its abstraction layers: write to the interface, configure the implementation externally.

How Spring Security works with JWT, OAuth2, and SSO

srinivas reddy gouru — Tue, 26 May 2026 17:43:05 +0000

The 401 You Never Wrote: What Spring Security Actually Does Out of the Box

Spring Security is a framework that intercepts every HTTP request entering your application and enforces authentication and authorisation rules before that request ever reaches a controller. You do not write the interception logic yourself; adding the spring-boot-starter-security dependency to a Spring Boot project is enough to activate a default configuration that locks down every endpoint and returns a 401 (or redirects to /login) for any unauthenticated caller. No annotations, no filters, no @PreAuthorize, just a single dependency, and suddenly your previously open API is asking for credentials.

That out-of-the-box behaviour surprises a lot of engineers the first time they see it, but it reflects a deliberate design choice: security should be the default, not something you opt into. The alternative, checking credentials inside each controller method, means one forgotten check equals one unprotected route. Spring Security removes that category of mistake entirely by applying rules uniformly at the HTTP layer, before a single line of your business logic runs.

What makes this possible is the security filter chain. Spring Security is not a single interceptor sitting in front of your application; it is a carefully ordered sequence of servlet filters, each responsible for one concern. One filter handles CSRF token validation. Another manages session creation and lookup. Another reads the Authorisation header and tries to resolve a credential. Another checks whether the resolved identity actually has permission to reach the requested URL. These filters run in a fixed order on every incoming request, and each one can either pass the request downstream or short-circuit the chain with an error response.

At the centre of this pipeline sits the SecurityContext. By the time the authorisation filters run, the chain expects someone to have placed an Authentication object into the context, a wrapper that holds the verified identity (the principal) and the roles or authorities it carries. If the context is empty when the authorisation filter runs, the chain treats the caller as anonymous. If the requested URL requires an authenticated user, the request is rejected before your controller is ever invoked.

This is the mental model that makes all of Spring Security coherent: a pipeline of filters, a shared context object, and one job delegated to whichever filter is responsible for your chosen credential mechanism. Every token mechanism you'll configure, JWT, OAuth2, SSO, is simply a different strategy for putting that principal into the context. Get that strategy wrong, or register it at the wrong point in the chain, and you'll keep seeing 401s that your controllers never threw.

The FilterChain: Spring Security's Request Processing Pipeline

Every HTTP request that reaches a Spring application passes through a structure called the SecurityFilterChain before it ever touches a controller. Understanding this chain is what separates confident Spring Security configuration from trial-and-error annotation hunting.

Spring Security enters the picture through a DelegatingFilterProxy, a thin Servlet filter that the framework registers with the container at startup. [1] Its only job is to hand the request off to Spring's application context, where the real work happens inside a SecurityFilterChain: an ordered list of Spring-managed filters, each responsible for one specific concern. [1] [2] Think of it as a pipeline where each stage either passes the request forward, modifies it, or short-circuits the whole chain by writing a response directly, returning a 401, for example, before the request ever reaches your business logic.

The order of filters in that chain is not arbitrary. Spring Security ships with defaults that place CSRF protection early, session management in the middle, and the authorisation check (AuthorizationFilter) near the end. Authentication filters, wherever you insert them, must run before that authorisation check, because authorisation reads from something the authentication filter writes: the SecurityContextHolder.

The SecurityContextHolder is a thread-local container that holds the SecurityContext, and the SecurityContext holds the Authentication object for the current request. [1] [1] When an authentication filter validates a credential and calls SecurityContextHolder.getContext().setAuthentication(...), it is essentially leaving a note for every filter that comes after it: "this request belongs to a principal with these roles." The authorisation filter at the end of the chain reads that note and decides whether the principal's roles satisfy the rules you configured for that endpoint. If no authentication filter populates the context, the authorisation filter finds an anonymous token and rejects the request accordingly.

This design has a practical consequence for configuration. When you call http.addFilterBefore(myFilter, UsernamePasswordAuthenticationFilter.class) or addFilterAfter, you are specifying exactly where in this ordered pipeline your logic runs. [2] Getting that position wrong is one of the most common sources of mysterious 403 responses, because a filter that runs after the authorisation check can no longer influence the outcome of that check.

Each of the three token mechanisms covered in the next sections plugs into this chain at a different point and populates the SecurityContextHolder in a different way. A JWT filter reads the Authorisation header, validates the token cryptographically, and sets the authentication directly, all within a single filter. OAuth2's resource server support does something similar but delegates signature verification to a remote or locally cached JWK set. SSO flows are different again: they redirect the browser entirely, establish a session on return, and then rely on a session-reading filter to restore the SecurityContext on subsequent requests. Same chain, same SecurityContextHolder contract, three different plug-in points. That consistency is what makes Spring Security composable once you internalise the pipeline model.

JWT Authentication: Stateless Token Validation in the Filter Chain

With the filter chain's structure in mind, it's worth seeing exactly how a real token mechanism plugs into it. JWT is the most common starting point, partly because it requires no session storage and no round-trip to an authorisation server; the token itself carries everything the application needs to make an access decision.

When a client sends a request to a secured endpoint, it includes a bearer token in the Authorisation header. Spring Security doesn't know what to do with that header on its own; nothing in the default chain extracts a JWT. That gap is your extension point. You create a filter that extends OncePerRequestFilter, extract the token from the header, validate its signature and expiry, and then construct an Authentication object that you write into the SecurityContextHolder. [1] From that moment forward in the chain, the request looks just like any other authenticated request; downstream filters and your controllers see a populated security context and never need to know a JWT was involved.

Position matters here. Your custom filter must be registered before UsernamePasswordAuthenticationFilter in the chain, which you do with addFilterBefore in your SecurityFilterChain configuration. If it runs after the authorisation filters have already evaluated the request, the SecurityContext will be empty at the moment it counts, and the request will be rejected as unauthenticated regardless of whether the token was valid.

@Component
@RequiredArgsConstructor
public class JwtAuthenticationFilter extends OncePerRequestFilter {

 private final JwtUtility jwtUtility;
 private final UserDetailsService userDetailsService;

 @Override
 protected void doFilterInternal(HttpServletRequest request,
 HttpServletResponse response,
 FilterChain filterChain)
 throws ServletException, IOException {

 String authHeader = request.getHeader(HttpHeaders.AUTHORIZATION);
 if (authHeader == null ||!authHeader.startsWith("Bearer ")) {
 filterChain.doFilter(request, response);
 return;
 }

 String token = authHeader.substring(7);
 String username = jwtUtility.extractUsername(token);

 if (username!= null && SecurityContextHolder.getContext().getAuthentication() == null) {
 UserDetails user = userDetailsService.loadUserByUsername(username);
 if (jwtUtility.isTokenValid(token, user)) {
 UsernamePasswordAuthenticationToken auth =
 new UsernamePasswordAuthenticationToken(user, null, user.getAuthorities());
 auth.setDetails(new WebAuthenticationDetailsSource().buildDetails(request));
 SecurityContextHolder.getContext().setAuthentication(auth);
 }
 }
 filterChain.doFilter(request, response);
 }
}

Because the server stores no session state, every request must be validated independently. The filter re-verifies the signature and the expiry claim on each call. This is what makes JWT genuinely stateless: the token is self-contained, and the server-side footprint is just the public key or shared secret used for verification.

The filter is also the right place to handle more advanced concerns. Token revocation, for example, can be layered in by checking a blocklist inside the same doFilterInternal method before the Authentication object is written to the context. Compromised password detection follows the same pattern. The filter becomes the single choke point for every token lifecycle decision, which keeps your authorisation logic clean and your business controllers unaware of authentication mechanics.

OAuth2 Login & Resource Server: Delegating Trust to an Authorisation Server

Where JWT authentication is a two-party handshake between your application and the token, OAuth2 introduces a third party: an authorisation server that your application has to coordinate with in real time. Spring Security handles all of that coordination, but you need to tell it which role your application is playing, because the configuration and the filter chain behaviour are meaningfully different depending on the answer.

Spring Security's OAuth2 support draws a clean line between two roles. oauth2Login() configures your application as an OAuth2 client: it performs the redirect to the authorization server, receives the callback, exchanges the authorization code for tokens, and logs the user in. oauth2ResourceServer() configures your application as an API backend that accepts and validates bearer tokens issued by an external authorization server. These two modes are not interchangeable. A browser-facing web app typically uses oauth2Login(); a REST API consumed by other services typically uses oauth2ResourceServer(). Some architectures need both in separate security filter chains.

In the Authorisation Code flow, OAuth2LoginAuthenticationFilter does the heavy lifting on the client side. When the authorisation server redirects the browser back to your app with a ?code=... parameter, this filter intercepts the request, calls the authorisation server's token endpoint to exchange the code for an access token and ID token, loads the user's identity from the user-info endpoint or from the ID token directly, and populates the SecurityContext. You do not write any of that exchange logic yourself. The minimal configuration to enable it looks like this:

@Configuration
@EnableWebSecurity
public class SecurityConfig {

 @Bean
 public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
 http
.authorizeHttpRequests(auth -> auth
.anyRequest().authenticated()
 )
.oauth2Login(Customizer.withDefaults());
 return http.build();
 }
}

Spring Boot auto-configuration picks up the client registration details from application.yml, client ID, client secret, authorisation URI, and token URI, so the Java config stays short.

On the resource server side, oauth2ResourceServer(OAuth2ResourceServerConfigurer::jwt) swaps in a BearerTokenAuthenticationFilter that reads the Authorisation: Bearer... header, validates the JWT's signature against the authorisation server's published JWKS endpoint, and populates the SecurityContext with the decoded claims. Introspection, calling the authorisation server's introspect endpoint on every request instead of validating locally, is also supported, and is the right choice when you need real-time token revocation checks rather than relying on a short expiry window.

Token lifecycle is another area where Spring Security saves work. When a stored access token has expired, the OAuth2 client support automatically attempts a refresh token grant before forwarding the request. Your application layer never sees the expired token.

One configuration pitfall is worth calling out explicitly. If you want to run custom logic after the token endpoint responds, setting a cookie from the token, for example, that logic must be placed inside the SecurityFilterChain at the correct position using addFilterAfter(), not in a plain servlet filter sitting outside Spring Security. Filters registered outside the chain execute at a different point in the request lifecycle and cannot reliably observe the outcome of Spring Security's token processing.

SSO with Spring Security: Session-Backed Identity Federation

The sharpest difference between SSO and the JWT path covered earlier is not in the configuration, it's in what survives a request boundary. JWT authentication is stateless: every request carries its own proof of identity in the Authorization header, and the filter validates that proof from scratch each time. SSO via oauth2Login() works the opposite way. Once a user completes the redirect dance with the identity provider, Spring Security stores the resulting Authentication object in the HTTP session, and every subsequent request is resolved against that session rather than against a fresh token.

That session-backed model is what makes SSO feel seamless to the end user: they authenticate once with Google or GitHub, get redirected back to your application, and from that point on the browser's session cookie is the credential. The identity provider is not consulted again until the session expires or the user explicitly signs out.

The application.yml registration for a Google-backed SSO looks like this:

spring:
 security:
 oauth2:
 client:
 registration:
 google:
 client-id: YOUR_CLIENT_ID
 client-secret: YOUR_CLIENT_SECRET
 scope: openid, profile, email
 provider:
 google:
 authorization-uri: https://accounts.google.com/o/oauth2/v2/auth
 token-uri: https://oauth2.googleapis.com/token
 user-info-uri: https://www.googleapis.com/oauth2/v3/userinfo

The Java security configuration that activates this is identical to the OAuth2 Login example in the previous section, http.oauth2Login() is all that's needed, with no additional filter registration. What changes is what happens inside the filter chain after the callback completes.

In the JWT path, SecurityContextHolderFilter reconstructs the SecurityContext on every request by validating the bearer token. In the SSO path, the same filter reconstructs the SecurityContext by reading it from the HttpSession, no token validation occurs because the session already holds a fully populated OAuth2AuthenticationToken. That is the meaningful delta: the persistence strategy changes from token-per-request to session-per-user, and SecurityContextHolderFilter adapts accordingly through HttpSessionSecurityContextRepository rather than the stateless repository wired in the JWT configuration.

The older @EnableOAuth2Sso annotation from the Spring Security OAuth2 legacy project achieved the same redirect-and-session behaviour, but required you to manually wire an OAuth2ClientContext and build an OAuth2ClientAuthenticationProcessingFilter yourself. Spring Security 5's native oauth2Login() DSL absorbed all of that plumbing, which is why a single YAML block and one DSL method call are now sufficient for a fully working SSO integration against Google, GitHub, or any standards-compliant identity provider.

One practical implication worth naming: because the session holds the Authentication, horizontal scaling requires either sticky sessions or a shared session store like Redis. This is the trade-off you accept relative to the stateless JWT model, where any node can validate a request independently.

Comparing the Three Mechanisms: Which Hook, Which Flow, Which Config

The three token mechanisms are not strict alternatives to one another, they solve slightly different problems, and they plug into the filter chain at different points. Knowing which slot each one occupies is the fastest way to choose between them and to diagnose failures when they occur.

JWT authentication is the right choice for stateless APIs and microservices where no session should be kept server-side. Your application is the sole verifier of the token: it checks the signature, extracts claims, and populates the SecurityContext, all within a single OncePerRequestFilter placed before UsernamePasswordAuthenticationFilter. Nothing about the user's identity is stored between requests.

OAuth2 resource server configuration is appropriate when a separate authorisation server issues tokens, and your application should validate them against a well-known JWKS endpoint. Spring's built-in BearerTokenAuthenticationFilter handles the extraction and delegates to the JwtDecoder you configure, so you write almost no filter code yourself. The trust anchor is the authorisation server's public key, not anything your application generates.

SSO via oauth2Login applies when a browser-based user needs to authenticate once and have that identity recognised across multiple applications. It reuses the same OAuth2LoginAuthenticationFilter that handles a standard OAuth2 login callback, but the session becomes the persistence mechanism rather than a token the client presents on each request. The identity provider, not your application, manages when that session expires.

When a 401 appears, and you need to find out why, the filter chain position is almost always the first thing worth examining. In a JWT setup, a silent 401 with no accompanying error body usually means one of two things: the custom filter ran but did not write to the SecurityContext, or the filter was registered at a position that places it after the authorisation check rather than before it. In an OAuth2 resource server setup, the failure is more often a misconfigured issuer-uri or an expired token that the decoder rejects. In an SSO setup, a 302 redirect rather than a 401 is the more common symptom; the session wasn't found, so the user is sent back to the identity provider.

The filter chain is the right mental model to carry into all three of these situations. Once you see JWT validation, OAuth2 token introspection, and SSO session lookup as separate plug-ins wired into the same ordered pipeline, the framework stops behaving like an unpredictable wall and starts behaving like a predictable sequence you can reason about step by step. Every 401 is answerable by asking: which filter was responsible for this request, did it run, and if it ran, did it write to the SecurityContext? You can make that question concrete immediately by enabling TRACE logging on org.springframework.security, the output prints each filter in the chain as it executes, which turns a confusing status code into a visible gap in the sequence. That visibility is a direct consequence of the model: when you understand the pipeline, you know exactly where to look.

Sources

How Spring Data JPA, JPA, and Hibernate work together

srinivas reddy gouru — Tue, 26 May 2026 17:24:05 +0000

The Magic Line That Raises the Right Question

Spring Data JPA is a library that lets you query a relational database by writing a Java interface method and nothing else: no SQL, no ResultSet parsing, no PreparedStatement boilerplate. You declare what you want, and the framework figures out how to fetch it.

Here is the moment that stops most backend engineers in their tracks:

public interface UserRepository extends JpaRepository<User, Long> {
 Optional<User> findByEmail(String email);
}

That is the entire implementation. You inject UserRepository into a service, call userRepository.findByEmail("alice@example.com"), and get back a populated User object from the database. The method works on the first try, which is satisfying until something breaks, or until you need to understand why a query is slow, why a LazyInitializationException keeps appearing, or why your transaction did not roll back the way you expected.

At that point, the magic stops feeling helpful and starts feeling like a wall.

The wall exists because findByEmail is not a single thing. It is the visible surface of three separate layers, each with its own responsibilities, failure modes, and configuration surface. Spring Data JPA translated your method name into a query. JPA, the Java Persistence API, provided the standard contract that queries had to follow. Hibernate actually executed it against the database. Without knowing which layer owns which job, you cannot know which layer to look at when something goes wrong.

To answer why that one line works, you need a clear picture of all three layers. That is what the rest of this article builds.

Three Layers, Three Jobs: The Stack in One Mental Model

When you call findByEmail("alice@example.com") on a repository interface you never implemented, what actually runs? The answer involves three distinct layers, each with a single clear job, and conflating any two of them is the source of most of the confusion engineers run into.

At the bottom sits Hibernate. Hibernate is a full ORM framework. It owns a SessionFactory, manages an in-memory representation of your entities, generates SQL, and fires that SQL over JDBC to the database. When things go wrong at the database level, a bad join, an unexpected number of queries, or a lazy-loading exception, Hibernate is the layer to examine.

In the middle sits JPA (Java Persistence API). JPA is not a library you can download and run; it is a specification, a set of interfaces, annotations, and contracts that any compliant ORM must honour. It defines what @Entity, @OneToMany, and EntityManager mean, but it ships no implementation of its own. Hibernate is Spring Boot's default JPA provider, meaning Hibernate is the concrete code that fulfils those contracts. This distinction matters because your annotations belong to JPA, while the SQL they produce belongs to Hibernate.

At the top sits Spring Data JPA. Its sole job is to eliminate boilerplate in the data-access layer. You write a repository interface; Spring generates a proxy implementation at startup that translates method names and annotations into JPA operations. Spring Data JPA does not talk to your database directly, and it is not itself an ORM. Every call it receives flows downward through JPA's EntityManager and on into Hibernate, which finally issues the JDBC call.

Laid out as a delegation chain, the flow looks like this:

Your code
 → Spring Data JPA (repository proxy)
 → JPA EntityManager (standard contract)
 → Hibernate (SQL generation + JDBC)
 → Database

Each arrow in that chain crosses a responsibility boundary. Spring Data JPA knows about Spring conventions and method-name patterns. JPA knows about entities and the persistence context lifecycle. Hibernate knows about SQL dialects and connection pooling. None of the three layers was designed to do each other's jobs, which is exactly why each one is replaceable in theory. You could swap Hibernate for EclipseLink and keep all your JPA annotations untouched, or swap Spring Data JPA for plain EntityManager calls and keep Hibernate doing exactly what it already does.

Keeping this map in your head pays off the moment something goes wrong. A LazyInitializationException is a Hibernate story. A @Transactional annotation that seems to do nothing is a Spring story. A repository method that generates a query you didn't expect is a Spring Data JPA story, though Hibernate ultimately executes it. The next three sections go bottom-up through the stack, starting with the layer that actually touches your database.

Hibernate: The Engine That Actually Talks to the Database

When your Spring application saves an entity and an SQL INSERT appears in the logs, Hibernate wrote that statement. It is an Object-Relational Mapping framework whose single job is bridging the gap between Java objects and relational tables, translating method calls and object state into the SQL your database actually understands, so you don't have to hand-write every query for routine CRUD operations.

At the centre of Hibernate's runtime is the SessionFactory, a heavyweight, thread-safe object built once at application startup. From that factory, Hibernate creates per-request Session instances (or, in the JPA vocabulary you'll see more often in Spring code, EntityManager instances). Those per-request objects are not thread-safe, so Spring handles their lifecycle carefully. The EntityManager you inject with @PersistenceContext is actually a thread-bound proxy that delegates to whichever transactional EntityManager is currently active for that request, falling back to a freshly created one if no transaction is in progress. That indirection is invisible in normal usage but matters when you start debugging concurrency issues.

Several behaviours that surface as runtime surprises are owned entirely by Hibernate, not by the layers above it.

Dirty checking is one of the most useful and least understood. Inside an active transaction, Hibernate holds a snapshot of every entity it has loaded. When the transaction commits, it compares current field values against those snapshots and issues UPDATE statements automatically for anything that changed, even if you never called save(). This is a feature, but it confuses engineers who expect no SQL to run unless they explicitly persist something.

First-level cache (the session cache) means that within a single EntityManager lifetime, calling find(User.class, 1L) twice returns the same Java object from memory on the second call, with no second database round-trip. This cache is scoped to the session, not the application, so it disappears when the session closes.

Lazy loading lets Hibernate defer fetching associated collections until your code actually accesses them. A User with a @OneToMany list of Orders won't load those orders until you call user.getOrders(). The downside is the N+1 problem: if you load 50 users in a list and then touch .getOrders() for each one inside a loop, Hibernate fires one query to fetch the users and then 50 separate queries to fetch each user's orders. The fix usually involves a JOIN FETCH in your query or switching the fetch type, but the first step is recognising that Hibernate is what's generating those 50 extra statements.

SQL dialect translation means Hibernate handles the differences between database engines. You configure spring.jpa.database-platform to point at a dialect class, and Hibernate adapts its generated SQL to match the syntax of PostgreSQL, MySQL, Oracle, or whatever you're running.

All of these behaviours, dirty checking, caching, lazy loading, dialect generation, live at the Hibernate layer. When something breaks, the explanation usually lives there too. The layer above it, JPA, doesn't implement any of this; it just defines the contract that Hibernate must honour.

JPA: The Standard Contract Every ORM Must Honour

The previous section covered what Hibernate actually does at runtime. But if you look at a typical Spring Boot entity class, you'll notice the imports don't say org.hibernate.*. They say jakarta.persistence.* (or javax.persistence.* on older versions). That's JPA, and understanding why it's there clarifies the whole stack.

JPA, the Jakarta Persistence API, is a specification, not a library you run. It defines a set of interfaces, annotations, and rules that any compliant ORM must implement. The annotations you use every day, @Entity, @Id, @OneToMany, @Column, are all defined by JPA. The central API for interacting with the persistence context, EntityManager, is also defined by JPA, with methods like persist, find, merge, and remove. Hibernate is simply the most popular implementation of that contract.

This separation matters practically. Because your application code targets JPA interfaces rather than Hibernate-specific classes, you could swap Hibernate for EclipseLink or OpenJPA and recompile without touching your entity or service code. In practice, most teams never make that swap, but portability isn't the only benefit. The bigger win is that JPA gives the whole Java ecosystem a shared vocabulary. Documentation, Stack Overflow answers, and framework libraries all speak JPA, even when each is backed by a different provider.

JPA also introduced JPQL, the Java Persistence Query Language, which lets you query against entity class names and field names rather than table and column names. A JPQL query looks like this:

TypedQuery<Order> query = entityManager.createQuery(
 "SELECT o FROM Order o WHERE o.customer.email =:email", Order.class
);
query.setParameter("email", "alice@example.com");
List<Order> orders = query.getResultList();

Compare that to raw JDBC, where you would write a SQL string against orders.customer_email, manually open a ResultSet, and map each column back to a field by hand. JPQL keeps queries at the object level, so renaming a Java field and its corresponding column mapping stays in one place rather than scattered across SQL strings throughout the codebase.

The honest limitation of plain JPA is that it still requires you to wire up an EntityManagerFactory, manage EntityManager lifecycles, write those createQuery calls, and handle transactions explicitly. For a small application that's manageable. For a service with thirty entity types and hundreds of queries, it becomes a lot of repeated structural code. That gap is exactly what Spring Data JPA was designed to close.

Spring Data JPA: The Boilerplate Eliminator on Top

With the JPA specification and Hibernate's implementation both accounted for, the question that opened this article is still hanging: where exactly does findByEmail() come from? You never wrote a SQL query, never touched an EntityManager, never defined a concrete class. The answer lives entirely in Spring Data JPA's repository abstraction.

When your application starts up, Spring Data JPA scans for interfaces that extend JpaRepository (or one of its parent interfaces) and generates concrete implementations on the fly. No DAO class needs to be written, because Spring creates a proxy object that wires in the actual behaviour at startup, before any request arrives. That proxy is what gets injected into your service layer when you use @Autowired or constructor injection.

The method-name translation is the part that feels most like magic. Spring Data JPA parses the method signature using reflection, breaks it down by the keywords it recognises (findBy, And, Or, OrderBy, Between, and so on), and constructs a JPQL query to match. A method named findByLastNameAndAgeGreaterThan becomes something like SELECT u FROM User u WHERE u.lastName =:lastName AND u.age >:age, assembled at startup rather than at call time. If you make a typo in the method name that produces an unresolvable expression, the application will fail to start, not fail at 2 AM when that code path is first hit in production.

public interface UserRepository extends JpaRepository<User, Long> {
 List<User> findByLastName(String lastName);
 Optional<User> findByEmail(String email);
 List<User> findByAgeGreaterThanOrderByLastNameAsc(int age);
}

That is the entire file. Spring Data JPA does the rest.

When one of these methods is called at runtime, the proxy delegates to a SimpleJpaRepository implementation under the hood, which in turn uses a JPA EntityManager to execute the translated query. That EntityManager call flows down to Hibernate, which generates the SQL and sends it to the database. The call chain traverses all three layers: your repository interface, the JPA EntityManager, and Hibernate's SQL engine.

Spring Data JPA is part of the broader Spring Data umbrella project, which applies the same repository abstraction pattern to MongoDB, Redis, Cassandra, and other stores. Switching from a relational database to MongoDB doesn't require learning an entirely different programming model. You extend a MongoDB-specific repository interface, and the query derivation works the same way. That consistency is the deeper goal of Spring Data JPA's design.

Derived query methods cover a large surface area, but they are not the only option. For anything complex, @Query lets you write JPQL (or even native SQL with nativeQuery = true) directly on the method. The proxy mechanism stays in place; only the query source changes. This means you can stay in the repository interface without writing any implementation code, even for queries that method-name derivation cannot express cleanly.

Knowing which layer owns which behaviour is immediately useful when things go wrong.

When the Magic Breaks: Debugging Across the Three Layers

Each failure mode has a home layer, and pointing your debugger at the wrong one wastes time.

LazyInitializationException, Hibernate's session is closed

This exception means Hibernate tried to load a lazy association, looked for an open persistence context, and found none. It is a Hibernate-layer error about session lifecycle, not a Spring Data JPA bug. The usual cause is accessing a @OneToMany collection after the repository method has returned and the EntityManager has closed. Fix it at the layer that owns sessions: either annotate the calling service method with @Transactional so the persistence context stays open while you traverse the association, or switch the fetch type to eager for relationships you always need. A third option is a JPQL JOIN FETCH or @EntityGraph on the repository method itself, which tells Hibernate to load the association in the initial query rather than deferring it.

Unexpected query count, Hibernate's lazy-loading strategy

If your application issues far more SQL statements than you expect, the N+1 problem is the likely culprit. Hibernate's default for @OneToMany is FetchType.LAZY, so fetching ten authors and iterating their posts triggers one query for the author list and ten more for the post collections. You won't see this in your repository interface; you have to look at what Hibernate is generating. Enable SQL logging with these two lines in application.properties:

spring.jpa.show-sql=true
logging.level.org.hibernate.SQL=DEBUG

Once you can see the queries, the fix is at the Hibernate or JPA layer, not the Spring Data layer. A JOIN FETCH in a @Query annotation collapses the N+1 into a single join. @EntityGraph on the repository method achieves the same result declaratively. For cases where you prefer to keep lazy loading but still want fewer round trips, setting spring.jpa.properties.hibernate.default_batch_fetch_size=10 tells Hibernate to load related collections in batches with an IN clause rather than one query per row.

Transaction rollback surprises, Spring's @Transactional infrastructure

When a transaction does not behave the way you expect, changes not persisting, rollbacks firing on the wrong exception, or operations on a second data source not participating in the transaction, the problem lives at the Spring layer. @Transactional is managed entirely by Spring's AOP proxy, and its default behaviour is to roll back only on unchecked exceptions. If your application has multiple data sources, Spring applies @Transactional to the primary one by default; the secondary data source participates only if you configure a JtaTransactionManager or explicitly name the transaction manager in the annotation. The fix is a Spring configuration change, not a Hibernate tuning exercise.

The mental model from earlier sections makes all three of these straightforward to triage: session lifecycle is Hibernate, query shape is Hibernate, transaction boundaries are Spring. When you see a stack trace or unexpected behaviour, ask which layer owns that concept, then inspect or configure exactly that layer. Turning on SQL logging in development is a good habit regardless; the raw queries Hibernate sends are the single most informative signal available, and most performance surprises reveal themselves within a few minutes of reading them.