DEV Community: Ozioma Ochin

An Incident Isn't an Event. It's a Definition.

Ozioma Ochin — Wed, 06 May 2026 13:18:59 +0000

Two evenings to build ingestion. Three weeks to decide what counts as an incident.

TL;DR: Building a log analysis tool, ingestion takes two evenings. Deciding what counts as an incident takes weeks. This post walks through three rules — a fingerprint, a threshold, and a reopen window — and the trade-offs each one forces.

In this post:

What makes two logs the same incident?
How many occurrences make an incident?
When is an incident closed?
Where the rules stop
Why detection is the hard part

When you're asked to build a log analysis tool, your first instinct is to build the ingestion path.

You add a database table. You wire up a controller. You hit the endpoint with Postman or Curl, see the row appear in Postgres, and feel like you've made progress.

Two evenings of work, and ingestion feels done.

I know this because that's what I did when I started TraceRoot.

Two evenings on ingestion. The next three weeks went into the part I hadn't planned for.

Logs are events. Incidents are stateful objects with lifecycles. Going from a stream of POST /logs calls to something that becomes a useful incident is the work that doesn't get written about.

It's also the work that determines whether your tool produces something engineers want to look at, or just a slow events table with extra steps.

This post walks through the decisions TraceRoot makes about what counts as an incident. None of them are technically hard. All of them are opinions encoded as rules. Each one has a wrong default that systems quietly inherit, and a tradeoff the right choice forces you to accept.

These are mine. They're not the only valid answers.
They're the ones the system enforces.

What makes two logs the same incident?

Once logs are flowing into Postgres, the next question comes fast: which of these are the same problem?

Two NullPointerException logs from inventory-service are obviously related.

But what about a NullPointerException and a TimeoutException from the same endpoint? Same underlying bug or different ones?

What about the same exception type from two different services? What about logs that share a trace ID but happen ten seconds apart?

Every one of these has a defensible answer. The problem is the system has to pick one and apply it consistently to millions of logs.

TraceRoot uses four fields:

public String buildPatternKey(LogRecord record) {

    String exceptionType = normalizeExceptionType(
            record.getExceptionType()
    );

    String endpoint = normalizeEndpoint(
            record.getEndpoint()
    );

    return record.getServiceName() + "|" +
           record.getLevel() + "|" +
           exceptionType + "|" +
           endpoint;
}

That's the fingerprint. Same four fields, same incident. Change one, and it's a different incident.

The reason these four specific fields, and not others, comes down to what each one captures that the others can't.

ServiceName: Two services failing similarly are different incidents. A NullPointerException in payment-service and the same exception in inventory-service might share a root cause, but they have different on-call paths, different rollback decisions, different blast radii.
Level: ERROR belongs in a fingerprint. WARN is informational. INFO is noise. Mixing them produces meaningless groupings.
ExceptionType: This is where TimeoutException and NullPointerException separate even at the same endpoint. Different bugs, different fixes.
Endpoint: Two endpoints in the same service throwing the same exception type might be a shared library bug, or two unrelated bugs. Splitting by endpoint preserves the distinction.

The fields left out of the fingerprint matter just as much as the ones included. Three are tempting to add and would each break the model in a different way:

Message text: Messages drift on every occurrence. "User abc-123 timed out after 4827ms" and "User def-456 timed out after 5102ms" are the same incident. Including message would create thousands of one-offs that should be one.
Timestamp: Going from logs to incidents is exactly what we're trying to do. Including timestamp would defeat the move.
Trace ID: A trace is one request. An incident might span thousands. Including traceId would mean every retry storm produces dozens of "incidents."

The four fields (serviceName, level, exceptionType, endpoint) are the smallest set that captures real differences without splitting on noise. That's the whole rule.

How many occurrences make an incident?

Fingerprinting groups logs that belong together. The next question is when a group of logs becomes worth showing to an engineer.

Before settling on three errors in five minutes, I considered the obvious alternatives. Each fails in a specific way.

Threshold of one: Every error becomes an incident. The result is 200 alerts a day, 195 of which are noise. On-call engineers stop reading by week two.
Threshold of one hundred in an hour: Catches sustained problems but misses fast-burn incidents. A payment provider goes down for 90 seconds, throws 50 timeouts, recovers. Important. Never gets surfaced.
No threshold, alert on rate change instead: Smarter, but requires baseline data the system doesn't have on day one. Useful as a layer on top of threshold-based detection. Not a replacement.

TraceRoot's choice is three matching errors within five minutes:

public static final int INCIDENT_THRESHOLD = 3;

// inside createLog(...), after active and resolved checks:
LocalDateTime windowStart = LocalDateTime.now().minusMinutes(5);

List<LogRecord> matchList =
        logRepository.findByServiceNameAndLevelAndExceptionTypeAndEndpointAndTimestampAfter(
                record.getServiceName(),
                record.getLevel(),
                record.getExceptionType(),
                record.getEndpoint(),
                windowStart
        );

if (matchList.size() >= INCIDENT_THRESHOLD) {
    incidentService.createIncident(
            fingerPrint,
            matchList.size(),
            request.getTimestamp()
    );
}

Three is enough to distinguish a transient blip from a pattern. A one-off NullPointerException from a malformed request isn't an incident; the same exception three times in five minutes is. Five minutes is short enough that detection fires before an engineer would notice on their own. It's also long enough that legitimate retries within a single user flow don't trip the threshold.

The numbers are not universal. The point is not the specific values. The point is that there is a threshold, and it is explicit, and it lives in one place where it can be changed.

What this misses, on purpose, is a single critical error that should fire without waiting. A DataCorruptionException happening once is more important than three timeouts in five minutes. Severity-based override paths solve that. TraceRoot doesn't have one yet.

When is an incident closed?

This is the decision most observability tools either skip or get wrong.

The naive design says incidents close when an engineer marks them resolved. New occurrences create new incidents. It's clean. It's easy to implement. It's also the wrong model for this problem.

I learned this on a previous team. We had a database query that timed out for one user, every Wednesday afternoon, for about three months. Same query, same error, same fingerprint. The incident tool created a fresh incident every time. Each one got triaged from scratch. Each one got resolved as "intermittent, can't reproduce." Each one got closed.

It wasn't different bugs. It was one bug, showing up about a dozen times. The tool couldn't tell us that because it didn't model continuity.

TraceRoot models continuity through a reopen window. When a fingerprint matches a resolved incident within 24 hours, the existing incident reopens.

LocalDateTime resolvedAt = incident.getResolvedAt();

if (resolvedAt == null) {
    return false;
}

Duration reopenWindow = Duration.between(
        resolvedAt,
        LocalDateTime.now()
);

if (reopenWindow.toHours() > 24) {
    return false;
}

incident.setIncidentStatus(IncidentStatus.ACTIVE);
incident.setEventCount(incident.getEventCount() + 1);
incident.setSummaryStale(true);
incident.setResolvedAt(null);

What the engineer sees is one incident accumulating events, with the recovery and recurrence visible in the metadata. Not duplicates of the same problem.

Why 24 hours specifically? Most "the bug came back" cases happen within a working day. After that, code has shipped. A new occurrence is more likely to be a new incident. Twenty-four hours catches the worst-case "fix didn't actually work" window without dragging stale context forward forever.

The trade-off is real, long-running intermittent bug recurring every 30 hours, but never reopens. A longer window creates its own problems by carrying old incidents into new code. Twenty-four hours is the line that catches most cases without making the incident table a graveyard.

Where the rules stop

These decisions get you a working incident model. They don't get you a complete one. Three real gaps:

Single critical errors: A DataCorruptionException happening once matters more than three timeouts in five minutes. Threshold-based detection delays the first alert by definition. The fix is a severity-based fast path that bypasses the count for known-critical exception types. TraceRoot doesn't have one yet.
Cross-service correlation: A payment-service timeout often causes an order-service NullPointerException two seconds later. The fingerprint logic treats them as separate incidents. They are related, and the system has no way to know it. Span-level correlation in a tracing system solves this. The incident model alone can't.
Rate changes against baseline: A service that used to throw zero errors per hour and now throws fifty isn't caught by a threshold of three. The slope matters, not just the floor. This is a different detection algorithm — historical baselines, statistical confidence — running alongside fingerprinting, not replacing it.

Most observability tooling implies more capability than it delivers. Knowing where the rules stop is what makes the rules trustworthy.

Each is a legitimate detection problem with its own algorithms, its own trade-offs. Threshold-based fingerprinting is the foundation other approaches build on, not a replacement for them.

The reason to be explicit about scope is that most tooling isn't. Dashboards imply more capability than they deliver. Knowing the difference matters when you're choosing what to trust.

Why detection is the hard part

For many teams, ingestion, storage, and search are mostly solved problems. You can wire up a competent pipeline in a weekend, and Postgres or OpenSearch handle the rest.

Detection isn't fully solved. Not because the algorithms are hard. They aren't. The threshold check in this article is six lines of code. The fingerprint is a method that joins four strings. The reopen logic is a date comparison.

Detection is hard because every rule embeds a worldview about what counts as one thing. Get the worldview wrong and the incident table becomes noisy and unreliable. Too many incidents, too few, or the wrong ones grouped together. Get it right, and an on-call engineer at 11 p.m. sees 3 incidents instead of 847 events. The work the system did to decide what counts as one thing is what made that list useful.

The point isn't that TraceRoot's worldview is the right one. It’s that somebody has to encode a worldview, and that decision determines everything downstream. The summary, the dashboard, the alert, and the postmortem all inherit that decision.

If you've built incident detection — thresholds, fingerprints, ML, anything else — I want to hear which worldview you encoded and what broke as a result. The specific decisions matter more than the algorithms, and there isn't a settled answer.

buenas / traceroot

AI-powered incident detection and root cause analysis platform built with Spring Boot and PostgreSQL

TraceRoot — AI-Powered Reliability Platform

TraceRoot is a backend reliability platform that ingests application logs, detects recurring production failures, groups them into lifecycle-managed incidents, and generates AI-powered summaries to accelerate root cause analysis.

It consists of two systems:

TraceRoot Platform — the core reliability platform (log ingestion, incident detection, lifecycle management, AI summarization, metrics).
Failure Lab — a distributed microservices sandbox that generates realistic failure patterns (cascading failures, retry storms, timeouts, null pointers) to stress-test the platform with real distributed traffic.

The project is intentionally structured to resemble systems like Datadog, Sentry, and New Relic at the architectural level, while remaining fully implementable and readable at the application layer.

Problem

Production backends generate large volumes of logs, but modern observability tooling still leaves engineers doing the hard work:

Logs are noisy and unstructured.
Recurring failures are buried in volume.
Engineers manually correlate errors across services and time windows.
There is…

View on GitHub

Engineers Don't Want to Search Logs. They Want to Know What Broke.

Ozioma Ochin — Tue, 21 Apr 2026 13:11:31 +0000

A working engineer's argument that the observability stack has been solving the wrong problem for fifteen years.

It's 7 pm on a Friday and production is on fire. Your weekend is off to a great start. Obviously.

Your Payment service is throwing errors and your company is losing revenue. You open your log tool, filter by service and error level, and the interface returns 847 matching log lines from the last ten minutes.

Now what?

You scroll. Some of the errors are timeouts. Some are null pointers. The log tool has done its job — it found the logs, filtered them down, rendered them quickly. The part it hasn't done, the part still sitting on your shoulders while your friends are chilling, is everything that happens after you find the logs.

Which of these 847 errors are the same issue in different clothes? Which ones caused which? Is it the previous deployment from six hours ago, or the provider that's been flaky all week?

This is the real work of debugging a production system. It's not search. It's reasoning over what search returned.

I've been thinking about this gap while building a log analysis platform of my own, and I've come to a conclusion that sounds obvious once you say it out loud: engineers don't actually want to search logs. They want to know what broke, why, and what to check next.

Log tools have spent years getting really good at the first part and barely touching the second.

Logs Used to Be Hard To Find

I started my career grepping through log files on production servers. For most of the history of software, the hardest part of working with logs was finding them.

You had SSH access to a server, a tail -f command, and a folder full of rotating log files. If production broke, you started typing grep commands and hoped the right service was the one you were currently looking at.

Then came Splunk, Elasticsearch, Datadog and a dozen others. Each of them solved the same problem at scale: get all your logs in one place, index them, and give engineers a fast way to query across them.

If you've ever had to debug a production incident by SSHing into three servers in sequence, you understand why developers have come to love these tools.

This was the right optimization for its time. When the bottleneck is "I physically cannot find the relevant logs in reasonable time," the answer is a better index and a better query interface. Search was the product.

But the bottleneck has moved. Modern backend systems have structured logs, trace IDs, correlation tokens, and distributed tracing. You don't spend 40 minutes finding the relevant logs anymore — you spend 40 seconds. The tooling works. The search layer is solved well enough that it's no longer the thing standing between you and understanding what happened.

The pain point is everything after you have found the search results.

What Debugging Actually Looks Like

Here's what debugging a production incident actually looks like, once search has done its part.

You have your 847 log lines on the screen. The first thing your brain does — after mourning the Friday party you're missing — is group them.

You notice that 600 of the errors say TimeoutException and come from the same endpoint. You mentally file those as one incident. Another 150 are null pointers from a different service. You file those as a second incident. The remaining 97 are a mix you can't immediately classify, so you set them aside.

You've just done pattern detection. No tool helped you. You did it in your head, in about fifteen seconds, using muscle memory you've built up over years of reading logs. That's not always the case. I remember staring at a log for more than 30 minutes with no idea where to start.

Next, you start asking about causality. The timeouts started at 6:42pm. The null pointers started at 6:44pm. You already suspect the timeouts caused the null pointers — some downstream service couldn't get data it needed, and its null check was sloppy. You don't know this yet, but you're operating on that assumption. You're building a mental graph of "this caused that" using timestamps and your own prior knowledge of how the services connect.

Then you start asking about state. How many times has this happened in the last hour? Is it still happening right now, or did it stop? Has this error pattern shown up before? If it has, what did you do about it last time?

Then you start asking about context. Was there a deploy today? (This used to be my go-to question.) Is the database healthy? Is the upstream provider having issues? You open three other tabs — your deploy log, your database dashboard, your provider's status page — and start cross-referencing.

Eventually, you form a hypothesis. Something like: the upstream payment provider started failing at 6:42pm, which caused our payment service to time out, which caused the order service to receive null responses and crash. You don't know if you're right. You start checking.

None of this is search. Search gave you the 847 log lines. Everything above — the grouping, the causality reasoning, the state tracking, the context gathering, the hypothesis forming — is work your log tool didn't do and was never designed to do.

This is the real pain point. Not "find the logs," but "reason over the logs after you find them." And this is the part automation has never reached, because it requires judgment.

The reason this matters now is that judgment-shaped problems are exactly what the current generation of language models is usable for. Not perfect at. Usable for. Grouping similar errors, reconstructing a rough timeline, drafting a plausible cause, suggesting what to check next — these are tasks that were speculative three years ago and are now tractable. Imperfectly, but it works.

The observability stack hasn't absorbed this yet. That's the gap.

Why This didn't Get Fixed Earlier

Fair question at this point: if the gap is so obvious, why hasn't it been filled already?

It has been attempted. Three times, mostly.

The first direction was anomaly detection. Starting around 2015, several tools attempted to train models on historical logs to automatically flag unusual patterns. If a model has seen a million logs from your system behaving normally, it should be able to tell you when something looks off. However, one company's authentication logs don't look like another's. A model trained on Company A's logs will underfit badly on Company B's, and there's no clean way to fix that without labeled training data — "this is an incident, this is not" — which nobody had.

The second direction was rules-based alerting. Define thresholds, define patterns, get a notification when they're exceeded. This works, up to a point — and it’s how most production systems are still monitored today.

The problem is that rules are fragile. Your system evolves. Your traffic shifts. Your dependencies change. And suddenly your alerts are firing on noise or missing real incidents. The rules require constant maintenance, and the maintenance work is high-skill but low-status — and it doesn't get done. Most alerting systems I've worked with end up muted. On one team, we'd silenced four out of every five Slack alerts because they'd stopped meaning anything.

The third direction was generative summarization, which is where you might expect this to have been solved years ago. Language models that summarize text have existed in various forms for a long time. The honest answer is the models weren't good enough. GPT-2 could summarize a news article. It could not coherently reason across two hundred structured log entries and produce something an engineer would trust. The summaries were confident, vague, and frequently wrong — which is the worst possible combination for a debugging tool. I tried pasting logs into GPT-3.5 back when it launched. The summaries read like they knew what they were talking about. They didn't.

That changed. The current generation of models, given structured input and a targeted prompt, can produce summaries that are at least useful as a first draft of an engineer's reasoning. They're not replacing the engineer. They're saving the first twenty minutes of the engineer's work. That's enough of a shift to make the category viable.

None of the earlier approaches were wrong. Anomaly detection and rules-based alerting remain the backbone of every observability stack. What's new is that the summarization layer on top of them is finally manageable.

What the new stack looks like

So what does a log tool that takes summarization seriously actually look like?

Four layers, in order of execution.

First, structured log ingestion and search. Still essential. You need the raw logs indexed somewhere you can query them fast, because you always need to fall back to primary evidence. OpenSearch, Elasticsearch, ClickHouse, or similar — the specific tool matters less than the fact that the layer exists. The reasoning built on top of this is only as good as the search underneath it.

Second, deterministic pattern detection. Before any LLM runs, you group related errors using rules. Same service, same level, same exception type, same endpoint — that's one pattern. Three matches within five minutes is an incident. These are rules, not AI.

They're boring, they're predictable, and they're the foundation that lets the AI layer be useful. If you skip this step and let an LLM group errors directly, you get garbage, because the LLM has no stable way of deciding what "similar" means. In TraceRoot, this grouping is deterministic and repeatable — not probabilistic — so the same logs always produce the same incident.

Third, incident lifecycle. An incident is not an event. It's a stateful object with a first-seen timestamp, a last-seen timestamp, an event count, and a status (active or resolved). When a resolved pattern reappears, it reopens instead of creating a new incident. This sounds obvious, but most log tools don't model it — they just show you the events and leave the state in your head. You have to figure it out yourself.

On a Friday evening, this is how weekends die. I’ve watched a resolved bug come back five hours after the fix shipped, while my team was still celebrating. Incident summaries are cached and generated only when the underlying event count or last-seen timestamp changes.

Fourth, the reasoning layer. The LLM reads the grouped, structured incident context — metadata plus a sample of the matching logs — and produces a summary, a probable cause, and recommended checks. It is not reading raw logs. It is reasoning over structured input that has already been filtered, grouped, and annotated by the three layers beneath it. This is the difference between "pasting logs into ChatGPT" (which produces garbage) and "asking a model to reason over a specific incident pattern" (which works).

The order matters. Determinism before intelligence. The LLM sits at the top because it works dramatically better when the input has been shaped. Reverse the order and you get 2023-era demos that break on real data.

This is the architecture I'm building in TraceRoot — an incident-first log reasoning system. It's not the only way to do it.

What this Post Is Not Arguing

It's not arguing that search is obsolete. You still need fast, faceted search over raw logs — for ad hoc investigation, for debugging patterns the system hasn't seen before, for verifying what the AI layer tells you. The reasoning layer sits on top of search. It doesn't replace it. If the search layer is broken or slow, the reasoning layer is useless.

It's not arguing that LLMs should make decisions. The summary, probable cause, and recommended checks are a first draft of an engineer's reasoning — a starting point to accept, reject, or refine. The engineer is still the one paging the on-call, reverting the deploy, or opening the incident bridge. The LLM is triage assistance, not triage replacement.

And it's not arguing that this architecture is finished. The summarization layer works, but "works" is a low bar. The evaluations are still young, the prompt engineering is still fragile, and there are failure modes I haven't seen yet. I'll write about those as I run into them.

Back to Friday, 7pm

The payment service is throwing errors. Your weekend is in the balance.

Pattern detection in your head. Causality in your head. State tracking across browser tabs. Hypothesis from experience. Twenty minutes before you know what you're actually dealing with.

In the new stack, you open the incident — not the logs, the incident — and you see it already grouped. Same service, same exception type, same endpoint. Event count: 847. First seen at 6:42pm.

The summary says the upstream payment provider looks degraded, the recommended checks point you at the provider's status page and the last deploy. You don't take the summary on faith. You verify it against the raw logs, which are one click away. The twenty minutes of reconstruction becomes one minute of reading and two of validation.

The engineer is still doing the work. The stack is doing less of the wrong kind of work, so the engineer can do more of the right kind.

That's the shift. Not "AI replaces debugging." Not "observability gets automated." Just: understanding gets faster, and the weekend starts on time.

I'm building TraceRoot to explore this. Over the next few months I'll be writing about the specific design decisions — how to define an incident, why LLM summarization over raw logs fails until you fix the input shape, what storage architecture supports this workload, what the failure modes look like.

This post was the frame. The rest is the work.

If you've tried to solve the "reasoning over logs" problem — with LLMs, with rules, with anything — I'd genuinely like to hear what worked and what didn't. This is an area where everyone's still figuring it out, and I don't think there's a settled answer yet.

Semantic Search Is an Architecture Problem

Ozioma Ochin — Sat, 11 Apr 2026 23:32:19 +0000

Most semantic search systems don’t fail because of embeddings. They fail because of how the system is designed around them.

When I started building a semantic search API with Spring Boot and pgvector, I expected the hard parts to be vector math and database configuration. Generating embeddings and computing similarity felt like the core of the problem.

They weren’t.

The system worked. Documents were stored, embeddings were generated, and search returned results that looked reasonable.

But they weren’t reliable. Some queries felt off. Others looked correct but weren’t useful. Small changes in input produced disproportionately different results.

Nothing was obviously broken — but the system wasn’t behaving in a way I could trust.

Semantic search isn’t defined by how you generate embeddings. It’s defined by how your system structures data before embedding, enforces consistency during ingestion, and filters and ranks results after retrieval.

In other words, semantic search is an architecture problem.

This article breaks down five decisions that proved harder than expected while building a production-ready semantic search API in Java — and why they had more impact on search quality than the embedding model.

The full source code is on GitHub. Each lesson is grounded in that implementation.

Lesson 1 — Embeddings Don’t Solve Retrieval. They Define Its Boundaries.

The most common mistake in semantic search is treating embeddings as the solution to retrieval.

Embeddings don’t solve retrieval; they define the space in which retrieval happens. That space is shaped by decisions made before the embedding call runs.

In my implementation, embeddings are generated by combining the document’s title and content:

float[] embedding = embeddingClient.embed(title + "\n\n" + content);

At a glance, this looks like a simple implementation detail. It isn’t — it’s an architectural decision.

That single line determines how every document enters the vector space.

It defines what context is preserved, what signals are amplified, and what noise is introduced.

Once the embedding is created, those decisions are fixed — every query, similarity calculation, and ranking outcome depends on them.

This is where the gap between similarity and usefulness begins.

A document titled “Payment Failure Handling Policy” produces a stronger, more useful embedding when the title is included — it anchors the content with meaningful context. But a document titled “Notes” or “Draft” does the opposite. The model encodes both the generic label and the actual content, pulling the vector in competing directions.

The result is a system that returns technically similar results that aren’t useful.

The inconsistency comes from the input itself. Embeddings are the output of your ingestion architecture. If that architecture is inconsistent, your search results will be too.

What I'd do differently:

I would define the embedding input as a system-level contract before writing any embedding code.

Not a guideline. A contract.

Every document should enter the embedding pipeline in a consistent, explicitly defined format, regardless of who created it or when it was ingested. The structure — which fields are included, how they are ordered, and how they are separated — must be fixed and enforced at the ingestion boundary.

In practice, this means formalizing a canonical input shape and rejecting or transforming anything that doesn’t conform. Titles like “Notes” or “Draft” should not be treated as meaningful context; they should be excluded or normalized before embedding so they don’t distort the representation.

The benefit isn’t just cleaner vectors. It’s determinism.

When search quality degrades, the first question is always: what did the model actually see? Without a defined input contract, answering that requires tracing through every ingestion path and reconstructing the input post hoc. With a contract, the input is predictable, and the failure surface is narrow.

Embedding quality is constrained by the structure of the input. Once vectors are generated, those decisions are fixed. Retrieval can only reflect what was decided upstream.

Takeaway: Embedding quality is bounded by the structure of the input, so consistency has to be enforced before the embedding call ever runs.

Lesson 2 — Search Quality Is Determined at Ingestion, Not at Query

Most efforts to improve semantic search focus on the query layer — thresholds, ranking logic, or retrieval strategies.

Search quality is determined long before a query is executed — by how data is structured, validated, and normalized at ingestion.

In the initial version of this system, the schema was intentionally flexible:

CREATE TABLE documents ( 
    id BIGSERIAL PRIMARY KEY, 
    title TEXT NOT NULL, 
    content TEXT NOT NULL, 
    metadata JSONB, 
    embedding VECTOR(1536) 
);

This made the system easy to use. It also allowed documents of different shapes to be embedded and retrieved using the same logic.

Short notes and long-form documents were encoded into the same vector space without regard for scale or structure. Metadata varied between objects, strings, and primitives.

The search layer operated correctly, but the data it operated on did not follow a consistent contract.

Similarity scores were valid, but not reliable.

This is where semantic search stops being a retrieval problem and becomes a data modeling problem. The query can only work with what the system has already allowed into the index.

What I'd do differently:

I would treat ingestion as a constrained interface, not a flexible one.

Every field should be validated against an explicit structural contract before it is stored or embedded. Metadata, in particular, should be restricted to a consistent shape — a JSON object with defined keys — rather than accepting any valid JSON representation.

That constraint belongs at the API boundary:

@JsonValidator 
private JsonNode metadata;

In this system, that validation arrived late. Before it existed, metadata was accepted as any valid JSON — including strings and primitives. The consequence was a V3 Flyway migration to repair data already in the database:

UPDATE documents
SET metadata = jsonb_build_object('raw', metadata)
WHERE metadata IS NOT NULL
  AND jsonb_typeof(metadata) = 'string';

A string stored as "category=billing" instead of {"category": "billing"} passes every JSON validation check and fails silently at query time — invisible to every filter that expects key-based access. That migration should never have been necessary. The validator at the API boundary was the fix — the migration was the cost of not having it earlier.

The boundary for enforcing correctness is the point of entry. By the time data reaches storage or embedding, it should already conform to the system’s expectations. Without that constraint, the system does not degrade gracefully — it drifts.

Takeaway: Retrieval quality is set at ingestion, not rescued at query time.

Lesson 3 — Systems Break at Boundaries, Not Within Components

Each component in a semantic search system can behave correctly in isolation. The database stores documents, the embedding layer generates vectors, and the search query returns results.

Failures appear at the boundaries when those components operate as a single system.

The initial version of the write path was straightforward:

generate the embedding → persist the document

If the embedding call failed — due to a timeout, rate limit, or unexpected response — the document was never written.

The system did not crash. It lost information.

The current implementation reverses that flow:

@Override
@Transactional
public CreateDocumentResponse create(CreateDocumentRequest request) {

    Document saved = saveAsPending(request);

    embedAndPersist(
            saved.getId(),
            saved.getTitle(),
            saved.getContent()
    );

    return new CreateDocumentResponse(
            saved.getId(),
            DocumentStatus.READY
    );
}

The document is written first, with an explicit lifecycle state. Embedding becomes a second step, not a prerequisite for persistence.

If embedding fails, the system records that failure:

UPDATE documents
SET status = 'FAILED',
    embedding_error = ?,
    embedding_updated_at = now()
WHERE id = ?;

Failures become part of the system’s state. This is the difference between a pipeline that appears to work and one that can be understood under failure.

What I'd do differently:

I would design the document lifecycle as a first-class system model before implementing any write path. A document should move through explicit, enforceable states, not implicit transitions tied to method execution.

In this system, those states are minimal but sufficient:

PENDING → READY (embedding succeeded)

PENDING → FAILED (embedding failed, error stored)

FAILED → PENDING (manual retry or background job)

What matters is not the number of states, but what they guarantee.

A document should not be considered searchable unless its embedding has been written successfully. That constraint must be enforced at the query layer, not assumed at write time. Without it, partially processed data leaks into retrieval and produces inconsistent results or runtime failures.

WHERE status = 'READY'
  AND embedding IS NOT NULL

Failure must be inspectable. Recording the error alongside the document makes it possible to answer what failed and why.

The key mistake in the initial design was treating embedding as part of persistence. In practice, it is a separate stage with its own failure modes, latency characteristics, and operational risks. Collapsing those concerns into a single step removes the ability to reason about them.

A system that models its intermediate states explicitly can tolerate failure without losing visibility. One that doesn’t will appear correct until it isn’t — and offer no explanation when it fails.

Takeaway: Systems fail at stage boundaries, so lifecycle state has to be modeled explicitly rather than inferred from method flow.

Lesson 4 — Retrieval Is Easy. Ranking Is Where Systems Fail.

Once embeddings are generated and queries return results, it’s tempting to consider the system complete.

The retrieval layer works. A query is converted into a vector and similar documents are returned.

Vector similarity measures proximity in a vector space. It does not measure whether a result is useful to the person searching. A system can return results that are semantically close and still fail the only test that matters: relevance.

In this implementation, ranking does not exist as a separate component. It is embedded directly in the result-mapping path:

private SearchResultItem mapToSearchResultItem(
        ResultSet rs,
        int rowNum
) throws SQLException {

    double dist   = rs.getDouble("cosine_distance");
    double cosSim = 1.0 - dist;
    double score  = (cosSim + 1.0) / 2.0;

    return new SearchResultItem(
            rs.getLong("id"),
            rs.getString("title"),
            rs.getString("content"),
            readMetadataAsJsonNode(rs.getObject("metadata")),
            dist,
            cosSim,
            score
    );
}

The raw cosine_distance returned by pgvector is transformed into cosine_similarity and then normalized into a bounded score before the result leaves the system.

This makes the scoring model explicit and traceable. Every result carries the values that determined its position.

But these values do not define relevance. They describe closeness in vector space, not whether a result is worth showing.

@DecimalMin("0.0")
@DecimalMax("1.0")
private Double minScore;

Set it too low, and results are technically related but not useful. Set it too high, and relevant results are excluded.

At this point, semantic search stops being a retrieval problem and becomes a decision problem. The system is no longer asking what is similar — it is deciding what deserves to be shown.

What I'd do differently:

I would define a relevance benchmark before implementing any ranking logic.

Without a benchmark, ranking becomes reactive. Thresholds are adjusted based on observed outputs, but there is no stable way to determine whether those adjustments improve the system or simply change it.

A benchmark makes that distinction measurable. Each change to the ranking logic can be evaluated against the same queries, making it possible to tune the system deliberately rather than iteratively guessing.

A threshold of 0.65 might return all twenty expected results but include fifteen irrelevant ones. A threshold of 0.80 might eliminate the noise but miss six expected results.

Without a benchmark, you cannot measure that tradeoff — only observe that results feel better or worse.

Takeaway: Similarity scores describe closeness; a benchmark is what makes ranking measurable.

Lesson 5 — The ORM Stops Helping Earlier Than You Expect in Vector Search.

The early version of the search layer expressed everything through Spring Data JPA. That worked for standard CRUD operations, and it felt natural to extend the same abstraction to search.

JPA is built for entity lifecycle operations and relational query patterns that map cleanly to objects. Vector search does not fit that model — it depends on database-specific operators, explicit casting, and query shapes that change at runtime.

In this system, JPA still has a clear role:

public interface DocumentRepository extends JpaRepository<Document, Long> {}

The search path is different. It uses raw SQL because the query depends on pgvector:

private static final String SQL_SEARCH_INNER = """
        SELECT
            id,
            title,
            content,
            metadata,
            (embedding <=> ?::vector) AS cosine_distance
        FROM documents
        WHERE status = 'READY'
          AND embedding IS NOT NULL
        """;

The query changes shape depending on filters, score thresholds, and pagination. That logic is handled directly in the service layer:

void applyFilters(Map<String, String> filters) {
    if (filters == null || filters.isEmpty()) return;

    for (Map.Entry<String, String> entry : filters.entrySet()) {
        String key = entry.getKey();

        if (key == null || !key.matches("^[a-zA-Z0-9_-]{1,64}$")) {
            throw new IllegalArgumentException(
                "Invalid metadata filter key: " + key
            );
        }

        sql.append("  AND (metadata->>'")
           .append(key)
           .append("') = ?\n");

        params.add(entry.getValue());
    }
}

The filter key is appended directly into the SQL string. SQL does not allow parameterizing column names or JSON path expressions. The regex is the only thing standing between user input and the database.

The subquery structure is necessary — not stylistic. PostgreSQL cannot reference a SELECT alias in the WHERE clause of the same query level. cosine_distance has to be resolved in a subquery before the score threshold can filter on it:

SELECT *
FROM (
    SELECT
        id,
        title,
        content,
        metadata,
        (embedding <=> ?::vector) AS cosine_distance
    FROM documents
    WHERE status = 'READY'
      AND embedding IS NOT NULL
) AS sub
WHERE (((1.0 - cosine_distance) + 1.0) / 2.0) >= ?
ORDER BY cosine_distance ASC
LIMIT ? OFFSET ?;

What I'd do differently:

I would define the persistence boundary earlier.

The temptation is to keep everything inside one abstraction for consistency.

In practice, that usually produces a worse result: repository interfaces with native queries, partial use of the ORM, and search logic split awkwardly across layers that were never designed to express it.

The common workaround — @NativeQuery annotations in the repository — produces the worst of both approaches: SQL strings embedded in JPA annotations, losing JPQL readability without gaining the flexibility of JdbcTemplate.

The better approach is to make the boundary explicit from the start. In this system, JPA owns entity lifecycle operations. The moment a query depends on vector operators, JSONB path access, runtime query construction, or database-specific casting, it moves into JDBC-backed SQL.

This is not a compromise — it is a cleaner design.

The ORM does not fail here. It reaches the edge of what it was built to model.

Takeaway: ORMs are useful until query behavior becomes database-specific; after that, forcing consistency across abstractions usually makes the design worse.

What This System Actually Is

At the beginning, I thought I was building a semantic search API.

As the system took shape, it became clear that result quality had little to do with how embeddings were generated. It depended on how data was structured before embedding and how results were filtered and ranked afterward. The system was deciding what counted as a useful result.

What I built wasn’t a feature behind an endpoint. It was a pipeline — one that transformed input, enforced constraints, and produced outcomes shaped by decisions at each stage.

How documents were structured at ingestion determined what the embedding layer could represent. How failures were captured determined what could be diagnosed. The lifecycle model defined what was eligible for retrieval. The ranking logic determined what users actually saw.

Search was the surface. Beneath it was a set of interdependent decisions that had to remain consistent.

That’s what made it hard. Not the vectors. Not the SQL. The design decisions that had to hold together.

The Service Layer: Where Separate Components Become a System

Ozioma Ochin — Sun, 05 Apr 2026 20:24:51 +0000

This is Part 4 of a series building a production-ready semantic search API with Java, Spring Boot, and pgvector.

Part 1 covered the architecture.

Part 2 defined the schema.

Part 3 handled the embeddings — how text becomes vectors.

Each piece worked in isolation.

But systems don't fail in isolation — they fail at the boundaries.

If you've ever built a feature that worked perfectly on its own but broke the moment you connected it to everything else — this article is about preventing that.

At this point, we have a schema that can store documents and an embedding layer that can generate vectors.

But nothing connects them. A document has nowhere to go. A query has no pipeline.

This is where the service layer comes in.

This is a production-style implementation — not a demo. The full project structure, tests, and configuration are available on GitHub.

What Does the Service Layer Actually Do?

The database stores state, but it doesn't understand it.

PENDING, READY, and FAILED only become meaningful once the service layer defines when those transitions happen and what triggers them.

When a document arrives, the service decides the order of operations — save first, embed second, update on success, record failure explicitly if something goes wrong.

Search follows the same pattern. A query doesn't go straight to the database. It's first converted into an embedding, then passed through a query that applies lifecycle constraints, metadata filters, and scoring thresholds.

The service layer controls that entire pipeline.

The service layer owns one thing: the rules that make the system predictable.

Without it, the system is just a collection of correct but disconnected components.

HTTP Request
     │
     ▼
Controller Layer       ← validates input, delegates to service
     │
     ▼
Service Layer          ← all decisions happen here
     │                    │
     ▼                    ▼
Repository Layer      Embedding Layer
(JPA + JdbcTemplate)  (EmbeddingClient interface)
     │                    │
     ▼                    ▼
PostgreSQL + pgvector  OpenAI API

The Interface That Keeps Everything Clean

The service layer exposes one interface to the rest of the application:

public interface DocumentService {
    CreateDocumentResponse create(CreateDocumentRequest request);
    DocumentResponse getById(Long id);
    SearchResponse search(SearchRequest request);
}

Controllers depend on the interface, not the implementation.

Defining the contract as an interface and hiding the implementation behind it is what makes the system testable and changeable without cascading updates across the codebase.

The more important detail is what does not cross this boundary.

The Document entity never crosses this boundary — by design. Controllers receive DTOs, not persistence objects.

That separation means the database schema and the API contract can evolve independently. The schema can change without breaking clients. The API can change without rewriting persistence logic.

Why this matters to you: If you've ever had a database change break your API — or an API change force a database rewrite — this boundary is what prevents that. Define it early and hold it firmly.

What Happens When Embedding Fails?

From the outside, creating a document looks simple. Send a document, get an ID back.

Inside the service, everything is built around one assumption: the second step might fail.

@Override
@Transactional
public CreateDocumentResponse create(CreateDocumentRequest request) {

    Document saved = saveAsPending(request);

    embedAndPersist(
            saved.getId(),
            saved.getTitle(),
            saved.getContent()
    );

    return new CreateDocumentResponse(
            saved.getId(),
            DocumentStatus.READY
    );
}

Two lines, two distinct operations.

The first saves the document immediately with a status of PENDING.

The document exists in the database before any embedding call is made.

If the application crashes at this point, the document is already there with a recoverable state.

The second calls the OpenAI API, generates the embedding, and updates the document to READY.

If this step fails, the document moves to FAILED instead, and the error is stored directly in the database.

POST /documents
      │
      ▼
saveAsPending()
status = PENDING ← document is safe in the database
      │
      ▼
embedAndPersist()
      │
   ┌──┴──────────────┐
   │                 │
   ▼                 ▼
status = READY   status = FAILED
searchable       error stored in DB
                 excluded from search

There's an alternative that looks simpler — embed first, then save.

It removes a step but removes visibility. If embedding fails in that model, the document never exists. There's no record, no state, nothing to debug.

By saving first, every attempt leaves a trace.

Failures don't disappear.

They become data.

This pattern — save first, embed second — is the difference between a failure you can debug and one that just disappears.

Here's how the failure handling actually works:

private void embedAndPersist(Long documentId, String title, String content) {
    try {
        float[] embedding = embeddingClient.embed(title + "\n\n" + content);
        int updated = jdbcTemplate.update(SQL_UPDATE_EMBEDDING,
                toPgVectorLiteral(embedding), documentId);
        if (updated != 1) {
            throw new IllegalStateException(
                    "Unexpected row count updating embedding for document id=" + documentId);
        }
    } catch (IllegalStateException e) {
        throw e;
    } catch (Exception e) {
        markFailed(documentId, e.getMessage());
        throw new RuntimeException("Embedding failed for document id=" + documentId, e);
    }
}

Three decisions here worth understanding:

Title and content are concatenated for embedding. title + "\n\n" + content gives the model full context. A document titled "Payment Failure Handling Policy" with content about retry logic produces a richer embedding than the content alone.
IllegalStateException is re-thrown unchanged. If the update affects zero or more than one row, something is wrong with the database state — not the embedding call. That error should propagate as-is rather than being wrapped as an embedding failure.
Everything else triggers markFailed. Network timeouts, rate limits, malformed responses — any exception that isn't an IllegalStateException records the failure and re-throws. The caller sees the failure. The database gets a record of what went wrong.

Most API integration failures are silent. This makes them loud.

Search — The Pipeline That Ties Everything Together

Search is the most complex operation in the service. It touches the embedding layer, the repository, and the database — and it has to coordinate all three correctly.

What makes it manageable is not reducing that complexity, but containing it deliberately.

The orchestration method is deliberately small:

@Override
public SearchResponse search(SearchRequest request) {

    String qVector = embedQuery(request.getQuery());

    List<SearchResultItem> items = fetchResults(
            request,
            qVector
    );

    int total = countResults(
            qVector,
            request.getFilters(),
            request.getMinScore()
    );

    return new SearchResponse(
            request.getPage(),
            request.getSize(),
            total,
            items
    );
}

Four lines. Each delegates to a private method with a clear name.

The method reads like a description of the search process — embed the query, fetch the results, count the total, return the response.

The how is pushed down into methods that can be reasoned about in isolation.

private String embedQuery(String query) { 
return toPgVectorLiteral(embeddingClient.embed(query)); 
}

The query goes through the same embedding client used for documents.

That symmetry matters — the query and the stored documents exist in the same vector space. Without it, similarity search would be meaningless.

The SQL is constructed in two layers: the inner query selects candidates and computes similarity, while the outer query applies score thresholds and pagination.

The split isn't stylistic. PostgreSQL cannot reference a SELECT alias in a WHERE clause at the same query level — which is why cosine_distance must be resolved in a subquery before the score threshold can filter on it.

SELECT * FROM (
    SELECT id, title, content, metadata,
           (embedding <=> ?::vector) AS cosine_distance
    FROM documents
    WHERE status = 'READY'
      AND embedding IS NOT NULL
      AND (metadata->>'category') = ?
) AS sub
WHERE (((1.0 - cosine_distance) + 1.0) / 2.0) >= ?
ORDER BY cosine_distance ASC
LIMIT ? OFFSET ?;

If you've ever wondered why your JPA queries feel limiting for complex use cases — this is where you cross that line deliberately.

Why JPA Isn’t Enough for Vector Search

The search query isn't static.

Metadata filters, score thresholds, and pagination all change the SQL at runtime.

At that point the abstraction provided by JPA starts to break down — you're no longer mapping objects, you're constructing a query.

That's where QueryBuilder comes in:

private static class QueryBuilder {

   private final StringBuilder sql;
   private final List<Object> params = new ArrayList<>();

   QueryBuilder(String baseSql, String firstParam) {
       this.sql = new StringBuilder(baseSql);
       this.params.add(firstParam);
   }

   QueryBuilder(String baseSql, QueryBuilder source) {
       this.sql = new StringBuilder(baseSql);
       this.params.addAll(source.params);
   }
}

The two constructors mirror the structure of the query – inner and outer.

The first builds the inner query.

The second builds the outer query, inheriting parameters from the inner one without tracking them manually.

Where injection risk actually lives:

void applyFilters(Map<String, String> filters) {
   if (filters == null || filters.isEmpty()) return;

   for (Map.Entry<String, String> entry : filters.entrySet()) {
       String key = entry.getKey();

       if (key == null || !key.matches("^[a-zA-Z0-9_-]{1,64}$")) {
           throw new IllegalArgumentException("Invalid metadata filter key: " + key);
       }

       sql.append("  AND (metadata->>'").append(key).append("') = ?\n");
       params.add(entry.getValue());
   }
}

The filter key is appended directly into the SQL string. SQL doesn't allow placeholders for column names or JSON path expressions — which means this is where injection risk enters the system.

The regex is not a convenience. It is the only control point between user input and the database.

^[a-zA-Z0-9_-]{1,64}$ — only alphanumeric characters, underscores, and hyphens.

Anything else is rejected before it reaches the database. Filter values, on the other hand, always go through JDBC parameters and are safe regardless of input.

This split — validated keys, parameterised values — is what makes the query both flexible and secure.

This is one of those cases where the 'boring' regex is doing serious security work. Don't skip it.

Key validation handles injection risk. The other challenge in query construction is where to apply the score threshold.

Score filtering is applied on the outer query — not the inner one. cosine_distance is defined in the inner query's SELECT clause.

PostgreSQL cannot reference that alias in a WHERE clause at the same level. Wrapping it as a subquery makes it a real column in the outer scope — which is what allows minScore to work at all.

This is the point where you stop “using an ORM” and start designing queries deliberately.

Updating a Document means Updating Its Embedding Too

Updating a document is not the same as updating a database row.

When content changes, the stored embedding becomes stale. A document about "payment retry logic" gets updated to "refund processing."

But the embedding still points toward payment retries. Searches for "refund policy" would miss it. Searches for "payment retries" would still find it — incorrectly.

The update operation handles this explicitly:

private void applyUpdates(Document doc, UpdateDocumentRequest request) {
    doc.setTitle(request.getTitle());
    doc.setContent(request.getContent());
    doc.setMetadata(request.getMetadata());
    doc.setStatus(DocumentStatus.PENDING);
    doc.setEmbeddingError(null);
    documentRepository.save(doc);
}

The moment content changes, the embedding becomes invalid.

The system makes that explicit by resetting the document to PENDING, removing it from search until a new embedding is generated.

This trades availability for correctness — a document disappearing briefly is preferable to returning incorrect results.

findOrThrow is called again after embedAndPersist so the response reflects the document's final state — including the updated status and embeddingUpdatedAt timestamp — not the state before the embedding ran.

This is easy to miss when you first build it. If a document update doesn't trigger a re-embed, your search results will silently drift out of sync with your content.

One Place for All Your Errors

Errors in this system fall into two categories — errors the caller caused and errors the system encountered.

Those two cases should not look the same.

A missing document returns a 404. Invalid input returns a 400. An embedding failure returns a 500.

What matters more than the distinction is consistency — every error, regardless of where it originates, returns the same shape:

{
  "code": "NOT_FOUND",
  "message": "Document not found: 42"
}

That consistency is enforced in one place — GlobalExceptionHandler.

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(ResourceNotFoundException.class)
    public ResponseEntity<ErrorResponse> handleNotFound(
            ResourceNotFoundException ex
    ) {
        return ResponseEntity.status(404)
                .body(new ErrorResponse(
                        "NOT_FOUND",
                        ex.getMessage()
                ));
    }

    @ExceptionHandler(MethodArgumentNotValidException.class)
    public ResponseEntity<ErrorResponse> handleValidation(
            MethodArgumentNotValidException ex
    ) {
        String message = ex.getBindingResult()
                .getFieldErrors()
                .stream()
                .map(e -> e.getField() + ": " + e.getDefaultMessage())
                .collect(Collectors.joining(", "));

        return ResponseEntity.status(400)
                .body(new ErrorResponse(
                        "VALIDATION_ERROR",
                        message
                ));
    }

    @ExceptionHandler(Exception.class)
    public ResponseEntity<ErrorResponse> handleGeneral(
            Exception ex
    ) {
        return ResponseEntity.status(500)
                .body(new ErrorResponse(
                        "INTERNAL_ERROR",
                        "An unexpected error occurred"
                ));
    }
}

The @RestControllerAdvice annotation makes it active across all controllers without being wired into any of them.

The service layer throws exceptions. The handler translates them. The controllers never see error handling code.

A client that always receives code and message can handle all errors with one piece of logic.

A client that receives different shapes from different endpoints has to handle each one separately.

One handler, consistent responses everywhere — your frontend team will thank you.

How the LifecycleKeeps Bad Data Out of Search

The document lifecycle isn't just about tracking failures. It's what keeps invalid data out of search results entirely.

Every search query filters on two conditions before any similarity calculation runs:


WHERE status = 'READY'
    AND embedding IS NOT NULL

A PENDING document is excluded. A FAILED document is excluded.

This is where the schema design from Part 2 pays off — the composite index on (status, created_at DESC) exists specifically to support this filtering pattern.

Without it, every search scans the full table and discards non-ready documents. With it, PostgreSQL jumps directly to the relevant subset.

PENDING ──────────────────────────────┐
   │                                  │
   ▼                                  │
embedAndPersist()                     │
   │                                  │
┌──┴──────────────┐                   │
│                 │                   │
▼                 ▼                   ▼
READY          FAILED            not searchable
searchable     error in DB
               not searchable

The lifecycle isn't just about correctness. It's a performance optimization.

If you've ever had stale or incomplete data show up in search results with no explanation — a lifecycle model like this is what prevents it.

The System Now Works

With the service layer in place, the system finally behaves like a system.

A document arrives at POST /documents. The controller validates the request and delegates to the service.

The service saves the document as PENDING, calls the embedding client, and updates the status to READY.

The document is now stored with a valid embedding and visible to search.

A search query arrives at POST /search.

The service embeds the query, builds the SQL dynamically through QueryBuilder, applies filters and score thresholds, and returns ranked results with three score fields — cosineDistance, cosineSimilarity, and score.

Every layer has exactly one job. Every failure is visible. Every response has a consistent shape.

The system that started as a schema and an embedding client in Part 1 is now a complete, working API.

What's Next

The service layer completes the system. Everything now works end to end.

But working systems still have flaws.

In the next article, I’ll step back from the implementation and break down what this system gets right, what it gets wrong, and what I would change if I were to build it again.

See you there.

Why Most Developers Reach for a Vector Database Too Soon.

Ozioma Ochin — Sat, 28 Mar 2026 23:41:26 +0000

Most semantic search tutorials start the same way: add a vector database.

The feature request sounded simple: type question, get the right internal doc back.

A few hundred documents. Support notes and wiki pages.

Nothing exotic. The kind of thing that should take a week, maybe less.

They did what most of us would do today.

They watched a couple of LangChain tutorials, skimmed the OpenAI docs, and followed the same architecture every example seemed to use.

Documents were chunked, embeddings generated, and everything went into a hosted vector database.

An ingestion pipeline kept the index in sync.

Queries hit the vector store first, then the app database.

It looked like the modern, correct way to build search.

Three weeks later, the feature worked — technically.

But updating a single document meant re-running the embedding pipeline.

The vector index and the app database could drift out of sync silently.

API keys just to run the thing locally.

Every deployment waited on background indexing to finish before results were reliable.

The system was fragile in ways that would keep compounding.

A Postgres full-text search would have solved the original problem in an afternoon.

The vector database wasn't wrong. It was just answering a question nobody had asked yet.

This article is about how to ask the right question before you start building — and what the answer looks like in practice.

What a Vector Database Is Actually For

Most developers working with embeddings already know what a vector database does.

Fewer stop to ask whether that specific capability is what their problem actually requires.

Before arguing when not to reach for one, it's worth being precise about what the tool is actually built for.

When you generate embeddings for text, images, or other data, you end up with arrays of floating-point numbers.

Finding the most similar item means comparing one vector against many others.

For small datasets, you can do this with a simple scan.

As the number of vectors grows, brute-force comparison becomes too slow, and you need specialized indexes designed for approximate nearest neighbor search.

That’s the problem vector databases are optimized to solve.

Under the hood, most of them rely on approximate nearest neighbour algorithms.

HNSW for graph-based search.

IVF for cluster-based partitioning.

They trade a small amount of recall accuracy for dramatically faster queries.

For semantic search, that trade-off is almost always acceptable — you don't need the single most similar document, you need several good ones, fast.

pgvector exposes this same choice directly in SQL — the query is identical with or without the index:

-- Without IVFFlat - PostgreSQL scans every row

SELECT content,
       embedding <=> query_vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;

-- with IVFFlat — PostgreSQL searches only relevant clusters
-- same query, dramatically different performance at scale
SELECT content,
       embedding <=> query_vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;

The query looks identical — the difference is entirely in the index. This is the performance decision pgvector hands back to you.

Everything else people associate with vector databases — metadata filtering, hybrid search, multi-tenant indexes, reranking — sits on top of that core capability.

Here's where the confusion starts.

The term "vector database" bundles several distinct concerns — storing embeddings, searching them, filtering results, and running the infrastructure — into what looks like a single decision.

The tooling reinforces it.

When every tutorial wires all four together in the same five lines of code, it stops looking like a choice and starts looking like a requirement.

As soon as a project involves embeddings, it can seem like a dedicated vector database is the only correct design.

It isn’t.

Embeddings are just data.

They can live in Postgres, SQLite, or even memory.

A vector database becomes the right tool when approximate nearest neighbor search is the bottleneck — not when embeddings first appear in the architecture.

Until that point, it’s often extra complexity you don’t need.

Why Developers Reach for It Too Early

1. Tutorial monoculture

Most examples of semantic search, RAG, or LLM-powered features follow the same pattern: chunk documents, generate embeddings, store them in a vector database, query with similarity search.

LangChain demos do it. LlamaIndex demos do it. OpenAI examples do it.

2. The scalability trap

Once embeddings enter the design, it’s easy to assume the system will eventually need fast similarity search at scale, so the vector database gets added early to avoid rewriting things later.

This is the same instinct that leads teams to introduce Kafka for a service that sends ten emails a day. The future problem might be real.

But solving it before it exists adds complexity immediately, with no corresponding benefit.

3. Tooling and marketing

Modern vector databases have excellent documentation, polished SDKs, and tutorials that get you from zero to similarity search in under an hour.

That ease of setup is genuinely impressive — and it's also exactly what makes the tool feel mandatory before you've decided whether you need it.

Great onboarding has a way of skipping the step where you ask whether you should be onboarding at all.

Developers don't reach for vector databases too early because they don't understand the technology.

They do it because the ecosystem makes it look like the obvious first step.

At some point, the vector database became the new Redis of the AI stack - added by default, before anyone confirmed it was actually needed.

The result isn't broken systems. It's systems that are harder to run, slower to change, and more expensive to maintain than the problem ever required. The complexity arrives on day one. The scale that would justify it may never come.

The Simpler Stack You’re Ignoring

Two tools get overlooked almost every time: pgvector, which runs inside the Postgres instance you already have, and plain keyword search, which still solves more problems than people want to admit.

1. pgvector — When Your Database Is Already Postgres

If your application already runs on PostgreSQL — and most do — adding pgvector gives you similarity search without introducing a new service.

No new deployment. No additional failure mode.

pgvector adds a VECTOR column type and similarity operators directly to PostgreSQL.

Embeddings live alongside the rest of your data, queryable with SQL, inside the same transactional system your application already depends on.

No new monitoring, no separate backups, no second system to explain to the next engineer on the team.

The setup starts with enabling the extension and creating the table:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id                   BIGSERIAL PRIMARY KEY,
    title                TEXT NOT NULL,
    content              TEXT NOT NULL,
    metadata             JSONB,
    embedding            VECTOR(1536),
    status               TEXT NOT NULL DEFAULT 'READY',
    embedding_error      TEXT,
    embedding_updated_at TIMESTAMPTZ,
    created_at           TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at           TIMESTAMPTZ NOT NULL DEFAULT now()
);

The status column is worth noting.

Because embedding is an external API call that can fail, documents move through a lifecycle — PENDING when first saved, READY once the embedding succeeds, FAILED if the API returns an error.

This means a failed embedding never silently corrupts search results. The status is always visible in the database.

For similarity search to scale beyond a few thousand documents, pgvector needs an index. The IVFFlat index is what makes this production-ready:

CREATE INDEX documents_embedding_ivfflat_idx 
    ON documents USING ivfflat (embedding vector_cosine_ops) 
    WITH (lists = 100);

This is a single SQL query that combines vector similarity search with lifecycle filtering and metadata filtering simultaneously.

A dedicated vector database handles each of those concerns separately — often requiring application-level joins or multiple round trips to combine them.

Here, everything runs inside one query, in one database, with full ACID guarantees.

The <=> operator is pgvector's cosine distance operator. It returns a value between 0 and 2 — lower means more similar. Results are ordered ascending so the closest matches come first.

For most products, this goes further than people expect. pgvector handles millions of vectors without meaningful performance degradation for typical query patterns.

If you're building an internal tool, a document search API, or a RAG feature for a product that isn't at serious scale yet, you're almost certainly in that range.

There are real limits. pgvector won't give you distributed indexing, automatic sharding, or sub-10ms latency under very high query volume.

If you're storing tens of millions of vectors and serving high-QPS queries, a dedicated vector database will outperform it.

But by the time you reach that point, you'll know exactly why you're making the switch — because you'll have measured the problem, not imagined it.

The full implementation — including all three Flyway migrations, the IVFFlat index configuration, lifecycle tracking, and metadata filtering — is available on GitHub.

You don't need a new database to add semantic search. You need pgvector and a migration file.

2. BM25 and Keyword Search — The Tool Nobody Wants to Admit Still Works

Before you generate a single embedding, it's worth asking whether your users actually need semantic search — or whether they just need search that works.

A lot of features labeled “AI search” are really just keyword lookup with better marketing.

If your users know the words they’re looking for, traditional full-text search is often faster, simpler, and more predictable than embeddings.

BM25-based search — the ranking algorithm used by most full-text engines — is extremely good at matching short, precise queries.

-- standard PostgreSQL full-text search — no new infrastructure needed
SELECT title, content,
       ts_rank(to_tsvector('english', content),
               plainto_tsquery('english', 'reset password')) AS rank
FROM documents
WHERE to_tsvector('english', content)
      @@ plainto_tsquery('english', 'reset password')
ORDER BY rank DESC
LIMIT 5;

This runs inside the same PostgreSQL instance as your pgvector queries — no new service, no new failure mode.

Searches like “reset password”, “invoice template”, or a specific error message often perform better with keyword scoring than with vector similarity.

In domains with strict terminology — legal references, product codes, medical terms — exact matches matter more than semantic closeness.

Embeddings shine when meaning matters more than wording. If users are asking “show me something like this” or “what document explains this idea”, vector search makes sense.

If they’re typing the name of the thing they want, it usually doesn’t.

You also don’t have to choose one or the other. Postgres supports full-text search, pgvector supports similarity search, and combining the two often gives better results than either alone.

A hybrid query looks like this — no new infrastructure, no new service:

SELECT id, title, content, 
        embedding <=> ?::vector AS cosine_distance, 
        ts_rank( 
            to_tsvector('english', content), 
            plainto_tsquery('english', ?) 
        ) AS text_rank 
FROM documents 
WHERE status = 'READY' 
    AND embedding IS NOT NULL 
ORDER BY 
    (embedding <=> ?::vector) * 0.7 
    - ts_rank( 
        to_tsvector('english', content), 
        plainto_tsquery('english', ?) 
) * 0.3 
ASC 
LIMIT 10;

A simple hybrid query can rank by keyword match first and semantic distance second, without adding any new infrastructure.

Before adding a vector database, answer a simpler question first: can keyword search solve 80% of this?

If the answer is yes, start there. You can always add embeddings later. You can't easily remove infrastructure you didn't need.

When You Actually Need a Vector Database

Vector databases aren't the villain here. They're the right tool when similarity search becomes a real, measured performance problem — not a projected one. The question is how to recognize that moment before you've already over-built.

These are the thresholds where teams consistently start feeling the limits of a general-purpose setup:

Signal	Typical threshold
Vector count	Millions of embeddings
Query latency	Sub-50 ms p99
Filtering complexity	Multi-tenant filters
Query volume	High QPS
Infrastructure maturity	Dedicated team
Use case	Recommendation, RAG, personalization

The numbers aren’t rules. They’re patterns. The real signal is when the simple approach stops being simple.

If your search feature is core to the product and needs predictable latency under load, the tradeoffs of a dedicated vector database start to make sense.

You know you need a vector database when brute-force similarity becomes your actual bottleneck — not your imagined one.

A Simple Decision Flowchart

Most projects don't need a new datastore. They need a clear decision process.

If you're starting a new feature and asking whether a vector database belongs in the design, the flowchart below maps the decision from 'I need search' to the right tool for your current scale.

A simple way to think about the decision:

The flowchart won't cover every edge case — no decision tool does.

But if your design jumps straight to a managed vector database before working through these questions, you're probably solving a scaling problem you don't have yet.

The cost of that mistake shows up slowly, in complexity that compounds before the scale ever arrives.

Architecture Should Follow the Problem

The best architecture isn't the most modern one.

It's the one that matches the problem you actually have, at the scale you're actually at — maintained by the team you actually have, not the one you might hire later.

Vector databases are powerful tools, but they come with real operational cost — another service to run, another datastore to keep in sync, another place where performance and correctness can drift apart.

That cost only makes sense when the problem demands it. Before that point, simpler designs are usually easier to build, easier to reason about, and easier to change when requirements shift.

Starting with pgvector or full-text search doesn’t lock you in. If you outgrow it, the path to a dedicated vector database is well understood. The reverse is harder.

Removing infrastructure you didn’t need is almost always more work than adding it later.

The full pgvector implementation, including schemas, index configuration, and the search query shown above, is available on GitHub .

Most systems don’t fail because they chose the wrong tool. They fail because they chose the right tool too early.

The real skill isn’t knowing how to use a vector database. It’s knowing when not to.

Building a Semantic Search API with Spring Boot and pgvector - Part 3: The Embedding Layer.

Ozioma Ochin — Fri, 20 Mar 2026 19:24:28 +0000

Most semantic search tutorials treat embeddings as a single line of code — call the API, get a vector, store it.

In practice, this is the part of the system where the most subtle bugs live. Not the kind that throw exceptions, but the kind that silently produces wrong similarity scores, wrong rankings, and search results that look correct but feel off.

When I first built this service, I expected the difficult parts to be the database schema and the search query. Instead, most of the time went into the embedding layer. Small mistakes here don’t crash the application. They just make search behave strangely.

Three things make this layer trickier than it looks.

First, the API call is external. It can fail because of network issues, rate limits, or invalid requests, and the failure is not always obvious from the client side.

Second, the response parsing has silent failure modes. A wrong field name, a missing element, or a partially parsed response can still produce a vector — just not the right one.

Third, the normalization step is easy to get wrong, skip entirely, or apply twice. When that happens, similarity scores change even though the text hasn’t.

In Part 2, the schema was designed to store embeddings safely, track their lifecycle, and support retries when something goes wrong. Now we need to generate those embeddings correctly.

That responsibility lives entirely inside the embedding layer.

What the embedding layer is responsible for

Before looking at any implementation, it helps to define what the embedding layer is supposed to do — and just as importantly, what it is not supposed to do.

At a high level, the layer has one job: convert text into a vector that can be stored and compared in the database.

That sounds simple, but several steps are involved: sending the text to the API, validating the response, parsing the JSON, converting to a float array, and normalizing the vector before returning it.

Everything else belongs somewhere else.

The embedding layer does not know about the database.

It does not know about documents, metadata, or search queries.

Its only responsibility is converting text into a vector and returning it to whoever asked.

That boundary is what makes this layer testable, replaceable, and easy to reason about in isolation.

The service layer can call it without knowing what happens inside. The tests can mock it without spinning up an HTTP client. A different provider can be swapped in without touching anything outside this layer.

That boundary is captured by a single interface.

The EmbeddingClient interface

Before looking at the OpenAI implementation, the most important design decision in this layer is the interface.

public interface EmbeddingClient {
    float[] embed(String text);
}

This interface is intentionally small, but it defines the boundary for the entire embedding layer.

The service layer depends on this contract, not on any specific provider. As far as the rest of the application is concerned, embedding is simply a function that takes text and returns a vector.

How that vector is produced is an implementation detail.

One method, one responsibility.

The embedding layer should not expose HTTP details, JSON parsing, or model configuration.

All of that stays behind the implementation.

The return type is also a deliberate choice. The method returns a float[], not a List and not a custom wrapper type.

The database layer ultimately writes this value into a VECTOR column, and pgvector expects a primitive float array. Returning anything else would only introduce unnecessary conversion code between layers.

Depending on the interface rather than the implementation means the provider is swappable.

The class that implements this interface today is called OpenAiEmbeddingClient, but nothing in the service layer depends on that fact.

The same interface could later be backed by a local model, a different provider, or even a mock implementation for tests.

Wiring the client with Spring

The client is registered as a Spring component and configured through constructor injection.

public OpenAiEmbeddingClient(
        ObjectMapper mapper,
        @Value("${openai.apiKey}") String apiKey,
        @Value("${openai.embeddingModel}") String model) {
    this.mapper = mapper;
    this.apiKey = apiKey;
    this.model = model;
}

The values for the API key and model come from application configuration.

openai.apiKey=${OPENAI_API_KEY}
openai.embeddingModel=${OPENAI_EMBEDDING_MODEL:text-embedding-3-small}

Reading the API key from an environment variable is not just a convention, it is a requirement for any service that runs outside a local machine.

Hardcoding credentials in source code makes rotation difficult and leaks secrets into version control. Using ${OPENAI_API_KEY} allows the same code to run locally, in CI, and in production without changes.

The model name is also injected rather than hardcoded, but with a default value. The syntax ${OPENAI_EMBEDDING_MODEL:text-embedding-3-small} means the property is optional.

If no environment variable is provided, the client falls back to text-embedding-3-small. This makes local setup easier while still allowing the model to be changed without recompiling the application.

Constructor injection is used instead of field injection for a reason. All dependencies are provided when the object is created, and the fields can remain final.

This makes the class easier to test and prevents partially constructed instances. It also keeps the configuration visible at the entry point of the class instead of scattered across annotations.

At this point the embedding layer has a clear boundary and a concrete implementation. The remaining work is inside the client itself: building the HTTP request, validating the response, and turning the result into a normalized vector.

The full source code — including OpenAiEmbeddingClient, EmbeddingUtils, and all three Flyway migrations — is available on GitHub.

The embed() orchestration method

 @Override
    public float[] embed(String text) {
        try {
            HttpResponse<String> response = sendRequest(text != null ? text : "");
            validateResponse(response);
            return parseEmbedding(response.body());
        } catch (RuntimeException e) {
            throw e;
        } catch (Exception e) {
            throw new RuntimeException("Failed to get embedding from OpenAI", e);
        }
    }

This method is intentionally small. It does not contain the implementation details of the HTTP call, response parsing, or normalization. Instead, it orchestrates the process by delegating each step to a private method.

Keeping the public method short makes the flow easy to read. The code describes what happens without showing how it happens: send the request, validate the response, parse the embedding.

The null guard at the entry point is intentional:

text != null ? text : ""

The embedding call should never fail because the caller passed a null value. Converting null to an empty string ensures the method always produces a result, even if the input is missing.

Handling this at the boundary keeps the rest of the code simpler because the private methods never need to check for null.

The exception handling follows the same idea of keeping the boundary clean. Runtime exceptions are rethrown unchanged, while checked exceptions are wrapped in a RuntimeException.

The caller never has to deal with checked exceptions coming from the embedding layer, and the service layer can treat embedding failures like any other runtime error.

Building the HTTP request

private HttpResponse<String> sendRequest(String text) throws Exception {
String body = mapper.writeValueAsString(
Map.of("model", model, "input", text)
);
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(OPENAI_EMBEDDINGS_URL))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(body, StandardCharsets.UTF_8))
.build();

return httpClient.send(request,
HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8)
);
}

Several small decisions in this method prevent bugs that are difficult to trace later.

The API URL is stored in a constant at the top of the class instead of being written inline.

private static final String OPENAI_EMBEDDINGS_URL =
        "https://api.openai.com/v1/embeddings";

Defining the URL once makes it visible and easy to verify. A single missing character — embedding instead of embeddings — produces a 404 that looks nothing like a URL error because the OpenAI response body for an unknown endpoint is not obvious.

The request body is built using Jackson instead of concatenating strings.

mapper.writeValueAsString(
    Map.of("model", model, "input", text)
);

Manually building JSON is fragile. A missing quote, an extra comma, or an unescaped character in the input text can produce a request that looks correct in code but fails at runtime.

Using the ObjectMapper guarantees that the JSON is valid every time.

The request explicitly uses UTF-8 when writing the body.

HttpRequest.BodyPublishers.ofString(
    body,
    StandardCharsets.UTF_8
)

Relying on the platform default charset can lead to different behaviour between local development and production.

Specifying UTF-8 ensures the request is encoded the same way in every environment.

The method returns the raw HTTP response instead of parsing it immediately. This keeps responsibilities separate. The request method only sends the request. Validation and parsing happen in the next steps.

Validating and parsing the response

Not every API response is a success. Before parsing anything, the response status needs to be verified and the parsing itself has subtle failure modes worth understanding.

A typical response from the embeddings API looks like this:

{
  "data": [
    {
      "embedding": [0.023, -0.181, 0.442, ...],
      "index": 0
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

The first step is to verify that the request actually succeeded.

private void validateResponse(HttpResponse<String> response) {
    if (response.statusCode() / 100 != 2) {
        throw new RuntimeException(
                "OpenAI embeddings failed: HTTP "
                        + response.statusCode()
                        + " body=" + response.body()
        );
    }
}

Instead of checking for a single status code, the method verifies that the response is in the 2xx range.

response.statusCode() / 100 != 2

Integer division keeps only the hundreds digit, so this condition catches any non-2xx response with one comparison.

This includes rate limits, server errors, and invalid requests, all of which should stop the embedding process immediately.

Once the response is known to be valid, the next step is to extract the vector.

private float[] parseEmbedding(String responseBody) throws Exception {
    JsonNode embedding = mapper.readTree(responseBody)
            .path("data")
            .get(0)
            .path("embedding");

    float[] out = new float[embedding.size()];

    for (int i = 0; i < embedding.size(); i++) {
        out[i] = (float) embedding.get(i).asDouble();
    }

    return EmbeddingUtils.l2Normalized(out);
}

The parsing code uses path() instead of get() for most lookups — and the difference matters.

path() returns a MissingNode if the field does not exist, while get() would return null.

This avoids null pointer exceptions and makes the parsing code more predictable when the response structure changes.

The values are read as doubles and then cast to float.

(float) embedding.get(i).asDouble()

Jackson parses JSON numbers as double by default. Converting through asDouble() preserves precision correctly before the cast to float, which matches the type expected by pgvector.

The vector is not returned directly after parsing — it passes through one more step first.

L2 normalization: what it is and why it matters

Normalisation is the final step before the vector is returned.

public static float[] l2Normalized(float[] v) {
        double sumOfSquares = 0.0;
        for (float f : v) sumOfSquares += (double) f * f;

        double norm = Math.sqrt(sumOfSquares);
        if (norm == 0.0) return v.clone();

        float[] out = new float[v.length];
        for (int i = 0; i < v.length; i++) {
            out[i] = (float) (v[i] / norm);
        }
        return out;
    }

In geometric terms, this moves every vector onto the surface of a unit sphere.

After normalization, the magnitude of the vector no longer depends on the length of the input text, only on its direction in the embedding space.

This matters because similarity search uses cosine distance.

Cosine similarity compares the angle between two vectors, not their length. If vectors are not normalized, longer vectors can produce larger dot products even when the meaning is not closer.

Without normalization, two documents about the same topic but different lengths can score differently against the same query. Not because one is more relevant, but because one is longer.

Normalization removes this length bias and makes similarity depend only on semantic direction.

The method also handles the edge case where the vector length is zero.

if (norm == 0.0) {
    return v.clone();
}

Returning a clone instead of the original array prevents the caller from accidentally mutating the input.

Recent embedding models already return normalized vectors, including text-embedding-3-small. The explicit normalization here is defensive.

It guarantees correct behaviour even if the model changes later, and it documents the assumption directly in code instead of relying on external behaviour.

Why the embedding layer is behind an interface

When the interface was introduced in earlier section, the implementation behind it was simple. Now that the full implementation is visible — HTTP requests, response validation, parsing, normalisation — the value of keeping all of that behind a single method becomes clearer.

A mock implementation can return a fixed vector without making an HTTP call, which allows the service layer to be tested without depending on the external API.

This separation may look unnecessary when the system is small, but it becomes important as soon as the embedding logic grows.

The client now handles HTTP requests, response validation, parsing, and normalization. Keeping all of that behind a single method prevents those details from leaking into the rest of the application.

What's Next

Part 4 moves up one level to the service layer — where everything built so far is orchestrated into a complete API.

See you in Part 4!

Building a Semantic Search API with Spring Boot and pgvector - Part 2: Designing the PostgreSQL Schema

Ozioma Ochin — Sun, 15 Mar 2026 19:41:25 +0000

Why the database layer matters

In a semantic search system, the database schema isn’t just storage.

It defines how embeddings are stored, indexed, and queried.

Many tutorials treat the database as a detail - create a table, add a vector column, and move on.

But when search quality depends on how vectors are stored and compared, the schema becomes a core architectural decision.

The schema determines what the system can do and what it cannot.

A missing index means slow queries at scale.
A missing status column means no visibility into embedding failures.
A poorly typed metadata column means filters that silently break.

Every column and every index in this schema exists because a specific part of the system depends on it.

Running pgvector locally

Before any migrations run, the database needs to support vector operations. That means PostgreSQL with the pgvector extension installed.

Using pgvector lets us keep embeddings in the same database as the documents.

This avoids the complexity of running a separate vector store.

For this project, the goal is simplicity and consistency, not maximum scale.

Keeping everything in PostgreSQL makes the system easier to reason about.

There’s no Pinecone account to manage, no separate service to keep in sync, and no eventual consistency between documents and embeddings.

Everything lives in one place and can be written in a single transaction.

The local setup uses Docker with the official pgvector image:

services:
  postgres:
    image: pgvector/pgvector:pg16
    container_name: semantic_search_postgres
    environment:
      POSTGRES_DB: semantic_search
      POSTGRES_USER: semantic
      POSTGRES_PASSWORD: semantic
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

The important line is pgvector/pgvector:pg16 instead of the standard postgres:16.

This image ships with the pgvector extension pre-installed.

No manual compilation, no OS-specific setup step.

Pull the image and the extension is ready.

The project includes two compose files.

`docker-compose_dev.yml` runs only the database — useful when running the Spring Boot app from IntelliJ

`docker-compose.yml` runs the full stack

The full source code including all the migrations is available on GitHub.

How Flyway works in this project

The schema in this project didn't arrive fully formed.

It evolved over time, and the migrations show exactly how.

Instead of writing the final schema all at once, the project builds it step by step through Flyway migrations, just like a real system would.

V1 creates the foundation.
V2 adds document lifecycle tracking.
V3 fixes a data quality problem that V1 didn't anticipate.

Each migration represents a decision made at a specific point in the project's history.

This approach matters for two reasons.

First, reproducibility. Any developer cloning the repository gets the exact same schema by running the application. Flyway applies the migrations in order, tracks which ones have already run in its flyway_schema_history table, and skips anything that's already been applied.

Second, safety. Because spring.jpa.hibernate.ddl-auto is set to validate, Hibernate will refuse to start if the schema doesn't match the entity definitions.

Flyway owns the schema.

Hibernate only validates it.

spring.flyway.enabled=true
spring.flyway.locations=classpath:db/migration
spring.jpa.hibernate.ddl-auto=validate

The naming convention for migrations also matters.

Every file follows the pattern:

V{number}__{description}.sql

There are two underscores between the version number and the description.

Flyway uses the version number to determine execution order, and a checksum of each file to detect changes.

If a migration file is modified after it has already been applied, Flyway refuses to start.

That constraint is intentional.

It forces schema changes to go through new migrations instead of editing old ones.

This project ends up with three migrations.

The first migration builds the entire foundation - table, indexes, and trigger - in a single SQL file.

V1: Building the foundation

The schema is designed around how the search queries will run, not just how the data is stored.

Every column in V1 exists because a specific part of the system depends on it.

Rather than showing the full migration at once, each part is broken down and explained in the order it appears in the file: extension first, table second, indexes third, and trigger last.

CREATE EXTENSION IF NOT EXISTS vector;

This line has to come first. PostgreSQL does not support the VECTOR type by default, so the rest of the migration would fail without it.

The IF NOT EXISTS clause also makes the migration safer. If the extension is already installed in a local environment, CI database, or shared dev database, Flyway can still run the migration without error.

The table

CREATE TABLE IF NOT EXISTS documents (
    id          BIGSERIAL PRIMARY KEY,
    title       TEXT NOT NULL,
    content     TEXT NOT NULL,
    metadata    JSONB,
    embedding   VECTOR(1536),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Each column exists for a reason:

id BIGSERIAL PRIMARY KEY
A standard auto-incrementing identifier. BIGSERIAL is used instead of SERIAL to avoid running out of IDs in larger datasets.

title TEXT NOT NULL and content TEXT NOT NULL
These are the fields that get embedded. TEXT is used instead of VARCHAR because PostgreSQL handles TEXT efficiently, and a hard length limit would be artificial here.

metadata JSONB
Optional metadata for filtering. JSONB is used instead of JSON because it is faster to query and supports GIN indexing.

embedding VECTOR(1536)
The vector representation of the document. 1536 matches the output size of OpenAI’s text-embedding-3-small model. If the model changes, this column definition would also need to change.

created_at TIMESTAMPTZ NOT NULL DEFAULT now()
Stores when the row was created. The database sets it automatically.

updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
Stores when the row was last updated. This value is maintained by a trigger so it stays correct no matter how the row is modified.

The indexes

V1 adds three indexes, each supporting a different access pattern.

Created_at index

CREATE INDEX IF NOT EXISTS idx_documents_created_at
    ON documents (created_at DESC);

This is a standard B-tree index for queries that sort documents by creation time.

That is useful for admin pages, auditing, and any endpoint that lists recently created documents.

The DESC ordering matches the most common query pattern, so PostgreSQL does not need to sort the results separately.

Metadata GIN index

CREATE INDEX IF NOT EXISTS idx_documents_metadata_gin
    ON documents USING gin (metadata);

This index supports metadata filtering.

Because metadata is stored as JSONB, PostgreSQL can use a GIN index to search inside the JSON structure efficiently. Without this index, metadata filters would require a full table scan.

embedding IVFFlat index

CREATE INDEX IF NOT EXISTS documents_embedding_ivfflat_idx
    ON documents USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

This is the most important index in the migration.

Without it, vector similarity search would require comparing the query embedding against every stored embedding in the table.

ivfflat is pgvector’s approximate nearest-neighbour index. It improves speed by grouping vectors into clusters and searching only the clusters closest to the query vector.

That comes with a tradeoff: slightly lower recall in exchange for much faster queries.

The lists = 100 setting controls how many clusters are created. For a small dataset, 100 is a reasonable starting point. As the dataset grows, this value should be revisited and the index rebuilt if needed.

The vector_cosine_ops operator class is also important. It tells PostgreSQL to optimize the index for cosine distance, which must match the operator used in the query.

Search Request
     ↓
status = 'READY'
     ↓
metadata filters
     ↓
embedding <=> query_vector
     ↓
ranked results

The schema and indexes are designed around the path a search query will take through the table.

The trigger

CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = now();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_documents_updated_at
BEFORE UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION set_updated_at();

This ensures that updated_at is refreshed automatically every time a document row is updated.

Handling this in the database is more reliable than doing it in the service layer. Even if someone updates the row through raw SQL, the timestamp remains correct.

That matters because a document content change may mean the stored embedding is now stale. An accurate timestamp is the only way to know when it was last computed.

V1 is the foundation.

Everything in V2 and V3 builds on top of it.

V2: Adding the document lifecycle

Once embeddings are generated, the database needs to track the state of each document.

Embedding is not instant. It depends on an external API call, which can fail, time out, or hit rate limits.

Without a way to track embedding state, a document could exist in the database but never appear in search results, with no clear explanation why.

V2 introduces a simple lifecycle model so the system always knows whether a document is searchable.

This migration adds three columns and one index.

ALTER TABLE documents
    ADD COLUMN IF NOT EXISTS status TEXT NOT NULL DEFAULT 'READY',
    ADD COLUMN IF NOT EXISTS embedding_error TEXT,
    ADD COLUMN IF NOT EXISTS embedding_updated_at TIMESTAMPTZ;

CREATE INDEX IF NOT EXISTS idx_documents_status_created
    ON documents (status, created_at DESC);

Why lifecycle tracking is needed

Embedding is performed after the document is stored, not at the same time.

That means a document can exist in several states:

saved but not embedded yet.
successfully embedded.
failed to embed.

Without a status column, the system cannot tell these cases apart.

A failed embedding would simply result in a document that never shows up in search, which makes debugging difficult.

V2 makes document state explicit in the schema.

The new columns

status

status TEXT NOT NULL DEFAULT 'READY'

This column tracks where a document is in its lifecycle.

The system uses three values:

PENDING — document saved, embedding not generated yet.
READY — embedding generated successfully.
FAILED — embedding request failed.

The default value is READY.

This might look strange at first, but it keeps the migration safe for existing rows. When the migration runs on a database that already has documents, those rows need a valid status value.

Using READY assumes existing data already has embeddings, which is the safest assumption.

New documents created after this migration are explicitly set to PENDING by the application before embedding runs.

    PENDING
       ↓
     READY

    PENDING
      ↓
    FAILED

Once embedding becomes a separate step, document state must become part of the schema.

embedding_error

embedding_error TEXT stores the error message when embedding fails.

Most documents will never use this column, so it is nullable.

When a document is in FAILED state, this field makes debugging much easier. Instead of searching through logs, the failure reason is visible directly in the database.

The error stored here might be a network timeout, a rate limited response, or an unexpected payload, whatever the API returned at the time of failure.

embedding_updated_at

embedding_updated_at TIMESTAMPTZ stores the last time the embedding was generated.

This is different from updated_at, which tracks when the document row changes.

This column makes it possible to implement retry logic later.

For example, a background job could look for documents where:

status = 'FAILED'
AND embedding_updated_at < now() - interval '1 hour'

and retry the embedding only for older failures.

This avoids retrying the same document repeatedly in a tight loop.

The composite index

CREATE INDEX IF NOT EXISTS idx_documents_status_created
    ON documents (status, created_at DESC);

This index supports two important query patterns.

First, search queries filter on:

status = 'READY'

before performing vector similarity search.

Without an index on status, PostgreSQL would have to scan many rows before the vector index can do its job.

Second, admin queries often need to list documents by status, ordered by creation time.

For example:

newest failed documents.
newest pending documents.
newest ready documents.

The column order in the index matters.

status comes first because it is used for filtering

created_at DESC comes second because it is used for sorting

With this order, PostgreSQL can use the same index for both filtering and ordering.

Why V2 changes the schema

Once embedding becomes a separate step, document state becomes part of the database model.

This is a good example of why schemas evolve.

The first version only needed to store documents.

The second version needs to describe their lifecycle.

And that change belongs in the database, not just in the application.

V3: Fixing bad data

Real systems rarely get the schema right the first time.

V3 is different from previous migrations. It doesn't add columns or create indexes.

Instead, it fixes data that was stored incorrectly in an earlier version of the service.

Before V3:

metadata = "category=billing"

After V3:

metadata = { "raw": "category=billing" }

This migration exists because the application originally allowed metadata to be saved as a JSON string instead of a JSON object.

That turned out to be a problem later when filtering was added.

This is a data migration, not a schema migration.

UPDATE documents
SET metadata = jsonb_build_object('raw', metadata)
WHERE metadata IS NOT NULL
  AND jsonb_typeof(metadata) = 'string';

What went wrong

The metadata column is defined as JSONB.

That means PostgreSQL will accept any valid JSON value: object, array, string, and number.

The application’s filter logic assumes metadata is always a JSON object, so it can query values using operators like:

metadata ->> 'category'

But earlier versions of the service allowed values like this:

"metadata": "category=billing"

This is valid JSON, but it is not an object.

Once filters were added, these rows stopped working correctly.

Queries expecting key–value pairs would fail or return incorrect results.

Later versions of the API added validation to prevent this, but by then some bad data already existed in the database.

What the migration does

The migration finds rows where metadata is stored as a string:

jsonb_typeof(metadata) = 'string'

and converts them into a valid JSON object:

jsonb_build_object('raw', metadata)

So this:

"category=billing"

becomes:

{"raw": "category=billing"}

This approach keeps the original value instead of deleting it.

The document stays intact, and the metadata becomes valid JSON that the filter system can handle.

If needed, a developer can still inspect the original value inside the “raw” field.

Why this migration matters

This migration highlights something that happens in almost every real system:

Migrations do not only add features, they also repair data.

Nothing new was added to the schema, but the data became consistent again.

This kind of migration is common in production systems. The important part is not avoiding mistakes, it is fixing them safely without losing information.

V3 is a record of a real problem that existed, and the decision made to correct it.

The Complete Schema

After all three migrations, the schema is now complete.

Here is the final table definition after V1, V2, and V3 have all been applied.

CREATE TABLE documents (
    id                   BIGSERIAL PRIMARY KEY,
    title                TEXT NOT NULL,
    content              TEXT NOT NULL,
    metadata             JSONB,
    embedding            VECTOR(1536),
    status               TEXT NOT NULL DEFAULT 'READY',
    embedding_error      TEXT,
    embedding_updated_at TIMESTAMPTZ,
    created_at           TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at           TIMESTAMPTZ NOT NULL DEFAULT now()
);

Nothing in this table is accidental.

Every column exists because a specific part of the system needs it.

The schema now describes not just the data, but the behavior of the system.

The Indexes

Each index exists for a specific query pattern — remove any one of them and something in the system either breaks or slows down significantly at scale.

Index	Type	Purpose
idx_documents_created_at	B-tree	Ordering by creation time
idx_documents_metadata_gin	GIN	Metadata filtering
documents_embedding_ivfflat_idx	IVFFlat	Vector similarity search
idx_documents_status_created	B-tree (composite)	Status filtering + ordering

What's Coming Next

The schema can now store embeddings, but generating them is where things get interesting.

Part 3 covers the full implementation of the embedding client using Java's built-in HttpClient, with no third-party SDK.

It also covers the bugs that are hardest to catch: the ones that don't throw exceptions but silently produce wrong similarity scores.

See you in Part 3.

Building a Semantic Search API with Spring Boot and pgvector - Part 1: Architecture

Ozioma Ochin — Sun, 08 Mar 2026 17:52:49 +0000

The problem with Keyword Search

Keyword search breaks more often than most engineers realize.

A few months ago, I was building an internal document management tool. Users could upload policy documents, product guides, and support articles — and search through them.

I implemented a simple keyword search, deployed it, and assumed I was done.

Then the complaints started.

One support engineer searched for "billing retries" and got zero results. The document absolutely existed. It was titled "Payment Failure Handling Policy" and covered exactly what they were looking for.

The problem wasn’t the content.

The problem was the search engine.

It was doing exactly what keyword search is designed to do: scanning documents for the exact words “billing” and “retries.”

Those words weren't in the document. So the system concluded there was no match.

Query: "billing retries" Document: "Payment Failure Handling Policy" Keyword search: ❌ No match — strings don't overlap Semantic search: ✅ Strong match — meaning is the same

This is the fundamental limitation of keyword search: it compares strings, not meaning.

It treats "car" and "automobile" as completely unrelated.

It sees "help me fix this bug" and "debugging assistance" as different queries. But that’s not how people search in the real world.

People search using intent, and they rarely phrase a query the same way a document is written.

Semantic search approaches the problem differently.

Instead of matching text directly, it attempts to capture the meaning behind the words. To do that, it converts text into numerical representations called embeddings. Before we build the search system itself, we first need to understand what embeddings are and why they work.

Keyword Search vs Semantic Search:

What Are Embeddings?

Embeddings are the core idea behind semantic search. At a high level, an embedding is a numerical representation of text.

Instead of storing meaning as words, machine learning models convert text into vectors - lists of numbers that capture semantic relationships between pieces of text.

For example, a sentence like:

"How do I retry a failed payment?" => [0.023, -0.181, 0.442, ..., 0.091]
                                        1,536 dimensions

These numbers by themselves don’t mean much to us.

What matters is how close two vectors are in this space.

If two pieces of text express similar ideas, their vectors will appear close together in this space.

If they describe completely different concepts, their vectors will be far apart.

For example, embeddings for the words: "car", "automobile" and "vehicle" will appear very close together.

Meanwhile, something unrelated like: "Banana" will be far away from them.

This is what allows semantic search to work.

Instead of asking:

Do these documents contain the same words?

The system asks:

Are these documents about the same idea?

That small shift fundamentally changes how search works.

It allows search engines to retrieve relevant documents even when the wording is completely different.

Embedding Space:

In practice, modern embedding models produce vectors with hundreds or thousands of dimensions.

The model used in this project generates vectors with 1536 dimensions, which means every piece of text becomes a point in a 1536-dimensional space.

While we can't visualize that space directly, distance between vectors can still be measured mathematically.

That measurement is what allows us to rank documents by semantic similarity.

Measuring Semantic Similarity

Once both the query and documents are converted into embeddings, the next question becomes:

How do we compare them?

This is where vector similarity comes in.

A semantic search system measures how close two vectors are to each other in the embedding space.

If two vectors point in nearly the same direction, the underlying text likely expresses the same idea.

If the vectors point in very different directions, the concepts are probably unrelated.

One of the most common ways to measure this similarity is cosine similarity.

Cosine similarity measures the angle between two vectors.

Vectors pointing in the same direction have a similarity close to 1.

Vectors pointing in different directions have a similarity closer to 0.

In practice, this allows the search system to rank documents by semantic relevance.

Instead of returning documents that simply contain the same words, the system returns documents whose meaning is closest to the user’s query.

This is what makes semantic search so powerful.

Even if the wording is different, the search engine can still retrieve the right documents.

Cosine Similarity:

Now let's look at how this works in the system we're building.

What We're Building

The service exposes six endpoints:

POST /documents — store a document and compute its embedding

GET /documents/{id} — retrieve a document by ID

PUT /documents/{id} — update a document and re-compute its embedding

DELETE /documents/{id} — remove a document

POST /search — semantic search with filters and pagination

GET /ping — health check

When a search request arrives, the system does five things in sequence.

The client sends a query to the API.

The API converts that query into an embedding using the same model that was used to embed the stored documents.

PostgreSQL then performs a vector similarity search using pgvector, comparing the query vector against document embeddings using pgvector's vector index.

The database returns the documents whose vectors are closest to the query.

The API ranks them by similarity score and returns the results.

The key detail is step two. The query and the documents are embedded using the same model, which means they live in the same vector space.

That shared space is what makes comparison possible. A query about "billing retries" and a document about "payment failure handling" end up close together in that space, and pgvector finds that closeness in milliseconds even across thousands of documents.

Search Execution Flow:

Because pgvector runs inside PostgreSQL, the similarity search can be combined with standard database features — filtering by metadata, pagination, and indexing — all inside a single query.

No separate vector database is required.

Here's what a search request and response look like:

{
  "query": "billing retries",
  "page": 0,
  "size": 10,
  "minScore": 0.6,
  "filters": { "category": "billing" }
}

And the response:

{
  "page": 0,
  "size": 10,
  "totalElements": 3,
  "items": [
    {
      "id": 1,
      "title": "Payment Failure Handling Policy",
      "cosineDistance": 0.12,
      "cosineSimilarity": 0.88,
      "score": 0.94
    }
  ]
}

Three score fields appear in every result.

cosineDistance is the raw output from pgvector — lower means more similar.

cosineSimilarity inverts that — higher means more similar.

score normalises the result to a clean [0, 1] range and is the value your application should actually use.

Set minScore: 0.7 in the request and only results with a score of 0.7 or above come back.

The filters field narrows results to documents whose metadata matches specific values. In the example above, only documents tagged category: billing are searched. The filter keys are validated at the API boundary — malformed keys are rejected before they reach the database.

The full source code is on GitHub — linked at the end of this article.

System Architecture:

The Tech Stack and Why

The goal of this project wasn’t just to build semantic search — it was to build it using tools that many backend engineers already use in production.

Instead of introducing a completely new ecosystem, the idea was to see how far we could push a familiar stack.

Here’s what that stack looks like.

Spring Boot

Spring Boot handles the infrastructure; dependency injection, validation, exception handling, configuration management — leaving the focus on business logic. Spring Boot 3 with Java 9+ also brings virtual threads via Project Loom, which is relevant for a service making frequent I/O calls to OpenAI.

The honest reason for this choice over Quarkus or Micronaut is that Spring Boot is widely used in enterprise Java, and this service needs to be readable and maintainable by other Java developers. Familiarity is a legitimate engineering consideration.

PostgreSQL

PostgreSQL stores the documents, metadata and timestamps. The vector storage is handled by the pgvector extension, covered next.

pgvector

The question worth addressing directly is why not a dedicated vector database like Pinecone, Weaviate, or Qdrant?
For most production use cases, you don't need one.

pgvector is a PostgreSQL extension that adds a VECTOR column type and a cosine distance operator <=>.

It stores embeddings directly alongside relational data, in the same database, with the same ACID guarantees.

A document and its embedding are written in a single transaction — no synchronisation between two systems, no eventual consistency to reason about.

CREATE TABLE documents (
    id                   BIGSERIAL PRIMARY KEY,
    title                TEXT NOT NULL,
    content              TEXT NOT NULL,
    metadata             JSONB,
    embedding            VECTOR(1536),
    status               TEXT NOT NULL,
    embedding_error      TEXT,
    embedding_updated_at TIMESTAMPTZ
);

One honest caveat is that pgvector works well at moderate scale, millions of documents. For billions of vectors with sub-millisecond latency requirements, a dedicated vector database makes more sense. But for the vast majority of production use cases, pgvector is the right starting point.

Reach for a specialist tool only when you've proven you need it.

OpenAI Embeddings

text-embedding-3-small converts text to 1,536-dimensional vectors. It was chosen over ada-002 and text-embedding-3-large for the balance of quality, speed, and cost. It produces embeddings that are more than good enough for document search at a fraction of the cost of the larger model.

More importantly, the OpenAI client is never imported directly into the service layer.

The OpenAI client sits behind an EmbeddingClient interface — the provider can be swapped without touching the service layer. More on this in Part 6.

Flyway

Finally, Flyway is used to manage database migrations. As the schema evolves, for example when introducing document status fields or metadata changes, Flyway ensures that database changes are applied consistently across environments.

Using migrations also makes it easier for readers of this series to reproduce the database setup.

High-Level Architecture

The service is organised into four layers. Each layer has one job and communicates only with the layer directly below it.

Controller — the HTTP boundary.

Receives requests, validates them with @valid, delegates to the service, and returns the correct status code. No business logic lives here. A GlobalExceptionHandler sits across all controllers and ensures every error response — whether a 400, 404, or 500 — returns the same structured JSON shape.

Service — where all decisions happen.

DocumentServiceImpl orchestrates the repository and the embedding client. It controls the document lifecycle, every document is saved immediately with a PENDING status, then moves to READY once the embedding succeeds, or FAILED if OpenAI returns an error.

A failed embedding is never silent — the error message is stored in the database and the document is excluded from all search results until it's resolved.

public CreateDocumentResponse create(CreateDocumentRequest request) {
    Document saved = saveAsPending(request);     // status = PENDING
    embedAndPersist(saved.getId(), ...);          // status → READY or FAILED
    return new CreateDocumentResponse(saved.getId(), DocumentStatus.READY);
}

Repository — Spring Data JPA handles standard CRUD.

JdbcTemplate handles vector operations. pgvector's <=> cosine distance operator and ::vector casting don't map to JPQL, so those queries are written in SQL directly. Two tools, two clearly defined responsibilities.

SELECT id, title,
       (embedding <=> ?::vector) AS cosine_distance
FROM documents
WHERE status = 'READY'
ORDER BY cosine_distance ASC
LIMIT ?;

Embedding — OpenAiEmbeddingClient sits behind an EmbeddingClient interface.

Nothing else in the application imports the implementation directly. Swapping OpenAI for a local model means writing one new class — the service layer is untouched.

public interface EmbeddingClient {
    float[] embed(String text);
}

Architecture:

The full source code, including all migrations and tests, is available on GitHub: link

Series Roadmap

This article covered the foundation — the problem semantic search solves, how embeddings work, and how the system is structured. The rest of the series builds out each layer in full.

Part 2 — The Database Layer : All three Flyway migrations in detail. The documents table structure, the IVFFlat index configuration, the JSONB metadata design, and how the schema supports the document lifecycle from day one.

Part 3 — Calling the OpenAI Embeddings API in Java Without an SDK: Building the HTTP request with plain java.net.http.HttpClient, parsing the response with Jackson, L2 normalisation, and the bugs worth knowing about before you write a single line.

Part 4 — The Full CRUD API and Service Layer: The complete DocumentServiceImpl — create, read, update, delete, and search. The QueryBuilder inner class for safe dynamic SQL. The GlobalExceptionHandler for consistent error responses across the entire API.

Part 5 — Testing Without a Real Database or API Key: Mockito, MockMvc, H2 test profiles, and the specific JdbcTemplate varargs trap that catches most developers the first time — with the exact fix.

Part 6 — Lessons Learned: 6 Bugs Found Before the Service Could Run: Six real bugs from this codebase — wrong API URL, missing annotation bracket, trailing comma in a JSON string, broken SQL subquery, silent double normalisation, and a RuntimeException returning 500 instead of 404. What each one taught me and how to avoid them.

If you found this useful, the full source code is available on GitHub and the next article in the series dives into the database layer and pgvector indexing.

See you in Part 2.

DEV Community: Ozioma Ochin

An Incident Isn't an Event. It's a Definition.

What makes two logs the same incident?

How many occurrences make an incident?

When is an incident closed?

Where the rules stop

Why detection is the hard part

buenas / traceroot

AI-powered incident detection and root cause analysis platform built with Spring Boot and PostgreSQL

TraceRoot — AI-Powered Reliability Platform

Problem

Engineers Don't Want to Search Logs. They Want to Know What Broke.

Logs Used to Be Hard To Find

What Debugging Actually Looks Like

Why This didn't Get Fixed Earlier

What the new stack looks like

What this Post Is Not Arguing

Back to Friday, 7pm

Semantic Search Is an Architecture Problem

Lesson 1 — Embeddings Don’t Solve Retrieval. They Define Its Boundaries.

Lesson 2 — Search Quality Is Determined at Ingestion, Not at Query

Lesson 3 — Systems Break at Boundaries, Not Within Components

Lesson 4 — Retrieval Is Easy. Ranking Is Where Systems Fail.

Lesson 5 — The ORM Stops Helping Earlier Than You Expect in Vector Search.

What This System Actually Is

Further Reading

The Service Layer: Where Separate Components Become a System

What Does the Service Layer Actually Do?

The Interface That Keeps Everything Clean

What Happens When Embedding Fails?

Search — The Pipeline That Ties Everything Together

Why JPA Isn’t Enough for Vector Search

Updating a Document means Updating Its Embedding Too

One Place for All Your Errors

How the LifecycleKeeps Bad Data Out of Search

The System Now Works

What's Next

Why Most Developers Reach for a Vector Database Too Soon.

What a Vector Database Is Actually For

Why Developers Reach for It Too Early

1. Tutorial monoculture

2. The scalability trap

3. Tooling and marketing

The Simpler Stack You’re Ignoring

1. pgvector — When Your Database Is Already Postgres

2. BM25 and Keyword Search — The Tool Nobody Wants to Admit Still Works

When You Actually Need a Vector Database

A Simple Decision Flowchart

Architecture Should Follow the Problem

Building a Semantic Search API with Spring Boot and pgvector - Part 3: The Embedding Layer.

What the embedding layer is responsible for

The EmbeddingClient interface

Wiring the client with Spring

The embed() orchestration method

Building the HTTP request

Validating and parsing the response

L2 normalization: what it is and why it matters

What's Next

Building a Semantic Search API with Spring Boot and pgvector - Part 2: Designing the PostgreSQL Schema

Why the database layer matters

Running pgvector locally

How Flyway works in this project

V1: Building the foundation

V2: Adding the document lifecycle

V3: Fixing bad data

The Complete Schema

What's Coming Next

Building a Semantic Search API with Spring Boot and pgvector - Part 1: Architecture

The problem with Keyword Search

What Are Embeddings?

Measuring Semantic Similarity

What We're Building

The Tech Stack and Why

High-Level Architecture

Series Roadmap