Ozioma Ochin

Posted on Mar 15 • Edited on Mar 20 • Originally published at ozi.hashnode.dev

Building a Semantic Search API with Spring Boot and pgvector - Part 2: Designing the PostgreSQL Schema

#java #springboot #postgressql #vectordatabase

Why the database layer matters

In a semantic search system, the database schema isn’t just storage.

It defines how embeddings are stored, indexed, and queried.

Many tutorials treat the database as a detail - create a table, add a vector column, and move on.

But when search quality depends on how vectors are stored and compared, the schema becomes a core architectural decision.

The schema determines what the system can do and what it cannot.

A missing index means slow queries at scale.
A missing status column means no visibility into embedding failures.
A poorly typed metadata column means filters that silently break.

Every column and every index in this schema exists because a specific part of the system depends on it.

Running pgvector locally

Before any migrations run, the database needs to support vector operations. That means PostgreSQL with the pgvector extension installed.

Using pgvector lets us keep embeddings in the same database as the documents.

This avoids the complexity of running a separate vector store.

For this project, the goal is simplicity and consistency, not maximum scale.

Keeping everything in PostgreSQL makes the system easier to reason about.

There’s no Pinecone account to manage, no separate service to keep in sync, and no eventual consistency between documents and embeddings.

Everything lives in one place and can be written in a single transaction.

The local setup uses Docker with the official pgvector image:

services:
  postgres:
    image: pgvector/pgvector:pg16
    container_name: semantic_search_postgres
    environment:
      POSTGRES_DB: semantic_search
      POSTGRES_USER: semantic
      POSTGRES_PASSWORD: semantic
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

The important line is pgvector/pgvector:pg16 instead of the standard postgres:16.

This image ships with the pgvector extension pre-installed.

No manual compilation, no OS-specific setup step.

Pull the image and the extension is ready.

The project includes two compose files.

`docker-compose_dev.yml` runs only the database — useful when running the Spring Boot app from IntelliJ

`docker-compose.yml` runs the full stack

The full source code including all the migrations is available on GitHub.

How Flyway works in this project

The schema in this project didn't arrive fully formed.

It evolved over time, and the migrations show exactly how.

Instead of writing the final schema all at once, the project builds it step by step through Flyway migrations, just like a real system would.

V1 creates the foundation.
V2 adds document lifecycle tracking.
V3 fixes a data quality problem that V1 didn't anticipate.

Each migration represents a decision made at a specific point in the project's history.

This approach matters for two reasons.

First, reproducibility. Any developer cloning the repository gets the exact same schema by running the application. Flyway applies the migrations in order, tracks which ones have already run in its flyway_schema_history table, and skips anything that's already been applied.

Second, safety. Because spring.jpa.hibernate.ddl-auto is set to validate, Hibernate will refuse to start if the schema doesn't match the entity definitions.

Flyway owns the schema.

Hibernate only validates it.

spring.flyway.enabled=true
spring.flyway.locations=classpath:db/migration
spring.jpa.hibernate.ddl-auto=validate

The naming convention for migrations also matters.

Every file follows the pattern:

V{number}__{description}.sql

There are two underscores between the version number and the description.

Flyway uses the version number to determine execution order, and a checksum of each file to detect changes.

If a migration file is modified after it has already been applied, Flyway refuses to start.

That constraint is intentional.

It forces schema changes to go through new migrations instead of editing old ones.

This project ends up with three migrations.

The first migration builds the entire foundation - table, indexes, and trigger - in a single SQL file.

V1: Building the foundation

The schema is designed around how the search queries will run, not just how the data is stored.

Every column in V1 exists because a specific part of the system depends on it.

Rather than showing the full migration at once, each part is broken down and explained in the order it appears in the file: extension first, table second, indexes third, and trigger last.

CREATE EXTENSION IF NOT EXISTS vector;

This line has to come first. PostgreSQL does not support the VECTOR type by default, so the rest of the migration would fail without it.

The IF NOT EXISTS clause also makes the migration safer. If the extension is already installed in a local environment, CI database, or shared dev database, Flyway can still run the migration without error.

The table

CREATE TABLE IF NOT EXISTS documents (
    id          BIGSERIAL PRIMARY KEY,
    title       TEXT NOT NULL,
    content     TEXT NOT NULL,
    metadata    JSONB,
    embedding   VECTOR(1536),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Each column exists for a reason:

id BIGSERIAL PRIMARY KEY
A standard auto-incrementing identifier. BIGSERIAL is used instead of SERIAL to avoid running out of IDs in larger datasets.

title TEXT NOT NULL and content TEXT NOT NULL
These are the fields that get embedded. TEXT is used instead of VARCHAR because PostgreSQL handles TEXT efficiently, and a hard length limit would be artificial here.

metadata JSONB
Optional metadata for filtering. JSONB is used instead of JSON because it is faster to query and supports GIN indexing.

embedding VECTOR(1536)
The vector representation of the document. 1536 matches the output size of OpenAI’s text-embedding-3-small model. If the model changes, this column definition would also need to change.

created_at TIMESTAMPTZ NOT NULL DEFAULT now()
Stores when the row was created. The database sets it automatically.

updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
Stores when the row was last updated. This value is maintained by a trigger so it stays correct no matter how the row is modified.

The indexes

V1 adds three indexes, each supporting a different access pattern.

Created_at index

CREATE INDEX IF NOT EXISTS idx_documents_created_at
    ON documents (created_at DESC);

This is a standard B-tree index for queries that sort documents by creation time.

That is useful for admin pages, auditing, and any endpoint that lists recently created documents.

The DESC ordering matches the most common query pattern, so PostgreSQL does not need to sort the results separately.

Metadata GIN index

CREATE INDEX IF NOT EXISTS idx_documents_metadata_gin
    ON documents USING gin (metadata);

This index supports metadata filtering.

Because metadata is stored as JSONB, PostgreSQL can use a GIN index to search inside the JSON structure efficiently. Without this index, metadata filters would require a full table scan.

embedding IVFFlat index

CREATE INDEX IF NOT EXISTS documents_embedding_ivfflat_idx
    ON documents USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

This is the most important index in the migration.

Without it, vector similarity search would require comparing the query embedding against every stored embedding in the table.

ivfflat is pgvector’s approximate nearest-neighbour index. It improves speed by grouping vectors into clusters and searching only the clusters closest to the query vector.

That comes with a tradeoff: slightly lower recall in exchange for much faster queries.

The lists = 100 setting controls how many clusters are created. For a small dataset, 100 is a reasonable starting point. As the dataset grows, this value should be revisited and the index rebuilt if needed.

The vector_cosine_ops operator class is also important. It tells PostgreSQL to optimize the index for cosine distance, which must match the operator used in the query.

Search Request
     ↓
status = 'READY'
     ↓
metadata filters
     ↓
embedding <=> query_vector
     ↓
ranked results

The schema and indexes are designed around the path a search query will take through the table.

The trigger

CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = now();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_documents_updated_at
BEFORE UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION set_updated_at();

This ensures that updated_at is refreshed automatically every time a document row is updated.

Handling this in the database is more reliable than doing it in the service layer. Even if someone updates the row through raw SQL, the timestamp remains correct.

That matters because a document content change may mean the stored embedding is now stale. An accurate timestamp is the only way to know when it was last computed.

V1 is the foundation.

Everything in V2 and V3 builds on top of it.

V2: Adding the document lifecycle

Once embeddings are generated, the database needs to track the state of each document.

Embedding is not instant. It depends on an external API call, which can fail, time out, or hit rate limits.

Without a way to track embedding state, a document could exist in the database but never appear in search results, with no clear explanation why.

V2 introduces a simple lifecycle model so the system always knows whether a document is searchable.

This migration adds three columns and one index.

ALTER TABLE documents
    ADD COLUMN IF NOT EXISTS status TEXT NOT NULL DEFAULT 'READY',
    ADD COLUMN IF NOT EXISTS embedding_error TEXT,
    ADD COLUMN IF NOT EXISTS embedding_updated_at TIMESTAMPTZ;

CREATE INDEX IF NOT EXISTS idx_documents_status_created
    ON documents (status, created_at DESC);

Why lifecycle tracking is needed

Embedding is performed after the document is stored, not at the same time.

That means a document can exist in several states:

saved but not embedded yet.
successfully embedded.
failed to embed.

Without a status column, the system cannot tell these cases apart.

A failed embedding would simply result in a document that never shows up in search, which makes debugging difficult.

V2 makes document state explicit in the schema.

The new columns

status

status TEXT NOT NULL DEFAULT 'READY'

This column tracks where a document is in its lifecycle.

The system uses three values:

PENDING — document saved, embedding not generated yet.
READY — embedding generated successfully.
FAILED — embedding request failed.

The default value is READY.

This might look strange at first, but it keeps the migration safe for existing rows. When the migration runs on a database that already has documents, those rows need a valid status value.

Using READY assumes existing data already has embeddings, which is the safest assumption.

New documents created after this migration are explicitly set to PENDING by the application before embedding runs.

    PENDING
       ↓
     READY

    PENDING
      ↓
    FAILED

Once embedding becomes a separate step, document state must become part of the schema.

embedding_error

embedding_error TEXT stores the error message when embedding fails.

Most documents will never use this column, so it is nullable.

When a document is in FAILED state, this field makes debugging much easier. Instead of searching through logs, the failure reason is visible directly in the database.

The error stored here might be a network timeout, a rate limited response, or an unexpected payload, whatever the API returned at the time of failure.

embedding_updated_at

embedding_updated_at TIMESTAMPTZ stores the last time the embedding was generated.

This is different from updated_at, which tracks when the document row changes.

This column makes it possible to implement retry logic later.

For example, a background job could look for documents where:

status = 'FAILED'
AND embedding_updated_at < now() - interval '1 hour'

and retry the embedding only for older failures.

This avoids retrying the same document repeatedly in a tight loop.

The composite index

CREATE INDEX IF NOT EXISTS idx_documents_status_created
    ON documents (status, created_at DESC);

This index supports two important query patterns.

First, search queries filter on:

status = 'READY'

before performing vector similarity search.

Without an index on status, PostgreSQL would have to scan many rows before the vector index can do its job.

Second, admin queries often need to list documents by status, ordered by creation time.

For example:

newest failed documents.
newest pending documents.
newest ready documents.

The column order in the index matters.

status comes first because it is used for filtering

created_at DESC comes second because it is used for sorting

With this order, PostgreSQL can use the same index for both filtering and ordering.

Why V2 changes the schema

Once embedding becomes a separate step, document state becomes part of the database model.

This is a good example of why schemas evolve.

The first version only needed to store documents.

The second version needs to describe their lifecycle.

And that change belongs in the database, not just in the application.

V3: Fixing bad data

Real systems rarely get the schema right the first time.

V3 is different from previous migrations. It doesn't add columns or create indexes.

Instead, it fixes data that was stored incorrectly in an earlier version of the service.

Before V3:

metadata = "category=billing"

After V3:

metadata = { "raw": "category=billing" }

This migration exists because the application originally allowed metadata to be saved as a JSON string instead of a JSON object.

That turned out to be a problem later when filtering was added.

This is a data migration, not a schema migration.

UPDATE documents
SET metadata = jsonb_build_object('raw', metadata)
WHERE metadata IS NOT NULL
  AND jsonb_typeof(metadata) = 'string';

What went wrong

The metadata column is defined as JSONB.

That means PostgreSQL will accept any valid JSON value: object, array, string, and number.

The application’s filter logic assumes metadata is always a JSON object, so it can query values using operators like:

metadata ->> 'category'

But earlier versions of the service allowed values like this:

"metadata": "category=billing"

This is valid JSON, but it is not an object.

Once filters were added, these rows stopped working correctly.

Queries expecting key–value pairs would fail or return incorrect results.

Later versions of the API added validation to prevent this, but by then some bad data already existed in the database.

What the migration does

The migration finds rows where metadata is stored as a string:

jsonb_typeof(metadata) = 'string'

and converts them into a valid JSON object:

jsonb_build_object('raw', metadata)

So this:

"category=billing"

becomes:

{"raw": "category=billing"}

This approach keeps the original value instead of deleting it.

The document stays intact, and the metadata becomes valid JSON that the filter system can handle.

If needed, a developer can still inspect the original value inside the “raw” field.

Why this migration matters

This migration highlights something that happens in almost every real system:

Migrations do not only add features, they also repair data.

Nothing new was added to the schema, but the data became consistent again.

This kind of migration is common in production systems. The important part is not avoiding mistakes, it is fixing them safely without losing information.

V3 is a record of a real problem that existed, and the decision made to correct it.

The Complete Schema

After all three migrations, the schema is now complete.

Here is the final table definition after V1, V2, and V3 have all been applied.

CREATE TABLE documents (
    id                   BIGSERIAL PRIMARY KEY,
    title                TEXT NOT NULL,
    content              TEXT NOT NULL,
    metadata             JSONB,
    embedding            VECTOR(1536),
    status               TEXT NOT NULL DEFAULT 'READY',
    embedding_error      TEXT,
    embedding_updated_at TIMESTAMPTZ,
    created_at           TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at           TIMESTAMPTZ NOT NULL DEFAULT now()
);

Nothing in this table is accidental.

Every column exists because a specific part of the system needs it.

The schema now describes not just the data, but the behavior of the system.

The Indexes

Each index exists for a specific query pattern — remove any one of them and something in the system either breaks or slows down significantly at scale.

Index	Type	Purpose
idx_documents_created_at	B-tree	Ordering by creation time
idx_documents_metadata_gin	GIN	Metadata filtering
documents_embedding_ivfflat_idx	IVFFlat	Vector similarity search
idx_documents_status_created	B-tree (composite)	Status filtering + ordering

What's Coming Next

The schema can now store embeddings, but generating them is where things get interesting.

Part 3 covers the full implementation of the embedding client using Java's built-in HttpClient, with no third-party SDK.

It also covers the bugs that are hardest to catch: the ones that don't throw exceptions but silently produce wrong similarity scores.

See you in Part 3.

DEV Community