Allan Roberto

Posted on Mar 8 • Edited on Apr 7

Indexing Knowledge Base Content with Spring Boot and pgvector

#ai #java #springboot #programming

In the previous article, we configured PostgreSQL as a vector database using pgvector.

But a vector database alone is not enough.

Before we can query embeddings, we must index our data.

In a real AI knowledge base, indexing usually follows a pipeline like this:

Document saved
→ event published
→ content chunked
→ embeddings generated
→ chunk vectors stored in PostgreSQL

This design separates the document persistence layer from the AI indexing process, making the system easier to scale and maintain.

In this article we will implement this indexing pipeline using Spring Boot.

Project Goal

Our goal is to support this workflow:

Save a knowledge document through a REST API
Automatically split the document into smaller chunks
Generate embeddings for each chunk
Store those embeddings in PostgreSQL using pgvector

Once indexed, the knowledge base will be ready for semantic search.

Maven Dependencies

Add the following dependencies to your pom.xml.

Dependencies

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
</dependency>
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <optional>true</optional>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-validation</artifactId>
</dependency>

Database Configuration

Example application.yml:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: admin
    password: admin

  jpa:
    hibernate:
      ddl-auto: update
    show-sql: true

Database Model

We will use two tables.

knowledge_document - Stores the original document.
knowledge_document_chunk - Stores chunk text and embeddings

This separation is important because one document may generate many chunks.

Index Status Enum

This enum tracks the lifecycle of indexing.

Enum: IndexStatus

package com.example.knowledgebase.domain;

public enum IndexStatus {
    PENDING,
    INDEXING,
    INDEXED,
    FAILED
}

Entity: KnowledgeDocument

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

import java.time.OffsetDateTime;

@Entity
@Table(name = "knowledge_document")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocument {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private String title;

    @Column(columnDefinition = "TEXT")
    private String content;

    @Enumerated(EnumType.STRING)
    private IndexStatus indexStatus;

    private OffsetDateTime createdAt;

    private OffsetDateTime updatedAt;

    @PrePersist
    public void prePersist() {
        createdAt = OffsetDateTime.now();
        updatedAt = OffsetDateTime.now();

        if (indexStatus == null) {
            indexStatus = IndexStatus.PENDING;
        }
    }

    @PreUpdate
    public void preUpdate() {
        updatedAt = OffsetDateTime.now();
    }
}

Entity: KnowledgeDocumentChunk
Each chunk of text gets its own embedding.

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

@Entity
@Table(name = "knowledge_document_chunk")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocumentChunk {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private Long documentId;

    private Integer chunkIndex;

    @Column(columnDefinition = "TEXT")
    private String chunkText;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    private String embeddingModel;
}

The vector(1536) column is provided by pgvector.

Repositoriy: KnowledgeDocumentRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocument;
import org.springframework.data.jpa.repository.JpaRepository;

public interface KnowledgeDocumentRepository
        extends JpaRepository<KnowledgeDocument, Long> {
}

Repository: KnowledgeDocumentChunkRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocumentChunk;
import org.springframework.data.jpa.repository.JpaRepository;

import java.util.List;

public interface KnowledgeDocumentChunkRepository
        extends JpaRepository<KnowledgeDocumentChunk, Long> {

    List<KnowledgeDocumentChunk> findByDocumentIdOrderByChunkIndexAsc(Long documentId);

    void deleteByDocumentId(Long documentId);
}

The deleteByDocumentId method will be useful for re-indexing documents.

Request DTO

package com.example.knowledgebase.api;

import jakarta.validation.constraints.NotBlank;

public record CreateKnowledgeDocumentRequest(
        @NotBlank String title,
        @NotBlank String content
) {}

Domain Event

After a document is saved, we publish an event.

package com.example.knowledgebase.event;

public record KnowledgeDocumentCreatedEvent(Long documentId) {}

Chunking Service

Large documents should be split into smaller pieces.

package com.example.knowledgebase.service;

import java.util.List;
import java.util.ArrayList;

import org.springframework.stereotype.Component;

@Component
public class TextChunker {

    private static final int MAX_CHARS = 500;

    public List<String> chunk(String text) {

        if (text == null || text.isBlank()) {
            return List.of();
        }

        String[] paragraphs = text.split("\\R\\s*\\R");

        List<String> chunks = new ArrayList<>();

        for (String paragraph : paragraphs) {
            appendParagraph(chunks, paragraph.strip());
        }

        return chunks;
    }

    private void appendParagraph(List<String> chunks, String paragraph) {

        if (paragraph.isEmpty()) {
            return;
        }

        if (paragraph.length() > MAX_CHARS) {
            splitLongParagraph(chunks, paragraph);
            return;
        }

        chunks.add(paragraph);
    }

    private void splitLongParagraph(List<String> chunks, String paragraph) {

        int start = 0;
        while (start < paragraph.length()) {
            int end = Math.min(start + MAX_CHARS, paragraph.length());
            chunks.add(paragraph.substring(start, end));
            start = end;
        }
    }
}

Chunking improves retrieval precision when searching later.

Embedding Service

We define an abstraction so the embedding provider can change later.

package com.example.knowledgebase.service;

public interface EmbeddingService {

    float[] generateEmbedding(String text);

    String modelName();
}

Fake Embedding Service

For the purpose of this tutorial, we simulate embeddings.

package com.example.knowledgebase.service;

import org.springframework.stereotype.Service;

import java.util.Random;

@Service
public class FakeEmbeddingService implements EmbeddingService {

    private static final int DIMENSIONS = 1536;
    private final Random random = new Random();

    @Override
    public float[] generateEmbedding(String text) {

        float[] vector = new float[DIMENSIONS];

        for (int i = 0; i < DIMENSIONS; i++) {
            vector[i] = random.nextFloat();
        }

        return vector;
    }

    @Override
    public String modelName() {
        return "fake-embedding-model";
    }
}

In a real system this service would call an embedding API.

KnowledgeDocumentService

This service saves documents and triggers the indexing pipeline.

package com.example.knowledgebase.service;

import com.example.knowledgebase.api.CreateKnowledgeDocumentRequest;
import com.example.knowledgebase.domain.IndexStatus;
import com.example.knowledgebase.domain.KnowledgeDocument;
import com.example.knowledgebase.event.KnowledgeDocumentCreatedEvent;
import com.example.knowledgebase.repository.KnowledgeDocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

@Service
@RequiredArgsConstructor
public class KnowledgeDocumentService {

    private final KnowledgeDocumentRepository repository;
    private final ApplicationEventPublisher eventPublisher;

    @Transactional
    public Long create(CreateKnowledgeDocumentRequest request) {

        KnowledgeDocument document = KnowledgeDocument.builder()
                .title(request.title())
                .content(request.content())
                .indexStatus(IndexStatus.PENDING)
                .build();

        KnowledgeDocument saved = repository.save(document);

        eventPublisher.publishEvent(
                new KnowledgeDocumentCreatedEvent(saved.getId())
        );

        return saved.getId();
    }
}

Event Listener

This component performs the indexing process.

package com.example.knowledgebase.event;

import com.example.knowledgebase.domain.*;
import com.example.knowledgebase.repository.*;
import com.example.knowledgebase.service.*;
import lombok.RequiredArgsConstructor;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import java.util.List;

@Component
@RequiredArgsConstructor
public class KnowledgeDocumentCreatedListener {

    private final KnowledgeDocumentRepository documentRepository;
    private final KnowledgeDocumentChunkRepository chunkRepository;
    private final TextChunker chunker;
    private final EmbeddingService embeddingService;

    @Transactional
    @EventListener
    public void handle(KnowledgeDocumentCreatedEvent event) {

        KnowledgeDocument document =
                documentRepository.findById(event.documentId())
                        .orElseThrow();

        document.setIndexStatus(IndexStatus.INDEXING);

        List<String> chunks = chunker.chunk(document.getContent());

        int index = 0;

        for (String chunkText : chunks) {

            float[] embedding =
                    embeddingService.generateEmbedding(chunkText);

            KnowledgeDocumentChunk chunk =
                    KnowledgeDocumentChunk.builder()
                            .documentId(document.getId())
                            .chunkIndex(index++)
                            .chunkText(chunkText)
                            .embedding(embedding)
                            .embeddingModel(embeddingService.modelName())
                            .build();

            chunkRepository.save(chunk);
        }

        document.setIndexStatus(IndexStatus.INDEXED);
    }
}

REST Controller

package com.example.knowledgebase.api;

import com.example.knowledgebase.service.KnowledgeDocumentService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/documents")
@RequiredArgsConstructor
public class KnowledgeDocumentController {

    private final KnowledgeDocumentService service;

    @PostMapping
    @ResponseStatus(HttpStatus.CREATED)
    public CreateKnowledgeDocumentResponse create(
            @RequestBody CreateKnowledgeDocumentRequest request
    ) {

        Long id = service.create(request);

        return new CreateKnowledgeDocumentResponse(id);
    }
}

Response DTO

public record CreateKnowledgeDocumentResponse(Long id) {}

What Happens After the Request?

When a document is created:

POST /documents

The system performs:

1 save document
2 publish event
3 split content into chunks
4 generate embeddings
5 store chunk vectors

The knowledge base is now indexed and ready for semantic retrieval.

Production Considerations

Spring events are synchronous by default.

In production systems you might evolve this pipeline using:

@Async event listeners
message queues (Kafka, RabbitMQ)
background workers
retry mechanisms

This ensures indexing does not impact API performance.

Conclusion

In this article we implemented the indexing side of an AI knowledge base.

The system now:

saves documents
splits content into chunks
generates embeddings
stores vectors in PostgreSQL

This is the foundation required for semantic search and Retrieval-Augmented Generation (RAG).

In the next article we will:

convert user questi

ons into embeddings
search the vector database
retrieve relevant chunks
build the AI prompt

Bonus — How to Test Article 3

POST /documents
Content-Type: application/json

{
  "title": "Spring Boot and pgvector",
  "content": "PostgreSQL can be used as a vector database with the pgvector extension.\n\nThe extension is enabled with CREATE EXTENSION vector.\n\nSpring Boot can index documents by chunking content and generating embeddings.\n\nChunking improves retrieval quality because specific sections can be retrieved instead of the whole document."
}

POST /documents
Content-Type: application/json

{
  "title": "Java Virtual Threads",
  "content": "Java 21 introduced virtual threads as part of Project Loom.\n\nVirtual threads are lightweight and managed by the JVM.\n\nThey help applications handle many concurrent tasks with a simpler programming model.\n\nThey are especially useful for workloads with blocking I/O."
}

POST /documents
Content-Type: application/json

{
  "title": "Semantic Search",
  "content": "Semantic search looks for meaning instead of exact words.\n\nEmbeddings convert text into numeric vectors.\n\nVector databases can compare embeddings using similarity search.\n\nThis approach is commonly used in Retrieval-Augmented Generation systems."
}

SQL to Verify Indexed Documents

Check the main documents table:

SELECT id, title, index_status
FROM knowledge_document
ORDER BY id;

Expected result should look like:

id	title	index_status
1	Spring Boot and pgvector	INDEXED
2	Java Virtual Threads	INDEXED
3	Semantic Search	INDEXED

Sequence

Meaning: How Data Vectorization Powers AI
Turning PostgreSQL Into a Vector Database with Docker
Indexing Knowledge Base Content with Spring Boot and pgvector
Building Semantic Search with Spring Boot, PostgreSQL, and pgvector (RAG Retrieval)
How I Added LangChain4j Without Letting It Take Over My Spring Boot App

Project Here

DEV Community