DEV Community

Allan Roberto
Allan Roberto

Posted on • Edited on

Indexing Knowledge Base Content with Spring Boot and pgvector

In the previous article, we configured PostgreSQL as a vector database using pgvector.

But a vector database alone is not enough.

Before we can query embeddings, we must index our data.

In a real AI knowledge base, indexing usually follows a pipeline like this:

Document saved
→ event published
→ content chunked
→ embeddings generated
→ chunk vectors stored in PostgreSQL
Enter fullscreen mode Exit fullscreen mode

This design separates the document persistence layer from the AI indexing process, making the system easier to scale and maintain.

In this article we will implement this indexing pipeline using Spring Boot.


Project Goal

Our goal is to support this workflow:

  1. Save a knowledge document through a REST API
  2. Automatically split the document into smaller chunks
  3. Generate embeddings for each chunk
  4. Store those embeddings in PostgreSQL using pgvector

Once indexed, the knowledge base will be ready for semantic search.


Maven Dependencies

Add the following dependencies to your pom.xml.

Dependencies

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
</dependency>
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <optional>true</optional>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-validation</artifactId>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Database Configuration

Example application.yml:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: admin
    password: admin

  jpa:
    hibernate:
      ddl-auto: update
    show-sql: true
Enter fullscreen mode Exit fullscreen mode

Database Model

We will use two tables.

knowledge_document - Stores the original document.
knowledge_document_chunk - Stores chunk text and embeddings

This separation is important because one document may generate many chunks.


Index Status Enum

This enum tracks the lifecycle of indexing.

Enum: IndexStatus

package com.example.knowledgebase.domain;

public enum IndexStatus {
    PENDING,
    INDEXING,
    INDEXED,
    FAILED
}
Enter fullscreen mode Exit fullscreen mode

Entity: KnowledgeDocument

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

import java.time.OffsetDateTime;

@Entity
@Table(name = "knowledge_document")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocument {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private String title;

    @Column(columnDefinition = "TEXT")
    private String content;

    @Enumerated(EnumType.STRING)
    private IndexStatus indexStatus;

    private OffsetDateTime createdAt;

    private OffsetDateTime updatedAt;

    @PrePersist
    public void prePersist() {
        createdAt = OffsetDateTime.now();
        updatedAt = OffsetDateTime.now();

        if (indexStatus == null) {
            indexStatus = IndexStatus.PENDING;
        }
    }

    @PreUpdate
    public void preUpdate() {
        updatedAt = OffsetDateTime.now();
    }
}
Enter fullscreen mode Exit fullscreen mode

Entity: KnowledgeDocumentChunk
Each chunk of text gets its own embedding.

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

@Entity
@Table(name = "knowledge_document_chunk")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocumentChunk {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private Long documentId;

    private Integer chunkIndex;

    @Column(columnDefinition = "TEXT")
    private String chunkText;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    private String embeddingModel;
}
Enter fullscreen mode Exit fullscreen mode

The vector(1536) column is provided by pgvector.

Repositoriy: KnowledgeDocumentRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocument;
import org.springframework.data.jpa.repository.JpaRepository;

public interface KnowledgeDocumentRepository
        extends JpaRepository<KnowledgeDocument, Long> {
}
Enter fullscreen mode Exit fullscreen mode

Repository: KnowledgeDocumentChunkRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocumentChunk;
import org.springframework.data.jpa.repository.JpaRepository;

import java.util.List;

public interface KnowledgeDocumentChunkRepository
        extends JpaRepository<KnowledgeDocumentChunk, Long> {

    List<KnowledgeDocumentChunk> findByDocumentIdOrderByChunkIndexAsc(Long documentId);

    void deleteByDocumentId(Long documentId);
}
Enter fullscreen mode Exit fullscreen mode

The deleteByDocumentId method will be useful for re-indexing documents.

Request DTO

package com.example.knowledgebase.api;

import jakarta.validation.constraints.NotBlank;

public record CreateKnowledgeDocumentRequest(
        @NotBlank String title,
        @NotBlank String content
) {}
Enter fullscreen mode Exit fullscreen mode

Domain Event

After a document is saved, we publish an event.

package com.example.knowledgebase.event;

public record KnowledgeDocumentCreatedEvent(Long documentId) {}
Enter fullscreen mode Exit fullscreen mode

Chunking Service

Large documents should be split into smaller pieces.

package com.example.knowledgebase.service;

import java.util.List;
import java.util.ArrayList;

import org.springframework.stereotype.Component;

@Component
public class TextChunker {

    private static final int MAX_CHARS = 500;

    public List<String> chunk(String text) {

        if (text == null || text.isBlank()) {
            return List.of();
        }

        String[] paragraphs = text.split("\\R\\s*\\R");

        List<String> chunks = new ArrayList<>();

        for (String paragraph : paragraphs) {
            appendParagraph(chunks, paragraph.strip());
        }

        return chunks;
    }

    private void appendParagraph(List<String> chunks, String paragraph) {

        if (paragraph.isEmpty()) {
            return;
        }

        if (paragraph.length() > MAX_CHARS) {
            splitLongParagraph(chunks, paragraph);
            return;
        }

        chunks.add(paragraph);
    }

    private void splitLongParagraph(List<String> chunks, String paragraph) {

        int start = 0;
        while (start < paragraph.length()) {
            int end = Math.min(start + MAX_CHARS, paragraph.length());
            chunks.add(paragraph.substring(start, end));
            start = end;
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

Chunking improves retrieval precision when searching later.


Embedding Service

We define an abstraction so the embedding provider can change later.

package com.example.knowledgebase.service;

public interface EmbeddingService {

    float[] generateEmbedding(String text);

    String modelName();
}
Enter fullscreen mode Exit fullscreen mode

Fake Embedding Service

For the purpose of this tutorial, we simulate embeddings.

package com.example.knowledgebase.service;

import org.springframework.stereotype.Service;

import java.util.Random;

@Service
public class FakeEmbeddingService implements EmbeddingService {

    private static final int DIMENSIONS = 1536;
    private final Random random = new Random();

    @Override
    public float[] generateEmbedding(String text) {

        float[] vector = new float[DIMENSIONS];

        for (int i = 0; i < DIMENSIONS; i++) {
            vector[i] = random.nextFloat();
        }

        return vector;
    }

    @Override
    public String modelName() {
        return "fake-embedding-model";
    }
}
Enter fullscreen mode Exit fullscreen mode

In a real system this service would call an embedding API.


KnowledgeDocumentService

This service saves documents and triggers the indexing pipeline.

package com.example.knowledgebase.service;

import com.example.knowledgebase.api.CreateKnowledgeDocumentRequest;
import com.example.knowledgebase.domain.IndexStatus;
import com.example.knowledgebase.domain.KnowledgeDocument;
import com.example.knowledgebase.event.KnowledgeDocumentCreatedEvent;
import com.example.knowledgebase.repository.KnowledgeDocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

@Service
@RequiredArgsConstructor
public class KnowledgeDocumentService {

    private final KnowledgeDocumentRepository repository;
    private final ApplicationEventPublisher eventPublisher;

    @Transactional
    public Long create(CreateKnowledgeDocumentRequest request) {

        KnowledgeDocument document = KnowledgeDocument.builder()
                .title(request.title())
                .content(request.content())
                .indexStatus(IndexStatus.PENDING)
                .build();

        KnowledgeDocument saved = repository.save(document);

        eventPublisher.publishEvent(
                new KnowledgeDocumentCreatedEvent(saved.getId())
        );

        return saved.getId();
    }
}
Enter fullscreen mode Exit fullscreen mode

Event Listener

This component performs the indexing process.

package com.example.knowledgebase.event;

import com.example.knowledgebase.domain.*;
import com.example.knowledgebase.repository.*;
import com.example.knowledgebase.service.*;
import lombok.RequiredArgsConstructor;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import java.util.List;

@Component
@RequiredArgsConstructor
public class KnowledgeDocumentCreatedListener {

    private final KnowledgeDocumentRepository documentRepository;
    private final KnowledgeDocumentChunkRepository chunkRepository;
    private final TextChunker chunker;
    private final EmbeddingService embeddingService;

    @Transactional
    @EventListener
    public void handle(KnowledgeDocumentCreatedEvent event) {

        KnowledgeDocument document =
                documentRepository.findById(event.documentId())
                        .orElseThrow();

        document.setIndexStatus(IndexStatus.INDEXING);

        List<String> chunks = chunker.chunk(document.getContent());

        int index = 0;

        for (String chunkText : chunks) {

            float[] embedding =
                    embeddingService.generateEmbedding(chunkText);

            KnowledgeDocumentChunk chunk =
                    KnowledgeDocumentChunk.builder()
                            .documentId(document.getId())
                            .chunkIndex(index++)
                            .chunkText(chunkText)
                            .embedding(embedding)
                            .embeddingModel(embeddingService.modelName())
                            .build();

            chunkRepository.save(chunk);
        }

        document.setIndexStatus(IndexStatus.INDEXED);
    }
}
Enter fullscreen mode Exit fullscreen mode

REST Controller

package com.example.knowledgebase.api;

import com.example.knowledgebase.service.KnowledgeDocumentService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/documents")
@RequiredArgsConstructor
public class KnowledgeDocumentController {

    private final KnowledgeDocumentService service;

    @PostMapping
    @ResponseStatus(HttpStatus.CREATED)
    public CreateKnowledgeDocumentResponse create(
            @RequestBody CreateKnowledgeDocumentRequest request
    ) {

        Long id = service.create(request);

        return new CreateKnowledgeDocumentResponse(id);
    }
}
Enter fullscreen mode Exit fullscreen mode

Response DTO

public record CreateKnowledgeDocumentResponse(Long id) {}
Enter fullscreen mode Exit fullscreen mode

What Happens After the Request?

When a document is created:

POST /documents
Enter fullscreen mode Exit fullscreen mode

The system performs:

1 save document
2 publish event
3 split content into chunks
4 generate embeddings
5 store chunk vectors
Enter fullscreen mode Exit fullscreen mode

The knowledge base is now indexed and ready for semantic retrieval.


Production Considerations

Spring events are synchronous by default.

In production systems you might evolve this pipeline using:

  • @Async event listeners
  • message queues (Kafka, RabbitMQ)
  • background workers
  • retry mechanisms

This ensures indexing does not impact API performance.


Conclusion

In this article we implemented the indexing side of an AI knowledge base.

The system now:

  • saves documents
  • splits content into chunks
  • generates embeddings
  • stores vectors in PostgreSQL

This is the foundation required for semantic search and Retrieval-Augmented Generation (RAG).

In the next article we will:

convert user questi

  • ons into embeddings
  • search the vector database
  • retrieve relevant chunks
  • build the AI prompt

Bonus — How to Test Article 3

POST /documents
Content-Type: application/json

{
  "title": "Spring Boot and pgvector",
  "content": "PostgreSQL can be used as a vector database with the pgvector extension.\n\nThe extension is enabled with CREATE EXTENSION vector.\n\nSpring Boot can index documents by chunking content and generating embeddings.\n\nChunking improves retrieval quality because specific sections can be retrieved instead of the whole document."
}
Enter fullscreen mode Exit fullscreen mode
POST /documents
Content-Type: application/json

{
  "title": "Java Virtual Threads",
  "content": "Java 21 introduced virtual threads as part of Project Loom.\n\nVirtual threads are lightweight and managed by the JVM.\n\nThey help applications handle many concurrent tasks with a simpler programming model.\n\nThey are especially useful for workloads with blocking I/O."
}
Enter fullscreen mode Exit fullscreen mode
POST /documents
Content-Type: application/json

{
  "title": "Semantic Search",
  "content": "Semantic search looks for meaning instead of exact words.\n\nEmbeddings convert text into numeric vectors.\n\nVector databases can compare embeddings using similarity search.\n\nThis approach is commonly used in Retrieval-Augmented Generation systems."
}
Enter fullscreen mode Exit fullscreen mode

SQL to Verify Indexed Documents

Check the main documents table:

SELECT id, title, index_status
FROM knowledge_document
ORDER BY id;
Enter fullscreen mode Exit fullscreen mode

Expected result should look like:

id title index_status
1 Spring Boot and pgvector INDEXED
2 Java Virtual Threads INDEXED
3 Semantic Search INDEXED

Sequence

  1. Meaning: How Data Vectorization Powers AI
  2. Turning PostgreSQL Into a Vector Database with Docker
  3. Indexing Knowledge Base Content with Spring Boot and pgvector
  4. Building Semantic Search with Spring Boot, PostgreSQL, and pgvector (RAG Retrieval)

Project Here

Top comments (0)