DEV Community

Allan Roberto
Allan Roberto

Posted on

Indexing Knowledge Base Content with Spring Boot and pgvector

In the previous article, we configured PostgreSQL as a vector database using pgvector.

But a vector database alone is not enough.

Before we can query embeddings, we must index our data.

In a real AI knowledge base, indexing usually follows a pipeline like this:

Document saved
→ event published
→ content chunked
→ embeddings generated
→ chunk vectors stored in PostgreSQL
Enter fullscreen mode Exit fullscreen mode

This design separates the document persistence layer from the AI indexing process, making the system easier to scale and maintain.

In this article we will implement this indexing pipeline using Spring Boot.


Project Goal

Our goal is to support this workflow:

  1. Save a knowledge document through a REST API
  2. Automatically split the document into smaller chunks
  3. Generate embeddings for each chunk
  4. Store those embeddings in PostgreSQL using pgvector

Once indexed, the knowledge base will be ready for semantic search.


Maven Dependencies

Add the following dependencies to your pom.xml.

Dependencies

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
</dependency>
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <optional>true</optional>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-validation</artifactId>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Database Configuration

Example application.yml:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: admin
    password: admin

  jpa:
    hibernate:
      ddl-auto: update
    show-sql: true
Enter fullscreen mode Exit fullscreen mode

Database Model

We will use two tables.

knowledge_document - Stores the original document.
knowledge_document_chunk - Stores chunk text and embeddings

This separation is important because one document may generate many chunks.


Index Status Enum

This enum tracks the lifecycle of indexing.

Enum: IndexStatus

package com.example.knowledgebase.domain;

public enum IndexStatus {
    PENDING,
    INDEXING,
    INDEXED,
    FAILED
}
Enter fullscreen mode Exit fullscreen mode

Entity: KnowledgeDocument

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

import java.time.OffsetDateTime;

@Entity
@Table(name = "knowledge_document")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocument {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private String title;

    @Lob
    private String content;

    @Enumerated(EnumType.STRING)
    private IndexStatus indexStatus;

    private OffsetDateTime createdAt;

    private OffsetDateTime updatedAt;

    @PrePersist
    public void prePersist() {
        createdAt = OffsetDateTime.now();
        updatedAt = OffsetDateTime.now();

        if (indexStatus == null) {
            indexStatus = IndexStatus.PENDING;
        }
    }

    @PreUpdate
    public void preUpdate() {
        updatedAt = OffsetDateTime.now();
    }
}
Enter fullscreen mode Exit fullscreen mode

Entity: KnowledgeDocumentChunk
Each chunk of text gets its own embedding.

package com.example.knowledgebase.domain;

import jakarta.persistence.*;
import lombok.*;

@Entity
@Table(name = "knowledge_document_chunk")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocumentChunk {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private Long documentId;

    private Integer chunkIndex;

    @Lob
    private String chunkText;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    private String embeddingModel;
}
Enter fullscreen mode Exit fullscreen mode

The vector(1536) column is provided by pgvector.

Repositoriy: KnowledgeDocumentRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocument;
import org.springframework.data.jpa.repository.JpaRepository;

public interface KnowledgeDocumentRepository
        extends JpaRepository<KnowledgeDocument, Long> {
}
Enter fullscreen mode Exit fullscreen mode

Repository: KnowledgeDocumentChunkRepository

package com.example.knowledgebase.repository;

import com.example.knowledgebase.domain.KnowledgeDocumentChunk;
import org.springframework.data.jpa.repository.JpaRepository;

import java.util.List;

public interface KnowledgeDocumentChunkRepository
        extends JpaRepository<KnowledgeDocumentChunk, Long> {

    List<KnowledgeDocumentChunk> findByDocumentIdOrderByChunkIndexAsc(Long documentId);

    void deleteByDocumentId(Long documentId);
}
Enter fullscreen mode Exit fullscreen mode

The deleteByDocumentId method will be useful for re-indexing documents.

Request DTO

package com.example.knowledgebase.api;

import jakarta.validation.constraints.NotBlank;

public record CreateKnowledgeDocumentRequest(
        @NotBlank String title,
        @NotBlank String content
) {}
Enter fullscreen mode Exit fullscreen mode

Domain Event

After a document is saved, we publish an event.

package com.example.knowledgebase.event;

public record KnowledgeDocumentCreatedEvent(Long documentId) {}
Enter fullscreen mode Exit fullscreen mode

Chunking Service

Large documents should be split into smaller pieces.

package com.example.knowledgebase.service;

import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;

@Component
public class TextChunker {

    private static final int MAX_CHARS = 500;

    public List<String> chunk(String text) {

        String[] paragraphs = text.split("\\n\\s*\\n");

        List<String> chunks = new ArrayList<>();
        StringBuilder current = new StringBuilder();

        for (String paragraph : paragraphs) {

            if (current.length() + paragraph.length() > MAX_CHARS) {
                chunks.add(current.toString());
                current = new StringBuilder();
            }

            current.append(paragraph).append("\n\n");
        }

        if (!current.isEmpty()) {
            chunks.add(current.toString());
        }

        return chunks;
    }
}
Enter fullscreen mode Exit fullscreen mode

Chunking improves retrieval precision when searching later.


Embedding Service

We define an abstraction so the embedding provider can change later.

package com.example.knowledgebase.service;

public interface EmbeddingService {

    float[] generateEmbedding(String text);

    String modelName();
}
Enter fullscreen mode Exit fullscreen mode

Fake Embedding Service

For the purpose of this tutorial, we simulate embeddings.

package com.example.knowledgebase.service;

import org.springframework.stereotype.Service;

import java.util.Random;

@Service
public class FakeEmbeddingService implements EmbeddingService {

    private static final int DIMENSIONS = 1536;
    private final Random random = new Random();

    @Override
    public float[] generateEmbedding(String text) {

        float[] vector = new float[DIMENSIONS];

        for (int i = 0; i < DIMENSIONS; i++) {
            vector[i] = random.nextFloat();
        }

        return vector;
    }

    @Override
    public String modelName() {
        return "fake-embedding-model";
    }
}
Enter fullscreen mode Exit fullscreen mode

In a real system this service would call an embedding API.


KnowledgeDocumentService

This service saves documents and triggers the indexing pipeline.

package com.example.knowledgebase.service;

import com.example.knowledgebase.api.CreateKnowledgeDocumentRequest;
import com.example.knowledgebase.domain.IndexStatus;
import com.example.knowledgebase.domain.KnowledgeDocument;
import com.example.knowledgebase.event.KnowledgeDocumentCreatedEvent;
import com.example.knowledgebase.repository.KnowledgeDocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

@Service
@RequiredArgsConstructor
public class KnowledgeDocumentService {

    private final KnowledgeDocumentRepository repository;
    private final ApplicationEventPublisher eventPublisher;

    @Transactional
    public Long create(CreateKnowledgeDocumentRequest request) {

        KnowledgeDocument document = KnowledgeDocument.builder()
                .title(request.title())
                .content(request.content())
                .indexStatus(IndexStatus.PENDING)
                .build();

        KnowledgeDocument saved = repository.save(document);

        eventPublisher.publishEvent(
                new KnowledgeDocumentCreatedEvent(saved.getId())
        );

        return saved.getId();
    }
}
Enter fullscreen mode Exit fullscreen mode

Event Listener

This component performs the indexing process.

package com.example.knowledgebase.event;

import com.example.knowledgebase.domain.*;
import com.example.knowledgebase.repository.*;
import com.example.knowledgebase.service.*;
import lombok.RequiredArgsConstructor;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import java.util.List;

@Component
@RequiredArgsConstructor
public class KnowledgeDocumentCreatedListener {

    private final KnowledgeDocumentRepository documentRepository;
    private final KnowledgeDocumentChunkRepository chunkRepository;
    private final TextChunker chunker;
    private final EmbeddingService embeddingService;

    @Transactional
    @EventListener
    public void handle(KnowledgeDocumentCreatedEvent event) {

        KnowledgeDocument document =
                documentRepository.findById(event.documentId())
                        .orElseThrow();

        document.setIndexStatus(IndexStatus.INDEXING);

        List<String> chunks = chunker.chunk(document.getContent());

        int index = 0;

        for (String chunkText : chunks) {

            float[] embedding =
                    embeddingService.generateEmbedding(chunkText);

            KnowledgeDocumentChunk chunk =
                    KnowledgeDocumentChunk.builder()
                            .documentId(document.getId())
                            .chunkIndex(index++)
                            .chunkText(chunkText)
                            .embedding(embedding)
                            .embeddingModel(embeddingService.modelName())
                            .build();

            chunkRepository.save(chunk);
        }

        document.setIndexStatus(IndexStatus.INDEXED);
    }
}
Enter fullscreen mode Exit fullscreen mode

REST Controller

package com.example.knowledgebase.api;

import com.example.knowledgebase.service.KnowledgeDocumentService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/documents")
@RequiredArgsConstructor
public class KnowledgeDocumentController {

    private final KnowledgeDocumentService service;

    @PostMapping
    @ResponseStatus(HttpStatus.CREATED)
    public CreateKnowledgeDocumentResponse create(
            @RequestBody CreateKnowledgeDocumentRequest request
    ) {

        Long id = service.create(request);

        return new CreateKnowledgeDocumentResponse(id);
    }
}
Enter fullscreen mode Exit fullscreen mode

Response DTO

public record CreateKnowledgeDocumentResponse(Long id) {}
Enter fullscreen mode Exit fullscreen mode

What Happens After the Request?

When a document is created:

POST /documents
Enter fullscreen mode Exit fullscreen mode

The system performs:

1 save document
2 publish event
3 split content into chunks
4 generate embeddings
5 store chunk vectors
Enter fullscreen mode Exit fullscreen mode

The knowledge base is now indexed and ready for semantic retrieval.


Production Considerations

Spring events are synchronous by default.

In production systems you might evolve this pipeline using:

  • @Async event listeners
  • message queues (Kafka, RabbitMQ)
  • background workers
  • retry mechanisms

This ensures indexing does not impact API performance.


Conclusion

In this article we implemented the indexing side of an AI knowledge base.

The system now:

  • saves documents
  • splits content into chunks
  • generates embeddings
  • stores vectors in PostgreSQL

This is the foundation required for semantic search and Retrieval-Augmented Generation (RAG).

In the next article we will:

convert user questi

  • ons into embeddings
  • search the vector database
  • retrieve relevant chunks
  • build the AI prompt

Project Here

Top comments (0)