In the previous article, we configured PostgreSQL as a vector database using pgvector.
But a vector database alone is not enough.
Before we can query embeddings, we must index our data.
In a real AI knowledge base, indexing usually follows a pipeline like this:
Document saved
→ event published
→ content chunked
→ embeddings generated
→ chunk vectors stored in PostgreSQL
This design separates the document persistence layer from the AI indexing process, making the system easier to scale and maintain.
In this article we will implement this indexing pipeline using Spring Boot.
Project Goal
Our goal is to support this workflow:
- Save a knowledge document through a REST API
- Automatically split the document into smaller chunks
- Generate embeddings for each chunk
- Store those embeddings in PostgreSQL using pgvector
Once indexed, the knowledge base will be ready for semantic search.
Maven Dependencies
Add the following dependencies to your pom.xml.
Dependencies
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
</dependency>
Database Configuration
Example application.yml:
spring:
datasource:
url: jdbc:postgresql://localhost:5432/vectordb
username: admin
password: admin
jpa:
hibernate:
ddl-auto: update
show-sql: true
Database Model
We will use two tables.
knowledge_document - Stores the original document.
knowledge_document_chunk - Stores chunk text and embeddings
This separation is important because one document may generate many chunks.
Index Status Enum
This enum tracks the lifecycle of indexing.
Enum: IndexStatus
package com.example.knowledgebase.domain;
public enum IndexStatus {
PENDING,
INDEXING,
INDEXED,
FAILED
}
Entity: KnowledgeDocument
package com.example.knowledgebase.domain;
import jakarta.persistence.*;
import lombok.*;
import java.time.OffsetDateTime;
@Entity
@Table(name = "knowledge_document")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocument {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String title;
@Lob
private String content;
@Enumerated(EnumType.STRING)
private IndexStatus indexStatus;
private OffsetDateTime createdAt;
private OffsetDateTime updatedAt;
@PrePersist
public void prePersist() {
createdAt = OffsetDateTime.now();
updatedAt = OffsetDateTime.now();
if (indexStatus == null) {
indexStatus = IndexStatus.PENDING;
}
}
@PreUpdate
public void preUpdate() {
updatedAt = OffsetDateTime.now();
}
}
Entity: KnowledgeDocumentChunk
Each chunk of text gets its own embedding.
package com.example.knowledgebase.domain;
import jakarta.persistence.*;
import lombok.*;
@Entity
@Table(name = "knowledge_document_chunk")
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class KnowledgeDocumentChunk {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private Long documentId;
private Integer chunkIndex;
@Lob
private String chunkText;
@Column(columnDefinition = "vector(1536)")
private float[] embedding;
private String embeddingModel;
}
The vector(1536) column is provided by pgvector.
Repositoriy: KnowledgeDocumentRepository
package com.example.knowledgebase.repository;
import com.example.knowledgebase.domain.KnowledgeDocument;
import org.springframework.data.jpa.repository.JpaRepository;
public interface KnowledgeDocumentRepository
extends JpaRepository<KnowledgeDocument, Long> {
}
Repository: KnowledgeDocumentChunkRepository
package com.example.knowledgebase.repository;
import com.example.knowledgebase.domain.KnowledgeDocumentChunk;
import org.springframework.data.jpa.repository.JpaRepository;
import java.util.List;
public interface KnowledgeDocumentChunkRepository
extends JpaRepository<KnowledgeDocumentChunk, Long> {
List<KnowledgeDocumentChunk> findByDocumentIdOrderByChunkIndexAsc(Long documentId);
void deleteByDocumentId(Long documentId);
}
The deleteByDocumentId method will be useful for re-indexing documents.
Request DTO
package com.example.knowledgebase.api;
import jakarta.validation.constraints.NotBlank;
public record CreateKnowledgeDocumentRequest(
@NotBlank String title,
@NotBlank String content
) {}
Domain Event
After a document is saved, we publish an event.
package com.example.knowledgebase.event;
public record KnowledgeDocumentCreatedEvent(Long documentId) {}
Chunking Service
Large documents should be split into smaller pieces.
package com.example.knowledgebase.service;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
@Component
public class TextChunker {
private static final int MAX_CHARS = 500;
public List<String> chunk(String text) {
String[] paragraphs = text.split("\\n\\s*\\n");
List<String> chunks = new ArrayList<>();
StringBuilder current = new StringBuilder();
for (String paragraph : paragraphs) {
if (current.length() + paragraph.length() > MAX_CHARS) {
chunks.add(current.toString());
current = new StringBuilder();
}
current.append(paragraph).append("\n\n");
}
if (!current.isEmpty()) {
chunks.add(current.toString());
}
return chunks;
}
}
Chunking improves retrieval precision when searching later.
Embedding Service
We define an abstraction so the embedding provider can change later.
package com.example.knowledgebase.service;
public interface EmbeddingService {
float[] generateEmbedding(String text);
String modelName();
}
Fake Embedding Service
For the purpose of this tutorial, we simulate embeddings.
package com.example.knowledgebase.service;
import org.springframework.stereotype.Service;
import java.util.Random;
@Service
public class FakeEmbeddingService implements EmbeddingService {
private static final int DIMENSIONS = 1536;
private final Random random = new Random();
@Override
public float[] generateEmbedding(String text) {
float[] vector = new float[DIMENSIONS];
for (int i = 0; i < DIMENSIONS; i++) {
vector[i] = random.nextFloat();
}
return vector;
}
@Override
public String modelName() {
return "fake-embedding-model";
}
}
In a real system this service would call an embedding API.
KnowledgeDocumentService
This service saves documents and triggers the indexing pipeline.
package com.example.knowledgebase.service;
import com.example.knowledgebase.api.CreateKnowledgeDocumentRequest;
import com.example.knowledgebase.domain.IndexStatus;
import com.example.knowledgebase.domain.KnowledgeDocument;
import com.example.knowledgebase.event.KnowledgeDocumentCreatedEvent;
import com.example.knowledgebase.repository.KnowledgeDocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
@Service
@RequiredArgsConstructor
public class KnowledgeDocumentService {
private final KnowledgeDocumentRepository repository;
private final ApplicationEventPublisher eventPublisher;
@Transactional
public Long create(CreateKnowledgeDocumentRequest request) {
KnowledgeDocument document = KnowledgeDocument.builder()
.title(request.title())
.content(request.content())
.indexStatus(IndexStatus.PENDING)
.build();
KnowledgeDocument saved = repository.save(document);
eventPublisher.publishEvent(
new KnowledgeDocumentCreatedEvent(saved.getId())
);
return saved.getId();
}
}
Event Listener
This component performs the indexing process.
package com.example.knowledgebase.event;
import com.example.knowledgebase.domain.*;
import com.example.knowledgebase.repository.*;
import com.example.knowledgebase.service.*;
import lombok.RequiredArgsConstructor;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
import java.util.List;
@Component
@RequiredArgsConstructor
public class KnowledgeDocumentCreatedListener {
private final KnowledgeDocumentRepository documentRepository;
private final KnowledgeDocumentChunkRepository chunkRepository;
private final TextChunker chunker;
private final EmbeddingService embeddingService;
@Transactional
@EventListener
public void handle(KnowledgeDocumentCreatedEvent event) {
KnowledgeDocument document =
documentRepository.findById(event.documentId())
.orElseThrow();
document.setIndexStatus(IndexStatus.INDEXING);
List<String> chunks = chunker.chunk(document.getContent());
int index = 0;
for (String chunkText : chunks) {
float[] embedding =
embeddingService.generateEmbedding(chunkText);
KnowledgeDocumentChunk chunk =
KnowledgeDocumentChunk.builder()
.documentId(document.getId())
.chunkIndex(index++)
.chunkText(chunkText)
.embedding(embedding)
.embeddingModel(embeddingService.modelName())
.build();
chunkRepository.save(chunk);
}
document.setIndexStatus(IndexStatus.INDEXED);
}
}
REST Controller
package com.example.knowledgebase.api;
import com.example.knowledgebase.service.KnowledgeDocumentService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/documents")
@RequiredArgsConstructor
public class KnowledgeDocumentController {
private final KnowledgeDocumentService service;
@PostMapping
@ResponseStatus(HttpStatus.CREATED)
public CreateKnowledgeDocumentResponse create(
@RequestBody CreateKnowledgeDocumentRequest request
) {
Long id = service.create(request);
return new CreateKnowledgeDocumentResponse(id);
}
}
Response DTO
public record CreateKnowledgeDocumentResponse(Long id) {}
What Happens After the Request?
When a document is created:
POST /documents
The system performs:
1 save document
2 publish event
3 split content into chunks
4 generate embeddings
5 store chunk vectors
The knowledge base is now indexed and ready for semantic retrieval.
Production Considerations
Spring events are synchronous by default.
In production systems you might evolve this pipeline using:
-
@Asyncevent listeners - message queues (Kafka, RabbitMQ)
- background workers
- retry mechanisms
This ensures indexing does not impact API performance.
Conclusion
In this article we implemented the indexing side of an AI knowledge base.
The system now:
- saves documents
- splits content into chunks
- generates embeddings
- stores vectors in PostgreSQL
This is the foundation required for semantic search and Retrieval-Augmented Generation (RAG).
In the next article we will:
convert user questi
- ons into embeddings
- search the vector database
- retrieve relevant chunks
- build the AI prompt
Top comments (0)