DEV Community: Raphael De Lio

Semantic Caching with Spring AI & Redis

Raphael De Lio — Thu, 31 Jul 2025 09:37:38 +0000

TL;DR: You’re building a semantic caching system using Spring AI and Redis to improve LLM application performance.

Unlike traditional caching that requires exact query matches, semantic caching understands the meaning behind queries and can return cached responses for semantically similar questions.

It works by storing query-response pairs as vector embeddings in Redis, allowing your application to retrieve cached answers for similar questions without calling the expensive LLM, reducing both latency and costs.

The Problem with Traditional LLM Applications

LLMs are powerful but expensive. Every API call costs money and takes time. When users ask similar questions like “What beer goes with grilled meat?” and “Which beer pairs well with barbecue?”, traditional systems would make separate LLM calls even though these queries are essentially asking the same thing.

Traditional exact-match caching only works if users ask the identical question word-for-word. But in real applications, users phrase questions differently while seeking the same information.

How Semantic Caching Works

Video: What is a semantic cache?

Semantic caching solves this by understanding the meaning behind queries rather than matching exact text. When a user asks a question:

The system converts the query into a vector embedding
It searches for semantically similar cached queries using vector similarity
If a similar query exists above a certain threshold, it returns the cached response
If not, it calls the LLM, gets a response, and caches both the query and response for future use

Behind the scenes, this works thanks to vector similarity search. It turns text into vectors (embeddings) — lists of numbers — stores them in a vector database, and then finds the ones closest to your query when checking for cached responses.

Today, we’re gonna build a semantic caching system for a beer recommendation assistant. It will remember previous responses to similar questions, dramatically improving response times and reducing API costs.

To do that, we’ll build a Spring Boot app from scratch and use Redis as our semantic cache store. It’ll handle vector embeddings for similarity matching, enabling our application to provide lightning-fast responses for semantically similar queries.

Redis as a Semantic Cache for AI Applications

Video: What's a vector database

Redis Open Source 8 not only turns the community version of Redis into a Vector Database, but also makes it the fastest and most scalable database in the market today. Redis 8 allows you to scale to one billion vectors without penalizing latency.

For semantic caching, Redis serves as:

A vector store using Redis JSON and the Redis Query Engine for storing query embeddings
A metadata store for cached responses and additional context
A high-performance search engine for finding semantically similar queries

Spring AI and Redis

Video: What’s an embedding model?

Spring AI provides a unified API for working with various AI models and vector stores. Combined with Redis, it allows developers to easily build semantic caching systems that can:

Store and retrieve vector embeddings for semantic search
Cache LLM responses with semantic similarity matching
Reduce API costs by avoiding redundant LLM calls
Improve response times for similar queries

Building the Application

Our application will be built using Spring Boot with Spring AI and Redis. It will implement a beer recommendation assistant that caches responses semantically, providing fast answers to similar questions about beer pairings.

0. GitHub Repository

The full application can be found on GitHub

https://github.com/redis-developer/redis-springboot-resources/tree/main/artificial-intelligence/semantic-caching-with-spring-ai

1. Add the required dependencies

From a Spring Boot application, add the following dependencies to your Maven or Gradle file:

implementation("org.springframework.ai:spring-ai-transformers:1.0.0")
implementation("org.springframework.ai:spring-ai-starter-vector-store-redis")
implementation("org.springframework.ai:spring-ai-starter-model-openai")

2. Configure the Semantic Cache Vector Store

We’ll use Spring AI’s RedisVectorStore to store and search vector embeddings of cached queries and responses:

@Configuration
class SemanticCacheConfig {
    @Bean
    fun semanticCachingVectorStore(
        embeddingModel: TransformersEmbeddingModel,
        jedisPooled: JedisPooled
    ): RedisVectorStore {
        return RedisVectorStore.builder(jedisPooled, embeddingModel)
            .indexName("semanticCachingIdx")
            .contentFieldName("content")
            .embeddingFieldName("embedding")
            .metadataFields(
                RedisVectorStore.MetadataField("answer", Schema.FieldType.TEXT)
            )
            .prefix("semantic-caching:")
            .initializeSchema(true)
            .vectorAlgorithm(RedisVectorStore.Algorithm.HSNW)
            .build()
    }
}

Let’s break this down:

Index Name: semanticCachingIdx — Redis will create an index with this name for searching cached responses
Content Field: content — The raw prompt that will be embedded
Embedding Field: embedding — The field that will store the resulting vector embedding
Metadata Fields:
answer: TEXT field for storing the LLM's response
Prefix: semantic-caching: — All keys in Redis will be prefixed with this to organize the data
Vector Algorithm: HSNW — Hierarchical Navigable Small World algorithm for efficient approximate nearest neighbor search

3. Implement the Semantic Caching Service

The SemanticCachingService handles storing and retrieving cached responses from Redis:

@Service
class SemanticCachingService(
    private val semanticCachingVectorStore: RedisVectorStore
) {
    private val logger = LoggerFactory.getLogger(SemanticCachingService::class.java)
    fun storeInCache(prompt: String, answer: String) {
        // Create a document for the vector store
        val document = Document(
            prompt,
            mapOf("answer" to answer)
        )
        // Store the document in the vector store
        semanticCachingVectorStore.add(listOf(document))

        logger.info("Stored response in semantic cache for prompt: ${prompt.take(50)}...")
    }
    fun getFromCache(prompt: String, similarityThreshold: Double = 0.8): String? {
        // Execute similarity search
        val results = semanticCachingVectorStore.similaritySearch(
            SearchRequest.builder()
                .query(prompt)
                .topK(1)
                .build()
        )
        // Check if we found a semantically similar query above threshold
        if (results?.isNotEmpty() == true) {
            val score = results[0].score ?: 0.0
            if (similarityThreshold < score) {
                logger.info("Cache hit! Similarity score: $score")
                return results[0].metadata["answer"] as String
            } else {
                logger.info("Similar query found but below threshold. Score: $score")
            }
        }
        logger.info("No cached response found for prompt")
        return null
    }
}

Key features of the semantic caching service:

Stores query-response pairs as vector embeddings in Redis
Retrieves cached responses using vector similarity search
Configurable similarity threshold for cache hits
Comprehensive logging for debugging and monitoring

4. Integrate with the RAG Service

The RagService orchestrates the semantic caching with the standard RAG pipeline:

@Service
class RagService(
    private val chatModel: ChatModel,
    private val vectorStore: RedisVectorStore,
    private val semanticCachingService: SemanticCachingService
) {
    private val logger = LoggerFactory.getLogger(RagService::class.java)
    fun retrieve(message: String): RagResult {
        // Check semantic cache first
        val startCachingTime = System.currentTimeMillis()
        val cachedAnswer = semanticCachingService.getFromCache(message, 0.8)
        val cachingTimeMs = System.currentTimeMillis() - startCachingTime
        if (cachedAnswer != null) {
            logger.info("Returning cached response")
            return RagResult(
                generation = Generation(AssistantMessage(cachedAnswer)),
                metrics = RagMetrics(
                    embeddingTimeMs = 0,
                    searchTimeMs = 0,
                    llmTimeMs = 0,
                    cachingTimeMs = cachingTimeMs,
                    fromCache = true
                )
            )
        }
        // Standard RAG process if no cache hit
        logger.info("No cache hit, proceeding with RAG pipeline")

        // Retrieve relevant documents
        val startEmbeddingTime = System.currentTimeMillis()
        val searchResults = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(message)
                .topK(5)
                .build()
        )
        val embeddingTimeMs = System.currentTimeMillis() - startEmbeddingTime
        // Create context from retrieved documents
        val context = searchResults.joinToString("\n") { it.text }

        // Generate response using LLM
        val startLlmTime = System.currentTimeMillis()
        val prompt = createPromptWithContext(message, context)
        val response = chatModel.call(prompt)
        val llmTimeMs = System.currentTimeMillis() - startLlmTime
        // Store the response in semantic cache for future use
        val responseText = response.result.output.text ?: ""
        semanticCachingService.storeInCache(message, responseText)
        return RagResult(
            generation = response.result,
            metrics = RagMetrics(
                embeddingTimeMs = embeddingTimeMs,
                searchTimeMs = 0, // Combined with embedding time
                llmTimeMs = llmTimeMs,
                cachingTimeMs = 0,
                fromCache = false
            )
        )
    }
    private fun createPromptWithContext(query: String, context: String): Prompt {
        val systemMessage = SystemMessage("""
            You are a beer recommendation assistant. Use the provided context to answer 
            questions about beer pairings, styles, and recommendations.

            Context: $context
        """.trimIndent())

        val userMessage = UserMessage(query)

        return Prompt(listOf(systemMessage, userMessage))
    }
}

Key features of the integrated RAG service:

Checks semantic cache before expensive LLM calls
Falls back to standard RAG pipeline for cache misses
Automatically caches new responses for future use
Provides detailed performance metrics including cache hit indicators

Running the Demo

The easiest way to run the demo is with Docker Compose, which sets up all required services in one command.

Step 1: Clone the repository

git clone https://github.com/redis-developer/redis-springboot-resources.git
cd redis-springboot-resources/artificial-intelligence/semantic-caching-with-spring-ai

Step 2: Configure your environment

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=sk-your-api-key

Step 3: Start the services

docker compose up --build

This will start:

redis: for storing both vector embeddings and cached responses
redis-insight: a UI to explore the Redis data
semantic-caching-app: the Spring Boot app that implements the semantic caching system

Step 4: Use the application

When all services are running, go to localhost:8080 to access the demo. You'll see a beer recommendation interface:

If you click on Start Chat, it may be that the embeddings are still being created, and you get a message asking for this operation to complete. This is the operation where the documents we'll search through will be turned into vectors and then stored in the database. It is done only the first time the app starts up and is required regardless of the vector database you use.

Once all the embeddings have been created, you can start asking your chatbot questions. It will semantically search through the documents we have stored, try to find the best answer for your questions, and cache the responses semantically in Redis:

If you ask something similar to a question had already been asked, your chatbot will retrieve it from the cache instead of sending the query to the LLM. Retrieving an answer much faster now.

Exploring the Data in Redis Insight

RedisInsight provides a visual interface for exploring the cached data in Redis. Access it at localhost:5540 to see:

Semantic Cache Entries: Stored as JSON documents with vector embeddings
Vector Index Schema: The schema used for similarity search
Performance Metrics: Monitor cache hit rates and response times

If you run the FT.INFO semanticCachingIdx command in the RedisInsight workbench, you'll see the details of the vector index schema that enables efficient semantic matching.

Wrapping up

And that’s it — you now have a working semantic caching system using Spring Boot and Redis.

Instead of making expensive LLM calls for every similar question, your application can now intelligently cache and retrieve responses based on semantic meaning. Redis handles the vector storage and similarity search with the performance and scalability Redis is known for.

With Spring AI and Redis, you get an easy way to integrate semantic caching into your Java applications. The combination of vector similarity search for semantic matching and efficient caching gives you a powerful foundation for building cost-effective, high-performance AI applications.

Whether you’re building chatbots, recommendation engines, or question-answering systems, this semantic caching architecture gives you the tools to dramatically reduce costs while maintaining response quality and improving user experience.

Try it out, experiment with different similarity thresholds, explore other embedding models, and see how much you can save on LLM costs while delivering faster responses!

Stay Curious!

Agent Long-term Memory with Spring AI & Redis

Raphael De Lio — Wed, 16 Jul 2025 19:57:23 +0000

TL;DR:
You're building an AI agent with memory using Spring AI and Redis.

Unlike traditional chatbots that forget previous interactions, memory-enabled agents can recall past conversations and facts.

It works by storing two types of memory in Redis: short-term (conversation history) and long-term (facts and experiences as vectors), allowing agents to provide personalized, context-aware responses.

LLMs respond to each message in isolation, treating every interaction as if it's the first time they've spoken with a user. They lack the ability to remember previous conversations, preferences, or important facts.

Memory-enabled AI agents, on the other hand, can maintain context across multiple interactions. They remember who you are, what you've told them before, and can use that information to provide more personalized, relevant responses.

In a travel assistant scenario, for example, if a user mentions "I'm allergic to shellfish" in one conversation, and later asks for restaurant recommendations in Boston, a memory-enabled agent would recall the allergy information and filter out inappropriate suggestions, creating a much more helpful and personalized experience.

Video: What is an embedding model?

Video: What is semantic search?

Today, we're gonna build a memory-enabled AI agent that helps users plan travel. It will remember user preferences, past trips, and important details across multiple conversations — even if the user leaves and comes back later.

To do that, we'll build a Spring Boot app from scratch and use Redis as our memory store. It'll handle both short-term memory (conversation history) and long-term memory (facts and preferences as vector embeddings), enabling our agent to provide truly personalized assistance.

Redis as a Memory Store for AI Agents

Video: What is a vector database?

In the last 15 years, Redis became the foundational infrastructure for realtime applications. Today, with Redis Open Source 8, it's committed to becoming the foundational infrastructure for AI applications as well.

Learn more: https://redis.io/blog/searching-1-billion-vectors-with-redis-8/

For AI agents, Redis serves as both:

A short-term memory store using Redis Lists to maintain conversation history
A long-term memory store using Redis JSON and the Redis Query Engine that enables vector search to store and retrieve facts and experiences

Spring AI and Redis

Spring AI provides a unified API for working with various AI models and vector stores. Combined with Redis, it allows our users to easily build memory-enabled AI agents that can:

Store and retrieve vector embeddings for semantic search
Maintain conversation context across sessions
Extract and deduplicate memories from conversations
Summarize long conversations to prevent context window overflow

Building the Application

Our application will be built using Spring Boot with Spring AI and Redis. It will implement a travel assistant that remembers user preferences and past trips, providing personalized recommendations based on this memory.

0. GitHub Repository

The full application can be found on GitHub: https://github.com/redis-developer/redis-springboot-resources/tree/main/artificial-intelligence/agent-long-term-memory-with-spring-ai

1. Add the required dependencies

From a Spring Boot application, add the following dependencies to your Maven or Gradle file:

    implementation("org.springframework.ai:spring-ai-transformers:1.0.0")
    implementation("org.springframework.ai:spring-ai-starter-vector-store-redis")
    implementation("org.springframework.ai:spring-ai-starter-model-openai")

    implementation("com.redis.om:redis-om-spring:1.0.0-RC3")

2. Define the Memory model

The core of our implementation is the Memory class that represents items stored in long-term memory:

data class Memory(
    val id: String? = null,
    val content: String,
    val memoryType: MemoryType,
    val userId: String,
    val metadata: String = "{}",
    val createdAt: LocalDateTime = LocalDateTime.now()
)

enum class MemoryType {
    EPISODIC,  // Personal experiences and preferences
    SEMANTIC   // General knowledge and facts
}

3. Configure the Vector Store

We'll use Spring AI's RedisVectorStore to store and search vector embeddings of memories:

@Configuration
class MemoryVectorStoreConfig {

    @Bean
    fun memoryVectorStore(
        embeddingModel: EmbeddingModel,
        jedisPooled: JedisPooled
    ): RedisVectorStore {
        return RedisVectorStore.builder(jedisPooled, embeddingModel)
            .indexName("longTermMemoryIdx")
            .contentFieldName("content")
            .embeddingFieldName("embedding")
            .metadataFields(
                RedisVectorStore.MetadataField("memoryType", Schema.FieldType.TAG),
                RedisVectorStore.MetadataField("metadata", Schema.FieldType.TEXT),
                RedisVectorStore.MetadataField("userId", Schema.FieldType.TAG),
                RedisVectorStore.MetadataField("createdAt", Schema.FieldType.TEXT)
            )
            .prefix("long-term-memory:")
            .initializeSchema(true)
            .vectorAlgorithm(RedisVectorStore.Algorithm.HSNW)
            .build()
    }
}

Let's break this down:

Index Name: longTermMemoryIdx - Redis will create an index with this name for searching memories
Content Field: content - The raw memory content that will be embedded
Embedding Field: embedding - The field that will store the resulting vector embedding
Metadata Fields:
- memoryType: TAG field for filtering by memory type (EPISODIC or SEMANTIC)
- metadata: TEXT field for storing additional context about the memory
- userId: TAG field for filtering by user ID
- createdAt: TEXT field for storing the creation timestamp

4. Implement the Memory Service

The MemoryService handles storing and retrieving memories from Redis:

@Service
class MemoryService(
    private val memoryVectorStore: RedisVectorStore
) {
    private val systemUserId = "system"

    fun storeMemory(
        content: String,
        memoryType: MemoryType,
        userId: String? = null,
        metadata: String = "{}"
    ): StoredMemory {
        // Check if a similar memory already exists to avoid duplicates
        if (similarMemoryExists(content, memoryType, userId)) {
            return StoredMemory(
                Memory(
                    content = content,
                    memoryType = memoryType,
                    userId = userId ?: systemUserId,
                    metadata = metadata,
                    createdAt = LocalDateTime.now()
                )
            )
        }

        // Create a document for the vector store
        val document = Document(
            content,
            mapOf(
                "memoryType" to memoryType.name,
                "metadata" to metadata,
                "userId" to (userId ?: systemUserId),
                "createdAt" to LocalDateTime.now().toString()
            )
        )

        // Store the document in the vector store
        memoryVectorStore.add(listOf(document))

        return StoredMemory(
            Memory(
                content = content,
                memoryType = memoryType,
                userId = userId ?: systemUserId,
                metadata = metadata,
                createdAt = LocalDateTime.now()
            )
        )
    }

    fun retrieveMemories(
        query: String,
        memoryType: MemoryType? = null,
        userId: String? = null,
        limit: Int = 5,
        distanceThreshold: Float = 0.9f
    ): List<StoredMemory> {
        // Build filter expression
        val b = FilterExpressionBuilder()
        val filterList = mutableListOf<FilterExpressionBuilder.Op>()

        // Add user filter
        val effectiveUserId = userId ?: systemUserId
        filterList.add(b.or(b.eq("userId", effectiveUserId), b.eq("userId", systemUserId)))

        // Add memory type filter if specified
        if (memoryType != null) {
            filterList.add(b.eq("memoryType", memoryType.name))
        }

        // Combine filters
        val filterExpression = when (filterList.size) {
            0 -> null
            1 -> filterList[0]
            else -> filterList.reduce { acc, expr -> b.and(acc, expr) }
        }?.build()

        // Execute search
        val searchResults = memoryVectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(limit)
                .filterExpression(filterExpression)
                .build()
        )

        // Transform results to StoredMemory objects
        return searchResults.mapNotNull { result ->
            if (distanceThreshold < (result.score ?: 1.0)) {
                val metadata = result.metadata
                val memoryObj = Memory(
                    id = result.id,
                    content = result.text ?: "",
                    memoryType = MemoryType.valueOf(metadata["memoryType"] as String? ?: MemoryType.SEMANTIC.name),
                    metadata = metadata["metadata"] as String? ?: "{}",
                    userId = metadata["userId"] as String? ?: systemUserId,
                    createdAt = try {
                        LocalDateTime.parse(metadata["createdAt"] as String?)
                    } catch (_: Exception) {
                        LocalDateTime.now()
                    }
                )
                StoredMemory(memoryObj, result.score)
            } else {
                null
            }
        }
    }
}

Key features of the memory service:

Stores memories as vector embeddings in Redis
Retrieves memories using vector similarity search
Filters memories by user ID and memory type
Prevents duplicate memories through similarity checking

5. Implement Spring AI Advisors

We’re going to rely on the Spring AI Advisors API. Advisors are a way to intercept, modify, and enhance AI-driven interactions.
We will implement two advisors: one for retrieval and another for recorder. These advisors will be plugged in our ChatClient and intercept every interaction with the LLM.

The retrieval advisor runs before your LLM call. It takes the user’s current message, performs a vector similarity search over Redis, and injects the most relevant memories into the system portion of the prompt so the model can ground its answer.

5.1 Advisor for Long-term memory retrieval

The retrieval advisor runs before LLM calls. It takes the user’s current message, performs a vector similarity search over Redis, and injects the most relevant memories into the system portion of the prompt so the model can ground its answer.

@Component
class LongTermMemoryRetrievalAdvisor(
  private val memoryService: MemoryService,
) : CallAdvisor, Ordered {

  companion object {
    const val USER_ID = "ltm_user_id"   
    const val TOP_K = "ltm_top_k"      
  }

  override fun getOrder() = Ordered.HIGHEST_PRECEDENCE + 40
  override fun getName() = "LongTermMemoryRetrievalAdvisor"

  override fun adviseCall(req: ChatClientRequest, chain: CallAdvisorChain): ChatClientResponse {
    val userId = (req.context()[USER_ID] as? String) ?: "system"
    val k = (req.context()[TOP_K] as? Int) ?: 5

    val query = req.prompt().userMessage.text
    val memories = memoryService.retrieveRelevantMemories(query, userId = userId)
      .take(k)

    val memoryBlock = buildString {
      appendLine("Use the MEMORY below if relevant. Keep answers factual and concise.")
      appendLine("----- MEMORY -----")
      memories.forEachIndexed { i, m -> appendLine("${i+1}. ${m.memory.content}") }
      appendLine("------------------")
    }

    val enrichedPrompt = req.prompt().augmentSystemMessage { sys ->
      val existing = sys.text
      sys.mutate()
        .text(
          buildString {
            appendLine(memoryBlock)
            if (existing.isNotBlank()) {
              appendLine()
              append(existing)
            }
          }
        ).build()
    }

    val enrichedReq = req.mutate()
      .prompt(enrichedPrompt)
      .build()

    return chain.nextCall(enrichedReq)
  }
}

5.2 Advisor for Long-term memory recording

The recorder advisor runs after the assistant responds. It looks at the last user message and the assistant’s reply, asks the model to extract atomic, useful facts (episodic or semantic), deduplicates them, and stores them in Redis.

@Component
class LongTermMemoryRecorderAdvisor(
  private val memoryService: MemoryService,
  private val chatModel: ChatModel
) : CallAdvisor, Ordered {

  data class MemoryCandidate(val content: String, val type: MemoryType, val userId: String?)
  data class ExtractionResult(val memories: List<MemoryCandidate> = emptyList())

  private val extractorConverter = BeanOutputConverter(ExtractionResult::class.java)

  override fun getOrder(): Int = Ordered.HIGHEST_PRECEDENCE + 60
  override fun getName(): String = "LongTermMemoryRecorderAdvisor"

  override fun adviseCall(req: ChatClientRequest, chain: CallAdvisorChain): ChatClientResponse {
    // 1) Proceed with the normal call (other advisors may have enriched the prompt)
    val res = chain.nextCall(req)

    // 2) Build extraction prompt (user + assistant text of *this* turn)
    val userText = req.prompt().userMessage.text
    val assistantText = res.chatResponse()?.result?.output?.text

    // 3) Ask the model to extract long-term memories as structured JSON
    val schemaHint = extractorConverter.jsonSchema // JSON schema string for the POJO
    val extractSystem = """
            You extract LONG-TERM MEMORIES from a dialogue turn.

            A memory is either:

            1. EPISODIC MEMORIES: Personal experiences and user-specific preferences
               Examples: "User prefers Delta airlines", "User visited Paris last year"

            2. SEMANTIC MEMORIES: General domain knowledge and facts
               Examples: "Singapore requires passport", "Tokyo has excellent public transit"

            Only extract clear, factual information. Do not make assumptions or infer information that isn't explicitly stated.
            If no memories can be extracted, return an empty array.

            The instance must conform to this JSON Schema (for validation, do not output it):
              $schemaHint

            Do not include code fences, schema, or properties. Output a single-line JSON object.
        """.trimIndent()

    val extractUser = """
            USER SAID:
            $userText

            ASSISTANT REPLIED:
            $assistantText

            Extract up to 5 memories with correct type; set userId if present/known.
        """.trimIndent()

    val options: ChatOptions = OpenAiChatOptions.builder()
      .responseFormat(ResponseFormat.builder().type(ResponseFormat.Type.JSON_OBJECT).build())
      .build()

    val extraction = chatModel.call(
      Prompt(
        listOf(
          UserMessage(extractUser),
          SystemMessage(extractSystem)
        ),
        options
      ),
    )

    val parsed = extractorConverter.convert(extraction.result.output.text ?: "")
      ?: ExtractionResult()

    // 4) Persist memories (MemoryService handles dedupe/thresholding)
    val userId = (req.context["ltm_user_id"] as? String) // optional per-call param
    parsed.memories.forEach { m ->
      val owner = m.userId ?: userId
      memoryService.storeMemory(
        content = m.content,
        memoryType = m.type,
        userId = owner
      )
    }

    return res
  }
}

6. Plugging the advisors in our ChatClient

In our ChatConfig class, we will configure our ChatClient as:

    @Bean
    fun chatClient(
        chatModel: ChatModel,
        // chatMemory: ChatMemory, (Necessary for short-term memory)
        longTermRecorder: LongTermMemoryRecorderAdvisor,
        longTermMemoryRetrieval: LongTermMemoryRetrievalAdvisor
    ): ChatClient {
        return ChatClient.builder(chatModel)
            .defaultAdvisors(
                // MessageChatMemoryAdvisor.builder(chatMemory).build(),
                longTermRecorder,
                longTermMemoryRetrieval
            ).build()
    }

7. Implement the Chat Service

Since the advisors have been plugged in the ChatClient itself, we don’t need to worry about managing memory ourselves when interacting with the LLM. The only thing we need to make sure is that with every interaction we send the expected parameters, namely the session or user ID, so that the advisors know which history to look at.

@Service
class ChatService(
    private val chatClient: ChatClient,
    private val shortTermMemoryRepository: ShortTermMemoryRepository,
    private val travelAgentSystemPrompt: Message,
    private val chatMemoryRepository: ChatMemoryRepository
) {
    private val log = LoggerFactory.getLogger(ChatService::class.java)

    fun sendMessage(
        message: String,
        userId: String,
    ): ChatResult {
        // Use userId as the key for conversation history and long-term memory
        log.info("Processing message from user $userId: $message")
        val response = chatClient
            .prompt(
                Prompt(
                    travelAgentSystemPrompt,
                    UserMessage(message)
                )
            )
            .advisors { it
                .param(ChatMemory.CONVERSATION_ID, userId)
                .param("ltm_user_id", userId)
            }
            .call()

        return ChatResult(
            response = response.chatResponse()!!
        )
    }


    fun getConversationHistory(userId: String): List<Message?> {
        return chatMemoryRepository.findByConversationId(userId)
    }

    fun clearConversationHistory(userId: String) {
        shortTermMemoryRepository.deleteById(userId)
        log.info("Cleared conversation history for user $userId from Redis")
    }
}

8. Configure the Agent System Prompt

The agent is configured with a system prompt that explains its capabilities and access to different types of memory:

@Bean
fun travelAgentSystemPrompt(): Message {
    val promptText = """
        You are a travel assistant helping users plan their trips. You remember user preferences
        and provide personalized recommendations based on past interactions.

        You have access to the following types of memory:
        1. Short-term memory: The current conversation thread
        2. Long-term memory:
           - Episodic: User preferences and past trip experiences (e.g., "User prefers window seats")
           - Semantic: General knowledge about travel destinations and requirements

        Always be helpful, personal, and context-aware in your responses.

        Always answer in text format. No markdown or special formatting.
    """.trimIndent()

    return SystemMessage(promptText)
}

9. Create the REST Controller

The REST controller exposes endpoints for chat and memory management:

@RestController
@RequestMapping("/api")
class ChatController(private val chatService: ChatService) {

    @PostMapping("/chat")
    fun chat(@RequestBody request: ChatRequest): ChatResponse {
        val result = chatService.sendMessage(request.message, request.userId)
        return ChatResponse(
            message = result.response.result.output.text ?: "",
            metrics = result.metrics
        )
    }

    @GetMapping("/history/{userId}")
    fun getHistory(@PathVariable userId: String): List<MessageDto> {
        return chatService.getConversationHistory(userId).map { message ->
            MessageDto(
                role = when (message) {
                    is SystemMessage -> "system"
                    is UserMessage -> "user"
                    is AssistantMessage -> "assistant"
                    else -> "unknown"
                },
                content = when (message) {
                    is SystemMessage -> message.content
                    is UserMessage -> message.content
                    is AssistantMessage -> message.content
                    else -> ""
                }
            )
        }
    }

    @DeleteMapping("/history/{userId}")
    fun clearHistory(@PathVariable userId: String) {
        chatService.clearConversationHistory(userId)
    }
}

Running the Demo

The easiest way to run the demo is with Docker Compose, which sets up all required services in one command.

Step 1: Clone the repository

git clone https://github.com/redis/redis-springboot-recipes.git
cd redis-springboot-recipes/artificial-intelligence/agent-long-term-memory-with-spring-ai

Step 2: Configure your environment

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=sk-your-api-key

Step 3: Start the services

docker compose up --build

This will start:

redis: for storing both vector embeddings and chat history
redis-insight: a UI to explore the Redis data
agent-memory-app: the Spring Boot app that implements the memory-aware AI agent

Step 4: Use the application

When all services are running, go to localhost:8080 to access the demo. You'll see a travel assistant interface with a chat panel and a memory management sidebar:

Enter a user ID and click "Start Chat":

Send a message like: "Hi, my name's Raphael. I went to Paris back in 2009 with my wife for our honeymoon and we had a lovely time. For our 10-year anniversary we're planning to go back. Help us plan the trip!"

The system will reply with the response to your message and, in case it identifies potential memories to be stored, they will be stored either as semantic or episodic memories. You can see the stored memories on the "Memory Management" sidebar.

On top of that, with each message, the system will also return performance metrics.

If you refresh the page, you will see that all memories and the chat history are gone.

If you reenter the same user ID, the long-term memories will be reloaded on the sidebar and the short-term memory (the chat history) will be reloaded as well:

If you refresh the page and enter the same user ID, your memories and conversation history will be reloaded

Exploring the Data in Redis Insight

RedisInsight provides a visual interface for exploring the data stored in Redis. Access it at localhost:5540 to see:

Short-term memory (conversation history) stored in Redis Lists

Long-term memory (facts and experiences) stored as JSON documents with vector embeddings

The vector index schema used for similarity search

If you run the FT.INFO longTermMemoryIdx command in the RedisInsight workbench, you'll see the details of the vector index schema that enables efficient memory retrieval.

Wrapping up

And that's it — you now have a working AI agent with memory using Spring Boot and Redis.

Instead of forgetting everything between conversations, your agent can now remember user preferences, past experiences, and important facts. Redis handles both short-term memory (conversation history) and long-term memory (vector embeddings) — all with the performance and scalability Redis is known for.

With Spring AI and Redis, you get an easy way to integrate this into your Java applications. The combination of vector similarity search for semantic retrieval and traditional data structures for conversation history gives you a powerful foundation for building truly intelligent agents.

Whether you're building customer service bots, personal assistants, or domain-specific experts, this memory architecture gives you the tools to create more helpful, personalized, and context-aware AI experiences.

Try it out, experiment with different memory types, explore other embedding models, and see how far you can push the boundaries of AI agent capabilities!

Stay Curious!

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

Raphael De Lio — Sun, 15 Jun 2025 17:15:42 +0000

Did you know the Deep Java Library (DJL) powers Spring AI and Redis OM Spring? DJL helps you run machine learning models right inside your Java applications.

Check them out:
Spring AI with DJL: https://docs.spring.io/spring-ai/reference/api/embeddings/onnx.html
Semantic Search with SpringBoot & Redis: https://foojay.io/today/semantic-search-with-spring-boot-redis/

TL;DR:

You’re doing zero-shot classification in a Java app using DJL.
DJL didn’t handle some models well — like DeBERTa. It missed support for token_type_ids, assumed wrong label positions, and oversimplified the softmax implementation.
It was fixed by reading the model config files and adjusting DJL's translator logic.
Now DJL gives correct results across different models — just like the Transformers library does in Python.
The fix is merged and will probably be released with version 0.34.0.

📚 Index

Introduction: What is Zero-Shot Classification?
Integrating a Zero-Shot Classification Model with the Deep Java Library
Problem #1: No support for token_input_ids
Problem #2: Hard coded logit positions and wrong softmax implementation
Contributing to the Deep Java Library
Final Words

What’s Zero-Shot Classification (and Why It Matters)

Zero-shot classification is a machine learning technique that allows models to classify text into categories they haven’t explicitly seen during training. Unlike traditional classification models that can only predict classes they were trained on, zero-shot classifiers can generalize to new, unseen categories.

One example of a zero-shot classification model is MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli. Like many other models for this task, it works by comparing a sentence (the premise) to different hypotheses (the labels) and scoring how likely each one is to be true.

For example, we can compare “Java is a great programming language”, the premise, to “Software Engineering, Software Programming, and Politics”, the hypotheses. In this case, the model will return:

Software Programming: 0.984
Software Engineering: 0.015
Politics: 0.001

Meaning that “Software Programming” is the hypothesis that best classifies the premise.

In this example, we’re comparing the premise to all hypotheses, but we could also compare them individually. We can do it by enabling the “multi_label” option. In this case, it will return:

Software Programming: 0.998
Software Engineering: 0.668
Politics: 0.000

With a higher score for “Software Engineering” and an even lower score for “Politics.”

You can easily try it out at: https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli

Integrating a Zero-Shot Classification Model with the Deep Java Library

The Deep Java Library (DJL) is an open-source library that makes it easier to work with machine learning models in Java. It lets you run models locally, in-process, inside your Java application. It supports many engines (like PyTorch and TensorFlow), and it can load models directly from Hugging Face or from disk.

A cool thing about this library is that **it hosts a collection of pre-trained models in its model zoo. **Those models are ready to use for common tasks like image classification, object detection, text classification, and more. They are curated and maintained by the DJL team to ensure they work out of the box with DJL’s APIs and that developers can load these models easily using a simple criteria-based API.

One example is this zero-shot classification model developed by Facebook: facebook/bart-large-mnli. This model is hosted by DJL in their model zoo and can easily be reached at the following URI djl://ai.djl.huggingface.pytorch/facebook/bart-large-mnli.

Let’s see how we can easily load it into our Java application and use it to classify text.

Dependencies

The dependencies we’re gonna be using are:

implementation("ai.djl.huggingface:tokenizers:0.32.0")
implementation("ai.djl.pytorch:pytorch-engine:0.32.0")
implementation("ai.djl:model-zoo:0.32.0")

The Criteria Class

The Criteria class in DJL is a builder-style utility that tells DJL how to load and use a model. It defines:

Input and output types (e.g., ZeroShotClassificationInput, ZeroShotClassificationOutput)
Where to get the model from (like a URL or model zoo ID)
Which engine to use (e.g., PyTorch, TensorFlow, ONNX)
Extra arguments (like tokenizer ID, batch size, device)
Custom logic, like a translator to convert between raw inputs/outputs and tensors

String modelUrl = "djl://ai.djl.huggingface.pytorch/facebook/bart-large-mnli";

Criteria criteria = Criteria.builder()
            .optModelUrls(modelUrl)
            .optEngine("PyTorch")
            .setTypes(ZeroShotClassificationInput.class, ZeroShotClassificationOutput.class)
            .optTranslatorFactory(new ZeroShotClassificationTranslatorFactory())
            .build();

When building a Criteria in DJL, we need to pick an engine that matches what the model was trained with. Most Hugging Face models use PyTorch. We also have to define the input and output types the model expects. For zero-shot classification, DJL gives us ready-to-use classes:

ZeroShotClassificationInput: lets us set the text (premise), candidate labels (hypotheses), whether it’s multi-label, and a hypothesis template;

ZeroShotClassificationOutput: returns the labels with their confidence scores.

Under the hood, machine learning models work with tensors that are basically arrays of numbers. To go from readable input to tensors and then back from model output to readable results, DJL uses a Translator. The ZeroShotClassificationTranslatorFactory creates a translator that knows how to tokenize the input text and how to turn raw model outputs (logits) into useful scores.

Loading and using the model

Loading the model is easy — you just call ModelZoo.loadModel(criteria). The criteria tells DJL what kind of model you’re looking for, like the engine (PyTorch), input/output types, and where to find it. Once the model is loaded, we get a Predictor from it. That’s what we use to actually run the predictions.

Next, we prepare the input. In this example, we’re checking how related the sentence “Java is the best programming language” is to a few labels like “Software Engineering”, “Software Programming”, and “Politics”. Since a sentence can relate to more than one label, we set multiLabel to true.

Then, we run the prediction and check the result that contains the labels and their scores. Basically, how likely it is that the input belongs to each category.

Finally, we loop over the results and print each label with its score. Once we’re done, we clean up by closing the predictor and model, which is always a good practice to free up resources.

// Load the model
Model model = ModelZoo.loadModel(criteria);
Predictor predictor = model.newPredictor();

// Create the input
String inputText = "Java is the best programming language";
String[] candidateLabels = {"Software Engineering", "Software Programming", "Politics"};
boolean multiLabel = true;
ZeroShotClassificationInput input = new ZeroShotClassificationInput(inputText, candidateLabels, multiLabel);

// Perform the prediction
ZeroShotClassificationOutput result = predictor.predict(input);

// Print results
System.out.println("\nClassification results:");
String[] labels = result.getLabels();
double[] scores = result.getScores();
for (int i = 0; i < labels.length; i++) {
    System.out.println(labels[i] + ": " + scores[i]);
}

// Clean up resources
predictor.close();
model.close();

By running the code above, we should see the following output:

Classification results:
Software Programming: 0.82975172996521
Software Engineering: 0.15263372659683228
Politics: 0.017614541575312614

This has been easy so far. But what if you want to use a different model?

Using different models

If you want to use a different model, you have two options: pick one that’s hosted by DJL or load one directly from Hugging Face. To see all the models that DJL hosts, just run the code below , it’ll all available models.

// Create an empty criteria to fetch all available models
Criteria criteria = Criteria.builder().build();

// List available model names
Set modelNames = ModelZoo.listModels(criteria);
System.out.println("Available models from DJL:");
for (String name : modelNames) {
    System.out.println("- " + name);
}

This will output multiple models for you with their respective URIs that you can simply replace on the criteria we implemented previously in this tutorial. It should just work.

However, if you want to host a model that is not available in the Model Zoo, you will have to not only download it from HuggingFace, but also convert it to a format that is compatible with DJL.

Using a model that is not available in the Model Zoo

The model I want to use is the one I introduced in the beginning of this article: MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli. It’s not available in the Model Zoo, so we will need to perform a few extra steps to make it compatible with DJL.

Hugging Face models are made for Python. So, we need to convert them before using them with DJL.

To bridge this gap, DJL provides a tool called djl-convert that transforms these models into a format that works in Java, removing Python-specific dependencies to make them ready for efficient inference with DJL.

To install djl-convert, you can run the following commands in your terminal: (All details here)

    # install release version of djl-converter
    pip install https://publish.djl.ai/djl_converter/djl_converter-0.30.0-py3-none-any.whl
    # install from djl master branch
    pip install "git+https://github.com/deepjavalibrary/djl.git#subdirectory=extensions/tokenizers/src/main/python"
    # install djl-convert from local djl repo
    git clone https://github.com/deepjavalibrary/djl.git
    cd djl/extensions/tokenizers/src/main/python
    python3 -m pip install -e .
    # Add djl-convert to PATH (if installed locally or not globally available)
    export PATH="$HOME/.local/bin:$PATH"
    # install optimum if you want to convert to OnnxRuntime
    pip install optimum
    # convert a single model to TorchScript, Onnxruntime or Rust
    djl-convert --help
    # import models as DJL Model Zoo
    djl-import --help

After that, you can run the following command to convert the model to a format DJL can understand:

djl-convert -m MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli

This will store the converted model under folder model/DeBERTa-v3-large-mnli-fever-anli-ling-wanli in the working directory.

Now we’re ready to go back to our Java application.

Loading a local model with DJL

Loading a local model is also straightforward. Instead of loading it from the DJL URL, you’re going to load it from the directory that was created during the conversion:

Criteria criteria = Criteria.builder()
                .optModelPath(Paths.get("model/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"))
                .optEngine("PyTorch")
                .setTypes(ZeroShotClassificationInput.class, ZeroShotClassificationOutput.class)
                .optTranslatorFactory(new ZeroShotClassificationTranslatorFactory())
                .build();

Running it should be as straightforward as before:

// Load the model
Model model = ModelZoo.loadModel(criteria);
Predictor predictor = model.newPredictor();

// Create the input
String inputText = "Java is the best programming language";
String[] candidateLabels = {"Software Engineering", "Software Programming", "Politics"};
boolean multiLabel = true;
ZeroShotClassificationInput input = new ZeroShotClassificationInput(inputText, candidateLabels, multiLabel);

// Perform the prediction
ZeroShotClassificationOutput result = predictor.predict(input);

// Print results
System.out.println("\nClassification results:");
String[] labels = result.getLabels();
double[] scores = result.getScores();
for (int i = 0; i  Dict(str, Tensor)

Problem #1: No support for token_input_ids

Not every Zero-Shot Classification Model is the same, and one thing that sets them apart is whether they use token type IDs.

Toke Type IDs are just extra markers that tell the model where one part of the input ends and the other begins, like separating the main sentence from the label it’s being compared to.

Some models, like BERT or DeBERTa, were trained to expect these markers, so they need them to work properly. Others, like RoBERTa or BART, were trained without them and just ignore that input.

And well, DJL’s ZeroShotClassificationTranslator had been implemented and tested with a BART model, which didn’t require token_type_ids to work properly.

By digging into the implementation of ZeroShotClassificationTranslator, I was able to see that token_type_ids were actually supported by DJL, it was simply hardcoded in the Translator, not allowing us to set it even if we initialized the Translator with its Builder:

// Line 85 of ZeroShotClassificationTranslator: https://github.com/deepjavalibrary/djl/blob/fe8103c7498f23e209adc435410d9f3731f8dd65/extensions/tokenizers/src/main/java/ai/djl/huggingface/translator/ZeroShotClassificationTranslator.java
// Token Type Ids is hardcoded to false
NDList in = encoding.toNDList(manager, false, int32);

I fixed this by adding a method to the Translator Builder. This method sets the token_type_id property during initialization. I also refactored the class to make it work.

public ZeroShotClassificationTranslator.Builder optTokenTypeId(boolean withTokenType) {
    this.tokenTypeId = withTokenType;
    return this;
}

And even though it worked as I expected, I was surprised to find out that the scores that were output way off from scores I expected.

While Python’s Transformers library would output the following, correct, results:

Software Programming: 0.9982864856719971
Software Engineering: 0.7510316371917725
Politics: 0.00020543287973850965

The Deep Java Library was outputting completely wrong scores:

Politics: 0.9988358616828918
Software Engineering: 0.0009450475918129086
Software Programming: 0.00021904722962062806

You can see that the scores were so wrong that it actually output that Politics was the label that best fit our premise: “Java is the best programming language.”

What’s going on here?

Problem #2: Hard coded logit positions nad oversimplified softmax implementation

To understand what’s going on, we also need to understand how Zero-Shot Classification models work. These models aren’t trained to classify things directly. Instead, they take two sentences, the input and the label as a hypothesis, and decide how they relate.

They return logits: raw scores for each label like “entailment”, “contradiction”, or “neutral”. These logits are just numbers. To make them readable, we apply softmax, which turns them into probabilities between 0 and 1.

DJL’s original implementation didn’t handle this properly. It grabbed the last logit from each label’s output, assuming it was the “entailment” score. Then, it normalized those scores across all labels.

This approach ignored how each label is its own comparison. Each one is a separate classification task. So softmax must be applied within each label, not across all labels.

Also, not all models use the same order for their logits. We can’t assume “entailment” is always the last. To know the correct position, we should read the model’s config.json and check the label2id field.

This mapping shows which index belongs to each class. Using it, we can apply softmax to the correct pair, usually “entailment” and “contradiction,” for each label.

Check an example of a config.json file here: https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli/blob/main/config.json

Therefore, I not only had to fix the way softmax was applied, but also make sure we were using the correct index for the entailment score — based on what the model actually defines in its config. That meant reading the label2id mapping from config.json, identifying which index corresponds to “entailment” and “contradiction”, and then applying softmax to just those two values for each label.

After refactoring the softmax logic, the translator started outputting the expected results. To test it with different types of models, I created a GitHub repository comparing the expected results from Python’s Transformers Library with the refactored ZeroShotClassificationTranslator.

You can check it out at: https://github.com/raphaeldelio/deep-java-library-zero-shot-classification-comparison-to-python/

Contributing to the Deep Java Library

After I had tested and made sure the translator was working as expected, it was time to contribute back to the library. I opened a pull request to the DJL repository with the changes I had made. The maintainer was super responsive and helped me refactor my changes to follow the guidelines of the project, and after a few tweaks, the changes were approved and merged.

As a result, you can find the PR here: https://github.com/deepjavalibrary/djl/pull/3712

Final Words

If you’re a Java developer working with AI, I really encourage you to check out the Deep Java Library, the Spring AI, and the Redis OM Spring projects, which build on top of it.

Thank you for following along!

Stay Curious

How to send prompts in bulk with Spring AI and Java Virtual Threads

Raphael De Lio — Tue, 13 May 2025 08:40:00 +0000

TL;DR: You’re building an AI-powered app that needs to send lots of prompts to OpenAI.
Instead of sending them one by one, you want to do it in bulk — efficiently and safely.
This is how you can use Spring AI with Java Virtual Threads to process hundreds of prompts in parallel.

When calling LLM APIs like OpenAI, you’re dealing with a high-latency, network-bound task. Normally, doing that in a loop slows you down and blocks threads. But with Spring AI and Java 21 Virtual Threads, you can fire off hundreds of requests in parallel without killing your app.

This is particularly useful when you want the LLM to perform actions such as summarizing or extracting information from lots of documents.

Here’s the flow:

Get your list of text inputs.
Filter the ones that need processing.
Split them into batches.
For each batch: — Use Virtual Threads to make OpenAI calls in parallel — Wait for all calls to finish (using CompletableFuture) — Save the results

Virtual Threads for Massive Parallelism

Java Virtual Threads are perfect for this. They’re lightweight, run on the JVM, and don’t block OS threads. Ideal for I/O-heavy operations like talking to APIs.

ExecutorService executorService = Executors.newVirtualThreadPerTaskExecutor()

Each OpenAI request runs in its own thread, but without the overhead of real threads.

Spring AI Prompt Call

You create a Prompt, then send it to the model:

ChatResponse response = chatModel.call(
  new Prompt(List.of(
    new SystemMessage(“You are a helpful assistant…”),
    new UserMessage(userInput)
  ))
);

You get back a structured response. From there, you just extract the output:

String summary = response.getResult().getOutput().getText();

Processing in Batches

Sending all prompts at once isn’t a good idea (rate limits, reliability, memory). Instead, chunk them into smaller batches (e.g., 300 items):

int batchSize = 300;
int totalBatches = (inputs.size() + batchSize — 1) / batchSize;

For each batch:

Launch a CompletableFuture for every input
Wait for all with CompletableFuture.allOf(…).join()
Collect the results

Handling Errors Gracefully

Each task is wrapped in a try/catch block. So if one OpenAI call fails, it doesn’t crash the batch. You just skip that result.

.map(input -> CompletableFuture.supplyAsync(() -> {
  try {
    ChatResponse r = chatModel.call(…);
    return r.getResult().getOutput().getText();
  } catch (Exception e) {
    return null;
  }
}))

Process Results in Bulk

After processing each batch:

Filter out the failed ones
Process the valid results

List processed = futures.stream()
.map(CompletableFuture::join)
.filter(Objects::nonNull)
.toList();

Full Implementation

In this example, we get a list of text, and send them to OpenAI in batches to get a summary. We do that in parallel, which makes the process much faster. After getting the summaries, we saves the results. Everything runs in a way that handles errors and avoids overloading the system.

@Service
public class BulkSummarizationService {

    private static final Logger logger = LoggerFactory.getLogger(BulkSummarizationService.class);
    private final ChatClient chatClient;
    private final TextRepository textRepository;

    public BulkSummarizationService(ChatClient chatClient, TextRepository textRepository) {
        this.chatClient = chatClient;
        this.textRepository = textRepository;
    }

    public void summarizeTexts(boolean overwrite) {
        logger.info("Starting bulk summarization");
        List<TextData> textsToSummarize = textRepository.findAll();
        logger.info("Found {} texts to summarize", textsToSummarize.size());

        if (textsToSummarize.isEmpty()) return;

        int batchSize = 300;
        int totalBatches = (textsToSummarize.size() + batchSize - 1) / batchSize;

        try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < totalBatches; i++) {
                int start = i * batchSize;
                int end = Math.min(start + batchSize, textsToSummarize.size());
                List<TextData> batch = textsToSummarize.subList(start, end);

                logger.info("Processing batch {} of {} ({} items)", i + 1, totalBatches, batch.size());

                List<CompletableFuture<TextData>> futures = batch.stream()
                        .map(text -> CompletableFuture.supplyAsync(() -> {
                            try {
                                ChatResponse response = chatClient.call(
                                        new Prompt(List.of(
                                                new SystemMessage("""
                                                    You are a helpful assistant that summarizes long pieces of text.
                                                    Focus on keeping the summary dense and informative.
                                                    Limit to 512 words.
                                                """),
                                                new UserMessage(text.getContent())
                                        ))
                                );
                                text.setSummary(response.getResult().getOutput().getText());
                                return text;
                            } catch (Exception e) {
                                logger.error("Failed to summarize text with ID: {}", text.getId(), e);
                                return null;
                            }
                        }, executor))
                        .toList();

                CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

                List<TextData> summarized = futures.stream()
                        .map(CompletableFuture::join)
                        .filter(Objects::nonNull)
                        .toList();

                if (!summarized.isEmpty()) {
                    textRepository.saveAll(summarized);
                    logger.info("Saved {} summaries", summarized.size());
                }
            }
        }

        logger.info("Bulk summarization complete");
    }
}

And that’s it! You now have a fully async, high-throughput pipeline that can send hundreds of prompts to OpenAI — safely and efficiently — using nothing but Spring AI, Java Virtual Threads, and good batching.

Stay curious!

Semantic Search with Spring Boot & Redis

Raphael De Lio — Tue, 29 Apr 2025 08:48:59 +0000

TL;DR:
You’re building a semantic search app using Spring Boot and Redis.

Instead of matching exact words, semantic search finds meaning using Vector Similarity Search (VSS).

It works by turning movie synopses into vectors with embedding models, storing them in Redis (as a vector database), and finding the closest matches to user queries.

Video: What is semantic search?

A traditional searching system works by matching the words a user types with the words stored in a database or document collection. It usually looks for exact or partial matches without understanding the meaning behind the words.

Semantic searching, on the other hand, tries to understand the meaning behind what the user is asking. It focuses on the concepts, not just the keywords, making it much easier for users to find what they really want.

In a movie streaming service, for example, if a movie’s synopsis is stored in a database as “A cowboy doll feels threatened when a new space toy becomes his owner’s favorite,” but the user searches for “jealous toy struggles with new rival,” a traditional search system might not find the movie because the exact words don’t line up.

But a semantic a semantic search system can still connect the two ideas and bring up the right movie. It understands the meaning behind your query — not just the exact words.

Behind the scenes, this works thanks to vector similarity search. It turns text (or images, or audio) into vectors — lists of numbers —store them in a vector database and then finds the ones closest to your query.

Today, we’re gonna build a vector similarity search app that lets users find movies based on the *meaning *of their synopsis — not just exact keyword matches. So that even if they don’t know the title, they can still get the right movie based on a generic description of the synopsis.

To do that, we’ll build a Spring Boot app from scratch and plug in Redis OM Spring. It’ll handle turning our data into vectors, storing them in Redis, and running fast vector searches when users send a query.

Redis as a Vector Database

Video: What is a vector database?

In the last 15 years, Redis became the foundational infrastructure for realtime applications. Today, with Redis 8, it’s commited to becoming the foundational infrastructure for AI applications as well.

Redis 8 not only turns the community version of Redis into a Vector Database, but also makes it the fastest and most scalable database in the market today. Redis 8 allows you to scale to one billion vectors without penalizing latency.

Learn more: https://redis.io/blog/searching-1-billion-vectors-with-redis-8/

Redis OM Spring

To allow our users and customers to take full advantage of everything Redis can do — with the speed Redis is known for — we decided to implement Redis OM Spring, a library built on top of Spring Data Redis.

Redis OM Spring allows our users to easily communicate with Redis, model their entities as JSONs or Hashes, efficiently query them by levaraging the Redis Query Engine and even take advantage of probabilistic data structures such as Count-min Sketch, Bloom Filters, Cuckoo Filters, and more.

Redis OM Spring on GitHub: https://github.com/redis/redis-om-spring

Dataset

The dataset we’ll be looking is a catalog of thousands of movies. Each of these movies has metadata such as its title, cast, genre, year, and synopsis. The JSON file representing this dataset can be found in the repository that accompanies this article.

Sample:

{
  "title": "Toy Story",
  "year": 1995,
  "cast": [
   "Tim Allen",
   "Tom Hanks",
   "Don Rickles"
  ],
  "genres": [
   "Animated",
   "Comedy"
  ],
  "href": "Toy_Story",
  "extract": "Toy Story is a 1995 American computer-animated comedy film directed by John Lasseter, produced by Pixar Animation Studios and released by Walt Disney Pictures. The first installment in the  Toy Story franchise, it was the first entirely computer-animated feature film, as well as the first feature film from Pixar. It was written by Joss Whedon, Andrew Stanton, Joel Cohen, and Alec Sokolow from a story by Lasseter, Stanton, Pete Docter, and Joe Ranft. The film features music by Randy Newman, was produced by Bonnie Arnold and Ralph Guggenheim, and was executive-produced by Steve Jobs and Edwin Catmull. The film features the voices of Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, R. Lee Ermey, John Morris, Laurie Metcalf, and Erik von Detten.",
  "thumbnail": "https://upload.wikimedia.org/wikipedia/en/1/13/Toy_Story.jpg",
  "thumbnail_width": 250,
  "thumbnail_height": 373
}

Building the Application

Our application will be built using Spring Boot with Redis OM Spring. It will allow movies to be searched by their synopsis based on semantic search rather than keyword matching. **Besides that, our application will also allow its users to perform **hybrid search, a technique that combines vector similarity with traditional filtering and sorting.

0. GitHub Repository

**The full application can be found on GitHub: **https://github.com/redis/redis-om-spring/tree/main/demos/roms-vss-movies/

1. Add the required dependencies

From a Spring Boot application, add the following dependencies to your Maven or Gradle file:

<!-- Redis OM Spring for Redis object mapping and vector search -->
<dependency>
    <groupId>com.redis.om.spring</groupId>
    <artifactId>redis-om-spring</artifactId>
    <version>0.9.11</version>
</dependency>

<!-- Redis OM Spring uses Spring AI for creating embeddings (vectors) -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai</artifactId>
    <version>1.0.0-M6</version>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-transformers</artifactId>
    <version>1.0.0-M6</version>
</dependency>

2. Define the Movie entity

Redis OM Spring provides two annotations that makes it easy to vectorize data and perform vector similarity search from within Spring Boot.

@vectorize: Automatically generates vector embeddings from the text field
@Indexed: Enables vector indexing on a field for efficient search

The core of the implementation is the Movie class with Redis vector indexing annotations:

@RedisHash // This annotation is used by Redis OM Spring to store the entity as a hash in Redis
public class Movie {

    @Id // IDs are automatically generated by Redis OM Spring as ULID
    private String title;

    @Indexed(sortable = true) // This annotation enables indexing on the field for filtering and sorting
    private int year;

    @Indexed
    private List<String> cast;

    @Indexed
    private List<String> genres;

    private String href;

    // This annotation automatically generates vector embeddings from the text
    @Vectorize(
            destination = "embeddedExtract", // The field where the embedding will be stored
            embeddingType = EmbeddingType.SENTENCE, // Type of embedding to generate (Sentence, Image, face, or word)
            provider = EmbeddingProvider.OPENAI, // The provider for generating embeddings (OpenAI, Transformers, VertexAI, etc.)
            openAiEmbeddingModel = OpenAiApi.EmbeddingModel.TEXT_EMBEDDING_3_LARGE // The specific OpenAI model to use for embeddings
    )
    private String extract;

    // This defines the vector field that will store the embeddings
    // The indexed annotation enables vector search on this field
    @Indexed(
            schemaFieldType = SchemaFieldType.VECTOR, // Defines the field type as a vector
            algorithm = VectorField.VectorAlgorithm.FLAT, // The algorithm used for vector search (FLAT or HNSW)
            type = VectorType.FLOAT32,
            dimension = 3072, // The dimension of the vector (must match the embedding model)
            distanceMetric = DistanceMetric.COSINE, // The distance metric used for similarity search (Cosine or Euclidean)
            initialCapacity = 10
    )
    private byte[] embeddedExtract;

    private String thumbnail;
    private int thumbnailWidth;
    private int thumbnailHeight;

    // Getters and setters...
}

In this example we're using OpenAI's embedding model that requires an OpenAI API Key to be set in the application.properties file of your application:

redis.om.spring.ai.open-ai.api-key=${OPEN_AI_KEY}

If an embedding model is not specified, Redis OM Spring will use a Hugging Face’s Transformers model (all-MiniLM-L6-v2) by default. In this case, make sure you match the number of dimensions in the indexed annotation to 384 which is the number of dimensions created by the default embedding model.

3. Repository Interface

A simple repository interface that extends RedisEnhancedRepository. This will be used to load the data into Redis using the saveAll() method:

public interface MovieRepository extends RedisEnhancedRepository<Movie, String> {}

This provides basic CRUD operations for Movie entities, with the first generic parameter being the entity type and the second being the ID type.

4. Search Service

The search service uses two beans provided by Redis OM Spring:

EntityStream: For creating a stream of entities to perform searches. The Entity Stream must not be confused with the Java Streams API. The Entity Stream will generate a Redis Command that will be sent to Redis so that Redis can perform the searching, filtering and sorting efficiently on its side.
Embedder: Used for generating the embedding for the query sent by the user. It will be generated following the configuration of the @vectorize annotation defined in the Movie class/

The search functionality is implemented in the SearchService:

@Service
public class SearchService {

    private static final Logger logger = LoggerFactory.getLogger(SearchService.class);
    private final EntityStream entityStream;
    private final Embedder embedder;

    public SearchService(EntityStream entityStream, Embedder embedder) {
        this.entityStream = entityStream;
        this.embedder = embedder;
    }

    public List<Pair<Movie, Double>> search(
            String query,
            Integer yearMin,
            Integer yearMax,
            List<String> cast,
            List<String> genres,
            Integer numberOfNearestNeighbors) {
        logger.info("Received text: {}", query);
        logger.info("Received yearMin: {} yearMax: {}", yearMin, yearMax);
        logger.info("Received cast: {}", cast);
        logger.info("Received genres: {}", genres);

        if (numberOfNearestNeighbors == null) numberOfNearestNeighbors = 3;
        if (yearMin == null) yearMin = 1900;
        if (yearMax == null) yearMax = 2100;

        // Convert query text to vector embedding
        byte[] embeddedQuery = embedder.getTextEmbeddingsAsBytes(List.of(query), Movie$.EXTRACT).getFirst();

        // Perform vector search with additional filters
        SearchStream<Movie> stream = entityStream.of(Movie.class);
        return stream
                // KNN search for nearest vectors
                .filter(Movie$.EMBEDDED_EXTRACT.knn(numberOfNearestNeighbors, embeddedQuery))
                // Additional metadata filters (hybrid search)
                .filter(Movie$.YEAR.between(yearMin, yearMax))
                .filter(Movie$.CAST.eq(cast))
                .filter(Movie$.GENRES.eq(genres))
                // Sort by similarity score
                .sorted(Movie$._EMBEDDED_EXTRACT_SCORE)
                // Return both the movie and its similarity score
                .map(Fields.of(Movie$._THIS, Movie$._EMBEDDED_EXTRACT_SCORE))
                .collect(Collectors.toList());
    }
}

Key features of the search service:

Uses EntityStream to create a search stream for Movie entities
Converts the text query into a vector embedding
Uses K-nearest neighbors (KNN) search to find similar vectors
Applies additional filters for hybrid search (combining vector and traditional search)
Returns pairs of movies and their similarity scores

5. Movie Service for Data Loading

The MovieService handles loading movie data into Redis. It reads a JSON file containing movie date and save the movies into Redis.

It may take one or two minutes to load the data for the thousands of movies in the file because the embedding generation is done in the background. The @vectorize annotation will generate the embeddings for the extract field before the movie is saved into Redis.

@Service
public class MovieService {

    private static final Logger log = LoggerFactory.getLogger(MovieService.class);
    private final ObjectMapper objectMapper;
    private final ResourceLoader resourceLoader;
    private final MovieRepository movieRepository;

    public MovieService(ObjectMapper objectMapper, ResourceLoader resourceLoader, MovieRepository movieRepository) {
        this.objectMapper = objectMapper;
        this.resourceLoader = resourceLoader;
        this.movieRepository = movieRepository;
    }

    public void loadAndSaveMovies(String filePath) throws Exception {
        Resource resource = resourceLoader.getResource("classpath:" + filePath);
        try (InputStream is = resource.getInputStream()) {
            List<Movie> movies = objectMapper.readValue(is, new TypeReference<>() {});
            List<Movie> unprocessedMovies = movies.stream()
                    .filter(movie -> !movieRepository.existsById(movie.getTitle()) &&
                            movie.getYear() > 1980
                    ).toList();
            long systemMillis = System.currentTimeMillis();
            movieRepository.saveAll(unprocessedMovies);
            long elapsedMillis = System.currentTimeMillis() - systemMillis;
            log.info("Saved " + movies.size() + " movies in " + elapsedMillis + " ms");
        }
    }

    public boolean isDataLoaded() {
        return movieRepository.count() > 0;
    }
}

5. Search Controller

The REST controller exposes the search endpoint:

@RestController
public class SearchController {

    private final SearchService searchService;

    public SearchController(SearchService searchService) {
        this.searchService = searchService;
    }

    @GetMapping("/search")
    public Map<String, Object> search(
            @RequestParam(required = false) String text,
            @RequestParam(required = false) Integer yearMin,
            @RequestParam(required = false) Integer yearMax,
            @RequestParam(required = false) List<String> cast,
            @RequestParam(required = false) List<String> genres,
            @RequestParam(required = false) Integer numberOfNearestNeighbors
    ) {
        List<Pair<Movie, Double>> matchedMovies = searchService.search(
                text,
                yearMin,
                yearMax,
                cast,
                genres,
                numberOfNearestNeighbors
        );
        return Map.of(
                "matchedMovies", matchedMovies,
                "count", matchedMovies.size()
        );
    }
}

6. Application Bootstrap

The main application class initializes Redis OM Spring and loads data. The @EnableRedisEnhancedRepositories annotation activates Redis OM Spring's repository support:

@SpringBootApplication
@EnableRedisEnhancedRepositories(basePackages = {"dev.raphaeldelio.redis8demo*"})
public class Redis8DemoVectorSimilaritySearchApplication {

    public static void main(String[] args) {
        SpringApplication.run(Redis8DemoVectorSimilaritySearchApplication.class, args);
    }

    @Bean
    CommandLineRunner loadData(MovieService movieService) {
        return args -> {
            if (movieService.isDataLoaded()) {
                System.out.println("Data already loaded. Skipping data load.");
                return;
            }
            movieService.loadAndSaveMovies("movies.json");
        };
    }
}

7. Sample Requests

You can make requests to the search endpoint:

GET http://localhost:8082/search?text=A movie about a young boy who goes to a wizardry school

GET http://localhost:8082/search?numberOfNearestNeighbors=1&yearMin=1970&yearMax=1990&text=A movie about a kid and a scientist who go back in time

GET http://localhost:8082/search?cast=Dee Wallace,Henry Thomas&text=A boy who becomes friend with an alien

Sample request:

GET http://localhost:8082/search?numberOfNearestNeighbors=1&yearMin=1970&yearMax=1990&text=A movie about a kid and a scientist who go back in time

Sample response:

{

  "count": 1,

  "matchedMovies": [

    {

      "first": { // matched movie

        "title": "Back to the Future",

        "year": 1985,

        "cast": [

          "Michael J. Fox",

          "Christopher Lloyd"

        ],

        "genres": [

          "Science Fiction"

        ],

        "extract": "Back to the Future is a 1985 American science fiction film directed by Robert Zemeckis and written by Zemeckis, and Bob Gale. It stars Michael J. Fox, Christopher Lloyd, Lea Thompson, Crispin Glover, and Thomas F. Wilson. Set in 1985, it follows Marty McFly (Fox), a teenager accidentally sent back to 1955 in a time-traveling DeLorean automobile built by his eccentric scientist friend Emmett \"Doc\" Brown (Lloyd), where he inadvertently prevents his future parents from falling in love – threatening his own existence – and is forced to reconcile them and somehow get back to the future.",

        "thumbnail": "https://upload.wikimedia.org/wikipedia/en/d/d2/Back_to_the_Future.jpg"

      },

      "second": 0.463297247887 // similarity score (the lowest the closest)

    }

  ]

}

Wrapping up

And that’s it — you now have a working semantic search app using Spring Boot and Redis.

Instead of relying on exact keyword matches, your app understands the meaning behind the query. Redis handles the heavy part: embedding storage, similarity search, and even traditional filters — all at lightning speed.

With Redis OM Spring, you get an easy way to integrate this into your Java apps. You only need two annotations: @vectorize and @Indexed and two Beans: EntityStream and Embedder.

Whether you’re building search, recommendations, or AI-powered assistants, this setup gives you a solid and scalable foundation.

Try it out, tweak the filters, explore other models, and see how far you can go!

More AI Resources

The best way to stay on the path of learning AI is by following the recipes available on the Redis AI Resources GitHub repository. There you can find dozens of recipes that will get you to start building AI apps, fast!

GitHub - redis-developer/redis-ai-resources: ✨ A curated list of awesome community resources, integrations, and examples of Redis in the AI ecosystem.

Stay Curious!

Token Bucket Rate Limiter (Redis & Java)

Raphael De Lio — Mon, 13 Jan 2025 14:15:29 +0000

This article is also available on YouTube!

The Token Bucket algorithm is a flexible and efficient rate-limiting mechanism. It works by filling a bucket with tokens at a fixed rate (e.g., one token per second). Each request consumes a token, and if no tokens are available, the request is rejected. The bucket has a maximum capacity, so it can handle bursts of traffic as long as the burst doesn’t exceed the number of tokens in the bucket.

Looking for a different rate limiter algorithm? Check the essential guide.

Index

Introduction
How the Token Bucket Rate Limiter Works
Implementation with Redis and Java
Testing with TestContainers and AssertJ
Conclusion (GitHub Repo)

How It Works

1. Define a Token Refill Rate

Set a rate at which tokens are added to the bucket, such as 1 token per second or 10 tokens per minute.

2. Track Token Consumption

For each incoming request, deduct one token from the bucket.

3. Refill Tokens

Continuously refill the bucket at the defined rate, up to its maximum capacity, ensuring unused tokens can accumulate for future bursts.

4. Rate Limit Check

Before processing a request, check if there are enough tokens in the bucket. If the bucket is empty, reject the request until tokens are replenished.

How to Implement It with Redis and Java

For the Token Bucket Rate Limiter, Redis provides an efficient way to track tokens and implement the algorithm. Here’s how to do it:

1. Retrieve current token count and last refill time

First, retrieve the current token count and the last refill time:

GET rate_limit:<clientId>:count  
GET rate_limit:<clientId>:lastRefill

If these keys don’t exist, initialize the token count to the bucket’s maximum capacity and set the current time as the last refill time using SET.

2. Refill tokens if necessary and update the bucket

Update the token count and last refill date time after processing each request:

SET rate_limit:<clientId>:count <new_token_count>  
SET rate_limit:<clientId>:lastRefill <current_time>

3. Allow or reject the request

If tokens are available, allow the request and decrement the count by one using:

DECR rate_limit:<clientId>:count

Implementing it with Jedis

Jedis is a popular Java library used to interact with **Redis **and we will use it for implementing our rate limiter because it provides a simple and intuitive API for executing Redis commands from JVM applications.

Add Jedis to Your Maven File:

Check the latest version here.

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>5.2.0</version>
</dependency>

Create a TokenBucketRateLimiter class:

The class will take:

Accept a Jedis instance.
Define the maximum capacity of the token bucket.
Specify the token refill rate (tokens per second).

    package io.redis;

    import redis.clients.jedis.Jedis;
    import redis.clients.jedis.Transaction;

    public class TokenBucketRateLimiter {
        private final Jedis jedis;
        private final int bucketCapacity; // Maximum tokens the bucket can hold
        private final double refillRate; // Tokens refilled per second

        public TokenBucketRateLimiter(Jedis jedis, int bucketCapacity, double refillRate) {
            this.jedis = jedis;
            this.bucketCapacity = bucketCapacity;
            this.refillRate = refillRate;
        }
    }

Validate the Requests

The main task of this rate limiter is to determine whether a client has sufficient tokens to process their request. If yes, the request is allowed, and tokens are deducted. If not, the request is blocked.

Step 1: Generate the keys
We’ll store each client’s token count and last refill time in Redis using unique keys. The keys will look like this:

public boolean isAllowed(String clientId) {
    String keyCount = "rate_limit:" + clientId + ":count";
    String keyLastRefill = "rate_limit:" + clientId + ":lastRefill";
}

For example, if the client ID is user123, their keys would be rate_limit:user123:count and rate_limit:user123:lastRefill.

Step 2: Fetch Current State
We use Redis’s GET command to retrieve the current token count and the last refill time. If the keys don’t exist, we assume the bucket is full, and the last refill time is the current timestamp.

public boolean isAllowed(String clientId) {
    String keyCount = "rate_limit:" + clientId + ":count";
    String keyLastRefill = "rate_limit:" + clientId + ":lastRefill";

    Transaction transaction = jedis.multi();
    transaction.get(keyLastRefill);
    transaction.get(keyCount);
    var results = transaction.exec();

    long currentTime = System.currentTimeMillis();
    long lastRefillTime = results.get(0) != null ? Long.parseLong((String) results.get(0)) : currentTime;
    int tokenCount = results.get(1) != null ? Integer.parseInt((String) results.get(1)) : bucketCapacity;
}

Step 3: Refill Tokens
Calculate how many tokens should be added based on the time elapsed since the last refill. Ensure the bucket doesn’t exceed its maximum capacity.

long elapsedTimeMs = currentTime - lastRefillTime;
double elapsedTimeSecs = elapsedTimeMs / 1000.0;
int tokensToAdd = (int) (elapsedTimeSecs * refillRate);

tokenCount = Math.min(bucketCapacity, tokenCount + tokensToAdd);

Step 4: Check Token Availability
Compare the current token count to determine if the request can be allowed. If tokens are available, deduct one token; otherwise, block the request.

boolean isAllowed = tokenCount > 0;

if (isAllowed) {
    tokenCount--;
}

Step 5: Update Redis
We update the token count and last refill time in Redis. Use a transaction to ensure atomic updates:

Transaction transaction = jedis.multi();
transaction.set(keyLastRefill, String.valueOf(currentTime)); // Update last refill time
transaction.set(keyCount, String.valueOf(tokenCount));       // Update token count
transaction.exec();

Complete Implementation

Here’s the full code for the FixedWindowRateLimiter class:

package io.redis;

import redis.clients.jedis.Jedis;
import redis.clients.jedis.Transaction;

public class TokenBucketRateLimiter {
    private final Jedis jedis;
    private final int bucketCapacity; // Maximum tokens the bucket can hold
    private final double refillRate; // Tokens refilled per second

    public TokenBucketRateLimiter(Jedis jedis, int bucketCapacity, double refillRate) {
        this.jedis = jedis;
        this.bucketCapacity = bucketCapacity;
        this.refillRate = refillRate;
    }

    public boolean isAllowed(String clientId) {
        String keyCount = "rate_limit:" + clientId + ":count";
        String keyLastRefill = "rate_limit:" + clientId + ":lastRefill";

        long currentTime = System.currentTimeMillis();

        // Fetch current state
        Transaction transaction = jedis.multi();
        transaction.get(keyLastRefill);
        transaction.get(keyCount);
        var results = transaction.exec();

        long lastRefillTime = results.get(0) != null ? Long.parseLong((String) results.get(0)) : currentTime;
        int tokenCount = results.get(1) != null ? Integer.parseInt((String) results.get(1)) : bucketCapacity;

        // Refill tokens
        long elapsedTimeMs = currentTime - lastRefillTime;
        double elapsedTimeSecs = elapsedTimeMs / 1000.0;
        int tokensToAdd = (int) (elapsedTimeSecs * refillRate);
        tokenCount = Math.min(bucketCapacity, tokenCount + tokensToAdd);

        // Check if the request is allowed
        boolean isAllowed = tokenCount > 0;

        if (isAllowed) {
            tokenCount--; // Consume one token
        }

        // Update Redis state
        transaction = jedis.multi();
        transaction.set(keyLastRefill, String.valueOf(currentTime));
        transaction.set(keyCount, String.valueOf(tokenCount));
        transaction.exec();

        return isAllowed;
    }
}

And we’re ready to start testing it’s behavior!

Testing our Rate Limiter

To ensure our Token Bucket Rate Limiter behaves as expected, we’ll write tests for various scenarios. For this, we’ll use three tools:

Redis TestContainers: This library spins up an isolated Redis container for testing. This means we don’t need to rely on an external Redis server during our tests. Once the tests are done, the container is stopped, leaving no leftover data.
JUnit 5: Our main testing framework, which helps us define and structure tests with lifecycle methods like @BeforeEach and @AfterEach.
AssertJ: A library that makes assertions readable and expressive, like assertThat(result).isTrue().

Let’s begin by adding the necessary dependencies to our pom.xml.

Adding Dependencies

Here’s what you’ll need in your Maven pom.xml file:

<dependency>
    <groupId>org.junit.jupiter</groupId>
    <artifactId>junit-jupiter-engine</artifactId>
    <version>5.10.0</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>com.redis</groupId>
    <artifactId>testcontainers-redis</artifactId>
    <version>2.2.2</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.assertj</groupId>
    <artifactId>assertj-core</artifactId>
    <version>3.11.1</version>
    <scope>test</scope>
</dependency>

Once you’ve added these dependencies, you’re ready to start writing your test class.

Setting Up the Test Class

The first step is to create a test class named FixedWindowRateLimiterTest. Inside, we’ll define three main components:

Redis Test Container: This launches a Redis instance in a Docker container.
Jedis Instance: This connects to the Redis container for sending commands.
Rate Limiter: The actual TokenBucketRateLimiter instance we’re testing.

Here’s how the skeleton of our test class looks:

public class TokenBucketRateLimiterTest {

    private static RedisContainer redisContainer;
    private Jedis jedis;
    private TokenBucketRateLimiter rateLimiter;

Preparing the Environment Before Each Test

Before running any test, we need to ensure a clean Redis environment. Here’s what we’ll do:

Connect to Redis: Use a Jedis instance to connect to the Redis container.
Flush Data: Clear any leftover data in Redis to ensure consistent results for each test.

We’ll set this up in a method annotated with @BeforeEach, which runs before every test case.

@BeforeAll
static void startContainer() {
    redisContainer = new RedisContainer("redis:latest");
    redisContainer.withExposedPorts(6379).start();
}

@BeforeEach
void setup() {
    jedis = new Jedis(redisContainer.getHost(), redisContainer.getFirstMappedPort());
    jedis.flushAll();
}

FLUSHALL is an actual Redis command that deletes all the keys of all the existing databases. Read more about it in the official documentation.

Cleaning Up After Each Test

After each test, we need to close the Jedis connection to free up resources. This ensures no lingering connections interfere with subsequent tests.

@AfterEach
void tearDown() {
    jedis.close();
}

Full Setup

Here’s how the complete test class looks with everything in place:

public class TokenBucketRateLimiterTest {

    private static RedisContainer redisContainer;
    private Jedis jedis;
    private TokenBucketRateLimiter rateLimiter;

    @BeforeAll
    static void startContainer() {
        redisContainer = new RedisContainer("redis:latest");
        redisContainer.withExposedPorts(6379).start();
    }

    @AfterAll
    static void stopContainer() {
        redisContainer.stop();
    }

    @BeforeEach
    void setup() {
        jedis = new Jedis(redisContainer.getHost(), redisContainer.getFirstMappedPort());
        jedis.flushAll();
    }

    @AfterEach
    void tearDown() {
        jedis.close();
    }
}

Verifying Requests Within the Bucket Capacity

This test ensures the rate limiter allows requests within the defined bucket capacity.

We configure it with a capacity of 5 tokens and a refill rate of one token per second, then call isAllowed(“client-1”) 5 times.

Each call should return true, confirming the rate limiter correctly tracks and permits requests within the capacity.

@Test
void shouldAllowRequestsWithinBucketCapacity() {
    rateLimiter = new TokenBucketRateLimiter(jedis, 5, 1.0);
    for (int i = 1; i <= 5; i++) {
        assertThat(rateLimiter.isAllowed("client-1"))
            .withFailMessage("Request %d should be allowed within bucket capacity", i)
            .isTrue();
    }
}

Verifying Requests Are Denied When Bucket is Empty

This test ensures the rate limiter correctly denies requests once the bucket is empty.

Configured with a capacity of 5 tokens and a refill rate of one token per second, we isAllowed(“client-1”) 5 times and expect all to return true.

On the 6th call, it should return false, verifying the rate limiter blocks requests once the bucket is empty.

@Test
void shouldDenyRequestsOnceBucketIsEmpty() {
    rateLimiter = new TokenBucketRateLimiter(jedis, 5, 1.0);
    for (int i = 1; i <= 5; i++) {
        assertThat(rateLimiter.isAllowed("client-1"))
            .withFailMessage("Request %d should be allowed within bucket capacity", i)
            .isTrue();
    }
    assertThat(rateLimiter.isAllowed("client-1"))
        .withFailMessage("Request beyond bucket capacity should be denied")
        .isFalse();
}

Verifying Bucket is Gradually Refilled

This test ensures the rate limiter refills the bucket correctly after every second.

Configured with a capacity of 5 tokens and a refill rate of one token per second, the first 5 requests (isAllowed(“client-1”)) return true, while the 6th request is denied (false).

After waiting for two seconds, the next two requests are allowed and the third one is denied. Confirming the refilling behavior works as expected.

    @Test
    void shouldRefillTokensGraduallyAndAllowRequestsOverTime() throws InterruptedException {
        rateLimiter = new TokenBucketRateLimiter(jedis, 5, 1.0);
        String clientId = "client-1";

        for (int i = 1; i <= 5; i++) {
            assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request %d should be allowed within bucket capacity", i)
                .isTrue();
        }
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request beyond bucket capacity should be denied")
            .isFalse();

        TimeUnit.SECONDS.sleep(2);

        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request after partial refill should be allowed")
            .isTrue();
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Second request after partial refill should be allowed")
            .isTrue();
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request beyond available tokens should be denied")
            .isFalse();
    }

Verifying Independent Handling of Multiple Clients

This test ensures the rate limiter handles multiple clients independently.

Configured with a capacity of 5 tokens and a refill rate of one token per second, the first 5 requests (isAllowed(“client-1”)) return true, while the 6th request is denied (false).

Simultaneously, all 5 requests from client-2 are allowed (true), confirming the rate limiter maintains separate counters for each client.

@Test
void shouldHandleMultipleClientsIndependently() {
    rateLimiter = new TokenBucketRateLimiter(jedis, 5, 1.0);

    String clientId1 = "client-1";
    String clientId2 = "client-2";

    for (int i = 1; i <= 5; i++) {
        assertThat(rateLimiter.isAllowed(clientId1))
            .withFailMessage("Client 1 request %d should be allowed", i)
            .isTrue();
    }
    assertThat(rateLimiter.isAllowed(clientId1))
        .withFailMessage("Client 1 request beyond bucket capacity should be denied")
        .isFalse();

    for (int i = 1; i <= 5; i++) {
        assertThat(rateLimiter.isAllowed(clientId2))
            .withFailMessage("Client 2 request %d should be allowed", i)
            .isTrue();
    }
}

Verifying Token Refill Does Not Exceed Bucket Capacity

This test verifies that the token bucket rate limiter correctly refills tokens up to the defined capacity without exceeding it.

Configured with a capacity of 3 tokens and a refill rate of 2 tokens per second, the first 3 requests (isAllowed(“client-1”)) return true, while the 4th request is denied (false), indicating the bucket is empty.

After waiting 3 seconds (enough to refill 6 tokens), the bucket refills only up to its maximum capacity of 3 tokens. The next 3 requests are allowed (true), but any additional request is denied (false), confirming that the rate limiter maintains the specified capacity limit regardless of refill surplus.

@Test
void shouldRefillTokensUpToCapacityWithoutExceedingIt() throws InterruptedException {
    int capacity = 3;
    double refillRate = 2.0;
    String clientId = "client-1";
    rateLimiter = new TokenBucketRateLimiter(jedis, capacity, refillRate);

    for (int i = 1; i <= capacity; i++) {
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request %d should be allowed within initial bucket capacity", i)
            .isTrue();
    }
    assertThat(rateLimiter.isAllowed(clientId))
        .withFailMessage("Request beyond bucket capacity should be denied")
        .isFalse();

    TimeUnit.SECONDS.sleep(3);

    for (int i = 1; i <= capacity; i++) {
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request %d should be allowed as bucket refills up to capacity", i)
            .isTrue();
    }
    assertThat(rateLimiter.isAllowed(clientId))
        .withFailMessage("Request beyond bucket capacity should be denied")
        .isFalse();
}

Verifying Denied Requests Do Not Affect Token Count

This test ensures that the token bucket rate limiter does not count denied requests when updating the token count.

Configured with a capacity of 3 tokens and a refill rate of 0.5 tokens per second, the first 3 requests (isAllowed(“client-1”)) are allowed (true), depleting the bucket. The 4th request is denied (false), confirming the bucket is empty.

The Redis token count (rate_limit:client-1:count) is then verified to ensure it accurately reflects the remaining tokens (0 in this case) and does not include denied requests. This confirms that the rate limiter updates the token count only when requests are successfully processed.

@Test
void testRateLimitDeniedRequestsAreNotCounted() {
    int capacity = 3;
    double refillRate = 0.5;
    String clientId = "client-1";
    rateLimiter = new TokenBucketRateLimiter(jedis, capacity, refillRate);

    for (int i = 1; i <= capacity; i++) {
        assertThat(rateLimiter.isAllowed(clientId))
            .withFailMessage("Request %d should be allowed", i)
            .isTrue();
    }
    assertThat(rateLimiter.isAllowed(clientId))
        .withFailMessage("This request should be denied")
        .isFalse();

    String key = "rate_limit:" + clientId + ":count";
    int requestCount = Integer.parseInt(jedis.get(key));
    assertThat(requestCount)
        .withFailMessage("The count should match remaining tokens and not include denied requests")
        .isEqualTo(0);
}

Is there any other behavior we should verify? Let me know in the comments!

The Token Bucket Rate Limiter is a flexible and efficient way to manage request rates, and Redis makes it incredibly fast and reliable.

By leveraging commands like GET, SET, and MULTI/EXEC, we implemented a solution that tracks token counts, refills tokens dynamically based on time elapsed, and ensures the bucket never exceeds its defined capacity.

Using Jedis, we built a clear and intuitive Java implementation, and with thorough testing using Redis TestContainers, JUnit 5, and AssertJ, we can confidently verify that it works as expected.

This approach offers a robust foundation for managing request limits while allowing for burst handling and gradual refill, making it adaptable for more advanced rate-limiting scenarios when needed.

GitHub Repo

You can find this implementation in Java and Kotlin:

Java (Implementation, Test)
Kotlin (Implementation, Test)

Stay Curious!

Fixed Window Counter Rate Limiter (Redis & Java)

Raphael De Lio — Mon, 30 Dec 2024 13:30:24 +0000

This article is also available on YouTube!

The Fixed Window Counter is the simplest and most straightforward rate-limiting algorithm. It divides time into fixed intervals (e.g., seconds, minutes, or hours) and counts the number of requests within each interval. If the count exceeds a predefined threshold, the requests are rejected until the next interval begins.

Looking for a more precise algorithm? Take a look at the Sliding Window Log implementation. (Coming soon)

Index

Introduction
How the Fixed Window Counter Rate Limiter Works
Implementation with Redis and Java
Testing with TestContainers and AssertJ
Conclusion (GitHub Repo)

How It Works

1. Define a Window Interval

Choose a time interval, such as 1 second, 1 minute, or 1 hour.

2. Track Requests

Use a counter to track the number of requests made during the current window.

3. Reset Counter:

At the end of the time window, reset the counter to zero and start counting again for the new window.

4. Rate Limit Check:

Compare the counter against the allowed limit. If it exceeds the limit, reject further requests until the next window.

How to Implement It with Redis and Java

There are two ways to implement the Fixed Rate Limiter with Redis. The simplest way is by:

1. Use the INCR command to increment the counter in Redis each time a request is allowed

INCR my_counter

If there's no counter set yet, the INCR command will create one as zero and then increment it to one.

If the counter is already set, the INCR commany will simply increment it by one.

2. Set the key to expire in one minute if it’s newly created

If the counter doesn’t exist, we need to set a time-to-live to ensure the time window lasts only for the specified period. But we should only set an expiration if it doesn’t already exist. Otherwise, Redis would reset the expiration, and older requests could be counted beyond the allowed time.

We’ll use the EXPIRE command with the NX flag on the key. The NX flag ensures the expiration is only set if the key doesn’t already have one.

This approach is smart because the counter will only track requests during the key’s lifespan. Once the key expires and is removed, the counter resets, ensuring we only account for requests within the intended time window.

EXPIRE my_counter 60 NX

3. Check the counter for each new request

When a new request comes in, check the counter to see how many requests have been made. If it’s below the threshold, allow the process and increment the counter. If not, block the process from proceeding.

If the key doesn’t exist, assume the counter starts at 0.

GET my_counter

Cool! Now that we understand the basics of our implementation, let’s implement it in Java with Jedis.

Implementing it with Jedis

Jedis is a popular Java library used to interact with **Redis **and we will use it for implementing our rate because it provides a simple and intuitive API for executing Redis commands from JVM applications.

Start by adding the Jedis library to your Maven file:

Check the latest version here.

    <dependency>
        <groupId>redis.clients</groupId>
        <artifactId>jedis</artifactId>
        <version>5.2.0</version>
    </dependency>

Create a FixedWindowRateLimiter class:

The class will take:

A Jedis instance.
A time window size (e.g., 60 seconds).
The maximum number of allowed requests.

    package io.redis;

    import redis.clients.jedis.Jedis;
    import redis.clients.jedis.Transaction;
    import redis.clients.jedis.args.ExpiryOption;

    public class FixedWindowRateLimiter {

        private final Jedis jedis;
        private final int windowSize;
        private final int limit;

        public FixedWindowRateLimiter(Jedis jedis, long windowSize, int limit) {
            this.jedis = jedis;
            this.limit = limit;
            this.windowSize = windowSize;
        }
    }

Validate the Requests

The main job of this rate limiter is to check if a client is within their allowed request limit. If yes, the request is allowed, and the counter is updated. If not, the request is blocked.

Step 1: Generate a key
We’ll store each client’s request count as a Redis key. To make keys unique for each client, we’ll format them like this:

    public boolean isAllowed(String clientId) {
        String key = "rate_limit:" + clientId;
    }

For example, if the client ID is user123, their key would be rate_limit:user123.

Step 2: Fetch the Current Counter
We’ll use Redis’s GET command to check how many requests the client has made so far. If the key doesn’t exist, we assume the client hasn’t made any requests, so the counter is 0.

    public boolean isAllowed(String clientId) {
        String key = "rate_limit:" + clientId;
        String currentCountStr = jedis.get(key);
        int currentCount = currentCountStr != null ? Integer.parseInt(currentCountStr) : 0;
    }

Step 3: Check the Request Limit
Next, we compare the current count to the allowed limit. If the counter is less than the limit, the request is allowed. Otherwise, it’s blocked.

    public boolean isAllowed(String clientId) {
        String key = "rate_limit:" + clientId;
        String currentCountStr = jedis.get(key);
        int currentCount = currentCountStr != null ? Integer.parseInt(currentCountStr) : 0;

        boolean isAllowed = currentCount < limit;
    }

Step 4: Increment the Counter and Set Expiration
If the request is allowed**, we need to do two things:

Increment the Counter: Use the Redis INCR command to increase the request count by 1.
Set an Expiration: Use the EXPIRE command to ensure the counter resets at the end of the time window. To make sure the expiration won’t reset everytime we increment the counter, we also need to set the NX flag.

We’ll do this in a transaction to ensure that:

Both INCR and EXPIRE happen together, avoiding race conditions.
Both INCR and EXPIRE are pipelined (sent in a batch to Redis) to reduce the number of network trips, improving performance.

    if (isAllowed) {
        Transaction transaction = jedis.multi();
        transaction.incr(key); // Increment the counter
        transaction.expire(key, windowSize, ExpiryOption.NX); // Set expiration only if not already set
        transaction.exec(); // Execute both commands atomically
    }

The first request marks the start of the time window. Any subsequent requests during this window’s lifespan will increment the counter.
Once the window expires, the key is automatically removed from Redis. The next request after that will define the start of a new window.
If we didn’t set the NX flag, the expiration would be reset everytime the counter is incremented, increasing the lifespan of the window.

Complete Implementation

Here’s the full code for the FixedWindowRateLimiter class:

package io.redis;

    import redis.clients.jedis.Jedis;
    import redis.clients.jedis.Transaction;
    import redis.clients.jedis.args.ExpiryOption;

    public class FixedWindowRateLimiter {

        private final Jedis jedis;
        private final int windowSize;
        private final int limit;

        public FixedWindowRateLimiter(Jedis jedis, long windowSize, int limit) {
            this.jedis = jedis;
            this.limit = limit;
            this.windowSize = windowSize;
        }

        public boolean isAllowed(String clientId) {
            String key = "rate_limit:" + clientId;
            String currentCountStr = jedis.get(key);
            int currentCount = currentCountStr != null ? Integer.parseInt(currentCountStr) : 0;

            boolean isAllowed = currentCount < limit;

            if (isAllowed) {
                Transaction transaction = jedis.multi();
                transaction.incr(key);
                transaction.expire(key, windowSize, ExpiryOption.NX); // Set expire only if not set
                transaction.exec();
            }

            return isAllowed;
        }
    }

And we’re ready to start testing it’s behavior!

Testing our Rate Limiter

To ensure our Fixed Window Rate Limiter behaves as expected, we’ll write tests for various scenarios. For this, we’ll use three tools:

Redis TestContainers: This library spins up an isolated Redis container for testing. This means we don’t need to rely on an external Redis server during our tests. Once the tests are done, the container is stopped, leaving no leftover data.
JUnit 5: Our main testing framework, which helps us define and structure tests with lifecycle methods like @BeforeEach and @AfterEach.
AssertJ: A library that makes assertions readable and expressive, like assertThat(result).isTrue().

Let’s begin by adding the necessary dependencies to our pom.xml.

Adding Dependencies

Here’s what you’ll need in your Maven pom.xml file:

<dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter-engine</artifactId>
        <version>5.10.0</version>
        <scope>test</scope>
    </dependency>

    <dependency>
        <groupId>com.redis</groupId>
        <artifactId>testcontainers-redis</artifactId>
        <version>2.2.2</version>
        <scope>test</scope>
    </dependency>

    <dependency>
        <groupId>org.assertj</groupId>
        <artifactId>assertj-core</artifactId>
        <version>3.11.1</version>
        <scope>test</scope>
    </dependency>

Once you’ve added these dependencies, you’re ready to start writing your test class.

Setting Up the Test Class

The first step is to create a test class named FixedWindowRateLimiterTest. Inside, we’ll define three main components:

Redis Test Container: This launches a Redis instance in a Docker container.
Jedis Instance: This connects to the Redis container for sending commands.
Rate Limiter: The actual FixedWindowRateLimiter instance we’re testing.

Here’s how the skeleton of our test class looks:

public class FixedWindowRateLimiterTest {

        private static final RedisContainer redisContainer = new RedisContainer("redis:latest")
                .withExposedPorts(6379);

        private Jedis jedis;
        private FixedWindowRateLimiter rateLimiter;

        // Start Redis container once before any tests run
        static {
            redisContainer.start();
        }
    }

Preparing the Environment Before Each Test

Before running any test, we need to ensure a clean Redis environment. Here’s what we’ll do:

Connect to Redis: Use a Jedis instance to connect to the Redis container.
Flush Data: Clear any leftover data in Redis to ensure consistent results for each test.

We’ll set this up in a method annotated with @BeforeEach, which runs before every test case.

    @BeforeEach
    public void setup() {
        jedis = new Jedis(redisContainer.getHost(), redisContainer.getFirstMappedPort());
        jedis.flushAll();
    }

FLUSHALL is an actual Redis command that deletes all the keys of all the existing databases. Read more about it in the official documentation.

Cleaning Up After Each Test

After each test, we need to close the Jedis connection to free up resources. This ensures no lingering connections interfere with subsequent tests.

    @AfterEach
    public void tearDown() {
        jedis.close();
    }

Full Setup

Here’s how the complete test class looks with everything in place:

    public class FixedWindowRateLimiterTest {
        private static final RedisContainer redisContainer = new RedisContainer("redis:latest")
                .withExposedPorts(6379);

        private Jedis jedis;
        private FixedWindowRateLimiter rateLimiter;

        static {
            redisContainer.start();
        }

        @BeforeEach
        public void setup() {
            jedis = new Jedis(redisContainer.getHost(), redisContainer.getFirstMappedPort());
            jedis.flushAll();
        }

        @AfterEach
        public void tearDown() {
            jedis.close();
        }
    }

Verifying Requests Within the Limit

This test ensures the rate limiter allows requests within the defined limit.

We configure it with a limit of 5 requests and a 10-second window, then call isAllowed(“client-1”) 5 times. Each call should return true, confirming the rate limiter correctly tracks and permits requests under the limit.

    @Test
    public void shouldAllowRequestsWithinLimit() {
        rateLimiter = new FixedWindowRateLimiter(jedis, 10, 5);
        for (int i = 1; i <= 5; i++) {
            assertThat(rateLimiter.isAllowed("client-1"))
                    .withFailMessage("Request " + i + " should be allowed")
                    .isTrue();
        }
    }

Verifying Requests Beyond the Limit

This test ensures the rate limiter correctly denies requests once the defined limit is exceeded.

Configured with a limit of 5 requests in a 60-second window, we call isAllowed(“client-1”) 5 times and expect all to return true. On the 6th call, it should return false, verifying the rate limiter blocks requests beyond the allowed limit.

    @Test
    public void shouldDenyRequestsOnceLimitIsExceeded() {
        rateLimiter = new FixedWindowRateLimiter(jedis, 60, 5);
        for (int i = 1; i <= 5; i++) {
            assertThat(rateLimiter.isAllowed("client-1"))
                    .withFailMessage("Request " + i + " should be allowed")
                    .isTrue();
        }

        assertThat(rateLimiter.isAllowed("client-1"))
                .withFailMessage("Request beyond limit should be denied")
                .isFalse();
    }

Verifying Requests After Window Reset

This test ensures the rate limiter resets correctly after the fixed window expires.

Configured with a limit of 5 requests and a 1-second window, the first 5 requests (isAllowed(“client-1”)) return true, while the 6th request is denied (false).

After waiting for the window to expire, the next request is allowed (true), confirming the reset behavior works as expected.

    @Test
    public void shouldAllowRequestsAgainAfterFixedWindowResets() throws InterruptedException {
        int limit = 5;
        String clientId = "client-1";
        int windowSize = 1;
        rateLimiter = new FixedWindowRateLimiter(jedis, windowSize, limit);

        for (int i = 1; i <= limit; i++) {
            assertThat(rateLimiter.isAllowed(clientId))
                    .withFailMessage("Request " + i + " should be allowed")
                    .isTrue();
        }

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request beyond limit should be denied")
                .isFalse();

        Thread.sleep((windowSize + 1) * 1000);

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request after window reset should be allowed")
                .isTrue();
    }

Verifying Independent Handling of Multiple Clients

This test ensures the rate limiter handles multiple clients independently.

Configured with a limit of 5 requests and a 10-second window, the first 5 requests from client-1 are allowed (true), while the 6th is denied (false).

Simultaneously, all 5 requests from client-2 are allowed (true), confirming the rate limiter maintains separate counters for each client.

    @Test
    public void shouldHandleMultipleClientsIndependently() {
        int limit = 5;
        String clientId1 = "client-1";
        String clientId2 = "client-2";
        int windowSize = 10;
        rateLimiter = new FixedWindowRateLimiter(jedis, windowSize, limit);

        for (int i = 1; i <= limit; i++) {
            assertThat(rateLimiter.isAllowed(clientId1))
                    .withFailMessage("Client 1 request " + i + " should be allowed")
                    .isTrue();
        }

        assertThat(rateLimiter.isAllowed(clientId1))
                .withFailMessage("Client 1 request beyond limit should be denied")
                .isFalse();

        for (int i = 1; i <= limit; i++) {
            assertThat(rateLimiter.isAllowed(clientId2))
                    .withFailMessage("Client 2 request " + i + " should be allowed")
                    .isTrue();
        }
    }

Verifying Requests Are Denied Until Fixed Window Resets

This test ensures the rate limiter denies additional requests until the fixed window expires.

Configured with a limit of 3 requests and a 5-second window, the first 3 requests (isAllowed(“client-1”)) are allowed (true), while the 4th is denied (false).

After waiting for half the window duration (2.5 seconds), requests are still denied (false).

Once the window fully resets (after another 2.5 seconds), the next request is allowed (true), confirming proper behavior during and after the fixed window.

    @Test
    public void shouldDenyAdditionalRequestsUntilFixedWindowResets() throws InterruptedException {
        int limit = 3;
        int windowSize = 5;
        String clientId = "client-1";
        rateLimiter = new FixedWindowRateLimiter(jedis, windowSize, limit);

        for (int i = 1; i <= limit; i++) {
            assertThat(rateLimiter.isAllowed(clientId))
                    .withFailMessage("Request " + i + " should be allowed within limit")
                    .isTrue();
        }

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request beyond limit should be denied")
                .isFalse();

        Thread.sleep(2500);

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request should still be denied within the same fixed window")
                .isFalse();

        Thread.sleep(2500);

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("Request should be allowed after fixed window reset")
                .isTrue();
    }

Verifying Denied Requests Are Not Counted

This test ensures that requests denied by the rate limiter are not included in the request count.

Configured with a limit of 3 requests and a 5-second window, the first 3 requests (isAllowed(“client-1”)) are allowed (true), while the 4th is denied (false).

Afterward, the Redis key for the client is checked to confirm the stored count equals the limit (3), ensuring denied requests do not increase the counter.

    @Test
    public void testRateLimitDeniedRequestsAreNotCounted() {
        int limit = 3;
        int windowSize = 5;
        String clientId = "client-1";
        rateLimiter = new FixedWindowRateLimiter(jedis, windowSize, limit);

        for (int i = 1; i <= limit; i++) {
            assertThat(rateLimiter.isAllowed(clientId))
                    .withFailMessage("Request " + i + " should be allowed")
                    .isTrue();
        }

        assertThat(rateLimiter.isAllowed(clientId))
                .withFailMessage("This request should be denied")
                .isFalse();

        String key = "rate_limit:" + clientId;
        int requestCount = Integer.parseInt(jedis.get(key));
        assertThat(requestCount)
                .withFailMessage("The count (" + requestCount + ") should be equal to the limit (" + limit + "), not counting the denied request")
                .isEqualTo(limit);
    }

Is there any other behavior we should verify? Let me know in the comments!

The Fixed Window Rate Limiter is a simple yet effective way to manage request rates, and Redis makes it incredibly fast and reliable.

By using commands like INCR and EXPIRE, we created a solution that tracks and limits requests while automatically resetting counters when the time window expires.

With Jedis, we built an easy-to-understand Java implementation, and thanks to thorough testing with Redis TestContainers, JUnit 5, and AssertJ, we can trust it works as expected.

This approach is a great starting point for handling request limits and can easily be adapted for more complex scenarios if needed.

GitHub Repo

You can find this implementation in Java and Kotlin:

Java (Implementation, Test)
Kotlin (Implementation, Test)

Stay Curious!

Rate limiting with Redis: An essential guide

Raphael De Lio — Mon, 23 Dec 2024 15:28:17 +0000

*Bluesky | Twitter | LinkedIn | YouTube | Instagram
This article is also available on YouTube!*

Rate limiting — it’s something you’ve likely encountered, even if you haven’t directly implemented one. For example, have you ever been greeted by a “429 Too Many Requests” error? That’s a rate limiter in action, protecting a resource from overload. Or maybe you’ve used a service with explicit request quotas based on your payment tier — same concept, just more transparent.

Rate limiting isn’t just about setting limits; it serves a variety of purposes. Take Figma, for instance. Their rate limiter, built with Redis, saved them from a spam attack where bad actors sent massive document invitations to random email addresses. Without it, Figma could have faced skyrocketing email delivery costs and damaged reputation. Or look at Stripe: as their platform grew, they realized they couldn’t just throw more infrastructure at the problem. They needed a smarter solution to prevent resource monopolization by misconfigured scripts or bad actors.

These stories show just how versatile rate limiting is. It prevents abuse, ensures fair access, manages load, cuts costs, and even protects against downtime. But here’s the kicker: the hard part isn’t knowing why you need a rate limiter. The real challenge is building one that’s both efficient and tailored to your needs.

Why Redis for Rate Limiting?

Redis has become a go-to tool for implementing rate limiters, and for good reason. It’s fast, reliable, and packed with features like atomic operations, data persistence, and Lua scripting. Just ask GitHub. When they migrated to a Redis-backed solution with client-side sharding, they solved tough challenges like replication, consistency, and scalability while ensuring reliable behavior across their infrastructure.

So, why Redis? Its speed, versatility, and built-in capabilities make it perfect for handling distributed traffic patterns. But what’s even more important is how you use it. Let’s break down the most common rate-limiting patterns you can implement with Redis and what each one brings to the table.

Popular Rate-Limiting Patterns

Choosing the right rate-limiting algorithm can be challenging. Here’s a breakdown of the most popular options, when to use them, and their trade-offs, with practical examples to help you decide:

Leaky Bucket

How It Works: Imagine a bucket with a small hole at the bottom. Requests (water) are added to the bucket and processed at a steady “drip” rate, preventing sudden floods.

Use Cases: Ideal for smoothing traffic flow, such as in streaming services or payment processing, where a predictable output is critical.

**Example: **A video streaming platform regulates API calls to its content delivery network, ensuring consistent playback quality.

Drawback: Not suitable for handling sudden bursts, like flash sales or promotional campaigns.

Token Bucket

How It Works: Tokens are generated at a fixed rate and stored in a bucket. Each request consumes a token, allowing for short bursts as long as tokens are available.

Use Cases: Perfect for APIs that need to handle occasional traffic spikes while enforcing overall limits, such as login attempts or search queries.

**Example: **An e-commerce site allows bursts of up to 20 requests per second during checkout but limits the overall rate to 100 requests per minute.

Drawback Example: Requires periodic token replenishment, which can introduce minor overhead in distributed systems.

Fixed Window Counter

How It Works: Tracks the number of requests in fixed intervals (e.g., 1 minute). Once the limit is reached, all subsequent requests in that window are denied.

Use Cases: Simple APIs with predictable traffic and low precision needs, like throttling a hobbyist developer’s free-tier usage.

**Example: **A public weather API allows 100 requests per user per minute, with any extra requests returning a “429 Too Many Requests” response.

**Drawback: **Users can game the system by stacking requests at the boundary of two time windows (e.g., 100 at 59 seconds and 100 at 1 second of the next window).

Sliding Window Log

How It Works: Maintains a log of timestamps for each request and calculates limits based on a rolling time window.

Use Cases: Critical systems requiring high accuracy, such as financial transaction APIs or fraud detection mechanisms.

Example: A banking API limits withdrawals to 10 per hour, with each new request evaluated against the timestamps of the last 10 requests.

Drawback: High memory usage and computational cost when scaling to millions of users or frequent requests.

Sliding Window Counter

How It Works: Divides the time window into smaller intervals (e.g., 10-second buckets) and aggregates request counts to approximate a rolling window.

Use Cases: APIs that need a balance between accuracy and efficiency, like chat systems or lightweight rate-limiting for microservices.

**Example: **A messaging app limits users to 30 messages per minute but divides the minute into 6 buckets, allowing more flexibility in traffic patterns.

**Drawback: **Small inaccuracies can occur, especially during highly bursty traffic patterns.

Choosing the Right Tool for the Job

Selecting a rate-limiting strategy isn’t just about matching patterns to scenarios; it’s about understanding the trade-offs and the specific needs of your application. Here’s how to make a more informed choice:

Understand Your Traffic Patterns

Predictable Traffic: If your API serves consistent request rates (e.g., hourly status checks or regular polling), Leaky Bucket is excellent for maintaining a steady flow.
Burst Traffic: If you expect short bursts of traffic, such as during product launches or login spikes, Token Bucket allows controlled bursts while enforcing limits.
Mixed Traffic: APIs with unpredictable traffic may benefit from Sliding Window Counter, which balances accuracy and resource usage.

Assess the Level of Precision Needed

High Precision: If exact limits are critical (e.g., financial transactions or fraud detection), Sliding Window Log provides the most accurate enforcement by logging every request.
Approximation is Okay: For most APIs, Sliding Window Counter strikes a balance between precision and efficiency, as it uses aggregated data instead of tracking every request.

Consider Resource Constraints

Memory and CPU Overhead: Algorithms like Sliding Window Log can become resource-intensive at scale, especially with millions of users. For a lightweight alternative, Fixed Window Counter is simple but effective for low-traffic APIs.
Scalability: Redis makes scaling rate limiting easier with atomic operations, Lua scripting, and replication features, but your choice of algorithm still affects performance. For instance, Token Bucket is computationally cheaper than Sliding Window Log in most distributed systems.

Account User Experience

User Tolerance for Errors: Fixed-window approaches like Fixed Window Counter may frustrate users due to rigid resets. Sliding-window methods smooth out these boundaries, leading to a better user experience.
Handling Edge Cases: Algorithms like Token Bucket allow some flexibility for bursts, which can help avoid unnecessary rate-limit errors during legitimate usage spikes.

In the end, rate limiting is about more than just enforcing boundaries — it’s about designing systems that are efficient, fair, and user-friendly. By carefully matching the algorithm to your use case, you’re not just managing traffic — you’re shaping a better experience for everyone involved.

Stay curious!

What do 200 electrocuted monks have to do with Redis 8, the fastest Redis ever?

Raphael De Lio — Tue, 19 Nov 2024 10:49:43 +0000

Bluesky | Twitter | LinkedIn | YouTube | Instagram
This article is also available on YouTube!

Have you ever heard of Jean-Antoine Nollet? Back in the 18th century, Nollet carried out an experiment where he lined up 200 monks, each connected hand-to-hand with iron wires, forming a continuous chain over a mile (1.6 km) long. Once everything was set up, he connected a primitive electrical battery to the line, delivering a powerful electric shock to all of them simultaneously.

Now, Nollet wasn’t just zapping monks for kicks. His experiment had a serious purpose: to study the properties of electricity and see how far and how fast it could travel along a wire. This was groundbreaking at a time when sending a message 100 miles took nearly a day by horseback. Nollet’s work hinted at something revolutionary — t*he potential for electricity to be used for communication*.

Fast forward to the 19th century, and the telegraph brought this idea to life. Suddenly, messages that used to take days could travel in minutes. Samuel Morse and other inventors transformed Nollet’s findings into a world-changing technology. The telegraph became the 19th-century equivalent of the internet, connecting people in ways no one had imagined before.

As Tom Standage describes in The Victorian Internet, t*he telegraph was so fast it scared some people*. Critics even argued it was “too fast for the truth.” It sounds funny now, doesn’t it?

Today, we see the internet as almost instantaneous, but back then, the telegraph felt like a leap into hyperspeed.

That said, even with our modern tech, we sometimes still think the internet is slow. To Nollet, the speed we’ve reached would have been incomprehensible, but we know there are limits. For example, even if data could travel at the speed of light in a vacuum, it would still take about 56.7 milliseconds to get from London to Sydney. That’s just physics — it can’t get any faster.

But speed isn’t just about how fast data travels; it’s also about how quickly it gets processed.

With applications like real-time gaming, video streaming, and AI-powered services, every millisecond matters. That’s where Redis comes in.

Redis is an in-memory database designed for speed. Unlike traditional databases that rely on disks, Redis keeps everything in RAM, giving you access times measured in microseconds. This makes it ideal for real-time analytics, online gaming, and AI workloads where responsiveness is critical.

And guess what? Redis just got even faster with Redis 8.

Redis 8: Faster Than Ever

The latest milestone, Redis 8.0 M02, brings significant latency reductions across widely-used commands, such as up to a 36% reduction in latency for ZADD, 28% for SMEMBERS, and 10% for HGETALL compared to Redis 7.2.5. Over 70% of Redis users will experience noticeably faster responses with these improvements.

Redis, like the telegraph once was, is revolutionizing our expectations of speed. It ensures that not only does data reach its destination quickly, but that it’s immediately available for processing and analysis. In a world where even a 100-millisecond delay can impact user experience, Redis plays a crucial role in minimizing the lag.

Scaling Like Never Before

Redis 8 isn’t just faster, it’s more scalable too. It brings features that were previously only available in Redis Cloud and Redis Software, like horizontal and vertical scaling for the Redis Query Engine.

With horizontal scaling, you can handle much larger datasets by clustering databases, which boosts read and write throughput. Vertical scaling adds processing power, delivering up to 16x more throughput.

Benchmarking Redis 8: Breaking Records

To showcase its improvements, Redis partnered with Intel to test its performance with one billion 768-dimensional vector embeddings. The results? Redis handled up to 66,000 vector insertions per second with indexing for 95% precision and up to 160,000 insertions per second for lower precision indexing.

Even with high-precision queries, Redis delivered a median latency of 200 milliseconds for a 90% precision rate when searching the top 100 nearest neighbors. And by tweaking HNSW (Hierarchical Navigable Small World) parameters, you can fine-tune Redis to balance speed and accuracy for your specific use case.

See more of the benchmarks in the official Redis Blog.

Try Redis 8 Today

Redis 8.0 M02 is available now, and you can experience its speed and scalability for yourself. Whether you’re looking for better latency, scalable query engines, or support for billion-scale vector search workloads, Redis 8 is ready to deliver.

Start experimenting today by downloading an Alpine or Debian Docker image from Redis Docker Hub. See what Redis 8 can do for your real-time applications!

Don't forget to flush! — Ensuring Data Integrity in Spring Data JPA

Raphael De Lio — Sun, 29 Sep 2024 12:05:14 +0000

Twitter | LinkedIn | YouTube | Instagram

Just like you wouldn’t leave the bathroom without flushing, you shouldn’t navigate through Spring Data JPA without understanding the importance of flushing. Flushing, in the context of JPA (Java Persistence API), is like telling your application, “Hey, let’s make sure all our pending changes to the database are actually sent and stored properly!”. It is making sure that your in-memory changes are synchronized with the database.

Imagine you’re editing a document; flushing is like hitting the ‘save’ button to ensure all your changes are permanently stored. In the context of JPA, this means ensuring that any modifications made to your entities are actually reflected in the database. It’s a process that can happen automatically, like a sensor-flush in modern toilets, or manually, where you decide the right moment to sync, similar to the traditional toilet flush lever.

Grasping the flushing mechanism is vital. Without proper flushing, you might end up with data discrepancies, where changes in your application’s memory don’t match what’s in the database. It’s like assuming your toilet will flush on its own, only to find out it doesn’t, leading to an unpleasant situation. Proper flushing ensures that your data integrity is maintained and your application’s interaction with the database is smooth and error-free.

Let’s take a look at an example:

The Deduplication Strategy with Flushing in Spring Boot JPA

Imagine you’re working with a function in Spring Boot that should run only once for a unique set of parameters. To ensure this uniqueness, you use a deduplication strategy involving a database table.

@Transactional
public void processIdempotent(
        String eventId,
        String data
) {
    deduplicate(eventId);
    updateDatabase(data);
    sendMessage(data);
}

The Deduplication Table:

You create a special table in your database. This table’s job is to store each unique set of parameters your function uses. It’s designed so that if you try to insert a set of parameters that’s already in the table, the database will throw a constraint violation exception.

@Entity(name="processed_events")
public class ProcessedEvent implements Serializable, Persistable<String> {

    @Id
    @Column(name="eventid")
    private String eventId;

    public ProcessedEvent(){}

    public ProcessedEvent(final String eventId) {
        this.eventId = eventId;
    }

    /**
     * Ensures Hibernate always does an INSERT operation when save() is called.
     */
    @Transient
    @Override
    public boolean isNew() {
        return true;
    }
}

Transactional Integrity and the Challenge of Parallel Execution:

In Spring Boot JPA, database interactions are often wrapped in transactions. This means all operations, including the insertion into your deduplication table, are only finalized when the transaction commits. If any part of the transaction fails, everything is rolled back.

However, imagine two instances of your function running at the same time, each within its own transaction. They both check the deduplication table and, finding no existing entries for their parameters, proceed.

Even though one of the transactions will fail by the time it tries to commit, this may still cause inconsistencies, especially when your function interacts with external systems, such as a message broker or a REST API, operations that won't be rolled back with the database.

The Flushing Solution:

To prevent this issue, you can use flushing right after inserting into the deduplication table. Flushing forces JPA to immediately synchronize the current state of the session with the database. So, if two instances of the function run in parallel, as soon as one tries to flush its insertion into the deduplication table, it’ll either succeed or fail immediately if the other has already inserted the same parameters.

private void deduplicate(UUID eventId) throws DuplicateEventException {
    try {
        processedEventRepository.saveAndFlush(new
ProcessedEvent(eventId));
        log.debug("Event persisted with Id: {}", eventId);
    } catch (DataIntegrityViolationException | PessimisticLockingFailureException e) {
        log.warn("Event already processed: {}", eventId);
        throw new DuplicateEventException(eventId);
    }
}

This immediate feedback is crucial. It prevents the function from fully executing if another instance has already run with the same parameters, ensuring that each unique set of parameters triggers the function only once. Flushing here acts as an early alert system, maintaining the integrity of your deduplication logic and preventing potential inconsistencies, especially when your function interacts with other systems.

Conclusion

As a developer, knowing when to flush in JPA is key to ensuring your data changes are properly saved and reflected in the database. It’s one of those fundamental skills that can save you from a lot of headaches down the road. So, remember to flush wisely and keep your data in sync — it’s as crucial in JPA as it is in real life after using the restroom!

Stay curious!

Contribute

Writing takes time and effort. I love writing and sharing knowledge, but I also have bills to pay. If you like my work, please, consider donating through Buy Me a Coffee: https://www.buymeacoffee.com/RaphaelDeLio

Or by sending me BitCoin: 1HjG7pmghg3Z8RATH4aiUWr156BGafJ6Zw

Follow Me on Social Media

Stay connected and dive deeper into the world of Spring with me! Follow my journey across all major social platforms for exclusive content, tips, and discussions.

Twitter | LinkedIn | YouTube | Instagram

The 6 Principles of Microservices Architecture

Raphael De Lio — Sat, 28 Sep 2024 10:36:15 +0000

Twitter | LinkedIn | YouTube | Instagram

I recently attended Urs Peter’s course on Event-Driven Architecture, and one of the cool things we dived into right at the start was the six key principles of Microservices architecture.

It’s important to remember that microservices aren’t a magic fix; they won’t solve every issue, and if they are not implemented correctly, they can even bring up some big new challenges. So, today, I’m excited to share with you these six principles of Microservices architecture, which are super important to get right for it to work well.

Isolation

In a microservices architecture, isolation ensures each microservice functions independently with its own codebase, data storage, and runtime environment, preventing process and resource sharing with other services.

One of the key advantages of isolation is that it contains failures within a single service. If one microservice fails, it doesn’t necessarily bring down the entire system, as other services continue to operate independently.

Following this principle means that each microservice owns its data and data model and that no direct database is shared between services. Data sharing, if necessary, is done through well-defined interfaces (APIs).

By isolating your services, typically, they will also run in isolated environments, such as containers, ensuring that issues in one service (like a memory leak) do not affect other services.

However, by isolating our microservices, the overall architecture becomes more complex, with multiple isolated services interacting with each other.

Moreover, managing communication between services, especially in an asynchronous environment, may also be challenging. Multiple databases add complexity to ensuring consistency across different services, especially when handling distributed transactions.

And, naturally, more services mean more deployments, more monitoring, and potentially more points of failure, increasing the overall operational complexity of your system.

In microservices architecture, isolation is all about finding the perfect balance between letting each service do its own thing and making sure they all work well together. The goal is to create a system where each service can stand on its own, handle problems without causing a domino effect, and easily grow as needed. But at the same time, all these services need to work together smoothly as part of a bigger picture. Getting this balance right isn’t just about the technical aspects; it also involves how teams work together and how the whole operation is run.

Autonomy

Each microservice should be autonomous, meaning it makes decisions based on its context without depending on other services. This includes how it processes data, handles business logic, and responds to requests. An autonomous service encapsulates a specific business functionality. It’s responsible for all aspects of that function, from data processing to business rules.

Teams should be able to develop and test their services independently, using tools and languages best suited to the service's functionality. They should own their data and define their data schema. This data is exposed only through APIs, which maintain control over how their data is accessed and used.

Moreover, services should be deployable independently. This means a service can be updated, fixed, or scaled without needing to redeploy the entire application.

These benefits also add overall complexity. While services are independent, they often need to communicate. Managing these communication patterns without creating tight coupling is a challenge.

Besides that, autonomous services can lead to duplication of effort or infrastructure, as each service may require its own support mechanisms like databases, caching, and logging.

Autonomy in microservices is about empowering individual services to operate independently while still contributing effectively to the overall system. It brings significant benefits regarding flexibility, resilience, and development speed. However, it also introduces challenges related to communication, consistency, and potential overhead. Careful design, clear service contracts, and a focus on well-defined boundaries are key to harnessing the full potential of autonomous microservices.

Single Responsibility

The Single Responsibility Principle (SRP) is a guiding concept that dictates each service should be responsible for a single piece of functionality or a single aspect of a system’s business logic.

A microservice following SRP should have one, and only one, reason to change. This means it should focus on a single business capability or function. The service’s responsibilities are well-defined, and it does not overlap with or bleed into the functionalities of other services.

Properly defining what each service should and should not do is crucial. This often involves identifying domain boundaries, which can be guided by practices like Domain-Driven Design (DDD).

However, it's important to notice that while services should be focused, overly granular services can lead to unnecessary complexity. Finding the right balance between service size and responsibility is key.

Exclusive State

Exclusive State emphasizes the importance of each microservice managing its own data independently.

In exclusive state, each microservice owns and controls its own database or state. This means that no other microservice has direct access to this data. This means that each microservice manages its data schema and storage mechanisms, which could differ from those of other services. Data sharing or synchronization between microservices, if necessary, is achieved through API calls, event streaming, or message brokers, maintaining data encapsulation.

By owning its data, each service ensures the integrity and consistency of the data it manages. Besides that, different services can scale their data storage and processing capabilities independently based on their specific requirements. Moreover, with exclusive state, the failure of one service’s data store does not directly impact other services, enhancing the system’s overall resilience.

However, it comes with a price. Transactions and operations that span multiple services become more complex, as they require coordination across independent data stores. Also, managing separate databases or state stores for each service can increase infrastructure complexity and costs.

In a nutshell, Exclusive State ensures that each service is self-contained in terms of its data, contributing to the overall robustness and scalability of the system. However, it introduces challenges in terms of data management, particularly when dealing with operations that span multiple services. Effective implementation of this principle requires thoughtful system design and a clear understanding of the trade-offs involved in managing data within a distributed environment.

Async Message Passing

In asynchronous message passing, a microservice sends a message (a request, data, notification) to another service without waiting for an immediate response. The sending service continues its operation and can handle the response at a later point in time. This often involves an event-driven architecture where services react to events and communicate changes through messages.

Systems implement asynchronous communication using technologies like message queues (e.g., RabbitMQ, Kafka), which temporarily store messages until they are processed by the receiving service. And services notify other parts of the system about changes or updates through events rather than direct calls or requests.

Benefits include services that are not tightly coupled to each other’s processes, leading to a more resilient system architecture. As services don’t wait for responses, they can handle more requests and scale better under load. Also, temporary failures in one service don’t immediately impact others, as messages can be retried or delayed.

However, ensuring reliable delivery and processing of messages can be complex, especially in a distributed system. While decoupling services, asynchronous communication can introduce delays in processing, which might not be suitable for time-sensitive operations. Moreover, tracing a request’s path and debugging issues can be more challenging in an asynchronous setup.

To summarize, Async Message Passing enables microservices to communicate in a decoupled, efficient, and resilient manner, which is particularly beneficial in distributed and scalable systems. However, it introduces complexities in managing and monitoring message flows and requires careful design to ensure consistency and reliability. Embracing this principle often involves a shift towards an event-driven architecture, which brings its considerations in system design and operation.

Location Transparency

In a system with location transparency, microservices are designed and operated without the need for other services to know their specific physical location (IP Address). Services communicate with each other based on logical identifiers rather than physical network addresses. This often involves mechanisms for dynamic service discovery, where services can find and communicate with each other through a registry or directory service, regardless of where they are deployed.

Tools like Kubernetes or service meshes provide a dynamic registry where services register themselves. Other services use this registry to discover and communicate with them. Location transparency allows for intelligent load balancing and rerouting of requests in case of service failures, enhancing the system’s fault tolerance.

This approach allows services to be easily scaled up or down, moved, or replicated across different servers or clusters without impacting the system’s operation. Besides that, services can be deployed on various platforms (on-premises, cloud, hybrid) without affecting their interaction with other services. And the system can automatically handle service failures by rerouting requests to other instances or locations.

Event-Driven MicroServices Training

The “Event-Driven Microservices Training” by Urs Course is open to all interested in enhancing their knowledge of Event-Driven Architecture, and I highly recommend it.

This two-day, in-person course is conducted in the Netherlands. For upcoming dates and pricing details, visit Event-Driven Microservices Training.

Stay Curious!

Contribute

Or by sending me BitCoin: 1HjG7pmghg3Z8RATH4aiUWr156BGafJ6Zw

Follow Me on Social Media

Stay connected and dive deeper into the world of Software Architecture with me! Follow my journey across all major social platforms for exclusive content, tips, and discussions.

Twitter | LinkedIn | YouTube | Instagram

What’s the Connection Between Leonardo Da Vinci, a Cup of Coffee in Lisbon, and the Nature of Software Development?

Raphael De Lio — Fri, 27 Sep 2024 13:12:03 +0000

In Walter Isaacson’s biography of Leonardo Da Vinci, he writes about an incident that occurred while Leonardo was painting one of his most famous works, “The Last Supper.”

Isaacson describes how the Prior of the church that had commissioned the work became irritated with Leonardo’s procrastination and complained to Ludovico Sforza, the then Duke of Milan. He wanted Leonardo never to put down the brush, as if he were an employee working in his garden.

When the artist was summoned by the Duke, the two ended up discussing how creativity manifests. Leonardo explained that sometimes you need to go slow, take breaks, and even procrastinate. This allows ideas to mature and intuition to be stimulated. Men of high intellect, he said to the duke, sometimes make their greatest advances when they work less, as their minds are occupied with their ideas and the refinement of concepts that will later take shape.

This passage reminded me of a moment I experienced in a café in Lisbon in 2022. In one of our conversations, my colleague Marcelo Maluf Teixeira, holding a cup in his hand, compared it to the ever-evolving nature of software.

"A coffee cup is a finished product; once it is molded, baked, and painted, it is complete. There are no updates or revisions needed. In contrast, software is a dynamic entity, constantly in a state of development and improvement."

As programmers, we often encounter the “Prior of the Church” mentality in our workplaces, represented by managers or executives who expect us to always be on standby, tirelessly typing to deliver software. They often see programming as a continuous production line, where the work is simply completing tasks one after the other. However, the reality of programming is that it is a creative and iterative process, where ‘active procrastination’ plays a crucial role.

Procrastination, when understood as a period of reflection and incubation of ideas, is essential in the world of programming. It’s not about avoiding work, but recognizing that conscious breaks and periods of reflection are vital for innovation and creative problem-solving. In these moments, instead of incessantly writing code, we allow ourselves to absorb and contemplate the problem as a whole, often finding more effective and innovative solutions.

Leonardo often took years to finish a painting, and in some cases, like the famous “Mona Lisa,” he continued to work and make changes until the end of his life. He was always experimenting with new techniques, like sfumato, a shading technique that creates a smooth transition between colors, giving an almost ethereal quality to his paintings.

Leonardo also left us a tip on how to deal with that stubborn manager. He told the Prior that he still had two heads to paint, Christ’s and Judas’, and claimed he was having trouble finding a model for Judas and would use the Prior’s image if he continued to pester him. The Duke burst out laughing, saying Leonardo had thousands of reasons to do so. And the poor Prior was embarrassed and went back to taking care of his garden, leaving Leonardo in peace.

Stay curious!