DEV Community: Daniel Cordeiro

Building an AI Billing Assistant: Integrating LangChain ReAct Agents with Spring Boot Microservices

Daniel Cordeiro — Mon, 08 Jun 2026 13:16:51 +0000

Introduction

A significant portion of telecom customer support calls follow the same pattern: What is my current bill?, Why did it go up?, I want to dispute this charge. These are structured, predictable requests with clear resolution paths — exactly the kind of interaction that a well-designed AI agent can handle reliably, without a human in the loop.

This post presents the design and development of the Smart Billing Assistant, an AI-powered telecom customer support agent that puts this idea into practice. A Python FastAPI service hosts a LangChain ReAct agent that gives customers a natural language interface to five self-service flows: viewing invoices, understanding bill changes, filing disputes, requesting plan changes, and checking payment status. Under the hood, the agent orchestrates two independent Java Spring Boot microservices — a reactive billing service (Spring WebFlux + R2DBC) and a transactional provisioning service (Spring MVC + JPA) — backed by PostgreSQL. The development lifecycle was shaped by spec-driven requirements, test-driven implementation and Claude Code as an AI pair programmer.

1. The Project Overview

Note: The v1 term mentionated through this post is a demo version of the Smart Billing Assistant project, built for exploring purposes. Several design decisions throughout this document are explicitly simplified to keep scope manageable.

Why This Domain

Business Support Systems (BSS) are the software backbone of a telecom company: billing, invoicing, customer accounts, payments. They are high-volume, data-intensive, and historically painful for customer support teams. A large portion of inbound support calls are about bill questions and simple service changes — exactly the kind of structured, predictable request that an AI agent handles well.

What the Agent Can Do

Six user stories were implemented:

User Story	What the customer says	What happens
US-01	What is my current bill?	The agent retrieves the customer's current invoice, including the billing summary and the individual line items that make up the total.
US-02	Why is my bill so high?	The agent compares the current billing cycle against the prior one, identifies overages and one-time charges, and explains what drove the increase.
US-03	I want to dispute this charge	The agent opens a dispute ticket for the specified charge, returns a unique reference number, and informs the customer that resolution takes up to 5 business days.
US-04	Can I switch to a cheaper plan?	The agent verifies whether the customer's line is eligible for the requested plan and either applies the change immediately for upgrades or schedules it for the next billing cycle for downgrades.
US-05	Did you receive my payment?	The agent returns the status of the customer's most recent payment — whether it was received, pending, or failed — together with the timestamp of the last update.
US-06	Why was it so high? (follow-up)	The agent resolves the reference to it from the active session context and answers without asking the customer to re-identify themselves or repeat previous information.

The Architecture

The system was decomposed into three independently deployable services:

agent-service (Python/FastAPI + LangChain): The sole customer entry point. Owns JWT validation, conversational session state, LangChain ReAct orchestration, and escalation logic. Calls the Java services over synchronous REST.
billing-service (Java/Spring WebFlux): The source of truth for invoices, line items, payments, and disputes. Uses reactive R2DBC for non-blocking database access.
provisioning-service (Java/Spring MVC): The source of truth for plan catalogues, eligibility rules, and customer line configuration. Uses JPA/Hibernate with blocking I/O — plan changes are infrequent transactional writes.

In summary, this is a holistic view of the project idea and the architecture it produced, and the following sections narrow the process in detail.

2. The Development Process

2.1 Using Claude Code as a Pair Programmer

The entire project was built using Claude Code [1] — Anthropic's CLI-based AI coding agent — as an interactive pair programmer. Rather than treating it as a one-shot code generator, it was used as a persistent collaborator across 13 implementation sessions, each one driving a task from TDD flow (Red → Green → Refactor).

CLAUDE.md: The Project Instruction File

The key to making Claude Code useful across sessions is the CLAUDE.md file. This file lives in the project root and is automatically loaded by Claude Code at the start of every session. It acts as the project contract: what the system should do, what design principles to follow, what quality gates to enforce, and exactly what steps to execute after each task is completed.

Depending on the project, a CLAUDE.md might include:

Design principles: KISS, YAGNI, AHA, SOLID — as concrete enforcement rules (e.g. Don't add features beyond what was asked; No Redis in v1 — YAGNI).
Quality gates: ≥80% line coverage, SonarQube Maintainability Rating A, Cognitive Complexity ≤15 per method (≤10 for Python production code).
Task Completion Protocol: An 8-step automated loop — implement → run tests → local validation (for feat: tasks) → commit → open PR → monitor CI → fix failures → report green.
Git conventions: Conventional Commits required; Semantic Release handles versioning; no manual version bumps in pom.xml or pyproject.toml.

Why Start Simple and Evolve

A comprehensive CLAUDE.md was not written upfront. The project started with a minimal version — basic rules about TDD and commit conventions — and expanded it as new needs emerged during actual development. So, each CLAUDE.md addition was prompted by a real friction point, evolving the process incrementally, driven by actual need.

Why Document Artifacts Instead of Prompting

One of the highest-leverage decisions was treating PROJECT_IDEA.md, REQUIREMENTS.md, DESIGN.md, and TASKS.md as first-class project artifacts — files Claude Code reads directly rather than content repeated in every chat prompt.

Benefits:

Token efficiency: The context is loaded once from stable files, not repeated in every session.
Consistency: The same scope, terminology, and decisions are visible in every session.
Guardrails: Claude Code stays bounded by what is documented; speculative features don't creep in.
Memory: Session history notes in the docs capture every design decision — when returning to a task, the rationale is already there.

2.2 Defining the Requirements

Spec-Driven Development

Before writing a single line of code, REQUIREMENTS.md was produced: user stories with explicit acceptance scenarios for every state the system needs to handle. This approach is inspired by spec-driven development tools like Kiro [2] — requirements come first, and the code is tested against them.

To ground the requirements in real domain knowledge, Claude Code was asked to adopt the role of a Billing Manager stakeholder with expertise in BSS telecom. This simulated discovery session drove a structured discussion: what do customers actually call about? What are the edge cases? What is in scope and what crosses a line the support agent should not cross? The conversation surfaced business rules (the 90-day dispute window, the delinquency threshold, the OSS/BSS boundary) that would otherwise have been discovered late — during implementation or testing.

The format: each user story has an As a / I want / So that header, followed by numbered acceptance scenarios (S1, S2, S3...) that map directly to test cases and eventually to BDD Gherkin feature files.

Example (US-03 — Dispute a Charge):

US-03: As a customer, I want to dispute a charge on my invoice,
       so that I can get incorrect charges reviewed.

S1 — Valid dispute filed: Customer provides invoice ID and line item ID.
     System creates a dispute record and returns a reference number
     with a 5-business-day resolution SLA.

S2 — Duplicate dispute blocked: An open dispute already exists for
     that line item. System returns the existing reference number
     instead of creating a second dispute.

S3 — Outside 90-day window: The charge is more than 90 days old.
     System rejects the request and explains the eligibility window.

These scenarios drove every test: unit tests mocked the repository layer and asserted each branch; integration tests against Testcontainers PostgreSQL confirmed end-to-end behavior; BDD feature files in Cucumber (Java) and pytest-bdd (Python) verified full conversation flows.

GLOSSARY.md: Capturing the Domain

One output of the requirements session was GLOSSARY.md — a glossary of BSS telecom terms. Proration, CDR, dunning, delinquency, cold handoff — these terms appear in the code, in tests, and in the agent's responses. Having a shared glossary ensures that when the code says OUTSTANDING or the agent mentions 5-business-day SLA, it means the same thing to everyone reading it.

2.3 Defining the Design

LangChain, LangGraph, and the ReAct Pattern

LangChain [3] is a Python framework for building applications powered by LLMs. Its core concept is the tool: a Python function the LLM can call to take actions or retrieve information. The LLM decides which tool to call, what arguments to pass, and what to do with the result.

Each tool is a plain Python function decorated with @tool. The LLM reads the function's docstring to understand when and how to use it — no routing tables, no decision trees. LangChain handles the mechanics of formatting the tool call, parsing the LLM's response, and invoking the function.

ReAct [4] (Reason + Act) is the agent pattern used here. The LLM alternates between:

Reasoning: The customer asked about their bill. I should call get_current_invoice.
Acting: Call the tool, get the result.
Observing: The invoice shows a $45 data overage. The customer needs an explanation.
Reasoning again: Do I need more information, or can I answer now?

This loop runs entirely inside the agent-service. From the customer's perspective, they send a message and receive a reply. Inside, the LLM may have called two or three tools, observed intermediate results, and reasoned about each before generating the final response.

LangGraph adds state management to this loop. It is a graph-based runtime where nodes are functions (like call the LLM or execute a tool call) and edges control flow. Crucially, it provides a MemorySaver checkpointer that persists conversation history between turns — keyed by a thread_id. This is what gives the agent multi-turn memory.

The thread_id is the session UUID. On every POST /chat request, the agent loads the conversation history for that thread, runs the ReAct loop, and saves the updated state — including the new user message, any tool calls, and the final AI response. The customer's next message picks up exactly where the last one left off.

LangSmith traces every ReAct loop: which tools were called, what the LLM reasoned at each step, how long each operation took. Invaluable for debugging agent behavior.

Spring WebFlux + R2DBC and Spring MVC + JPA

Spring WebFlux + R2DBC powers the billing-service. WebFlux is Spring's reactive web framework [5]: instead of one blocking thread per HTTP request, it uses a small, fixed thread pool with event-loop-based I/O. Requests are represented as non-blocking streams (Mono<T> for a single value, Flux<T> for a sequence). R2DBC (Reactive Relational Database Connectivity) is the reactive counterpart to JDBC — database queries return Mono or Flux publishers rather than blocking the calling thread. This stack is a good fit for the billing-service because invoice lookups, comparison queries, and payment status checks are all read-heavy operations that hit the database frequently and concurrently.

Spring MVC + JPA powers the provisioning-service. Spring MVC is the classic, blocking web framework [6]: one thread per request, synchronous database calls through JPA/Hibernate. This is the choice for the provisioning-service because plan changes are low-frequency, write-heavy transactional operations — the simplicity of blocking code outweighs any throughput benefit from reactive streams. JPA's entity mapping and transaction management make the write path straightforward to reason about and test.

Prometheus + Grafana Observability: Both Java services expose metrics through the Micrometer instrumentation library, which is included with Spring Boot Actuator. Prometheus scrapes the /actuator/prometheus endpoint on both services and stores the metrics as time-series data. Grafana connects to Prometheus as a datasource and visualises the data in dashboards.

Key Design Decisions

Decision	Chosen	Alternative	Reason
Language split	Python + Java	Monolith (either)	LangChain is Python-first; Spring WebFlux is battle-tested for high-volume BSS data. Each language serves the layer where it is strongest, and that domain alignment justifies the operational complexity of running two runtimes.
Agent pattern	LangChain ReAct	Structured routing	A structured router requires every customer intent to be hardcoded upfront. ReAct lets the LLM reason dynamically across multi-step billing queries.
Session state	LangGraph in-process	Redis	Redis adds a new infrastructure dependency, serialisation, and failure handling for no v1 benefit. Session loss on restart is accepted at this scale. Redis is the natural v2 step when horizontal scaling is needed.
Inter-service	REST (sync)	Message queues	The customer waits in real time — async messaging would require correlation IDs and timeout handling for an inherently synchronous interaction. REST gives predictable latency and simple error propagation.
Disputes	Flag-only	Auto-reversal	Auto-reversal requires a Revenue Assurance approval workflow that is outside the agent's authority and out of v1 scope. The agent captures the claim and issues a reference number; the reversal decision stays with a human reviewer.
JWT validation	agent-service boundary	API Gateway	JWT validation was placed at the FastAPI boundary rather than in a dedicated API Gateway because the agent-service is the only external-facing service in v1. Single external-facing service makes a dedicated gateway YAGNI for v1.
Database type	PostgreSQL for billing and provisioning services	MongoDB for provisioning	Eligibility checks depend on structured SQL queries across `plans` and `customer_lines`. The array columns and event-log pattern in provisioning create minor relational friction but do not outweigh the operational cost of running a second database technology at v1 scale. MongoDB is the natural revisit if the plan catalogue grows to support dozens of configurable attributes per tier.

2.4 Executing the Tasks

TASKS.md: From Design to Executable Work

The final design output was TASKS.md: 13 tasks, each decomposed into sub-tasks, each sub-task referencing specific user story scenarios. TDD within each task followed the same rhythm: write a failing unit test, implement just enough to pass, write a failing integration test, implement until it passes, then refactor.

The Three Services and Their Roles: Putting It All Together

agent-service is the sole customer-facing entry point. It validates JWT tokens, creates and manages conversational sessions, and runs the LangChain ReAct agent loop. When a customer sends a message, the agent reasons about intent, calls whichever billing or provisioning tool is needed, observes the result, and formulates a natural language response.
billing-service is the source of truth for all financial data. It owns invoices and their line items, payment records, and dispute tickets. It is the only service that knows whether a customer's account is active or suspended, and whether their balance is overdue. Its reactive stack (Project Reactor + R2DBC) handles high read volumes — invoice queries, comparison lookups, payment status checks — without blocking server threads.
provisioning-service owns the plan catalogue, eligibility rules, and each customer's current line configuration. It decides which plans a customer can switch to (based on network capability flags and regional availability) and applies or schedules the resulting plan change.

3. Conclusion

Key Takeaways

Each language in its own domain. Python is the natural home for LangChain. Java is proven for high-volume transactional services. Combining them means the agent layer and the data layer each run in the ecosystem where they are best supported.
Spec-driven development pays off at test time. Producing REQUIREMENTS.md with explicit acceptance scenarios before touching the code makes test-writing focused. Every scenario has a name, a precondition, and an expected outcome.
ReAct agents are powerful but need guardrails. The ReAct loop gives the LLM significant autonomy — it decides which tool to call, in what order, and when to stop. For billing queries, this flexibility is valuable: a customer asking why did my bill change? may require the agent to call two tools and reason about both results before answering — a flow that would be brittle to hardcode. But autonomy introduces risk: the LLM could call a write tool (like file_dispute) when the customer only asked a question. The tool closure pattern (capturing customer_id in the closure, returning status-keyed dicts instead of raising exceptions) and the clear separation of read and write tools are the guardrails that keep the agent predictable. LangSmith traces make any misbehavior visible and debuggable.
Load the customer's data once at session start. Fetching the customer's invoice and eligible plans at session creation makes every follow-up question in the conversation instantaneous and natural. The customer does not need to repeat their account details on every follow-up message, because the agent already has the relevant data loaded from the moment the session opened.
CLAUDE.md as a living document. The instruction file that guides an AI pair programmer grows alongside the project. Each friction point or new decision is an opportunity to add a rule that prevents the same issue from recurring.

Trade-offs and Limitations

Session loss on restart: LangGraph's MemorySaver stores all conversation state in the agent-service process memory. When the process restarts — during a deployment, a crash, or a container restart — every active session is immediately lost. A customer mid-conversation would receive a session not found error and have to start over from scratch. For v1, where there is a single process and restarts are infrequent, this is acceptable. In a multi-instance production setup, it becomes a hard blocker: a session created on instance A would not be visible to instance B, making load balancing impossible without sticky sessions. Redis would solve this by persisting each conversation turn to a shared external store, making sessions portable across instances and restart-safe.
JWT validation at the application boundary: In a proper production microservices architecture, JWT validation is the responsibility of an API Gateway (Kong, AWS API Gateway, nginx with an auth module, etc.). The gateway validates the token, extracts verified claims, and forwards them as trusted headers (e.g., X-Customer-Id) to services behind it. In v1, this responsibility was placed directly in the FastAPI boundary because the agent-service is the only external-facing service — it effectively acts as its own gateway, making a dedicated one YAGNI. The problem surfaces when the system grows: a second external-facing service would need to duplicate the same JWT logic, and rotating the signing key or changing the token format would require updating every service that validates tokens.
Delinquency as synchronous cross-service call: When a customer requests a plan change, the provisioning-service makes a blocking HTTP call to the billing-service to check whether the account has an overdue balance. This creates a direct runtime dependency between the two services: if the billing-service is slow or temporarily unavailable, the provisioning plan change endpoint is also degraded — even though plan logic has nothing to do with invoice processing. For v1 with a small user base, this coupling is manageable. At higher volume, the right approach is an event-driven model: the billing-service publishes an account status event when delinquency is detected, and the provisioning-service maintains a local read-model of account statuses updated from those events. Plan change requests then query the local cache — no synchronous cross-service call needed at runtime.
Single PostgreSQL instance: The billing and provisioning schemas both live inside one PostgreSQL container in Docker Compose. While the application code enforces strict schema separation (no cross-schema queries, no shared tables), they share the same database process, disk I/O, connection pool, and resource limits. A slow billing query can starve provisioning reads. In production, each service should own its own PostgreSQL instance — separate containers, separate data volumes, potentially separate machines — so that they can be tuned, backed up, scaled, and failed over independently.
Relational database for both services: The billing-service is unambiguous — financial records, ACID guarantees, complex aggregation queries across billing cycles map naturally to relational. The provisioning-service is less clear-cut: the plans table uses array columns (regions[], network_flags) and plan_changes is effectively an event log — both patterns that a document store handles more naturally. For v1, eligibility checks benefit from structured SQL queries, KISS applies, and the operational cost of introducing a second database technology outweighs the schema flexibility it would bring at this scale. If the plan catalogue grows to support dozens of configurable attributes per tier, MongoDB would be the natural migration path for the provisioning-service's plan data.

Source code: github.com/dancodingbr/smart-billing-assistant

References

[1] Claude Code — Anthropic's CLI-based AI coding agent. Available at: https://github.com/anthropics/claude-code

[2] Kiro — AWS AI-powered IDE built around spec-driven development. Available at: https://kiro.dev

[3] LangChain — Framework for building LLM-powered applications. Available at: https://python.langchain.com

[4] ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al., 2022. Available at: https://arxiv.org/abs/2210.03629

[5] Spring WebFlux — Reactive web framework built on Project Reactor. Available at: https://docs.spring.io/spring-framework/reference/web/webflux.html

[6] Spring Boot — Convention-over-configuration framework for Java microservices. Available at: https://spring.io/projects/spring-boot

How Terraform and Helm Split Responsibilities in a Kubernetes CI/CD Pipeline

Daniel Cordeiro — Tue, 02 Jun 2026 14:36:19 +0000

Introduction

Anyone working with Kubernetes for a while will likely face a version of the same question: should Kubernetes resources be managed through Terraform, or through something else?

In addition to provisioning the cluster, Terraform has a mature kubernetes provider that exposes namespaces, Deployments, StatefulSets, and ConfigMaps as first-class resources. Everything can live in the same state file, the same repository, and the same workflow. For teams that already operate Terraform for infrastructure, the case for extending that coverage to application workloads is genuinely strong.

The problem is that infrastructure and application workloads have fundamentally different change rates. A Kubernetes namespace or a monitoring stack might remain untouched for months. A container image changes with every commit. This is the gap Helm was designed to fill: a dedicated tool for packaging, versioning, and deploying application workloads into Kubernetes, with parameterized overrides, release history, and rollback built in.

To make this concrete, this article documents a hands-on project named Personal Blog that explores both scenarios, and how it evolved from the monolithic Stage 2 — where Terraform managed both infrastructure provisioning and application deployment — to the decoupled Stage 3 - a clean separation of concerns where Terraform owns the platform layer and Helm owns the application layer, orchestrated by a GitLab CI/CD pipeline.

The Temptation of "Terraform Everything"

In Stage 2 of the personal blog CI/CD pipeline, a decision was made that seemed reasonable at the time: use Terraform for everything. Not just the cluster — everything. The Kubernetes namespace, MongoDB StatefulSet, backend and frontend Deployments, Services, ConfigMaps for Prometheus and Grafana, the entire monitoring stack. One tool, one state file, one workflow.

Here's what the terraform apply workflow covered in Stage 2:

Provisioning a Kind (Kubernetes in Docker) cluster with the tehcyx/kind provider;
Generating kubeconfig and loading local Docker images into the cluster registry using a Provisioner;
Creating Kubernetes namespace with labels;
Deploying MongoDB as a kubernetes_stateful_set with a PersistentVolumeClaim;
Deploying Spring Boot backend: kubernetes_deployment with 2 replicas, resource limits (500m CPU / 512Mi), NodePort service;
Deploying Angular frontend: kubernetes_deployment with 2 replicas, resource limits (300m CPU / 256Mi), NodePort service;
Deploying Prometheus, Loki, Promtail DaemonSet, and Grafana with pre-configured datasources and dashboards via kubernetes_config_map;

All of this in a single local_kubernetes.tf file — over 1,000 lines.

Look closely at what those 1,000 lines actually contain: Kubernetes Deployments, StatefulSets, Services, ConfigMaps, and DaemonSets — written in HCL syntax instead of YAML. In other words, the file is a collection of implicit Kubernetes manifests embedded inside Terraform. Every resource that would normally be a few lines of YAML becomes a deeply nested HCL block. Each one is hardcoded — image tags, replica counts, resource limits, service ports, environment variables all written inline, duplicated wherever they appear, and entirely specific to one environment.

This is precisely a gap that Helm can fit to close. Without it, you end up with exactly what Stage 2 produced: large, unmaintainable manifests with no reusability across environments, no parameterization strategy, and no mechanism to version or roll back the application as a deployable unit.

It ran. But every time the application changed — a new Docker image, a backend configuration update — the only way to deploy was terraform apply. Which means every code change ran through a tool designed for infrastructure provisioning, not application delivery.

Why Terraform Can Manage Kubernetes Resources at All

Terraform is built around a provider plugin model [1]. A provider is a plugin that translates Terraform resource definitions into API calls for a specific platform. HashiCorp publishes official providers for AWS, Azure, GCP, and Kubernetes. Third parties publish providers for everything else — in this project, the tehcyx/kind provider was used to provision the Kind cluster itself.

The hashicorp/kubernetes provider [2] is what makes managing Kubernetes resources declaratively with Terraform possible. It exposes Kubernetes API objects — Namespaces, Deployments, Services, ConfigMaps, StatefulSets, Secrets, RBAC resources — as first-class Terraform resources [3]. Under the hood, it authenticates against the Kubernetes API server using the configured kubeconfig and issues the equivalent of kubectl apply calls, but managed through Terraform's state model.

terraform {
  required_providers {
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.38.0"
    }
  }
}

provider "kubernetes" {
  config_path = pathexpand(var.kubeconfig_path)
}

From that point on, any Kubernetes object can be declared as a Terraform resource:

resource "kubernetes_deployment" "backend" {
  metadata {
    name      = "personal-blog-backend"
    namespace = kubernetes_namespace.app.metadata[0].name
  }
  spec {
    replicas = 2
    selector {
      match_labels = { app = "personal-blog-backend" }
    }
    template {
      metadata {
        labels = { app = "personal-blog-backend" }
      }
      spec {
        container {
          name  = "backend"
          image = "dancodingbr/personal-blog-backend:latest"
          resources {
            limits   = { cpu = "500m", memory = "512Mi" }
            requests = { cpu = "250m", memory = "256Mi" }
          }
        }
      }
    }
  }
}

This means that Terraform can manage the entire lifecycle of a Kubernetes cluster and the workloads running inside it, from a single configuration and state file.

The Core Problem: Two Different Change Rates

Terraform is designed around the assumption that infrastructure changes infrequently. You create a VPC, a database cluster, a Kubernetes namespace — and those things persist for months or years. Terraform's state model, plan/apply cycle, and provider ecosystem are all optimized for this cadence.

Application deployments have a completely different change rate. A development team might deploy multiple times per day. The image tag changes with every commit. Replica counts get tuned. Environment variables get updated. These are not infrastructure events — they are application lifecycle events.

When you run terraform apply to update an image tag, you're running the full plan/apply cycle — provider initialization, state refresh, dependency graph evaluation — just to change one line in a Deployment spec. It's slow, it carries the risk of unintentional side-effects on other resources in the state file, and it conflates two fundamentally different concerns.

The Solution: Separation of Concerns

Stage 3 of the project refactors this cleanly. Terraform keeps exactly two responsibilities:

Kubernetes namespace creation.
Helm releases [4] for platform-level services: MongoDB and the monitoring stack (Prometheus, Loki, Promtail, Grafana).

# terraform/helm/kubernetes.tf
resource "kubernetes_namespace" "personal_blog_namespace" {
  metadata {
    name   = var.app_namespace
    labels = var.app_labels
  }
}

# terraform/helm/helm.tf
resource "helm_release" "mongodb" {
  name      = "mongodb-release"
  namespace = var.app_namespace
  chart     = "${path.module}/../../charts/mongodb"
  depends_on = [kubernetes_namespace.personal_blog_namespace]
}

resource "helm_release" "monitoring" {
  name      = "monitoring-release"
  namespace = var.app_namespace
  chart     = "${path.module}/../../charts/monitoring"
  depends_on = [kubernetes_namespace.personal_blog_namespace]
}

The frontend and backend are removed from Terraform entirely. They become Helm chart releases managed by the GitLab CI pipeline — deployed with helm upgrade --install and a dynamic image tag override:

helm upgrade --install personal-blog-backend-release \
  ./charts/personal-blog-backend \
  --set image.repository="dancodingbr/personal-blog-backend" \
  --set image.tag="$CI_COMMIT_SHORT_SHA" \
  --namespace personal-blog-app-dev \
  --wait --timeout 90s

In this way, Terraform runs infrequently to provision the platform. Helm runs on every deploy to update the application.

The Result: A Clean Responsibility Matrix

Concern	Tool	Change frequency
Kubernetes namespace	Terraform	Rarely
MongoDB deployment	Terraform + Helm	Rarely
Monitoring stack	Terraform + Helm	Rarely
Backend deployment	Helm (via GitLab CI)	Every commit
Frontend deployment	Helm (via GitLab CI)	Every commit

What Helm Gives You For App Deployments

Kubernetes manifests can become large and repetitive, hard to parameterize across environments, and difficult to version and reuse. A simple application — a Deployment, a Service, a ConfigMap — might span dozens of YAML files, each needing different values for dev, staging, and production. Helm's answer was Charts: reusable, templated application packages with a clean variable model.

The 1,000-line local_kubernetes.tf from Stage 2 is a perfect illustration of what happens when that problem goes unsolved. Every kubernetes_deployment, kubernetes_service, and kubernetes_config_map block in that file is a hardcoded, single-environment, non-reusable Terraform approximation of a Kubernetes manifest. Helm would have expressed the same application as a handful of templated files and a single values.yaml.

The capabilities Helm [5] introduces to solve this are exactly the ones Terraform's Kubernetes provider lacks:

1. Chart packaging and environment parameterization

A Helm chart separates what to deploy (the templates) from how to configure it (the values). The same chart deploys to dev, staging, and production by swapping the values file — no duplication, no environment-specific HCL blocks.

2. Parameterized overrides at deploy time

Helm's values.yaml + --set overrides are designed for the pattern of "here's a base config, override what changes per deploy." In CI/CD, this is how image tags are injected at runtime without modifying source files.

3. Release management

helm list gives you a clean view of what version of each chart is deployed, in which namespace, with which revision. Rolling back is helm rollback <release> <revision>.

4. Immutable, traceable deployments

Using --set image.tag="$CI_COMMIT_SHORT_SHA" means every deployment is tied to a specific Git commit. Combined with helm history, you have a full audit trail.

Key Takeaways

One goal of DevOps is not to minimize the number of tools. It is to give each concern the tool that fits it best. So, when designing a Kubernetes-based deployment pipeline from scratch, the following separation of concerns is suggested:

Terraform for: namespaces, persistent infrastructure services (databases, message queues), RBAC, cluster-level resources.
Helm for: application workloads — anything that has a Deployment, scales independently, and is updated frequently.
CI/CD tool for: sequencing them correctly — run Terraform first (idempotently), then run Helm for the changed application.

Source code: gitlab.com/dancodingbr/personal-blog.

References

[1] HashiCorp. Providers — Terraform Language. HashiCorp Developer.
https://developer.hashicorp.com/terraform/language/providers

[2] HashiCorp. Kubernetes Provider — Terraform Registry.
https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs

[3] HashiCorp. Manage Kubernetes resources with Terraform. HashiCorp Developer.
https://developer.hashicorp.com/terraform/tutorials/kubernetes/kubernetes-provider

[4] HashiCorp. hashicorp/helm — Terraform Registry.
https://registry.terraform.io/providers/hashicorp/helm/latest

[5] Helm Project. Using Helm.
https://helm.sh/docs/intro/using_helm

Polyglot Persistence in Microservices: Let the Domain Choose the Database

Daniel Cordeiro — Sat, 23 May 2026 15:01:34 +0000

Introduction

One of the most consequential decisions in microservices architecture is data storage. Monolithic systems traditionally rely on a single relational database to service all needs — a model that worked well for decades but creates tight coupling, limits scalability, and forces every domain to conform to the same persistence paradigm regardless of whether it is the right fit.

Modern distributed systems have embraced a concept known as polyglot persistence — the practice of using different data storage technologies within the same system, each chosen to match the access patterns and characteristics of the domain it serves. A MVP e-commerce project examined in this document demonstrates this pattern in a concrete way: three different databases, each serving a distinct microservice, each chosen deliberately.

The Three-Database Architecture

The platform studied here organizes data across three specialized stores:

Service	Database	Type	Rationale
Order Service	PostgreSQL	Relational (ACID)	Transactional consistency, financial data
Product Service	MongoDB	Document (NoSQL)	Flexible schemas, rich catalog data
Cart Service	Redis	In-memory K/V	Sub-millisecond speed, ephemeral state

This is the Database per Service pattern [1]. Each service owns its database exclusively — no service reads directly from another's store. This boundary enforces loose coupling and allows each team to evolve the schema independently without risk of cross-service breakage.

PostgreSQL for the Order Service: ACID as a Requirement

A relational database organizes data into tables — structured grids where every row is a record and every column is a typed, constrained attribute. Relationships between tables are expressed through foreign keys: a column in one table that references the primary key of another, letting the engine enforce referential integrity automatically. This rigid schema is not a limitation but a deliberate guarantee: every row must conform to the same structure, and the engine validates constraints at write time. The payoff is ACID — the ability to group multiple writes into a single all-or-nothing transaction that either commits fully or rolls back completely, leaving the database in a consistent state regardless of failures.

The Order Service persists financial records. An order is not just data — it is a legal artifact, a commitment. This makes ACID guarantees non-negotiable.

The service uses Spring Data JPA with Flyway for schema migrations. The schema reflects classical relational design: parent orders table with a child order_items table linked by a foreign key with ON DELETE CASCADE.

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id VARCHAR(255) NOT NULL,
    total DECIMAL(19, 2) NOT NULL,
    status VARCHAR(50) NOT NULL,
    order_date TIMESTAMP NOT NULL
);

CREATE TABLE order_items (
    id BIGSERIAL PRIMARY KEY,
    order_id BIGINT NOT NULL,
    product_id VARCHAR(255) NOT NULL,
    price DECIMAL(19, 2) NOT NULL,
    quantity INTEGER NOT NULL,
    CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES orders (id) ON DELETE CASCADE
);

The OrderService.placeOrder() method is annotated with @Transactional. This ensures that if any step in the checkout flow fails — building the item list, calculating the total, persisting the record — the database rolls back to a consistent state. The JPA cascade configuration ensures that saving the parent Order entity also persists all child OrderItem entities in a single atomic operation.

Flyway provides versioned, reproducible migration scripts. On startup the service validates that the running schema matches the expected baseline, preventing "works on my machine" drift between environments [2].

MongoDB for the Product Service: Schema Flexibility at Catalog Scale

A document database stores data as self-describing records — typically JSON or BSON objects — where each document can carry a different set of fields. There is no enforced column list; a document simply contains whatever the application writes into it. Documents that represent the same concept live in a collection, but the engine does not require them to be structurally identical. This makes document databases well-suited to domains where the data model is heterogeneous.

Products have heterogeneous attributes: a laptop has RAM and storage, a t-shirt has size and color, a book has an ISBN and author. Fitting all of these into rigid relational columns requires either complex EAV (Entity-Attribute-Value) schemes or sparse nullable columns — both are maintenance burdens.

MongoDB's document model stores each product as a self-describing JSON document. When the catalog team needs to add a new attribute category, no schema migration is required. The application code simply begins writing the new field, and existing documents remain valid.

The Product Service uses Spring Data MongoDB with repository abstraction:

@Document(collection = "products")
public class Product {
    @Id
    private String id;
    private String name;
    private String description;
    private BigDecimal price;
    private Integer stockQuantity;
    private String skuCode;
    private String category;
}

The @Document annotation maps the Java class to a MongoDB collection. Spring Data's MongoRepository provides CRUD operations and dynamic query derivation without boilerplate SQL.

Redis for the Cart Service: Ephemeral State at Memory Speed

A key-value store is the simplest of all database models: every entry is a pair of a unique key and an associated value, with no enforced structure beyond that. There is no schema, no query language, and no relational machinery — retrieval is always by key, and the engine does nothing more than store and fetch the associated value as fast as possible. That simplicity is what makes key-value stores fast: without the overhead of parsing queries, enforcing constraints, or managing transaction logs, the engine can serve reads and writes at memory speed.

A shopping cart is session-like: it changes frequently, needs sub-millisecond read/write response times, and is inherently transient — if a cart is lost, the customer can simply re-add items. These characteristics make a relational database an inappropriate choice (too much transactional overhead for short-lived state) and a document database acceptable but not optimal.

Redis was designed precisely for this use case. As an in-memory data structure store, it delivers microsecond latency for key-value operations [3]. The Cart Service models cart data as a Redis Hash where the top-level key is cart:{userId} and the value is a JSON-serialized Cart object.

The custom RedisConfig configures a RedisTemplate with explicit serializers:

@Bean
public RedisTemplate<String, Object> redisTemplate(RedisConnectionFactory connectionFactory) {
    RedisTemplate<String, Object> template = new RedisTemplate<>();
    template.setConnectionFactory(connectionFactory);

    ObjectMapper mapper = new ObjectMapper();
    JacksonJsonRedisSerializer<Object> serializer = new JacksonJsonRedisSerializer<>(mapper, Object.class);

    template.setKeySerializer(new StringRedisSerializer());
    template.setValueSerializer(serializer);
    template.setHashKeySerializer(new StringRedisSerializer());
    template.setHashValueSerializer(serializer);

    return template;
}

This configuration ensures keys are stored as human-readable strings (cart:user123) while values are stored as JSON, which is both inspectable via Redis CLI and portable across service restarts.

The Trade-offs: What This Architecture Costs

Polyglot persistence is not free. The benefits in autonomy and performance come with real operational costs. For example:

No cross-service joins. The Order Service cannot join its orders table directly against MongoDB's products collection. In a monolith, a JOIN happens inside the database engine in microseconds with full transactional isolation. Across services, the equivalent operation is an HTTP round trip, which introduces variable latency and a dependency on network availability.
Eventual consistency. In the OrderService.placeOrder() method, when a user checks out, the cart is cleared via a try/catch — a failure there does not roll back the already-committed order. True cross-service transactions require the Saga pattern [4].
Operational overhead. Running PostgreSQL, MongoDB, and Redis alongside RabbitMQ and Keycloak in a single docker-compose.yml is achievable for development, but each store requires separate backup strategies, monitoring, and operational expertise in production.

Conclusion

The demo e-commerce platform described here demonstrates that polyglot persistence, when applied with intention, produces a system where each component operates at its natural best. PostgreSQL provides the ACID guarantees that financial records demand. MongoDB provides the flexibility that a diverse product catalog requires. Redis provides the speed that shopping cart interactions need.

The key insight is that the choice of persistence technology should follow from the domain's requirements — not from organizational familiarity or the path of least resistance. In practice, this means asking a different set of questions before reaching for a database. Is the data relational, or is it a self-describing document with variable structure? Is consistency a hard requirement, or can the system tolerate brief divergence in exchange for availability and speed? Is the data long-lived and auditable, or ephemeral by nature? Each of these questions points toward a different storage paradigm, and no single engine answers all of them optimally.

Source code: github.com/dancodingbr/ecommerce

References

[1] Microservices.io. Pattern: Database per service. Available at: https://microservices.io/patterns/data/database-per-service.html

[2] Flyway. Why database migrations. Available at: https://documentation.red-gate.com/fd/why-database-migrations-184127574.html

[3] Redis. Get Started. Available at: https://redis.io/docs/latest/get-started/

[4] Microservices.io. Pattern: Saga. Available at: https://microservices.io/patterns/data/saga.html