DEV Community: Tuấn Anh

Architecting 21-Service E-commerce with Golang & DDD

Tuấn Anh — Thu, 09 Jul 2026 13:35:30 +0000

Answer-first: Migrating an e-commerce monolith to 21+ distributed microservices using Golang & DDD. Explore Kratos architecture, Saga patterns, and race conditions.

Scaling an e-commerce platform past 10,000+ orders per day containing multiple SKUs across dynamic warehouses is where naive architecture breaks down. Hardware scaling ceases to be a magic bullet when distributed transactions, race conditions, and eventual consistency are involved.

In this deep tech dive, we will tear apart the "Hello World" abstraction of Microservices. We will look at exactly how our 21-service distributed ecosystem interacts under the hood. I will share the exact Golang architectural patterns (Kratos), the Saga orchestration for distributed checkout, and how we handle race conditions under severe load.

1. The Distributed Landscape

Microservices without bounded contexts degenerate into a latency-heavy "Distributed Monolith". We bounded our ecosystem loosely around five core domains, prioritizing strict database-per-service isolation (If you are just starting out, this is exactly why you might want to start with a Modular Monolith Architecture before jumping to distributed extraction):

graph TD
    API[API Gateway]
    API --> Checkout[Checkout Service]
    API --> Cart[Cart Service]

    subgraph "Dapr Event Mesh (Pub/Sub)"
        Checkout -- checkout.requested event --> Dapr[Redis / Dapr]
        Dapr --> Order[Order Service]
        Dapr --> Warehouse[Warehouse Service]
        Dapr --> Pricing[Pricing Service]
    end

    Warehouse -- inventory.reserved event --> Dapr
    Pricing -- pricing.validated event --> Dapr
    Order -- checkout.failed (Rollback) --> Dapr

The diagram above encapsulates the most volatile flow: The Checkout Saga. When a user checks out, we cannot just open a 4-table SQL transaction anymore. Checkout must synchronize asynchronously with Pricing (to validate totals), Warehouse (to lock inventory), and Order (to generate the final aggregate).

2. Enforcing Clean Architecture with Kratos

To manage 21 separate codebases, consistency among the engineering team is mandatory. We utilized Kratos (v2) to strictly enforce Clean Architecture in Golang. (You can explore the full stack we use in our Microservices Tech Radar). By physically separating boundaries, we prevent database logic from bleeding into HTTP or gRPC handlers.

Here is what a standard Kratos blueprint looks like in our ecosystem:

// internal/biz/order.go (Business Logic Layer)
type OrderUsecase struct {
    repo OrderRepo
    log  *log.Helper
}

func (uc *OrderUsecase) CreateOrder(ctx context.Context, o *Order) error {
    if o.TotalAmount <= 0 {
        return v1.ErrorInvalidAmount("order amount must be positive")
    }
    // Biz layer knows NOTHING about PostgreSQL or GORM
    return uc.repo.Save(ctx, o)
}

// internal/data/order.go (Data Persistence Layer)
type orderRepo struct {
    data *Data
    log  *log.Helper
}

// Implement the Biz interface
func (r *orderRepo) Save(ctx context.Context, o *biz.Order) error {
    // Database transactions safely isolated here
    return r.data.db.WithContext(ctx).Create(o).Error
}

We tie these layers together dynamically using Google Wire for compile-time Dependency Injection. This allows developers to write unit tests with mocked repositories effortlessly, entirely insulating the business core from transport protocols.

3. The Real Beast: Distributed Transactions (Saga Pattern)

The most generic advice in microservices is "Use Pub/Sub". But how do you handle failure when Service A succeeds but Service B fails?

In our ecosystem, we implemented an Event-Choreography Saga Pattern using Dapr. Let's trace the complex ConfirmCheckout flow:

Checkout Service receives the HTTP request, validates the cart, and publishes a checkout.requested event to Dapr.
Warehouse Service and Pricing Service listen to this event and act independently on the payloads.

Handling Race Conditions in Warehouse

Inventory race conditions happen when two sub-second requests try to buy the last iPhone.

If Warehouse Service just fires SELECT stock FROM items WHERE id = ?, both concurrent threads will see stock = 1, and both will decrement it, leading to -1 stock.

Instead, our Warehouse service utilizes Optimistic Concurrency Control (OCC) at the database layer:

// Optimistic Locking to prevent overselling
result := db.Exec(`
    UPDATE inventory 
    SET reserved_stock = reserved_stock + ?, version = version + 1 
    WHERE sku_id = ? 
      AND (total_stock - reserved_stock) >= ? 
      AND version = ?`, 
    qty, skuID, qty, currentVersion)

if result.RowsAffected == 0 {
    return ErrStockInsufficientOrRaceCondition
}

If the lock fails due to an instant mismatch, Warehouse publishes an inventory.reservation.failed event.

The Rollback (Compensation)

Because state is distributed, if Warehouse successfully locks the stock but Pricing reports that the applied voucher is invalid, the entire Saga must abort.

Order Service often acts as the sink. If it sees inventory.reservation.failed OR pricing.validation.failed, it fires a massive compensation event: checkout.failed.

Background workers (consumers) in Warehouse Service catch this event and immediately trigger Compensation Logic:

// Background Worker un-reserving stock
func (w *WarehouseWorker) HandleCheckoutFailed(ctx context.Context, event CheckoutFailed) error {
    // Rollback the reserved stock using the original transaction ID
    return w.inventoryRepo.ReleaseReservedStock(ctx, event.TransactionID)
}

4. Taming Eventual Consistency with Idempotency

When you rely on network events, network retries will happen. Dapr guarantees "At-Least-Once" delivery, meaning Warehouse Service might receive the same checkout.requested event twice if a timeout occurs.

To prevent reserving stock twice, every single Database in our ecosystem involved in transactions employs an Idempotency Key.

CREATE TABLE processed_events (
    event_id VARCHAR(255) PRIMARY KEY,
    status VARCHAR(50),
    created_at TIMESTAMP
);

Before processing an incoming Dapr message, the service opens a database transaction and attempts to insert the event_id. If it throws a constraint violation, the event was already processed, and the system safely acks and drops the duplicate message.

Conclusion

Migrating an e-commerce Monolith to a 21-service ecosystem is not about setting up an API Gateway and calling it a day. The real engineering begins when you hit the edges: gracefully rolling back partial checkouts, preventing database locks under high concurrent loads, and forcing strict domain boundaries so codebases remain readable.

By mapping contexts meticulously, enforcing strict separation via Kratos, and utilizing Idempotent Saga patterns over Dapr, we engineered a system that can absorb massive Black Friday traffic spikes without dropping a single order. The initial complexities of distributed state are painful, but the resulting scalability and developer isolation are profoundly worth the investment.

Continue Reading:

Go Microservices Architecture: Production Guide — the comprehensive architecture manual for the entire stack.
Deconstructing the Ecosystem: Service Details by Domain — a full breakdown of all 21 services across 6 business domains.
Mastering Event-Driven Architecture with Dapr Pub/Sub — deep dive into the Saga, DLQ, and idempotency patterns powering this ecosystem.
GitOps at Scale: Kubernetes & ArgoCD for Microservices — how we deploy all 21 services with zero manual kubectl commands.

Hi, I'm Lê Tuấn Anh (vesviet) 👋
I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms (25M+ requests/month).
If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.

FAQ

Q: What is Golang?
A:
Golang is a critical architectural pattern or system discussed in this guide. Migrating an e-commerce monolith to 21+ distributed microservices using Golang & DDD. Explore Kratos architecture, Saga patterns, and race conditions.

Q: How does Golang compare to traditional alternatives?
A:
Unlike legacy systems, Golang introduces modern microservices or event-driven paradigms that scale efficiently. This article explores the exact tradeoffs and engineering constraints involved.

Prompt Engineering vs Fine-Tuning SLM: Production Cost & Latency Benchmarks

Tuấn Anh — Tue, 07 Jul 2026 12:13:42 +0000

Prompt Engineering vs Fine-Tuning SLM: Production Cost & Latency Benchmarks

When moving LLMs (Large Language Models) or SLMs (Small Language Models) into production, the debate between Prompt Engineering and Fine-Tuning isn't just about model intelligence. It is fundamentally a battle of Cost and Latency.

Based on real-world data from our AI engineering team, here are the benchmarks and tipping points that dictate when you must abandon complex prompts and start fine-tuning.

The Tipping Point: Context Window Bloat

When does Prompt Engineering become too expensive?

The Benchmark: The tipping point hits around 50,000 requests per day for tasks requiring Structured Output (e.g., forcing the LLM to return strict JSON).

To guarantee a strict JSON schema using a Cloud API (like GPT-4o), you often have to inject heavy Few-Shot examples and extensive system instructions. This causes Context Window Bloat, easily inflating a single prompt to 8,000 - 10,000 input tokens.

Under 1,000 requests/day: Paying for Cloud API input tokens is still cheaper and requires zero infrastructure maintenance.
Over 50,000 requests/day: The variable cost of input tokens skyrockets. At this scale, transitioning to a fixed-cost model—renting GPUs to host a Fine-Tuned SLM (like Llama-3 8B using LoRA)—becomes significantly cheaper. The fine-tuned model understands the JSON schema inherently, completely eliminating the need for a 10k-token few-shot prompt.

Latency Benchmarks: 150ms vs 800ms TTFT

Cost aside, User Experience (UX) is dictated by latency, specifically TTFT (Time To First Token). For real-time applications like Chatbots or inline coding assistants, latency is make-or-break.

Here is our production TTFT benchmark:

Cloud API (10k Token Prompt): TTFT averages 800ms to 1.2 seconds. The cloud model spends a massive amount of time in the "Prefill Phase" reading your massive context window, plus network overhead.
Fine-Tuned SLM (7B, INT4 Quantized, Edge/Local Hosting): TTFT drops to 150ms - 250ms. Because the behavior and formatting rules are baked into the model weights, the input prompt is incredibly short. The UX shifts from "painful waiting" to "instantaneous response."

Managing Tech Debt: PromptOps vs MLOps

A surprising finding from our production deployment is that PromptOps generates more silent technical debt than MLOps.

We call it Semantic Drift. When you tweak a massive system prompt to fix a bug for "Edge Case A," you almost inevitably break the structured output for "Edge Case B." Traditional CI/CD pipelines cannot catch semantic drift because it is not a syntax error.

Fine-tuning (MLOps) requires a much heavier initial lift (Data Pipelines, Evaluation frameworks, GPU provisioning). However, once established, model checkpoints provide absolute, reproducible version control. Warning: If your organization does not have a culture of clean data, Fine-Tuning will result in a "Garbage In, Garbage Out" disaster.

The Golden Rule: RAG vs Fine-Tuning

RAG (Retrieval-Augmented Generation) and Fine-Tuning are not mutually exclusive; they solve completely different problems.

The Golden Rule:

Use RAG to inject Knowledge (Facts, Documentation, Real-time data).
Use Fine-Tuning to teach Behavior and Format (Tone, JSON schemas, Reasoning style).

Our Failure Case Study: We once attempted to fine-tune a model by feeding it all our proprietary engineering documentation, hoping it would "memorize" the knowledge. It failed miserably. The model hallucinated wildly when asked cross-domain questions.
Do not fine-tune a model just to cram data into it. Use RAG to fetch the data, and Fine-Tune the model so it knows exactly how to format the answer.

Are you currently relying entirely on Prompt Engineering for your production AI features? Have you hit the latency wall yet? Let me know your benchmarks in the comments!

Exporting Magento 2 Data: Flatten EAV with SQL & Node

Tuấn Anh — Mon, 06 Jul 2026 10:57:31 +0000

Answer-first: Production-grade guide to extracting data from Magento 2's EAV model. Includes direct SQL queries and a resilient Node.js streaming pipeline.

When migrating off Magento 2, the first obstacle is always the database schema. Magento does not store data in clean flat rows — it uses an Entity-Attribute-Value (EAV) model that spreads data across dozens of tables with store-scope inheritance. Understanding this before writing SQL will save you days.

This guide covers two extraction problems: order export (the simpler case) and product catalog export (the genuinely hard case), followed by a production-grade Node.js pipeline to ingest that data into your new service databases.

Part 1: Exporting Orders

Order data lives across sales_order, sales_order_address, sales_order_payment, and sales_order_item. Unlike the product catalog, this is standard foreign-key joins — not full EAV pivoting.

Full Order + Payment + Shipping Export

SELECT
    so.entity_id            AS order_id,
    so.increment_id         AS magento_order_number,
    so.status               AS order_status,
    so.grand_total          AS total_amount,
    so.base_currency_code   AS currency,
    so.created_at           AS order_created_at,
    so.customer_email,
    so.customer_firstname   AS customer_first,
    so.customer_lastname    AS customer_last,

    -- Shipping address (denormalized)
    soa.street              AS ship_street,
    soa.city                AS ship_city,
    soa.region              AS ship_region,
    soa.postcode            AS ship_postcode,
    soa.country_id          AS ship_country,
    soa.telephone           AS ship_phone,

    -- Payment method
    sop.method              AS payment_method,
    sop.last_trans_id       AS payment_transaction_id,

    -- Shipment (NULL if not yet fulfilled)
    sos.entity_id           AS shipment_id,
    sos.created_at          AS shipped_at

FROM sales_order so
LEFT JOIN sales_order_address soa
    ON soa.parent_id = so.entity_id AND soa.address_type = 'shipping'
LEFT JOIN sales_order_payment sop
    ON sop.parent_id = so.entity_id
LEFT JOIN sales_shipment sos
    ON sos.order_id = so.entity_id

WHERE so.status NOT IN ('canceled', 'fraud')
  AND so.created_at >= '2022-01-01 00:00:00'
ORDER BY so.created_at ASC;

Order Line Items (Second Pass)

SELECT
    soi.order_id,
    soi.sku,
    soi.name                AS product_name,
    soi.qty_ordered,
    soi.qty_shipped,
    soi.qty_refunded,
    soi.price               AS unit_price,
    soi.row_total,
    soi.product_type,
    soi.parent_item_id      -- non-null for configurable child rows

FROM sales_order_item soi

WHERE soi.parent_item_id IS NULL  -- skip phantom child rows for configurables
ORDER BY soi.order_id ASC, soi.item_id ASC;

Join on order_id in your ingestion script to reconstruct the full order object.

Part 2: Exporting the Product Catalog (The Hard Part)

This is where most migration engineers underestimate the effort. The product catalog uses full EAV with store scope inheritance: a value at store_id = 0 (Admin/Global) is the default; a value at a specific store_id overrides it for that store view. A naive SELECT * will return corrupted or incomplete data.

The correct approach is a two-step process.

Step 1: Materialize Attribute IDs

The attribute_id values are environment-specific — they differ between Magento installations. Run this once and use the result to populate your export query:

SELECT attribute_id, attribute_code, backend_type
FROM eav_attribute
WHERE entity_type_id = (
    SELECT entity_type_id FROM eav_entity_type
    WHERE entity_type_code = 'catalog_product'
)
AND attribute_code IN (
    'name', 'url_key', 'description', 'short_description',
    'price', 'special_price', 'status', 'visibility', 'weight'
);

Step 2: Flattened Product Export with Store-Scope Fallback

This query exports products for store store_id = 1. For each attribute, it prefers the store-specific value and falls back to the global default (store_id = 0). Replace the attribute_id values with results from Step 1:

SELECT
    e.entity_id,
    e.sku,
    e.type_id                                           AS product_type,
    e.created_at,

    -- Name (varchar): prefer store-specific, fallback to global
    COALESCE(v_name_s.value, v_name_g.value)            AS name,
    COALESCE(v_url_s.value, v_url_g.value)              AS url_key,

    -- Status: 1=Enabled, 2=Disabled (int)
    COALESCE(i_status_s.value, i_status_g.value)        AS status,
    -- Visibility: 1=Not visible, 4=Catalog+Search (int)
    COALESCE(i_vis_s.value, i_vis_g.value)              AS visibility,

    -- Price (decimal — always global scope in Magento)
    d_price.value                                       AS price,
    d_special.value                                     AS special_price,
    d_weight.value                                      AS weight

FROM catalog_product_entity e

-- === VARCHAR: name ===
LEFT JOIN catalog_product_entity_varchar v_name_s
    ON v_name_s.entity_id = e.entity_id AND v_name_s.attribute_id = 73 AND v_name_s.store_id = 1
LEFT JOIN catalog_product_entity_varchar v_name_g
    ON v_name_g.entity_id = e.entity_id AND v_name_g.attribute_id = 73 AND v_name_g.store_id = 0

-- === VARCHAR: url_key ===
LEFT JOIN catalog_product_entity_varchar v_url_s
    ON v_url_s.entity_id = e.entity_id AND v_url_s.attribute_id = 120 AND v_url_s.store_id = 1
LEFT JOIN catalog_product_entity_varchar v_url_g
    ON v_url_g.entity_id = e.entity_id AND v_url_g.attribute_id = 120 AND v_url_g.store_id = 0

-- === INT: status ===
LEFT JOIN catalog_product_entity_int i_status_s
    ON i_status_s.entity_id = e.entity_id AND i_status_s.attribute_id = 96 AND i_status_s.store_id = 1
LEFT JOIN catalog_product_entity_int i_status_g
    ON i_status_g.entity_id = e.entity_id AND i_status_g.attribute_id = 96 AND i_status_g.store_id = 0

-- === INT: visibility ===
LEFT JOIN catalog_product_entity_int i_vis_s
    ON i_vis_s.entity_id = e.entity_id AND i_vis_s.attribute_id = 99 AND i_vis_s.store_id = 1
LEFT JOIN catalog_product_entity_int i_vis_g
    ON i_vis_g.entity_id = e.entity_id AND i_vis_g.attribute_id = 99 AND i_vis_g.store_id = 0

-- === DECIMAL: price, special_price, weight (global only) ===
LEFT JOIN catalog_product_entity_decimal d_price
    ON d_price.entity_id = e.entity_id AND d_price.attribute_id = 77 AND d_price.store_id = 0
LEFT JOIN catalog_product_entity_decimal d_special
    ON d_special.entity_id = e.entity_id AND d_special.attribute_id = 78 AND d_special.store_id = 0
LEFT JOIN catalog_product_entity_decimal d_weight
    ON d_weight.entity_id = e.entity_id AND d_weight.attribute_id = 80 AND d_weight.store_id = 0

-- Only export enabled products
WHERE COALESCE(i_status_s.value, i_status_g.value) = 1
ORDER BY e.entity_id ASC;

Performance: On catalogs with 25,000+ SKUs, this query will be slow. Run EXPLAIN ANALYZE first, ensure composite indexes exist on (entity_id, attribute_id, store_id) for each EAV value table, and batch by entity_id ranges (WHERE e.entity_id BETWEEN 1 AND 5000) to avoid locking your production database.

Part 3: The Production Node.js Ingestion Pipeline

With data exported to CSV, you need a streaming pipeline that handles gigabytes without OOM, with batching, retry logic, idempotency, and a dead-letter queue for failed rows.

Pipeline Architecture

CSV File → Readable Stream → csv-parse → Batch Collector → DB Upsert (with retry)
                                                         ↓ (on max retries)
                                                   Dead-Letter File (JSONL)

Implementation

// migrate.js — Production-grade Magento → PostgreSQL pipeline
const { pipeline, Transform } = require('stream');
const { promisify } = require('util');
const { parse } = require('csv-parse');
const fs = require('fs');
const db = require('./db'); // your pg connection pool

const pipe = promisify(pipeline);

const BATCH_SIZE = 500;
const MAX_RETRIES = 3;
const RETRY_BASE_MS = 500;

const dlqStream = fs.createWriteStream('./failed-rows.jsonl', { flags: 'a' });
let processed = 0, failed = 0;
const startTime = Date.now();

// Exponential backoff retry
async function withRetry(fn, label) {
    for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
        try {
            return await fn();
        } catch (err) {
            if (attempt === MAX_RETRIES) throw err;
            const delay = RETRY_BASE_MS * Math.pow(2, attempt - 1);
            console.warn(`\n⚠ ${label} failed (attempt ${attempt}). Retrying in ${delay}ms…`);
            await new Promise(r => setTimeout(r, delay));
        }
    }
}

// Upsert batch — idempotent by magento_order_id
async function upsertBatch(batch) {
    const client = await db.connect();
    try {
        await client.query('BEGIN');
        for (const row of batch) {
            await client.query(`
                INSERT INTO orders (
                    magento_order_id, magento_increment_id, status,
                    total_amount, currency, customer_email, created_at
                ) VALUES ($1,$2,$3,$4,$5,$6,$7)
                ON CONFLICT (magento_order_id) DO UPDATE SET
                    status       = EXCLUDED.status,
                    total_amount = EXCLUDED.total_amount,
                    updated_at   = NOW()
            `, [
                row.order_id, row.magento_order_number, row.order_status,
                parseFloat(row.total_amount) || 0, row.currency,
                row.customer_email, row.order_created_at
            ]);
        }
        await client.query('COMMIT');
    } catch (err) {
        await client.query('ROLLBACK');
        throw err;
    } finally {
        client.release();
    }
}

// Transform stream: collect rows into batches, flush with backpressure
function createBatchCollector(batchSize, onBatch) {
    let buffer = [];

    const flush = async (rows, callback) => {
        try {
            await withRetry(() => onBatch(rows), `batch ~row ${processed}`);
            processed += rows.length;
            process.stdout.write(
                `\r✓ ${processed.toLocaleString()} rows | ✗ ${failed} failed | ` +
                `${((Date.now() - startTime) / 1000).toFixed(0)}s elapsed`
            );
        } catch (err) {
            failed += rows.length;
            console.error(`\n✗ Permanent batch failure: ${err.message}`);
            rows.forEach(r => dlqStream.write(JSON.stringify(r) + '\n'));
        }
        callback();
    };

    return new Transform({
        objectMode: true,
        async transform(row, _enc, callback) {
            buffer.push(row);
            if (buffer.length >= batchSize) {
                const toFlush = buffer.splice(0, batchSize);
                await flush(toFlush, callback);
            } else {
                callback();
            }
        },
        async flush(callback) {
            if (buffer.length > 0) await flush(buffer, callback);
            else callback();
        }
    });
}

async function migrate(csvPath) {
    console.log(`\nMigrating: ${csvPath} | Batch: ${BATCH_SIZE} | Retries: ${MAX_RETRIES}\n`);
    await pipe(
        fs.createReadStream(csvPath, { encoding: 'utf8' }),
        parse({ columns: true, skip_empty_lines: true, trim: true }),
        createBatchCollector(BATCH_SIZE, upsertBatch)
    );
    dlqStream.end();
    const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
    console.log(`\n\n✅ Done in ${elapsed}s — ${processed.toLocaleString()} rows | ${failed} DLQ`);
    if (failed > 0) console.log(`   DLQ: ./failed-rows.jsonl`);
}

migrate(process.argv[2] || './orders.csv').catch(err => {
    console.error('\n✗ Fatal:', err.message);
    process.exit(1);
});

Key Design Decisions

Idempotency (ON CONFLICT DO UPDATE): The pipeline can be safely restarted. If it crashes at row 47,000, rows 1–47,000 are simply updated to the same values when you re-run. No duplicates.

Dead-Letter Queue: Batches that exhaust all retries are written to failed-rows.jsonl. After the migration, inspect the file, fix the root cause, and re-run the script pointing at the DLQ file.

Backpressure: The callback() in the Transform stream is not called until upsertBatch resolves. Node.js automatically pauses the readable stream when the database is under pressure — no manual pause()/resume() needed.

stream.pipeline: Using the promisified pipeline instead of manually chaining .pipe() ensures that if any stream in the chain errors, all other streams are automatically destroyed and file handles are released.

# Run migration
node migrate.js ./exports/magento-orders.csv

# Replay only failed rows
node migrate.js ./failed-rows.jsonl

For the full architectural context of where this extracted data lands in a microservice ecosystem, see Why You Should Migrate from Magento to Microservices and the Zero-Downtime Migration Blueprint.

Go deeper: Architecting a 21-Service E-commerce Ecosystem with Golang & DDD — the distributed microservices architecture that this data pipeline feeds into.

FAQ

Q: What is Magento?
A:
Magento is a critical architectural pattern or system discussed in this guide. Production-grade guide to extracting data from Magento 2's EAV model. Includes direct SQL queries and a resilient Node.js streaming pipeline.

Q: How does Magento compare to traditional alternatives?
A:
Unlike legacy systems, Magento introduces modern microservices or event-driven paradigms that scale efficiently. This article explores the exact tradeoffs and engineering constraints involved.

Migrating Magento to Microservices: When & Why

Tuấn Anh — Thu, 02 Jul 2026 23:30:40 +0000

Answer-first: Honest breakdown of why Magento's monolithic architecture becomes a liability at scale and the exact reasons to migrate to a microservice ecosystem.

Let's be direct: Magento is not a bad platform. For thousands of businesses, it is the right tool. It has a mature plugin ecosystem, a large developer community, and a proven track record across enterprise e-commerce.

But there is a ceiling. And when you hit it, you feel it everywhere — in your deployment pipeline, in your database query times, in your team's ability to ship features independently, and ultimately in your ability to serve customers reliably at scale.

This post is about what that ceiling looks like technically, why it exists architecturally, and what a migration to microservices actually solves — and what it doesn't.

The Core Problem: Magento is a Shared-State Monolith

Magento's architecture is fundamentally a single application with a single shared MySQL database. Every module — catalog, orders, payments, inventory, customers, promotions — reads and writes to the same database cluster.

graph TB
    subgraph "Magento Monolith"
        APP["Single PHP Application<br>Catalog · Orders · Payment<br>Inventory · Customers · CMS"]
        APP --> DB[("Single MySQL DB<br>300+ tables")]
        APP --> CACHE["Varnish / Redis Cache"]
    end

    CLIENT["Web / Mobile"] --> APP

This design works well at low-to-medium scale. The problem surfaces when you need to grow.

1. You Cannot Scale Selectively

During a flash sale, your Order and Checkout modules get hammered. Your Catalog module is mostly idle. In Magento, you cannot scale just the checkout flow — you must scale the entire application. Every PHP worker you spin up carries the full weight of every module, whether it's under load or not.

In a microservice architecture, you scale surgically:

# Scale only the Order service during flash sale
# Other services remain at baseline
order-service:    replicas: 10   # 10x during sale
checkout-service: replicas: 8
payment-service:  replicas: 6
catalog-service:  replicas: 2   # Unchanged
analytics-service: replicas: 1  # Unchanged

The cost difference at scale is measurable. In our production environment, selective scaling during flash sale events reduced EC2 compute spend by approximately 60% compared to scaling the full Magento stack uniformly — because we only scaled the 3 services under load, not all 21.

2. A Single Failure Brings Down Everything

In Magento, a misbehaving extension, a slow database query, or a memory leak in one module can cascade into a full site outage. The application shares a process space and a database connection pool.

In a distributed system, failure is contained:

Magento:          Review module crashes → entire site down
Microservices:    Review service crashes → customers still browse, add to cart, and pay

This is not theoretical. The Review service going down should never affect the Payment service. Database isolation enforces this at the infrastructure level — each service owns its own PostgreSQL instance. A slow query in the Analytics database cannot lock rows in the Order database.

3. The EAV Schema Becomes a Performance Liability

Magento's product catalog uses an Entity-Attribute-Value (EAV) model. Instead of storing product data in flat rows, it spreads attributes across multiple tables: catalog_product_entity_varchar, catalog_product_entity_int, catalog_product_entity_decimal, and so on.

Fetching a single product with 30 attributes can require joining 5+ tables. At 25,000+ SKUs with complex attribute sets, this becomes a measurable latency problem — especially for search and listing pages. The SQL to export even a basic order manifest from Magento looks like this:

-- Just to get orders with payment and shipment IDs — already 3 JOINs
SELECT 
    sales_order.entity_id        AS "Order ID",
    sales_order_payment.entity_id AS "Payment ID",
    sales_shipment.entity_id      AS "Shipment ID",
    sales_order.status            AS "Order Status",
    sales_order.grand_total       AS "Total"
FROM sales_order
LEFT JOIN sales_order_payment 
    ON (sales_order.entity_id = sales_order_payment.parent_id)
LEFT JOIN sales_shipment 
    ON (sales_order.entity_id = sales_shipment.entity_id)
ORDER BY sales_order.created_at ASC;

And that is just orders. The product catalog EAV joins are significantly worse — fetching a single product with 30 attributes touches catalog_product_entity_varchar, catalog_product_entity_int, catalog_product_entity_decimal, and more in a single query. For a full breakdown of how to extract and flatten this data during migration, see Exporting Magento 2 Orders: Bypassing the EAV Model with Clean SQL & Node.js.

A dedicated Catalog Service with a purpose-built schema and an Elasticsearch read model solves this cleanly:

Writes go to a normalized PostgreSQL schema owned by the Catalog service
A CQRS read model in Elasticsearch serves product listings and search with sub-100ms response times
Price and stock updates propagate via Dapr events, keeping the search index fresh in near real-time

The CQRS flow works like this: when the Catalog or Pricing service updates a product, it publishes a catalog.product.updated or pricing.price.updated event to the Dapr event mesh. The Search service subscribes to these topics and rebuilds the Elasticsearch document for that SKU — no cron jobs, no full reindex, no stale data windows.

graph LR
    CAT[Catalog Service] -- "catalog.product.updated" --> DAPR[Dapr PubSub]
    PRC[Pricing Service] -- "pricing.price.updated" --> DAPR
    WH[Warehouse Service] -- "warehouse.stock.changed" --> DAPR
    DAPR --> SEARCH[Search Service Worker]
    SEARCH --> ES[(Elasticsearch)]
    ES -- "sub-100ms reads" --> GW[API Gateway]

4. Teams Step on Each Other

At scale, multiple squads need to work on the same platform simultaneously. In Magento, this means multiple teams modifying the same codebase, the same database schema, and deploying together.

Conway's Law is real: your system architecture mirrors your team structure. A monolith forces teams to coordinate deployments, negotiate schema changes, and share release cycles. One team's bug blocks another team's feature.

Bounded contexts solve this. When the Payment team owns their service end-to-end — their codebase, their database, their deployment pipeline — they ship independently. A bug in the Loyalty service does not block a Checkout release.

5. Distributed Transactions Require Explicit Design

Magento handles checkout as a synchronous database transaction: reserve stock, create order, capture payment — all in one BEGIN ... COMMIT block. This is simple and correct for a single database.

At scale, this becomes a liability. A slow payment gateway response holds a database transaction open, consuming connection pool slots. Under load, this cascades into connection exhaustion.

The microservice answer is the Saga pattern: each step is a local transaction, and failures trigger compensating transactions rather than database rollbacks.

sequenceDiagram
    participant CK as Checkout Service
    participant WH as Warehouse Service
    participant PAY as Payment Service
    participant ORD as Order Service

    CK->>WH: Reserve stock (TTL 15 min)
    WH-->>CK: Stock reserved ✅

    CK->>PAY: Authorize payment
    PAY-->>CK: Authorized ✅

    CK->>ORD: Create order
    ORD-->>CK: Order created ✅

    Note over CK,ORD: If payment fails at any point:
    CK->>WH: Release reservation (compensation)
    CK->>PAY: Void authorization (compensation)

No long-lived database transactions. No connection pool exhaustion. Each service handles its own state, and failures trigger explicit rollback logic rather than implicit database rollbacks.

What Microservices Actually Deliver

Based on a production 21-service Go ecosystem handling 10,000+ orders per day, here is what the architecture concretely delivers:

Capability	Magento	Microservices
Per-module scaling	❌ Scale entire app	✅ Scale only what's under load
Fault isolation	❌ One crash = site down	✅ Isolated failure domains
Database isolation	❌ 300+ shared tables	✅ Separate DB per service
Independent deploys	❌ Full app deployment	✅ Deploy one service at a time
Payment resilience	❌ Sync, no retry logic	✅ Saga + DLQ + compensation
Search performance	⚠️ EAV joins at query time	✅ Pre-indexed Elasticsearch
Event reliability	❌ Sync observers	✅ Transactional outbox, at-least-once
Zero-downtime deploy	⚠️ Maintenance mode	✅ Rolling updates per service

The difference between these two event models is worth unpacking. In Magento, events are synchronous PHP observers — if the observer is slow or throws an exception, it blocks the entire request:

// Magento: Synchronous observer — blocks the HTTP request
class OrderPlaceAfterObserver implements ObserverInterface
{
    public function execute(Observer $observer)
    {
        $order = $observer->getEvent()->getOrder();
        // If this call to an external API is slow or fails,
        // the customer's checkout request hangs or errors out
        $this->loyaltyService->awardPoints($order->getCustomerId(), $order->getGrandTotal());
        $this->analyticsService->trackPurchase($order); // Another blocking call
    }
}

In the microservice model, the Order service writes the event to an outbox table in the same database transaction as the order itself — then a background worker publishes it asynchronously:

// Go: Transactional Outbox — event is guaranteed, non-blocking
func (uc *OrderUsecase) CreateOrder(ctx context.Context, o *Order) error {
    return uc.repo.WithTx(ctx, func(tx Tx) error {
        // 1. Save the order
        if err := tx.SaveOrder(ctx, o); err != nil {
            return err
        }
        // 2. Write event to outbox in the SAME transaction
        // If the DB commits, the event is guaranteed to be published
        return tx.SaveOutboxEvent(ctx, "orders.order.created", o)
    })
    // Background worker picks up outbox events and publishes to Dapr
    // Checkout request returns immediately — no blocking on downstream services
}

The outbox guarantees delivery even if the Dapr broker is temporarily unavailable. The Magento observer has no such guarantee — a failed observer silently drops the event.

The Real Cost of Migration

This is where most migration posts stop being honest. Microservices are not free.

Operational complexity increases dramatically. You are now running 21+ services, each with its own database, deployment pipeline, and failure modes. You need Kubernetes, a service mesh, distributed tracing, centralized logging, and a team that understands all of it.

Distributed systems introduce new failure modes. Network partitions, event ordering issues, idempotency bugs, and eventual consistency edge cases do not exist in a monolith. They require explicit engineering investment to handle correctly.

The migration itself is high-risk. A naive "big bang" rewrite is how multimillion-dollar projects fail. The only safe path is an incremental migration using the Strangler Fig pattern — routing traffic gradually from the monolith to new services while maintaining data consistency through CDC pipelines and bidirectional sync.

Team size matters. A team of 2-3 developers cannot maintain 21 services. The operational overhead alone requires dedicated platform engineering capacity. Shopify or a managed Magento cloud is the right answer for small teams.

When to Migrate (And When Not To)

Migrate when:

You have 5+ developers and dedicated DevOps capacity
You are hitting Magento's scaling ceiling (slow deploys, shared DB contention, module conflicts)
You need independent team autonomy across multiple squads
You require custom payment flows, multi-warehouse WMS, or VN-specific integrations that Magento handles poorly
You want full source ownership with zero vendor licensing costs

Do not migrate when:

Your team is under 5 engineers
You need to launch in weeks, not months
Your traffic is manageable on a well-tuned Magento stack
You rely heavily on Magento's plugin ecosystem
You do not have the operational maturity to run Kubernetes in production

The Bottom Line

Magento's monolithic architecture is not a flaw — it is a deliberate design choice that optimizes for simplicity and ecosystem richness. For the majority of e-commerce businesses, it is the correct choice. (If you are evaluating alternatives to Magento but aren't ready for full microservices, evaluating the Modular Monolith Architecture alternative is highly recommended).

The migration to microservices makes sense when the cost of that simplicity — shared database contention, inability to scale selectively, coupled deployments, cascading failures — exceeds the cost of distributed systems complexity.

That crossover point is real, and when you hit it, the architectural investment pays for itself in deployment velocity, operational resilience, and the ability to scale exactly what needs scaling — nothing more.

For the exact playbook on how to execute this migration safely — including the 3-phase Strangler Fig pattern, Debezium CDC pipelines, and bidirectional sync — read The Zero-Downtime Blueprint: Moving from Magento to Microservices.

If you are still evaluating team capability before a migration, read our core guide on Magento Development in Vietnam: 2026 Hiring Guide. For the destination stack, explore the complete Go Microservices Architecture: Production Guide.

FAQ

Q: When should you migrate from Magento to microservices?
A:
Migrate from Magento to microservices when you have 5+ engineers with dedicated DevOps capacity, you are hitting Magento's scaling ceiling (slow deploys, shared database contention, module conflicts blocking independent team deployments), and you require fine-grained fault isolation — where a failure in one domain (e.g., reviews, loyalty) should never bring down the entire checkout flow. Do not migrate if your team is under 5 engineers, your traffic is manageable on a well-tuned Magento stack, or you do not have the operational maturity to run Kubernetes in production. The operational overhead of 21+ services is real.

Q: What is the Strangler Fig pattern for Magento migration?
A:
The Strangler Fig pattern is an incremental migration strategy where new microservices gradually wrap around the legacy Magento monolith, intercepting traffic domain by domain until the monolith becomes a hollow shell. In practice: Phase 1 routes reads to new services while writes still hit Magento; Phase 2 migrates write APIs incrementally (Customer first, then Catalog, then Order) with bidirectional sync keeping Magento in sync; Phase 3 cuts over all traffic and keeps Magento as a hot standby for 30 days before terminating. No big-bang rewrite. Each phase is independently reversible with a feature flag toggle.

Q: What is the EAV schema problem in Magento?
A:
Magento's Entity-Attribute-Value (EAV) model stores product attributes across multiple tables (catalog_product_entity_varchar, catalog_product_entity_int, catalog_product_entity_decimal, etc.) instead of flat rows. Fetching a single product with 30 attributes requires joining 5+ tables. At 25,000+ SKUs under load, this becomes a measurable latency problem — especially for search and listing pages. During migration, this means you cannot do a naive SELECT * export; you need an ETL pipeline to flatten EAV data into the normalized schemas your new microservices expect. Debezium CDC handles ongoing delta sync after the initial ETL.

Q: How does the Saga pattern replace Magento's database transactions in microservices?
A:
Magento handles checkout as a synchronous database transaction: reserve stock, create order, capture payment — all in one BEGIN ... COMMIT. This works for a single database but breaks in a distributed system because a slow payment gateway response holds a database transaction open, consuming connection pool slots and cascading into connection exhaustion under load. The Saga pattern replaces this with local transactions per service and explicit compensating transactions on failure: if payment authorization fails after stock was reserved, a compensation message triggers release_reservation on the Warehouse service. No long-lived database locks, no connection pool exhaustion, and each failure case is explicitly handled rather than silently dropped.

Ready to Execute the Migration?

If you have decided to migrate — or are building the business case to get executive sign-off — the next step is the technical execution plan.

Zero-Downtime: Moving from Magento to Microservices →

That guide covers the three-phase Strangler Fig execution: the Read-Only Gateway, the Dual-Write sync layer, and the Full Cutover with hot standby — all without dropping a single order.

[AI] Optimizing vLLM Serving: AWQ, GPTQ, & GGUF | SLM Playbook

Tuấn Anh — Thu, 02 Jul 2026 13:31:02 +0000

Successfully training and aligning a Small Language Model (SLM) is only half the battle. In enterprise environments, deploying a model to production serving requires solving three major challenges: high request concurrency, low response latency, and minimized compute cost.

To achieve this, we must master model compression (Quantization) and high-performance serving configurations using vLLM—the state-of-the-art serving engine for LLMs.

This final article in The SLM Playbook series compares the technical attributes of AWQ, GPTQ, and GGUF quantization formats, details how to set up Dynamic LoRA serving to conserve VRAM, and outlines a resilient enterprise-grade serving architecture.

1. Comparing Quantization Formats: AWQ vs. GPTQ vs. GGUF

Quantization is the process of compressing model weights from 16-bit floating-point (FP16/BF16) to lower-bit integer representations (such as INT8 or INT4). This drastically reduces VRAM requirements and accelerates hardware compute operations.

┌──────────────────────────────────────────────────────────────┐
│                Quantization Format Comparison                │
├──────────────────┬──────────────────┬────────────────────────┤
│ Format           │ Primary Target   │ Technical Attributes   │
├──────────────────┼──────────────────┼────────────────────────┤
│ AWQ (Recommended)│ GPU Serving      │ Preserves the top 1%   │
│                  │                  │ salient weights in     │
│                  │                  │ FP16. Retains accuracy.│
├──────────────────┼──────────────────┼────────────────────────┤
│ GPTQ             │ GPU Serving      │ Calibration-based      │
│                  │                  │ linear quantization.   │
│                  │                  │ Minor accuracy loss.   │
├──────────────────┼──────────────────┼────────────────────────┤
│ GGUF             │ CPU / Edge       │ Supports dynamic layer │
│                  │                  │ offloading to host CPU │
│                  │                  │ RAM (via llama.cpp).   │
└──────────────────┴──────────────────┴────────────────────────┘

1.1. AWQ (Activation-aware Weight Quantization)

Not all weights in a neural network contribute equally to its output representation. AWQ discovered that protecting just 1% of the most salient weight channels from quantization preserves the majority of model capability.

Mechanism: AWQ identifies these salient weight channels, keeps them in their native 16-bit format, and quantizes the remaining 99% of non-salient channels to 4-bit.
Verdict: AWQ consistently yields lower perplexity (better accuracy) compared to GPTQ on reasoning tasks while executing fast on NVIDIA GPUs using optimized CUDA kernels.

1.2. GPTQ (Generalized Post-Training Quantization)

GPTQ utilizes a calibration dataset to compute second-order weight influences (the Hessian matrix), adjusting remaining weights to compensate for quantization errors.

Verdict: Widely supported across all serving engines. However, for smaller models (under 8B parameters), GPTQ can occasionally introduce noticeable degradation on complex math or programming tasks.

1.3. GGUF (GPT-Generated Unified Format)

Developed by the open-source community surrounding llama.cpp, GGUF is a single-file model format optimized for mixed CPU/GPU execution.

Verdict: The standard for running models on local developer machines (MacBooks, laptops) or edge deployments lacking dedicated datacenter GPUs. It is not recommended for high-throughput enterprise backend clusters.

2. Designing a Dynamic LoRA Architecture

In enterprise deployments, different teams require distinct fine-tuned behaviors (e.g., accounting needs JSON invoice classification, while engineering needs code debugging).

Hosting separate model instances on individual GPUs drives up infrastructure budgets exponentially. vLLM's Dynamic LoRA Serving resolves this issue.

                   ┌────────────────┐
                   │  User Request  │
                   └───────┬────────┘
                           │
             [Determine Target Adapter via Headers]
             [e.g., 'X-Lora-Adapter: accounting']
                           │
                           ▼
         ┌──────────────────────────────────────┐
         │        vLLM Server Container         │
         │                                      │
         │        ┌───────────────────┐         │
         │        │   Base Model 8B   │         │ (Shared in VRAM)
         │        │   (FP16 or AWQ)   │         │
         │        └─────────┬─────────┘         │
         │                  │                   │
         │     ┌────────────┼────────────┐      │
         │     ▼            ▼            ▼      │ (Loaded dynamically
         │ ┌───────┐    ┌───────┐    ┌───────┐  │  on-demand)
         │ │Lora A │    │Lora B │    │Lora C │  │
         │ └───────┘    └───────┘    └───────┘  │
         └──────────────────────────────────────┘

2.1. How Dynamic LoRA Operates

vLLM loads a single, shared base model (e.g., Llama 3 8B AWQ) into GPU VRAM. When a request specifies a target LoRA adapter, vLLM dynamically loads the adapter parameters from disk or system RAM and computes the delta weight adjustment ($\Delta W$) on-the-fly during the forward pass.

Advantage: Reduces memory overhead by up to 90%. Dozens of fine-tuned task-specific adapters can be served simultaneously on a single 24GB GPU.

2.2. vLLM Production Command for Dynamic LoRA

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --quantization awq \
    --enable-lora \
    --max-loras 8 \
    --max-lora-rank 16 \
    --lora-dtype auto

When invoking the API, clients simply specify their target adapter in the request payload:

{
  "model": "accounting-lora-adapter",
  "messages": [
    {"role": "user", "content": "Analyze this invoice..."}
  ]
}

3. Production Serving Benchmarks

The following benchmarks demonstrate the memory and throughput gains achieved on a single NVIDIA A10G (24GB VRAM) running Llama 3 8B:

┌──────────────────────────────────────────────────────────────┐
│                     Serving Benchmark Results                │
├──────────────────┬──────────────────┬────────────────────────┤
│ Format           │ Throughput (tps) │ Peak VRAM Usage        │
├──────────────────┼──────────────────┼────────────────────────┤
│ FP16 (Baseline)  │ 32 tokens/sec    │ 16.2 GB (Low batch     │
│                  │                  │ limits, prone to OOM)  │
├──────────────────┼──────────────────┼────────────────────────┤
│ GPTQ 4-bit       │ 74 tokens/sec    │ 6.4 GB (Supports high  │
│                  │                  │ concurrency batches)   │
├──────────────────┼──────────────────┼────────────────────────┤
│ AWQ 4-bit        │ 78 tokens/sec    │ 6.1 GB (15% faster     │
│                  │                  │ TTFT than GPTQ)        │
└──────────────────┴──────────────────┴────────────────────────┘

Takeaway: Compressing your model to AWQ 4-bit saves over 60% of GPU VRAM, increasing sustained serving throughput by 2.4x compared to FP16. This provides a resilient foundation for serving high-concurrency enterprise workloads.

Summary of The SLM Playbook

Our 6-part playbook equips you with the complete workflow needed to customize and serve Small Language Models within your private enterprise infrastructure:

Architecture Design: Balance cost and capability by deploying local SLMs alongside cloud frontier models via a Hybrid Router Gateway.
Data Engineering: Mitigate memorization and clean instruction data using NEFTune noise injection and SemDeDup semantic pruning.
High-Performance Training: Execute LoRA/QLoRA training loops using Axolotl and Unsloth to optimize GPU utilization.
Knowledge Distillation: Distill structured reasoning paths (Chain of Thought) from deep models like DeepSeek-R1.
Preference Alignment: Align outputs and safety parameters using sample-efficient DPO and GRPO reinforcement learning.
Enterprise Serving: Quantize models to 4-bit AWQ and serve multiple tasks concurrently via Dynamic LoRA on vLLM.

By combining hardware optimization with targeted alignment, your team can deploy private, highly optimized models that guarantee data privacy at a fraction of the cost of public APIs.

Access the complete source code and configs on the SLM Playbook Home Page.

This post was originally published on my blog at Optimizing vLLM Serving: AWQ, GPTQ, & GGUF | SLM Playbook.

Hi, I'm Lê Tuấn Anh (vesviet) 👋
I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms (25M+ requests/month).
If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.

[AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

Tuấn Anh — Tue, 30 Jun 2026 23:39:00 +0000

Author's Note: This is Part 3 of the SLM & GenAI Playbook. You can read Part 2: SFT Data Engineering or jump to Part 4: Knowledge Distillation.

Full-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups.

To resolve these hardware barriers, Parameter-Efficient Fine-Tuning (PEFT) methods were developed, with LoRA and QLoRA emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU (like an RTX 3090, 4090, or A10G) while maintaining near-zero performance degradation compared to full tuning.

This article dissects the mathematics behind low-rank adaptation, details how to build production-grade Axolotl configurations, and uses Unsloth to accelerate training loops.

1. LoRA: Low-Rank Adaptation Matrix Decomposition

During domain-specific fine-tuning (e.g., text-to-SQL or medical terminology), parameter weight updates do not occupy the full parameter space; they exhibit a very low intrinsic rank. Instead of updating the massive original weight matrix $W_{0} \in R^{d \times k}$ , LoRA freezes $W_{0}$ and models the weight updates $Δ W$ as the product of two extremely low-rank matrices $B$ and $A$ of rank $r$ ( $r ≪ min (d, k)$ ):

Δ W = B \cdot A

Where:

$W_{0}$ is the frozen pre-trained weight matrix (no gradient updates).
$B \in R^{d \times r}$ and $A \in R^{r \times k}$ are the trainable adapter matrices.
$r$ is the Rank parameter (typically $r \in [8, 64]$ ).

        LoRA Layer Forward Pass:

             Input x 
             ┌───┴───┐
             │       │
             ▼       ▼
          ┌─────┐ ┌─────┐
          │     │ │  A  │ (Rank r, Gaussian initialized)
          │ W_0 │ └─────┘
          │     │    │ (r-dimensional vector)
          │(Frozen)  ▼
          │     │ ┌─────┐
          │     │ │  B  │ (Rank r, Zero initialized)
          └─────┘ └─────┘
             │       │
             ▼       ▼
            h_W     h_LoRA * (alpha / r)
             └───┬───┘
                 ▼
              Output y

1.1. LoRA Forward Pass Equation

For a given input $x$ , the output activation $y$ is computed as:

y = W_{0} x + Δ W x = W_{0} x + \frac{α}{r} (B A x)

Where:

$α$ is a constant scaling factor that controls the adapter's influence over the base model weights.
At the start of training, $A$ is randomly initialized via a Gaussian distribution, and $B$ is initialized to zero. Consequently, $Δ W = 0 \times A = 0$ , ensuring the model's baseline behavior is completely unchanged at step zero.

2. QLoRA: Maximizing VRAM Efficiency via Double Quantization

Introduced by Tim Dettmers in 2023, QLoRA (Quantized Low-Rank Adaptation) takes memory efficiency a step further by quantizing the base model weights $W_{0}$ to a highly compressed 4-bit representation, while keeping the active LoRA adapter weights in 16-bit precision.

QLoRA relies on three key mathematical and systems innovations:

2.1. NormalFloat 4 (NF4) Data Type

Neural network weights naturally follow a zero-centered normal distribution. Standard linear quantization schemes (like INT4) allocate quantization bins uniformly, wasting precision at the sparse tails of the distribution.

NF4 addresses this by establishing non-linear quantization intervals such that each bin contains an equal number of expected parameters (equal information entropy):

\int_{q_{i} q_{i + 1}} N (0, 1) d x = const

This preserves the maximum information of the original FP16 weights, matching FP4/INT4 precision while cutting model weight size to 4 bits per parameter.

2.2. Double Quantization (DQ)

In standard quantization, weight blocks are scaled using a 32-bit float constant. With a block size of 64, this scaling constant introduces an overhead of $32/64 = 0.5$ bits per parameter.

Double Quantization quantizes these scaling constants themselves from 32-bit floats to 8-bit floats with a block size of 256.

Impact: Reduces scaling overhead from $0.5$ bits/parameter to $0.127$ bits/parameter, saving approximately 3 GB VRAM on an 8B model.

2.3. Paged Optimizers

During training with long sequence lengths or large batches, sudden gradient allocation spikes can exceed physical VRAM limits, triggering OOM crashes.

Paged Optimizers leverage CUDA Unified Memory to automatically swap (page) optimizer states between GPU VRAM and CPU RAM during peak memory phases, gracefully slowing down training rather than crashing.

3. Hands-On: Configuring Axolotl for QLoRA

Axolotl is a robust framework for LLM fine-tuning, offering native integration with FlashAttention-2, DeepSpeed, and PyTorch FSDP.

Here is a complete production-ready qlora_llama3_8b.yml configuration optimized for a single NVIDIA A10G (24GB VRAM):

# Model & Training Mode Config
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: PreTrainedTokenizerFast

# Enable QLoRA (4-bit NF4 Quantization)
load_in_8bit: false
load_in_4bit: true
gptq: false

# Precision settings
bf16: true
fp16: false
tf32: true

# LoRA Adapter Configuration
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Dataset Configurations
datasets:
  - path: ./temp_cleaned_dataset.jsonl
    type: alpaca
    shards: 10
dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./lora-llama3-8b-output

# Memory & Speed Optimizations
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
flash_attention: true

# Hyperparameters
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.01
max_grad_norm: 1.0

# Checkpointing & Logs
save_steps: 100
eval_steps: 100
logging_steps: 10

4. Accelerating Loops: 3x Speedup with Unsloth

While Axolotl is highly configurable, standard PyTorch backward passes for attention layers leave performance on the table. Unsloth rewrites the attention and MLP backward steps in raw OpenAI Triton, achieving a 3x speedup while reducing memory consumption by 60%.

Complete Python script to execute QLoRA using Unsloth:

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

max_seq_length = 4096 # Limit context length to optimize speed on 24GB GPUs
dtype = None # Auto-detect (Float16 or Bfloat16)
load_in_4bit = True # Enable 4-bit quantization

# 1. Initialize model and tokenizer via Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 2. Add optimized LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0, # Unsloth is optimized for dropout = 0
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Memory-optimized gradient checkpointing
    random_state = 3407,
)

# 3. Format SFT Prompts (Alpaca style)
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for inst, out in zip(instructions, outputs):
        text = alpaca_prompt.format(inst, out) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts }

# Load semantic deduplicated dataset from Part 2
dataset = load_dataset("json", data_files="temp_cleaned_dataset.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True)

# 4. Setup SFT Trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Set to True to pack short sequences and speed up training
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 120, # Number of training steps for test run
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Execute training run
trainer_stats = trainer.train()

# 5. Save model adapter weights
model.save_pretrained("lora_model_adapter")
tokenizer.save_pretrained("lora_model_adapter")
print("Training complete! Model saved.")

5. Merging LoRA Weights for Serving

Fine-tuning via LoRA outputs a directory of adapter weights (typically 50MB - 500MB). To run high-performance inference serving with engines like vLLM, you should merge these adapter matrices back into the 16-bit base model weights.

Python Script to Merge Weights:

from unsloth import FastLanguageModel

# Load the base model and model adapter in native 16-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = False, # Must be False to export back to native 16-bit float
)
model.load_adapter("lora_model_adapter")

# Execute weights merge and save to disk
print("Merging weights and saving to disk...")
model.save_pretrained_merged("merged_model_fp16", tokenizer, save_method = "merged_16bit")
print("Merge complete! Ready for vLLM serving.")

The output in merged_model_fp16 is a standalone 16-bit Hugging Face model directory ready to be loaded by vllm serve.

Next Chapter

Supervised Fine-Tuning instructs your model on formatting styles and conversational behavior. However, complex, multi-step logical operations (Reasoning) benefit from structured channelling of reasoning steps.

In Part 4: Task & Knowledge Distillation, we explore how to extract reasoning traces (Chain of Thought - CoT) from larger teacher models like DeepSeek-R1 into small student models.

This post was originally published on my blog at Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook.

[AI] Context Engineering for AI Coding: AGENTS.md, Cursor Rules & RAG

Tuấn Anh — Mon, 29 Jun 2026 23:42:03 +0000

In 2025, METR — an AI safety and capability research organization — ran a rigorous randomized controlled trial. Sixteen experienced open-source developers worked on 246 real-world tasks, each randomly assigned to either use AI coding tools freely or not at all.

The result was counterintuitive: developers using AI tools were 19% slower on complex tasks.

Before the study, those same developers predicted AI would make them 24% faster. After completing the experiment — still believing they had gone faster — their subjective confidence remained completely unshaken.

The finding did not make headlines for the reason people assumed. The headline was not "AI is useless." The headline was this: the bottleneck is not model quality. It is context quality.

The developers who slowed down were spending significant time on what researchers call "verification overhead" and "workflow friction" — the effort required to correct AI output that did not understand the architectural constraints, naming conventions, existing utility functions, and established patterns of the codebase they were working in. The AI was generating code. It was generating code for an imaginary system.

This part of the series is about solving that problem.

Series Orientation: This article is Part 2 of the AI Code Review & Vibe Coding series, detailing the context engineering practices needed to align AI generation with codebase conventions. For the preceding guide on initial tools and non-technical vibe coding, see Part 1 — Vibe Coding & The Production Wall.

Scope note: This article focuses specifically on code-review-level context engineering — the practices individual engineers and teams use to make AI agents produce reviewable, architecturally correct code on an existing codebase. If you are interested in platform-level context infrastructure — building an organizational AI Platform layer, internal RAG systems at scale, or enterprise knowledge management — see Context Engineering: Domain-Driven Design for AI in the AI-Driven Playbook series.

The Fundamental Problem: AI Operates on a Blank Slate

Every time you open a new session with an AI coding tool, you begin from zero. The model knows nothing about:

Your architectural decisions and why you made them
Which patterns you have standardized on and which you are migrating away from
What your team calls a "service" versus a "handler" versus a "controller"
Which utility functions already exist in your shared library
What a "successful" PR looks like in your codebase — what passes review and what gets rejected

Without this information, the AI operates like a very fast, very confident junior developer who has never seen your codebase before and will reproduce whatever pattern was most common in its training data — not whatever pattern is correct for your system.

Context engineering is the discipline of structuring and delivering organizational knowledge to AI agents in a form they can reliably use. It is, as the industry consensus now describes it, the "DevOps moment" for AI — the operational layer that separates experimental AI assistance from reliable production-grade AI collaboration.

The Context Hierarchy: From Files to RAG Pipelines

Modern AI coding environments support context at multiple layers. Understanding the hierarchy is the foundation of any effective context strategy.

Layer 1: Rule Files (Always-On, Zero Overhead)

Rule files are plain-text configuration files that are automatically injected into every AI interaction. They are the most important and most underutilized form of context.

AGENTS.md (or CLAUDE.md / GEMINI.md)

These files — stored at the root of your repository — are read by AI agents before they begin any task. They function as the agent's standing orders: architectural constraints, behavioral standards, and explicit prohibitions that apply to everything the agent does.

A well-structured AGENTS.md covers:

# Project Architecture
This is a Kratos v2 microservice using Clean Architecture.
Layer rules:
- api/ = contracts only (proto + generated code)
- internal/service/ = adapter layer only, no business logic
- internal/biz/ = business logic, NO direct database calls
- internal/data/ = persistence only, GORM + PostgreSQL

# Mandatory Standards
- All context must propagate through function parameters
- Use errgroup for managed goroutines only
- SQL queries must use parameterized inputs — NEVER string concatenation
- Secrets come from environment variables or Kratos Config — NEVER hardcode

# What NOT To Do
- Do not use global state
- Do not expose raw database errors to HTTP/gRPC responses
- Do not create new patterns without checking internal/util first

The specificity is the point. A generic instruction like "follow clean architecture" produces inconsistent results. A specific instruction like "the biz layer must never import gorm.DB directly" produces deterministic ones.

Cursor Rules (.cursorrules)

Cursor's rule files work similarly to AGENTS.md but are native to the Cursor IDE. They support scoped rules — you can define different behavior for different file patterns, enforce language-specific standards, and specify which files should never be modified by the AI.

[rules]
name = Go Microservice Standards
glob = **/*.go

[security]
never_hardcode_secrets = true
require_parameterized_queries = true
forbid_global_state = true

[architecture]
enforce_layer_boundaries = true
require_context_propagation = true

The practical effect: your AI assistant now operates with your standards embedded, not as an afterthought you patch into every prompt.

How Rule Files Prevent Architectural Leakage (A Go / Kratos Example)

Consider a request: "Retrieve a user profile by email in the service layer."

Without a Rule File (AGENTS.md): The AI will write a GORM query directly inside the adapter service layer, bypassing Clean Architecture design:

// File: internal/service/user.go
func (s *UserService) GetProfileByEmail(ctx context.Context, req *pb.GetProfileReq) (*pb.GetProfileReply, error) {
    var user biz.User
    // VIOLATION: Direct database access leaking into the service layer
    if err := s.db.WithContext(ctx).Where("email = ?", req.Email).First(&user).Error; err != nil {
        return nil, err
    }
    return &pb.GetProfileReply{Name: user.Name, Email: user.Email}, nil
}

With a Rule File (AGENTS.md): The AI enforces layer isolation, routing GORM access exclusively through the persistence domain (repository) and business use case:

// File: internal/service/user.go
func (s *UserService) GetProfileByEmail(ctx context.Context, req *pb.GetProfileReq) (*pb.GetProfileReply, error) {
    // CORRECT: Service calls the biz layer orchestrator (UseCase)
    user, err := s.userUseCase.FindByEmail(ctx, req.Email)
    if err != nil {
        return nil, err
    }
    return &pb.GetProfileReply{Name: user.Name, Email: user.Email}, nil
}

Layer 2: Session Management (Active Context Control)

Even with rule files in place, long sessions degrade. This is the "context rot" phenomenon: as a session accumulates failed attempts, corrected errors, and discarded planning notes, the signal-to-noise ratio in the context window drops. The model may prioritize recent noise over foundational constraints.

The Fresh Session Strategy

High-performing engineering teams treat AI sessions like stateless functions: one distinct task per session. When you complete a bug fix, close the session. When you begin a new feature, open a fresh one. The operational rule: task boundaries are session boundaries.

Structured Handovers

When a session grows long before the task is complete, perform a structured handover:

Ask the AI to summarize: "What decisions have we made? What is the current state? What remains to be done?"
Capture that output in a PLAN.md or HANDOVER.md file in your project directory
Open a fresh session and load the summary alongside your core rule files

This eliminates context rot while preserving all meaningful progress.

Compaction Commands

Modern coding agents (Claude Code, Cursor) include /compact or /summarize commands. Use them proactively when a session runs long — before the model hits its context limit and before performance degrades. A compacted summary is a much higher-quality input than an accumulating stream of raw conversation.

Layer 3: Repository Indexing (Selective Context Injection)

Rule files establish standards. Session management controls noise. Repository indexing solves a different problem: giving the AI accurate knowledge of what already exists in your codebase.

The N+1 Discovery Problem

Without repository context, AI agents routinely implement functions that already exist. They create new database tables that duplicate existing ones. They define error types that collide with established patterns. They import packages that violate your dependency graph. Not because they are incapable of doing better — because they do not know what already exists.

Manual Selection vs. Full-Repo Scanning

Most AI coding tools offer the ability to scan an entire repository automatically. This sounds valuable and is often counterproductive. A large codebase injected wholesale into context adds significant noise — irrelevant files, outdated patterns, deprecated modules. The principle: manually select only the files directly relevant to the task.

For a task modifying user authentication, the relevant context is:

The authentication service interface
The existing user repository implementation
The session management middleware
The relevant error type definitions

Not the entire codebase.

Semantic Memory Banks

More sophisticated teams maintain curated "memory bank" files — structured markdown documents that describe the codebase's architecture, key patterns, and important decisions in a form optimized for AI consumption:

# Memory Bank: Authentication Domain

## Architecture
- Auth service handles JWT issuance and validation
- User identity stored in PostgreSQL via GORM, users table
- Sessions use Redis with 24h TTL (see internal/data/session_repo.go)
- MFA implemented via TOTP (internal/service/mfa_service.go)

## Key Patterns
- All auth errors return domain errors, never raw DB errors
- Rate limiting is middleware-level (internal/middleware/rate_limiter.go)
- Refresh tokens are hashed before storage (see HashToken in internal/util/crypto.go)

## Common Mistakes to Avoid
- Do NOT check password directly — always use bcrypt.CompareHashAndPassword
- Do NOT log token values — only log token IDs
- Do NOT implement new crypto — use internal/util/crypto.go exclusively

These memory banks are updated when significant architectural decisions are made and committed to the repository alongside code.

Layer 4: RAG Pipelines (Enterprise-Scale Context)

For large engineering organizations — those with hundreds of services, mature documentation, and complex architectural standards — static rule files are insufficient. The relevant context for any given task changes too rapidly and exists in too many places to manage manually.

Retrieval-Augmented Generation (RAG) for code context works by:

Indexing your codebase, Architecture Decision Records (ADRs), runbooks, and internal documentation into a vector database
When an AI agent begins a task, automatically querying that index for the most semantically relevant context
Injecting retrieved context into the session alongside the task description

The operational result: an AI agent working on a payments feature automatically retrieves the relevant payment service interfaces, the ADR explaining why you chose the current transaction model, and the runbook for the payment provider integration — without the engineer manually curating that context.

ADRs as Machine-Readable Judgment

Architecture Decision Records deserve special attention. When committed in a structured format and indexed into a RAG pipeline, ADRs transform from static documentation into active constraints:

# ADR-047: Event-Sourcing for Order State Transitions

## Status: Accepted (2025-03)
## Context
Direct state mutation of order records creates audit trail gaps and makes rollback scenarios complex.
## Decision
All order state transitions are implemented as events, appended to the events table.
The current state is derived by replaying events, not by direct column updates.
## Consequences
- New order state logic MUST add new event types, NOT modify existing ones
- Order queries require projection logic (see internal/projection/order_projector.go)
- Do NOT write directly to orders.status — always publish an OrderStateTransitioned event

An AI agent with access to this ADR will not generate direct UPDATE orders SET status = ? queries for order state changes. Without it, it almost certainly will.

MCP Servers as Context Infrastructure

The Model Context Protocol (MCP), released by Anthropic and now adopted across the industry, provides a standardized interface for serving context to AI agents. Rather than building bespoke integrations for each AI tool, organizations build MCP servers — lightweight services that expose specific organizational knowledge (documentation, code patterns, ticket context) through a standard protocol.

The shift this enables: context infrastructure becomes a shared organizational asset rather than a per-engineer configuration problem.

The ContextOps Discipline

The industry now has a name for operating context infrastructure at organizational scale: ContextOps.

The operational loop is: Ingest → Validate → Structure → Serve → Audit → Refine.

Ingest: Pull from Confluence, Notion, ADR files, runbooks, Slack architectural discussions
Validate: Confirm content is accurate, up-to-date, and not contradictory
Structure: Format for AI consumption — clear headers, explicit "do/do not" sections, structured code examples
Serve: Via MCP servers, rule files, or RAG retrieval
Audit: Monitor whether AI outputs are adhering to the context (if the agent keeps making mistakes the context prohibits, the context is either wrong or unclear)
Refine: Update context based on what the audit reveals

Organizations that treat context as throwaway configuration — updated ad hoc, inconsistently formatted, stored in unindexed markdown files — experience the METR result: AI that slows teams down. Organizations that treat context as infrastructure — versioned, validated, monitored — experience meaningfully different outcomes.

Practical Implementation: Starting From Zero

If your team does not have any context infrastructure today, the practical starting point is a three-step sequence:

Step 1: Write an AGENTS.md (one afternoon)

Focus on the highest-value content first:

Your layer structure and the key rules for each layer
Your top 5 "never do this" patterns (the ones your code reviewers catch most often)
Your top 5 "always use this instead" patterns (the shared utilities and established conventions)
Your security non-negotiables (no hardcoded secrets, parameterized queries, etc.)

Step 2: Establish session discipline (one team discussion)

Agree on task-based session boundaries. Add a compaction step to your team norms: before any session exceeds 20 substantive exchanges, compact and continue in a fresh session.

Step 3: Build your first memory bank (one sprint)

Pick your most critical domain — authentication, payments, whatever carries the highest risk. Document it in a memory bank format. Add a rule to your code review checklist: "Was the relevant memory bank file updated as part of this PR?"

The marginal improvement from even basic context infrastructure is significant. Teams that complete these three steps report substantially fewer AI-generated PRs that violate architectural standards, require significant rework, or introduce security issues the memory bank explicitly prohibits.

What Good Context Engineering Looks Like in Practice

Consider a task: "Implement a new endpoint to export user transaction history as a CSV."

Without context engineering, an AI agent will:

Create a new database query that joins users and transactions directly in the service layer
Generate all transactions at once rather than streaming (OOM risk at scale)
Write the CSV logic inline rather than using your existing internal/util/csv_writer.go
Skip the rate limiting middleware your architecture requires on all export endpoints

With effective context engineering, the same agent:

Knows that data access must go through the repository layer, not direct DB queries in the service
Retrieves the existing csv_writer.go and uses it rather than reimplementing
Finds the streaming pagination pattern used by internal/service/report_service.go and applies it
Applies the export endpoint rate limit from internal/middleware/ as specified in your AGENTS.md

This is not a different model. It is the same model with correct context. The output difference is substantial.

Connecting to the Review Pipeline

Context engineering is not a replacement for code review. It is a force multiplier on code review. When AI agents operate with accurate, comprehensive context, the output they produce:

Contains fewer architectural violations (the context prohibits them explicitly)
Reuses existing utilities more consistently (the context surfaces them)
Makes security mistakes less frequently (the context specifies the security requirements)

The result: human reviewers spend less time on pattern violations and architectural corrections, and more time on the genuinely high-value review tasks — logical correctness, edge case handling, and the security behaviors that require judgment rather than rule application.

Part 3 covers what those high-value review tasks are: the full taxonomy of AI-generated bugs, from the ones automated tools catch to the ones that only careful human review finds.

Next: Part 3 — AI Bug Taxonomy: From Silent Logic Failures to Slopsquatting

This post was originally published on my blog at Context Engineering for AI Coding: AGENTS.md, Cursor Rules & RAG.

[Go] Go pprof in Kubernetes: Remote CPU & Memory Profiling Without Restarting Pods

Tuấn Anh — Sun, 28 Jun 2026 11:39:07 +0000

Prerequisite: This guide covers how to profile and diagnose complex performance issues in production. If you are specifically dealing with unbounded goroutine growth, ensure you first understand the foundational concepts in Goroutine Leak Detection and Fix in Production Go Services.

Performance degradation in production is inevitable. When a Go microservice suddenly spikes to 90% CPU utilization or triggers an Out-Of-Memory (OOM) kill in Kubernetes, guessing the root cause by staring at the code is rarely effective. You need data.

Enter pprof.

Built directly into the Go standard library, pprof is an incredibly powerful diagnostic tool that samples your application’s execution to identify exactly where CPU time is being spent and where memory is being allocated. While many developers use pprof locally, doing it safely in a high-throughput production environment requires understanding sampling rates, overhead, and secure exposure.

This tutorial is a deep-dive into production-ready Go profiling. We will explore how to safely expose endpoints, compare CPU profiling against the Execution Tracer, dissect memory metrics (alloc_space vs inuse_space), and leverage advanced features like custom profiling labels and the experimental Go 1.26 goroutine leak profiler.

Safely Exposing pprof Endpoints in Production

Answer-first: Go pprof is the standard library profiling tool for diagnosing CPU usage, memory allocation, and goroutine leaks in production Go services, with safe exposure via internal HTTP endpoints and minimal performance overhead when configured correctly.

The most common way to enable profiling is to import the net/http/pprof package. As a side effect of the import, this package automatically registers its HTTP handlers to the default http.DefaultServeMux.

// Exposing pprof safely on an internal port
// Purpose: Starts an isolated HTTP server dedicated to pprof endpoints
// ensuring that diagnostic data is not exposed to the public internet.
package main

import (
    "log"
    "net/http"
    _ "net/http/pprof" // Automatically registers /debug/pprof/
)

func main() {
    // ... your main application logic ...

    // Run pprof in a background goroutine on a completely separate,
    // internal-only port (e.g., blocked by your VPC or Ingress rules).
    go func() {
        log.Println("Starting pprof server on localhost:6060")
        if err := http.ListenAndServe("localhost:6060", nil); err != nil {
            log.Fatalf("pprof server failed: %v", err)
        }
    }()

    // block forever or wait for graceful shutdown
    select {}
}

Production Security and Overhead

Never expose /debug/pprof/ to the public internet. Exposing it can lead to information disclosure (revealing your source code structure) and Denial of Service (DoS) if an attacker repeatedly triggers expensive CPU profiles.

Is it safe to run in production?

Heap (Memory) Profiling: Extremely safe. It runs continuously by default with negligible overhead (statistically sampling 1 in every 512 KB allocated).
CPU Profiling: Safe for short bursts. Running a 30-second CPU profile samples the stack at 100Hz and generally adds less than 2% overhead.
Block & Mutex Profiling: Disabled by default. Setting their rates to 1 (capturing every event) can add 5–20% overhead. Use them surgically.

Once exposed, you can capture a profile using the go tool pprof command from your local machine (via port-forwarding):

# Capture a 30-second CPU profile and open the interactive web UI
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

Profiling Go Applications in Kubernetes (Without Restarting Pods)

The core challenge: pprof endpoints run inside a Kubernetes pod on localhost:6060. To reach them from your machine, you cannot connect directly — you need kubectl port-forward to bridge the network. No pod restart required.

Step 1 — Find the Pod Name

# List pods and find the one you want to profile
kubectl get pods -n production -l app=orders-service

# Example output:
# orders-service-7d9f4b8c6-xk9pz   1/1   Running   0   3h

Step 2 — Open a Port-Forward Tunnel

# Forward pod port 6060 to your local machine
# This does NOT restart the pod or affect production traffic
kubectl port-forward pod/orders-service-7d9f4b8c6-xk9pz 6060:6060 -n production

# Keep this terminal open. In a second terminal:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

Security note: port-forward uses an authenticated Kubernetes API server tunnel. No inbound rule change is needed. Never add a NodePort or LoadBalancer service just to expose pprof — that is a critical security mistake.

Step 3 — Lock Down the pprof Port with a NetworkPolicy

Even on localhost:6060, other pods in the same namespace can reach the pprof port via pod-to-pod networking. Add a Kubernetes NetworkPolicy to restrict access:

# Allow pprof access only from pods with the "monitoring" label
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-pprof-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: orders-service
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: monitoring       # only observability pods may reach pprof
    ports:
    - protocol: TCP
      port: 6060

Step 4 — Automate Profile Capture on OOM or High CPU

For recurring issues (OOM kills, CPU spikes at 03:00), manually running kubectl port-forward is too slow. The open-source pprof-operator watches for threshold-based alerts and automatically captures profiles:

# Install pprof-operator (Kubernetes Operator)
kubectl apply -f https://github.com/josepdcs/kubectl-prof/releases/latest/download/install.yaml

# Trigger a CPU profile remotely without port-forward:
kubectl prof orders-service-7d9f4b8c6-xk9pz --lang go --type cpu

This is the production pattern used by teams running Go at scale on Kubernetes — profile on-demand, no pod disruption, no always-on overhead.

CPU Profiling vs. Execution Tracer (trace)

When a service is slow, the first instinct is to pull a CPU profile. But CPU profiles only tell you what the CPU is actively doing. If your service is slow because it is waiting (e.g., waiting for a database lock, blocked on channel I/O, or paused by the Garbage Collector), the CPU profile will look surprisingly empty.

When to use `pprof` (CPU Profile)

Use pprof when you have High CPU Utilization. It identifies "hot paths"—the loops, expensive algorithms, or massive JSON decoding blocks that are burning through clock cycles.

When to use `go tool trace` (Execution Tracer)

Use the tracer when you have High Latency but Low CPU Utilization.
The tracer hooks directly into the Go runtime and records an event log of every goroutine scheduling decision, syscall, and garbage collection pause.

# Capture a 5-second trace
curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5

# View the trace in the browser
go tool trace trace.out

Overhead Warning: The Execution Tracer is heavy. It generates massive files and can introduce 10–20% performance overhead. Do not run it continuously; use it for brief 1–5 second windows when actively debugging a latency spike.

Memory Profiling: alloc_space vs inuse_space

Understanding the difference between allocation and retention is the biggest hurdle for engineers learning pprof. The heap profile tracks two fundamentally different metrics:

inuse_space (Retention): The amount of memory currently held by your application and not yet garbage collected. If this number climbs infinitely, you have a Memory Leak.
alloc_space (Allocation Churn): The total amount of memory ever allocated over the lifetime of the program, even if it was immediately garbage collected. If this number is astronomically high, you have High GC Pressure, which consumes CPU cycles to constantly clean up short-lived objects.

Debugging Workflow

Scenario A: The OOM Killer (Finding Leaks)
If Kubernetes is killing your pod for exceeding memory limits, you want to look at inuse_space.

# Focus explicitly on retained memory
go tool pprof -inuse_space http://localhost:6060/debug/pprof/heap

Inside the interactive UI, type top to see the functions holding onto the most memory. Often, memory leaks in Go are actually goroutine leaks—a goroutine is blocked forever on a channel, keeping all of its local variables alive.

Scenario B: Optimizing CPU through Memory (Fixing Churn)
If your CPU usage is high, but the CPU profile shows runtime.mallocgc at the top, your program is spending all its time allocating and collecting memory.

# Focus explicitly on historical allocation volume
go tool pprof -alloc_space http://localhost:6060/debug/pprof/allocs

To fix this, you optimize by reducing allocations:

Pre-allocate slices: make([]int, 0, expectedCapacity) prevents multiple underlying array re-allocations as the slice grows.
Use sync.Pool: Cache and reuse temporary objects (like bytes.Buffer or JSON encoders) to completely bypass the GC.

Finding Goroutine Leaks (and Go 1.26 Features)

A standard way to check for goroutine leaks is to compare the baseline number of goroutines against the current number. If it steadily grows from 100 to 10,000 without traffic increasing, you have a leak.

curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | grep "goroutine profile: total"

The Go 1.26 `goroutineleak` Profile (Experimental)

Historically, finding which of the 10,000 goroutines was leaked required manual inspection of stack traces. Go 1.26 introduces a revolutionary experimental profile: /debug/pprof/goroutineleak.

This profile leverages the Garbage Collector's reachability analysis. It mathematically proves whether a goroutine blocked on a channel or mutex can ever be unblocked. If the synchronization primitive it is waiting on is unreachable by any active, runnable code, the runtime flags the goroutine as permanently leaked.

To use it in Go 1.26, you must compile your service with the experiment flag:

GOEXPERIMENT=goroutineleakprofile go build -o myapp main.go

Then, simply curl the endpoint to get a precise list of deadlocked, leaked goroutines:

go tool pprof http://localhost:6060/debug/pprof/goroutineleak

Advanced: Custom Profiling Labels with pprof.Do

In a massive multi-tenant microservice, looking at a generic CPU profile is often unhelpful. You might see json.Unmarshal taking 40% of the CPU, but you don't know which API route or which tenant is triggering it.

Go supports Custom Profiling Labels, allowing you to attach arbitrary key-value pairs to the execution context.

// Tagging goroutines with custom pprof labels
// Purpose: Allows filtering CPU and allocation profiles by tenant or HTTP route
package handlers

import (
    "context"
    "runtime/pprof"
)

func ProcessOrder(ctx context.Context, tenantID string, route string) {
    // 1. Create a LabelSet (must be key-value pairs)
    labels := pprof.Labels("tenant", tenantID, "route", route)

    // 2. Wrap the execution block with pprof.Do
    // Any CPU samples or allocations collected inside this closure 
    // will be permanently tagged with these labels.
    pprof.Do(ctx, labels, func(ctx context.Context) {
        // Expensive processing goes here...
        decodeHeavyPayload()
    })
}

When you download the profile, you can open the Web UI (go tool pprof -http=:8080 profile.out) and use the Focus menu to filter by tenant=xyz. The Flame Graph will instantly redraw to show only the CPU cycles consumed by that specific tenant!

Frequently Asked Questions (FAQ)

{{< faq q="What is the performance overhead of Go pprof?" >}}
Heap profiling uses probabilistic sampling (default runtime.MemProfileRate is 512 KB) and is practically free (< 1% overhead). CPU profiling (100Hz sampling) is also very lightweight (< 2%). However, setting Block or Mutex profile rates to capture 100% of events can add 5-20% overhead. Execution tracing (go tool trace) is the heaviest, adding 10-20% overhead while actively running.
{{< /faq >}}

{{< faq q="When should I use go tool trace instead of pprof?" >}}
Use pprof to find functions actively burning CPU or allocating memory. Use go tool trace when you need to diagnose latency spikes, scheduler delays, or lock contention where the CPU is mostly idle but requests are taking too long to complete.
{{< /faq >}}

{{< faq q="How do I profile mutex contention in Go?" >}}
First, enable it in your application code via runtime.SetMutexProfileFraction(100) (which samples 1% of contention events). Then, access the data via go tool pprof http://localhost:6060/debug/pprof/mutex. Look for functions waiting the longest for a sync.Mutex to unlock.
{{< /faq >}}

{{< faq q="What's the difference between alloc_space and inuse_space?" >}}
inuse_space measures the memory currently held by the application (useful for finding memory leaks), whereas alloc_space measures the total memory allocated over the program's lifetime (useful for finding high garbage collection pressure).
{{< /faq >}}

{{< faq q="How do you enable the Go 1.26 goroutine leak profiler?" >}}
You must compile your service with the experiment flag: GOEXPERIMENT=goroutineleakprofile go build -o myapp main.go. After that, you can fetch the profile via /debug/pprof/goroutineleak.
{{< /faq >}}

🔗 Related Reading: Profiling tells you why a function is slow, but detecting goroutine growth early is the first line of defence. Read the companion guide Goroutine Leak Detection and Fix in Production Go Services for a deep-dive into goroutine lifecycle management. For distributing observability across your entire microservices fleet, see Mastering Event-Driven Architecture with Dapr which covers tracing, retry, and DLQ patterns end-to-end.

This post was originally published on my blog at Go pprof in Kubernetes: Remote CPU & Memory Profiling Without Restarting Pods.

[System Design] Chapter 3: Traffic Shield - Peak Shaving with Kafka and Graceful Degradation

Tuấn Anh — Sat, 27 Jun 2026 13:41:12 +0000

← Series hub
← Prev • Next →

Chapter 3: Peak Shaving - The Power of Apache Kafka and Graceful Degradation

In Chapter 2, we utilized Redis to deduct inventory in a fraction of a millisecond. However, the purchase journey isn't over. The system still needs to: Create the order record in MySQL, generate an invoice, deduct money from ShopeePay, calculate shipping, and award Shopee Coins.

If we attempt to perform all these steps Synchronously while the user waits, the system will collapse due to database lock timeouts or slow third-party API responses. The secret is: Asynchronous Processing.

1. Peak Shaving with Apache Kafka

The core philosophy of Flash Sale design is: Accept requests blazingly fast, process them slowly. Shopee uses Apache Kafka—a massive, high-throughput message broker—as a massive buffer funnel.

Once Redis successfully deducts inventory, a lightweight message stating "User A ordered an iPhone" is pushed into a Kafka Topic.
The system immediately returns a success response to the app: "You are in queue. Your order is being processed." The user experience takes just milliseconds.
Behind the scenes, Backend Workers slowly pull messages from Kafka and insert them into the actual Database.
The Result: Even if a spike of 1 million orders arrives in a single second, it won't crash the Database. The 1 million messages are safely stored in Kafka. If the workers process at 10,000 orders/second, the backlog is cleared in 100 seconds. The massive traffic spike has been successfully "shaved" into a flat, manageable horizontal line.

graph LR
    subgraph Traffic Storm
        Users((Millions of Users)) -->|1 Million Req/s| Checkout[Checkout Service]
    end

    Checkout -->|Write| Kafka[(Apache Kafka<br/>Message Broker)]

    subgraph Async Processing
        Kafka -->|Pull at 10k/s| Worker1[Order Worker]
        Kafka -->|Pull at 10k/s| Worker2[Payment Worker]
        Worker1 --> DB[(MySQL / TiDB)]
        Worker2 --> API[External APIs]
    end

2. Eventual Consistency

Shopee embraces the philosophy of Eventual Consistency for distributed systems. Do not attempt to force 100% Strong Consistency across all microservices instantly. There will be a slight delay from the moment you tap "Buy" to the moment the invoice fully appears in your "To Ship" tab. This minor time trade-off is the key to preserving the high availability of the entire e-commerce platform.

3. Graceful Degradation

During the midnight rush of 11.11, Shopee enforces a strict policy: Protect the Core Flow at all costs (Search -> Add to Cart -> Checkout). Everything else can die, but the checkout system must survive!

Circuit Breakers: When an internal system (e.g., the Promotions Service) becomes overloaded and slow, a Circuit Breaker (like Hystrix or Sentinel) automatically trips. It severs the connection to the failing service for a set duration, instantly returning a default fallback response. This prevents slow services from creating cascading timeouts.
Feature Toggling: Engineers pre-configure "switches" in their centralized configuration system. When traffic crosses a critical threshold, the system automatically degrades the user experience by turning off non-essential, heavy features:
- Disabling historical purchase viewing.
- Hiding Seller analytics dashboards.
- Pausing heavy AI Recommendation engines.
- Disabling avatar updates.

Developer Takeaway: Message Queues (Kafka/RabbitMQ) are the key to decoupling monolithic processes into independent pipelines. In high-concurrency design, you must embrace trade-offs: Be willing to sacrifice auxiliary features to keep the primary money-making flow alive.

This post was originally published on my blog at Chapter 3: Traffic Shield - Peak Shaving with Kafka and Graceful Degradation.

Escaping the Monolith: How We Replaced Magento with 21 Go Microservices

Tuấn Anh — Fri, 26 Jun 2026 15:00:00 +0000

Answer-first: How we escaped Magento's licensing and scaling walls by migrating to a Composable Commerce Platform built on 21 Go microservices, Dapr PubSub, and a 3-phase Strangler Fig strategy without dropping a single order.

At exactly midnight during a major campaign, a monolithic Magento server can quickly become your single point of failure. Every engineering team that builds seriously on Magento eventually hits the same walls: the licensing wall ($100k-$200k/year for Enterprise), the scaling wall (scaling the entire massive codebase just to handle a traffic spike on the checkout page), and the developer velocity wall.

When we decided to migrate away from Magento, the standard industry advice was to split the monolith into "4 to 6 microservices". We ignored that advice. For serious e-commerce at scale, that approach inevitably leads to a distributed monolith — where services are deployed separately but remain tightly coupled through HTTP chains or shared database tables.

Instead, we built a Composable Commerce Platform using 21 Go microservices, Google's Kratos v2 framework, and Dapr PubSub. Here is the blueprint of how we structured it, and how we migrated with zero downtime.

For the complete architecture deep-dive across all domains, see the Composable Commerce Migration Series.

The 21-Service Blueprint: Domain-Driven Design in Practice

Before writing a single line of Go code, we established boundaries using Domain-Driven Design (DDD). Every domain owns its own PostgreSQL database. There are absolutely no cross-domain queries.

Here is the breakdown of our 6 core domains:

Commerce Flow (Checkout, Order, Payment): The highly critical money path, orchestrated using Saga patterns.
Product & Content (Catalog, Pricing, Promotion, Search): Read-heavy, heavily cached, requiring sub-50ms latency.
Logistics (Warehouse, Fulfillment, Shipping): Integration with the physical world and 3PLs.
Post-Purchase (Returns, Loyalty): Customer retention and post-sale state machines.
Identity & Access (Auth, User, Customer): Strict separation between internal staff (RBAC) and external customer PII.
Platform Operations (Gateway, Analytics, Notification): Shared infrastructure utilities.

Here is how traffic flows through the ecosystem. (For the full traffic anatomy and Elasticsearch CQRS implementation, see the Architecture Blueprint).

The 3 Rules of Decoupling: Avoiding the Distributed Monolith

The most common failure mode when migrating from a monolith is building a system where a single failure in a non-critical service cascades and takes down the checkout flow. We avoided this with three hard rules:

No cross-domain database queries. If the Order service needs product data, it cannot query the Catalog database. It must either hold a denormalized copy (via CQRS) or call the Catalog service's gRPC API. This ensures schema changes in one domain never break another.
No synchronous calls in the async event path. Once an event (e.g., order.paid) enters the Dapr PubSub mesh, it is processed independently. Downstream services like Fulfillment or Loyalty react to the event, but they never synchronously call back into the producer. If the Loyalty service is down, the order still succeeds.
No shared deployment pipelines. Every service lives in a Rush monorepo but has its own ArgoCD Application and container registry path. A bug or a blocked deployment in the Notification service cannot block a hotfix for the Checkout service.

The Zero-Downtime Migration Strategy: The 3-Phase Strangler Fig

Magento's EAV (Entity-Attribute-Value) schema is a trap. You cannot just run an ETL job over the weekend and cut over traffic. We used a strict 3-Phase Strangler Fig approach to ensure zero dropped orders:

Phase 1 (Read-Only via CDC): We deployed the Go microservices in read-only mode behind our API Gateway. We used Debezium to stream Magento MySQL changes to our Go services via Dapr. Writes still went to Magento.
Phase 2 (Dual-Write & Conflict Resolution): Microservices started accepting writes, persisting to their own Postgres DBs, and publishing domain events. A sync-adapter service listened to these events and wrote back to Magento. Conflicts were resolved by explicit policies (e.g., timestamp-wins).
Phase 3 (Full Cutover via GitOps): Using ArgoCD, we gradually shifted traffic per service: 25% → 50% → 75% → 100%. Magento was kept on "hot standby" for a 30-day rollback window.

Takeaways:

Establish boundaries before writing code. Extract services based on business domains (DDD) and data ownership, not UI pages.
Enforce the DB-per-service rule strictly.
CDC is your best friend during migration.
Do not do a "big bang" release.

Why Go (Golang) instead of Node.js or Java for microservices?

Go provides the perfect balance for e-commerce microservices: it compiles to a single static binary (fast startup for scaling), uses very little memory compared to JVM languages, and has excellent concurrency primitives (goroutines) for handling high-throughput API gateways and event consumers.

How do you handle distributed transactions across 21 services?

We don't do distributed locks or 2PC (Two-Phase Commit). We use the Saga Pattern orchestrated by the Checkout service, leveraging Dapr PubSub. If a step fails, the orchestrator fires compensating events to roll back previous steps.

Does the API Gateway become a new monolith?

No. The gateway only handles Auth, Rate Limiting, and basic routing. We strictly forbid placing business logic in the API Gateway. It acts purely as a traffic shield and BFF (Backend-For-Frontend).

For the full engineering blueprint — including the exact EAV extraction SQL and our Dapr Saga implementation — read the complete Composable Commerce Migration Series.

FAQ

Q: How do you migrate from Magento to microservices without downtime?
A: The safest path is the 3-Phase Strangler Fig pattern: Phase 1 deploys new microservices alongside Magento in read-only mode — reads hit the new services, writes still go to Magento, with Debezium CDC syncing Magento's MySQL binlog to the new services in real time. Phase 2 gradually migrates write APIs (starting with lower-risk domains like Customer, then Catalog, then Order), using bidirectional Dapr Pub/Sub sync to keep Magento's legacy Fulfillment module in sync. Phase 3 cuts all traffic to microservices but keeps Magento as a hot standby with reverse sync for 30 days before termination. Each phase includes a feature flag for sub-10-second rollback.

Q: What is Debezium and why is it used in Magento migration?
A: Debezium is a Change Data Capture (CDC) tool that streams MySQL binary log (binlog) events to a message broker in real time. In a Magento migration, it solves the data consistency problem during the transition period: instead of batch ETL jobs that create race conditions, Debezium captures every INSERT, UPDATE, and DELETE from Magento's MySQL database the moment it happens and publishes it to Dapr Pub/Sub. The new microservices subscribe to these events and keep their own databases synchronized. This creates a continuous, event-driven data bridge between the legacy system and the new architecture with no polling loops or cron jobs.

Q: How do you handle Magento's integer IDs vs UUIDs in microservices migration?
A: Magento uses sequential integer entity_id values as primary keys across all tables. Modern distributed microservices use UUIDs to avoid ID collisions across independent databases and services. The solution is a magento_id_map cross-reference table maintained during the migration period: every Magento integer ID is mapped to a generated UUID before insertion into the new service's database. All new writes from microservices generate UUIDs directly. The Legacy Sync Worker that writes microservice events back into Magento performs the reverse lookup — UUID to integer — when creating records in Magento's EAV schema. This mapping table is the source of truth during dual-write and is retired after the hot standby period ends.

Q: What is bidirectional sync in a microservices migration?
A: Bidirectional sync is the dual-write pattern used during Phase 2 of the migration when both Magento and the new microservices are simultaneously handling writes. When a microservice (e.g., Order Service) processes a transaction, it writes an order.created event to its outbox table in the same database transaction. A Legacy Sync Worker consumes this event from the Dapr Event Mesh and writes it backward into Magento's database, translating modern payloads back into Magento's EAV schema format. Conflict resolution uses timestamp precedence — the newest write wins. This bidirectional sync allows legacy modules still running inside Magento (e.g., Fulfillment) to remain functional while the migration completes.

[System Design] Chapter 2: Flash Sale Engine - Solving Overselling and Hot Keys

Tuấn Anh — Tue, 23 Jun 2026 23:38:50 +0000

← Series hub
← Prev • Next →

Chapter 2: Flash Sale Engine - The Mystery Behind Redis and Hot Keys

Flash Sale events are the ultimate stress test for system architecture. When an iPhone is sold for $1, millions of users will smash the "Buy Now" button in the exact same millisecond. If this massive spike hits a MySQL database directly, the system will instantly crash due to Row Locks and Deadlocks.

1. The Hot Key Problem and Two-Tier Caching

A highly discounted product is known as a Hot Key.
Many developers mistakenly believe that "just putting inventory in Redis" solves everything. However, a single Redis node has Network Bandwidth and CPU limits (typically maxing out at ~100k Ops/sec). One million clicks on a single key will saturate the network interface card (NIC) of that Redis node.

Shopee's Solution: Multi-Level Caching

Tier 1 (Local Cache): Built directly into the RAM of the Golang Application Servers (using tools like sync.Map or BigCache). This local cache only stores a boolean flag: "Is the item still in stock?". It has a TTL of just 1-2 seconds but successfully blocks 90% of useless traffic from hitting the network once the item is sold out.
Tier 2 (Distributed Cache - Redis): Only when the Local Cache reports that the item is available does the request proceed to the Redis cluster.

2. Preventing Overselling with Atomic Lua Scripts

When a user buys an item, the system must deduct the inventory. But if you use standard commands: Read stock (GET) -> Check if > 0 -> Write new stock (SET), you will face a critical Race Condition. Two parallel threads might both read a stock value of 1, both decrement it, and result in selling two items when only one existed (Overselling).

The Solution: Shopee wraps the inventory deduction logic inside Lua Scripts running natively within Redis. Because Redis is fundamentally single-threaded, executing a Lua script acts as an Atomic Transaction—no other requests can interrupt it mid-execution.

-- Example Lua Script for Inventory Deduction
local stock_key = KEYS[1]
local stock = tonumber(redis.call('GET', stock_key))

if stock and stock > 0 then
    redis.call('DECR', stock_key)
    return 1 -- Purchase Successful
else
    return 0 -- Out of Stock
end

Fail Fast: Thanks to this mechanism, if the Lua script returns 0, the request is immediately rejected and the user sees an "Out of Stock" message. This RAM-level operation takes mere microseconds.

3. Inventory Sharding

For mega-campaigns, a single Hot Key on a single Redis Node is still too risky. Shopee employs Inventory Sharding.
If there are 1,000 iPhones, they do not store the number 1,000 in a single key iphone_stock. Instead, they slice it into 10 shards: iphone_stock_1 to iphone_stock_10. Each key holds 100 items and is distributed across 10 different physical Redis Nodes.

A load balancer or router randomly routes incoming user traffic to one of those 10 keys, instantly dividing the massive system pressure by 10.

sequenceDiagram
    participant User
    participant App as Golang Server<br/>(Local Cache)
    participant Redis as Redis Cluster<br/>(Sharded)
    participant Worker as Kafka Worker

    User->>App: Click "Buy Now"
    Note over App: Check Local Cache.<br/>Block if Out of Stock
    App->>Redis: Route to shard (e.g. stock_3)
    Note over Redis: Execute Atomic Lua Script
    Redis-->>App: If 0: Return Error
    Redis-->>Worker: If 1: Push Order Event to Queue
    Worker-->>User: Process Order Asynchronously

Developer Takeaway: RAM and caching are your strongest weapons against heavy traffic. However, do not blindly rely on a Distributed Cache. Combine it with Local Caches on the App Server to save network bandwidth, and always use Lua Scripts to guarantee data consistency when handling sensitive numbers like inventory or wallet balances.

References & Further Reading

This post was originally published on my blog at Chapter 2: Flash Sale Engine - Solving Overselling and Hot Keys.

[System Design] Banking Microservices Architecture: Event Sourcing, CQRS & Saga Patterns for Core Banking

Tuấn Anh — Sun, 21 Jun 2026 00:28:28 +0000

Series context (Part 4 of 8): This article assumes familiarity with ACID transactions and database concurrency. Understanding why consistency guarantees are hard at the database layer is essential context before introducing distributed patterns here.

Why Microservices in Banking?

Microservices in banking is the architectural pattern where a core banking system is broken into independently deployable, domain-owned services (CIF, Payments, Lending, Notifications) connected by an event bus instead of direct database calls. This replaces monolithic systems like T24 or Flexcube — where a single change to the Payments module requires redeploying the entire application and risks taking down unrelated services.

High-risk deployments: Modifying a small module requires redeploying the entire system. A patch to the Payments module can take down CIF.
Inefficient scaling: You cannot scale just the Payments module during peak loads without scaling everything else — including parts that don't need more capacity.
Technology lock-in: Bound to a single programming language and database. Adding a modern ML risk engine becomes an 18-month integration project.

The current trend is transitioning to Headless Core Banking — decoupling the domain logic from the delivery channels (Mobile App, Internet Banking, ATM) using a banking microservices architecture.

Overall Architecture

                    ┌─────────────────────────────────────┐
  CHANNELS          │  Mobile App  │  Internet Banking  │  ATM/POS  │
                    └──────────────────────┬──────────────────────────┘
                                           │ REST/gRPC
                    ┌──────────────────────▼──────────────────────────┐
  API GATEWAY       │         API Gateway (Auth, Rate Limit, Routing)  │
                    └──────────────────────┬──────────────────────────┘
                                           │
         ┌─────────────────────────────────┼──────────────────────────┐
         │                                 │                          │
  ┌──────▼──────┐               ┌──────────▼─────────┐    ┌──────────▼──────────┐
  │ CIF Service │               │  Account Service   │    │ Payment Service     │
  │ (Customer)  │               │  (CASA, GL)        │    │ (Transfers, Fees)   │
  └─────────────┘               └────────────────────┘    └─────────────────────┘
         │                                 │                          │
         └─────────────────────────────────┼──────────────────────────┘
                                           │ Events (Kafka/Dapr)
                    ┌──────────────────────▼──────────────────────────┐
  EVENT BUS         │              Message Broker (Kafka / Redis)      │
                    └──────────────────────┬──────────────────────────┘
                                           │
         ┌─────────────────────────────────┼──────────────────────────┐
         │                                 │                          │
  ┌──────▼──────┐               ┌──────────▼─────────┐    ┌──────────▼──────────┐
  │ Loan Service│               │ Notification Svc   │    │ Reporting Service   │
  │ (Lending)   │               │ (SMS, Push, Email) │    │ (CQRS Read Side)    │
  └─────────────┘               └────────────────────┘    └─────────────────────┘

Pattern 1: Event Sourcing for the Ledger

In traditional architectures, we store the current state. In Event Sourcing, we store a sequence of immutable events that produce that state.

Why does Event Sourcing fit Core Banking?

The ledger is already essentially Event Sourcing — every entry is an immutable event. The current balance is simply the result of replaying all entries from the beginning.

// Events in the Account domain
type AccountOpened struct {
    AccountID    string
    CIFNumber    string
    Currency     string
    OpenedAt     time.Time
}

type MoneyDeposited struct {
    AccountID     string
    Amount        int64
    TransactionID string
    OccurredAt    time.Time
}

type MoneyWithdrawn struct {
    AccountID     string
    Amount        int64
    TransactionID string
    OccurredAt    time.Time
}

// Calculate balance by replaying events
func calculateBalance(events []Event) int64 {
    var balance int64
    for _, event := range events {
        switch e := event.(type) {
        case MoneyDeposited:
            balance += e.Amount
        case MoneyWithdrawn:
            balance -= e.Amount
        }
    }
    return balance
}

Pattern 2: CQRS — Command Query Responsibility Segregation

Core Banking has a unique characteristic: writes must be exceptionally robust (ACID) but reads need to be lightning fast (dashboards, reports). CQRS completely separates these two flows:

WRITE SIDE (Command)                READ SIDE (Query)
────────────────────────            ──────────────────────────
POST /transfers            →        Materialized Views
POST /accounts             →        Elasticsearch Index
PUT /loans/repay           →        Redis Cache

↓ Event Published ↓                ↑ Subscribe & Update ↑
         └──────────────────────────┘
              (Event Bus / Kafka)

Real-world Example:

Write Side: Processes transfers using PostgreSQL with full ACID compliance, guaranteeing money isn't lost.
Read Side: Dashboards display transaction history from Elasticsearch — ultra-fast queries, full-text search, and multi-condition filtering.

Pattern 3: Saga — Distributed Transactions Across Services

When a cross-bank transfer requires coordinating 3 services: Account Service (deduct money), Payment Service (send to clearing house), and Notification Service (send SMS), how do you ensure integrity?

Choreography Saga (Event-Driven)

Account Service                Payment Service           Notification Service
      │                               │                          │
      │── TransferInitiated ──────────▶│                          │
      │                               │── PaymentSubmitted ──────▶│
      │                               │                          │── SMS Sent
      │◀── PaymentCompleted ──────────│                          │
      │                               │                          │
   (release hold)                                            (done)

If Payment fails:
      │◀── PaymentFailed ─────────────│
      │                               │
   (cancel hold, refund)

Outbox Pattern — Guaranteeing Events are Never Lost

Problem: What if a service successfully commits to the database but fails to publish the event to Kafka?

Solution: Write the event to the database within the same transaction, then have a separate worker read and publish it to Kafka.

-- Outbox table: written in the same transaction as business data
CREATE TABLE outbox_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    topic       VARCHAR(100) NOT NULL,  -- 'account.transfer.completed'
    payload     JSONB        NOT NULL,
    status      VARCHAR(20)  NOT NULL DEFAULT 'PENDING',
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

-- Inside the same Database Transaction:
-- 1. Update account balance
-- 2. Write ledger entries  
-- 3. INSERT into outbox_events

-- Separate worker running periodically:
-- SELECT * FROM outbox_events WHERE status = 'PENDING'
-- → Publish to Kafka
-- → UPDATE status = 'PUBLISHED'

API Design for Financial Transactions

Design Principles

Stateless APIs: Every request must contain all necessary information.
Mandatory Idempotency headers for all state-changing APIs.
Strict separation between Request (commands) and Status (polling).

POST /v1/transfers                    → Initiate transfer command
  Header: Idempotency-Key: <uuid>
  Body: { from, to, amount, currency }
  Response: { transfer_id, status: "PROCESSING" }

GET  /v1/transfers/{transfer_id}      → Check result
  Response: { status: "COMPLETED" | "FAILED", ... }

Never design a transfer API as a synchronous block because processing through a central clearing network (like SWIFT) can take anywhere from seconds to minutes.

Technical Stack Selection

Layer	Popular Choices	Reason
Service Framework	Go (Kratos, Fiber), Java (Spring Boot)	High performance, type-safe
Primary Database	PostgreSQL	Strong ACID, flexible JSONB
Cache	Redis	Balances, sessions, rate limiting
Event Bus	Apache Kafka, Dapr PubSub	Durable, ordered, replayable
Service Mesh	Istio, Dapr	mTLS, circuit breaking
Orchestration	Kubernetes	Auto-scaling, self-healing

References & Further Reading

🔗 Previous Step: Explore the foundational database layer in Part 3 — Database Design for Financial Transactions (ACID & Concurrency).

🔗 Next Step: Now that you understand banking microservices architecture and its event-driven patterns, see how these services communicate with the outside world through international financial standards. Continue reading Part 5 — International Integration Standards: ISO 8583 & ISO 20022.

🔗 Deep Dive: For a complete engineering guide to the full composable banking stack — ledger concurrency patterns, Strangler Fig migrations, RFC 8705 mTLS, and the next-gen vendor landscape — see Composable Banking Architecture: From Monolith to Modular Core.

This post was originally published on my blog at Banking Microservices Architecture: Event Sourcing, CQRS & Saga Patterns for Core Banking.