Dipankar Sethi

Posted on Jan 27

👻 The Silent Ghost: How a Single Enum Broke Our Distributed Transaction

#java #distributedsystems #sql #backend

It was 3:00 AM on a Tuesday when PagerDuty decided my night was over.

User onboarding wasn’t down.
No red dashboards.
No obvious errors.

But something felt off.

A quick query against production revealed the truth:

👉 Thousands of users existed in Auth, with no HR profiles.

No UI errors.
No retries.
No alarms.

Just ghost users.

This is a real production postmortem from a system built using Spring Boot, PostgreSQL, and the Saga Pattern — and how a single enum value quietly broke our data integrity.

Architecture Overview

We use an orchestration-based Saga, owned by the HR service.

Goal

When a new employee joins:
Create Auth credentials
Create an HR profile

Both must succeed — or neither should exist.

Original Saga Flow
[ Client ]
    |
[ HR Service (Saga Orchestrator) ]
    |
    |-- (1) Create User --------------> [ Auth Service ] -> [ Auth DB ]
    |
    |-- (2) Save Profile -------------> [ HR DB ❌ CHECK constraint ]
    |
    |-- (3) Delete User (Compensation) -> [ Auth Service ]

The logs claimed compensation succeeded.
Production data disagreed.
The Trigger: A Harmless Enum

We added support for remote work:

public enum AddressType {
    HOME,
    CURRENT,
    OFFICE,
    PERMANENT,
    WORK
}

CI passed.
Deploy succeeded.
Production failed silently.

The Real Culprit: Schema Drift

The HR database had a PostgreSQL CHECK constraint:

CHECK (address_type IN ('HOME','CURRENT','OFFICE','PERMANENT'))

So when WORK was saved:


ERROR: violates check constraint "employee_addresses_address_type_check"

But the Saga still didn’t roll back.
Why?

The Invisible Failure: JPA Lazy Writes

@Transactional
public void onboard(Request req) {
    try {
        authClient.createUser(req);               // SUCCESS
        profileRepository.save(new Profile(req)); // Deferred
    } catch (Exception e) {
        authClient.deleteUser(req.getUserId());   // Never reached
    }
} // Commit happens here

Key detail:

save() doesn’t hit the DB
SQL runs at commit
Exception thrown after try-catch exits

Result:

Auth user created ✅
HR insert fails ❌
Compensation skipped 👻

Fix #1: Fail Fast with flush()

@Transactional
public void onboardEmployee(OnboardRequest request) {
    try {
        authClient.createUser(request);
        employeeRepository.save(new EmployeeProfile(request));

        // Force SQL execution NOW
        employeeRepository.flush();

    } catch (Exception e) {
        authClient.deleteUser(request.getUserId());
        throw e;
    }
}

Now:

DB errors surface immediately
Saga becomes aware of failure
Compensation actually runs

Fix #2: Repair the Schema

ALTER TABLE employee_addresses
DROP CONSTRAINT employee_addresses_address_type_check;

ALTER TABLE employee_addresses
ADD CONSTRAINT employee_addresses_address_type_check
CHECK (address_type IN (
  'HOME','CURRENT','OFFICE','PERMANENT','WORK'
));

The Real Fix: State-Based Saga

Delete-based rollback assumes perfect networks.
They don’t exist.

Improved Flow
[ Client ]
    |
[ HR Service ]
    |
    |-- Create User (PENDING) --------> [ Auth DB ]
    |
    |-- Save Profile (FLUSH) ---------> [ HR DB ]
    |
    |-- Activate User (ACTIVE) -------> [ Auth DB ]

No deletes.
Only state transitions.

The Safety Net: Reconciliation Job

@Scheduled(cron = "0 0 2 * * *")
public void cleanupOrphans() {
    for (User user : authRepo.findAllByStatus(PENDING)) {
        if (!hrService.profileExists(user.getId())) {
            authRepo.delete(user);
        }
    }
}

This job quietly saved us more than once.

Lessons Learned

Java enums ≠ DB constraints
Never trust save() in a Saga
Flush early, fail fast
Delete-based rollback is fragile

Reconciliation jobs save careers

Distributed systems don’t always crash.
Sometimes they corrupt your data while telling you everything is fine.

DEV Community

👻 The Silent Ghost: How a Single Enum Broke Our Distributed Transaction

The Real Culprit: Schema Drift

Fix #1: Fail Fast with flush()

Fix #2: Repair the Schema

The Real Fix: State-Based Saga

Top comments (0)