DEV Community

Cover image for đź‘» The Silent Ghost: How a Single Enum Broke Our Distributed Transaction
Dipankar Sethi
Dipankar Sethi

Posted on

đź‘» The Silent Ghost: How a Single Enum Broke Our Distributed Transaction

It was 3:00 AM on a Tuesday when PagerDuty decided my night was over.

  • User onboarding wasn’t down.
  • No red dashboards.
  • No obvious errors.

But something felt off.

A quick query against production revealed the truth:

👉 Thousands of users existed in Auth, with no HR profiles.

  • No UI errors.
  • No retries.
  • No alarms.

Just ghost users.

This is a real production postmortem from a system built using Spring Boot, PostgreSQL, and the Saga Pattern — and how a single enum value quietly broke our data integrity.

Architecture Overview

We use an orchestration-based Saga, owned by the HR service.

Goal

  • When a new employee joins:
  • Create Auth credentials
  • Create an HR profile
Both must succeed — or neither should exist.

Original Saga Flow
[ Client ]
    |
[ HR Service (Saga Orchestrator) ]
    |
    |-- (1) Create User --------------> [ Auth Service ] -> [ Auth DB ]
    |
    |-- (2) Save Profile -------------> [ HR DB ❌ CHECK constraint ]
    |
    |-- (3) Delete User (Compensation) -> [ Auth Service ]

Enter fullscreen mode Exit fullscreen mode
  • The logs claimed compensation succeeded.
  • Production data disagreed.
  • The Trigger: A Harmless Enum

We added support for remote work:

public enum AddressType {
    HOME,
    CURRENT,
    OFFICE,
    PERMANENT,
    WORK
}
Enter fullscreen mode Exit fullscreen mode
  • CI passed.
  • Deploy succeeded.
  • Production failed silently.

The Real Culprit: Schema Drift

The HR database had a PostgreSQL CHECK constraint:

CHECK (address_type IN ('HOME','CURRENT','OFFICE','PERMANENT'))

So when WORK was saved:
Enter fullscreen mode Exit fullscreen mode

ERROR: violates check constraint "employee_addresses_address_type_check"
Enter fullscreen mode Exit fullscreen mode

But the Saga still didn’t roll back.
Why?

The Invisible Failure: JPA Lazy Writes

@Transactional
public void onboard(Request req) {
    try {
        authClient.createUser(req);               // SUCCESS
        profileRepository.save(new Profile(req)); // Deferred
    } catch (Exception e) {
        authClient.deleteUser(req.getUserId());   // Never reached
    }
} // Commit happens here
Enter fullscreen mode Exit fullscreen mode

Key detail:

  • save() doesn’t hit the DB
  • SQL runs at commit
  • Exception thrown after try-catch exits

Result:

  • Auth user created âś…
  • HR insert fails ❌
  • Compensation skipped đź‘»

Fix #1: Fail Fast with flush()

@Transactional
public void onboardEmployee(OnboardRequest request) {
    try {
        authClient.createUser(request);
        employeeRepository.save(new EmployeeProfile(request));

        // Force SQL execution NOW
        employeeRepository.flush();

    } catch (Exception e) {
        authClient.deleteUser(request.getUserId());
        throw e;
    }
}
Enter fullscreen mode Exit fullscreen mode

Now:

  • DB errors surface immediately
  • Saga becomes aware of failure
  • Compensation actually runs

Fix #2: Repair the Schema

ALTER TABLE employee_addresses
DROP CONSTRAINT employee_addresses_address_type_check;

ALTER TABLE employee_addresses
ADD CONSTRAINT employee_addresses_address_type_check
CHECK (address_type IN (
  'HOME','CURRENT','OFFICE','PERMANENT','WORK'
));
Enter fullscreen mode Exit fullscreen mode

The Real Fix: State-Based Saga

Delete-based rollback assumes perfect networks.
They don’t exist.

Improved Flow
[ Client ]
    |
[ HR Service ]
    |
    |-- Create User (PENDING) --------> [ Auth DB ]
    |
    |-- Save Profile (FLUSH) ---------> [ HR DB ]
    |
    |-- Activate User (ACTIVE) -------> [ Auth DB ]

Enter fullscreen mode Exit fullscreen mode

No deletes.
Only state transitions.

The Safety Net: Reconciliation Job

@Scheduled(cron = "0 0 2 * * *")
public void cleanupOrphans() {
    for (User user : authRepo.findAllByStatus(PENDING)) {
        if (!hrService.profileExists(user.getId())) {
            authRepo.delete(user);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This job quietly saved us more than once.

Lessons Learned

  • Java enums ≠ DB constraints
  • Never trust save() in a Saga
  • Flush early, fail fast
  • Delete-based rollback is fragile

Reconciliation jobs save careers

  • Distributed systems don’t always crash.
  • Sometimes they corrupt your data while telling you everything is fine.

Top comments (0)