It was 3:00 AM on a Tuesday when PagerDuty decided my night was over.
- User onboarding wasn’t down.
- No red dashboards.
- No obvious errors.
But something felt off.
A quick query against production revealed the truth:
👉 Thousands of users existed in Auth, with no HR profiles.
- No UI errors.
- No retries.
- No alarms.
Just ghost users.
This is a real production postmortem from a system built using Spring Boot, PostgreSQL, and the Saga Pattern — and how a single enum value quietly broke our data integrity.
Architecture Overview
We use an orchestration-based Saga, owned by the HR service.
Goal
- When a new employee joins:
- Create Auth credentials
- Create an HR profile
Both must succeed — or neither should exist.
Original Saga Flow
[ Client ]
|
[ HR Service (Saga Orchestrator) ]
|
|-- (1) Create User --------------> [ Auth Service ] -> [ Auth DB ]
|
|-- (2) Save Profile -------------> [ HR DB ❌ CHECK constraint ]
|
|-- (3) Delete User (Compensation) -> [ Auth Service ]
- The logs claimed compensation succeeded.
- Production data disagreed.
- The Trigger: A Harmless Enum
We added support for remote work:
public enum AddressType {
HOME,
CURRENT,
OFFICE,
PERMANENT,
WORK
}
- CI passed.
- Deploy succeeded.
- Production failed silently.
The Real Culprit: Schema Drift
The HR database had a PostgreSQL CHECK constraint:
CHECK (address_type IN ('HOME','CURRENT','OFFICE','PERMANENT'))
So when WORK was saved:
ERROR: violates check constraint "employee_addresses_address_type_check"
But the Saga still didn’t roll back.
Why?
The Invisible Failure: JPA Lazy Writes
@Transactional
public void onboard(Request req) {
try {
authClient.createUser(req); // SUCCESS
profileRepository.save(new Profile(req)); // Deferred
} catch (Exception e) {
authClient.deleteUser(req.getUserId()); // Never reached
}
} // Commit happens here
Key detail:
- save() doesn’t hit the DB
- SQL runs at commit
- Exception thrown after try-catch exits
Result:
- Auth user created âś…
- HR insert fails ❌
- Compensation skipped đź‘»
Fix #1: Fail Fast with flush()
@Transactional
public void onboardEmployee(OnboardRequest request) {
try {
authClient.createUser(request);
employeeRepository.save(new EmployeeProfile(request));
// Force SQL execution NOW
employeeRepository.flush();
} catch (Exception e) {
authClient.deleteUser(request.getUserId());
throw e;
}
}
Now:
- DB errors surface immediately
- Saga becomes aware of failure
- Compensation actually runs
Fix #2: Repair the Schema
ALTER TABLE employee_addresses
DROP CONSTRAINT employee_addresses_address_type_check;
ALTER TABLE employee_addresses
ADD CONSTRAINT employee_addresses_address_type_check
CHECK (address_type IN (
'HOME','CURRENT','OFFICE','PERMANENT','WORK'
));
The Real Fix: State-Based Saga
Delete-based rollback assumes perfect networks.
They don’t exist.
Improved Flow
[ Client ]
|
[ HR Service ]
|
|-- Create User (PENDING) --------> [ Auth DB ]
|
|-- Save Profile (FLUSH) ---------> [ HR DB ]
|
|-- Activate User (ACTIVE) -------> [ Auth DB ]
No deletes.
Only state transitions.
The Safety Net: Reconciliation Job
@Scheduled(cron = "0 0 2 * * *")
public void cleanupOrphans() {
for (User user : authRepo.findAllByStatus(PENDING)) {
if (!hrService.profileExists(user.getId())) {
authRepo.delete(user);
}
}
}
This job quietly saved us more than once.
Lessons Learned
- Java enums ≠DB constraints
- Never trust save() in a Saga
- Flush early, fail fast
- Delete-based rollback is fragile
Reconciliation jobs save careers
- Distributed systems don’t always crash.
- Sometimes they corrupt your data while telling you everything is fine.
Top comments (0)