Pod crashes halfway through processing 50,000 invoices. The replacement pod starts up and begins from invoice 1. Customers 1 through 25,000 get processed twice.
This happens to every team running scheduled jobs across multiple pods. ShedLock prevents duplicate execution across pods but has no answer for crash recovery. When the pod dies, all progress is lost.
The standard answer does not scale
The usual advice is to make your jobs idempotent. For simple jobs that works. For billing, reporting, or anything that calls an external service, true idempotency is either very difficult or impossible. And even with idempotency, reprocessing 25,000 items wastes time and resources.
The right answer is checkpointing. Save progress after each item. On resume, skip what is already done.
What this looks like with Vigil
I am a master's student building a library for Spring Boot that adds per-item checkpointing to scheduled jobs. The API looks like this:
@FencedScheduled(name = "monthly-billing", cron = "0 0 1 1 * *",
lockTtlSeconds = 7200, checkpoint = true)
public void run(JobContext ctx) {
ChargeState result = ctx.forEachPageWithState(
Stage.CHARGE,
lastId -> customerRepo.findNext100After(lastId),
Customer::getId,
new ChargeState(0, 0),
(customer, token, state) -> {
stripeService.charge(customer, IdempotencyKey.of(token, customer.getId()));
return new ChargeState(state.charged() + 1, state.failed());
}
);
ctx.step(Stage.NOTIFY, () -> slack.post("Done: " + result.charged() + " charged"));
}
Pod crashes at customer 30,000. Replacement pod starts, skips customers 1 through 30,000, resumes from 30,001. The NOTIFY step only runs after all customers are processed and never fires twice.
It also solves the GC split-brain
ShedLock has a second problem most teams do not know about until it hits them. If a pod enters a long GC pause, the lock TTL expires, a second pod takes over, and the first pod wakes up still believing it holds the lock. Both run simultaneously. Both writes accepted. ShedLock's own README documents this as a known limitation.
Vigil uses fencing tokens to prevent this. Every lock acquisition increments a counter. Every checkpoint write must present the current token. A zombie pod carries a stale token — its writes hit zero rows and it stops itself. No corruption.
Honest status
The library is not finished. The core locking and checkpointing is implemented and tested. The full API, autoconfiguration, and demo application are still in progress. I am targeting a first release alongside my thesis submission.
I am sharing this now because my thesis needs to validate whether this solves a real problem for real teams — not just in theory. Your answers genuinely matter for the research.
Two questions:
- Have you hit either of these problems in production — crash recovery or the GC split-brain?
- If this library existed today as a single Maven dependency, would you actually use it?
Any answer helps, including "no, we handle it differently" — that is just as valuable to know.
Top comments (0)