Originally published at adrijshikhar.dev.
Most retry libraries wrap one call. Fine for a single flaky operation — but when you run a pool of tasks, retry should be the pool's job, not yours.
retry-thread-pool puts retries at the thread-pool level: wrap any ExecutorService, submit a named task, get a CompletableFuture — retries happen on their own. Java 17+, on Maven Central, zero runtime dependencies.
Quickstart
RetryPolicy policy = RetryPolicy.builder()
.maxRetries(3)
.backoff(Backoff.exponentialWithJitter(Duration.ofMillis(100), Duration.ofSeconds(5)))
.retryOn(IOException.class)
.build();
try (RetryExecutor executor = RetryExecutor.builder().retryPolicy(policy).build()) {
CompletableFuture<User> user = executor.submit("fetch-user", () -> client.fetchUser(id));
// compose it, join it, or collect a whole batch — it's a normal CompletableFuture
}
What you get
-
Backoff —
none,fixed,exponential,exponentialWithJitter. Jitter kills synchronized retry storms. -
Predicates —
retryOn(...)/abortOn(...);abortOnwins.ErrorandInterruptedExceptionnever retry. - Per-attempt timeout — a hung attempt is interrupted and retried, not left to wedge a worker.
-
Listeners —
onRetry/onSuccess/onExhausted/onAbort, for metrics/logs without touching task code. - Stats — immutable snapshot: submitted / succeeded / exhausted / retried / timed-out counts.
-
Bring your own pool — any
ExecutorService, including virtual threads on 21+. -
Loud exhaustion — out of retries →
RetryExhaustedException(cause = last failure); a non-retryable error surfaces as itself.
Why it matters
-
Fire and forget — submit → future. No catch, no
sleep, no attempt counters, no rescheduling in your code. -
Async stays async — backoff is a scheduler timer, not a
Thread.sleep. Workers keep working; throughput holds when a dependency flaps. - Independent healing — each task has its own budget; one flaky task doesn't stall the ninety-nine beside it.
- Resilience is a pool property — not retry logic threaded through every call site.
Observability
See what the pool is doing — without instrumenting your task code:
-
Listeners —
onRetry/onSuccess/onExhausted/onAbortfire on every transition; bridge them to Micrometer, StatsD, or logs. -
stats()— an immutable snapshot: submitted / succeeded / exhausted / aborted / retried / timed-out / rejected, plus active + queued counts. Scrape it for a dashboard or a health check. -
Logs — via
System.Logger, routed to your existing backend. Nothing to wire. -
Latency —
TaskEvent.attemptDuration(per attempt) andstats().totalExecutionMillis(aggregate) give you timing, not just counts.
RetryExecutor executor = RetryExecutor.builder()
.retryPolicy(policy)
.listener(new RetryListener() {
@Override public void onRetry(TaskEvent e) { meter.counter("pool.retry", "task", e.taskName()).increment(); }
@Override public void onExhausted(TaskEvent e) { meter.counter("pool.exhausted", "task", e.taskName()).increment(); }
})
.build();
RetryExecutorStats s = executor.stats(); // point-in-time snapshot
log.info("succeeded={} exhausted={} retries={} timedOut={}",
s.succeeded(), s.exhausted(), s.retriesScheduled(), s.timedOut());
Lifecycle & control
-
AutoCloseable— use try-with-resources;close()stops new submits and drains in-flight plus already-scheduled retries before returning. - Owns only what it makes — it shuts down its internal pool; a pool you pass in stays yours to close.
-
Cancellation —
future.cancel(true)interrupts the running attempt and cancels the pending retry. Cancelled ≠ exhausted, so no spuriousonExhausted.
Robustness
-
Fail-fast config — the builder validates at
build():maxRetries >= 0, positive durations, and a class listed in bothretryOnandabortOnis rejected. -
Overflow-safe backoff — exponential delays cap cleanly instead of overflowing; jitter is full jitter over
[0, delay]. - Correct under load — the scheduler thread never runs your code (attempts and listeners run on the work pool), and stats are lock-free.
Zero dependencies
Logging goes through the JDK's System.Logger facade (Java 9+) — routes to your SLF4J/Log4j if present, silent otherwise. You add one artifact and nothing else comes with it.
Agent-first
Built so an AI agent can use it from the examples alone:
-
llms.txt— discovery index pointing agents at the docs. -
docs/AI_USAGE.md— full public surface + a recipe per feature. -
AGENTS.md— build/test/conventions for agents editing the library. -
Docs = compiling tests — every recipe is a real test in
ExamplesTest. Change the API and the examples stop compiling, so the build fails. The docs can't drift from the code.
// from ExamplesTest — compiles and passes on every build
@Test
void exhaustionSurfacesLastFailure() {
RetryPolicy policy = RetryPolicy.builder()
.maxRetries(2).backoff(Backoff.fixed(Duration.ofMillis(5))).build();
try (RetryExecutor executor = RetryExecutor.builder().retryPolicy(policy).build()) {
CompletableFuture<String> result =
executor.submit("doomed", () -> { throw new IOException("permanent"); });
ExecutionException thrown = assertThrows(ExecutionException.class, result::get);
RetryExhaustedException cause =
assertInstanceOf(RetryExhaustedException.class, thrown.getCause());
assertEquals(3, cause.attempts()); // 1 initial + 2 retries
assertInstanceOf(IOException.class, cause.getCause());
}
}
Try it
<dependency>
<groupId>io.github.adrijshikhar</groupId>
<artifactId>retry-thread-pool</artifactId>
<version>0.2.0</version>
</dependency>
- Repo: https://github.com/adrijshikhar/retry-thread-pool
- API docs: https://javadoc.io/doc/io.github.adrijshikhar/retry-thread-pool
Retries belong wherever your work runs. If your work runs on a pool, they belong on the pool.
Top comments (0)