---
title: "Structured Concurrency in Ktor 3: Failure Isolation Done Right"
published: true
description: "Build a resilient Ktor 3 coroutine hierarchy with SupervisorJob, scoped background jobs, and failure isolation that prevents one slow upstream from cascading across your service."
tags: kotlin, architecture, api, backend
canonical_url: https://blog.mvpfactory.co/structured-concurrency-in-ktor-3-failure-isolation-done-right
---
## What We Will Build
In this workshop we will wire up a **coroutine supervision tree** for a Ktor 3 service that handles parallel upstream calls per-request, runs background jobs that outlive requests but respect SIGTERM, and isolates third-party SDK failures behind blast-radius boundaries. By the end you will have the exact `SupervisorJob` + `CoroutineExceptionHandler` hierarchy you can drop into production.
## Prerequisites
- Kotlin 1.9+ and Ktor 3.x on your classpath
- Familiarity with `async`/`launch` and `coroutineScope`
- Micrometer on your dependency list for the metrics section
## Step 1: Stop Default Scoping From Killing Siblings
Here is the minimal setup to get this working. When you fan out parallel calls inside a route handler, the default `coroutineScope` uses a regular `Job`. One timeout cancels everything — database, cache, the lot.
kotlin
// DANGEROUS: one failure cancels everything
get("/dashboard") {
coroutineScope {
val user = async { userService.fetch(id) } // DB call
val prefs = async { cacheService.getPrefs(id) } // Redis
val recs = async { recoApi.fetch(id) } // External API, slow
respond(DashboardResponse(user.await(), prefs.await(), recs.await()))
}
}
If `recoApi.fetch()` throws a `TimeoutCancellationException`, both `user` and `prefs` are cancelled. At 2,000 req/s, one flaky upstream turns your p99 latency into a p50 error rate.
**The fix:** replace `coroutineScope` with `supervisorScope`. Child failures stop propagating sideways:
kotlin
get("/dashboard") {
supervisorScope {
val user = async { userService.fetch(id) }
val prefs = async { cacheService.getPrefs(id) }
val recs = async {
withTimeout(500.milliseconds) { recoApi.fetch(id) }
}
val recsResult = runCatching { recs.await() }.getOrDefault(emptyList())
respond(DashboardResponse(user.await(), prefs.await(), recsResult))
}
}
Now a recommendation API timeout returns a degraded response instead of a 500. The critical path completes independently.
| Strategy | Child failure behavior | Use case |
|---|---|---|
| `coroutineScope` (regular `Job`) | Cancels all siblings | All-or-nothing transactions |
| `supervisorScope` (`SupervisorJob`) | Siblings continue | Parallel independent fetches |
| Custom `SupervisorJob` + `CoroutineExceptionHandler` | Siblings continue, errors logged | Background job pools |
## Step 2: Background Jobs That Survive Requests
Webhook retries and cache warming should outlive the request but respect graceful shutdown. Let me show you a pattern I use in every project — an application-scoped supervisor tied to Ktor's lifecycle:
kotlin
fun Application.configureBackgroundJobs() {
val handler = CoroutineExceptionHandler { _, throwable ->
log.error("Background job failed", throwable)
meterRegistry.counter("bg.job.failure", "type", throwable.javaClass.simpleName).increment()
}
val bgScope = CoroutineScope(SupervisorJob() + Dispatchers.Default + handler)
environment.monitor.subscribe(ApplicationStopping) {
bgScope.cancel()
runBlocking { bgScope.coroutineContext.job.children.forEach { it.join() } }
}
routing {
post("/webhook") {
val payload = call.receive<WebhookPayload>()
call.respond(HttpStatusCode.Accepted)
bgScope.launch {
retryWithBackoff(maxAttempts = 3) { webhookProcessor.deliver(payload) }
}
}
}
}
The `SupervisorJob` means one failing delivery does not cancel other in-flight jobs. The shutdown hook ensures all active jobs complete during SIGTERM. No orphaned coroutines, no lost deliveries.
## Step 3: Isolate Rogue SDK Coroutines
The docs do not mention this, but third-party SDKs that launch coroutines into your scope are the ones that get you at 3 AM. One unhandled exception propagates up and cancels your application scope.
kotlin
val sdkScope = CoroutineScope(
SupervisorJob() + Dispatchers.IO + CoroutineExceptionHandler { _, ex ->
log.warn("SDK failure isolated", ex)
meterRegistry.counter("sdk.failure.isolated").increment()
}
)
suspend fun safeSdkCall(): SdkResult = withContext(sdkScope.coroutineContext) {
withTimeout(2.seconds) { thirdPartySdk.riskyOperation() }
}
This creates a blast radius boundary. The SDK throws whatever it wants — your request pipeline and background jobs are untouched.
## Step 4: Wire Micrometer Into Job States
kotlin
fun CoroutineScope.launchTracked(
name: String, registry: MeterRegistry,
block: suspend CoroutineScope.() -> Unit
): Job {
registry.gauge("jobs.active", Tags.of("name", name), this) {
coroutineContext.job.children.count().toDouble()
}
return launch {
registry.timer("job.duration", "name", name).recordSuspend { block() }
}
}
This gives you Grafana dashboards with active job counts, duration percentiles, and failure rates by job type.
## The Full Scope Hierarchy
plaintext
Application (SupervisorJob + CEH → logs & metrics)
├── RequestScope (supervisorScope per-request)
│ ├── async { dbCall }
│ ├── async { cacheCall }
│ └── async { apiCall } ← timeout doesn't kill siblings
├── BackgroundJobScope (SupervisorJob + CEH)
│ ├── launch { webhookRetry } ← failure isolated
│ └── launch { cacheWarming }
└── SdkIsolationScope (SupervisorJob + CEH)
└── thirdPartySdk calls ← blast radius contained
## Gotchas
- **`supervisorScope` does not swallow exceptions.** You still must `runCatching` on each `await()` — the supervisor only prevents *sibling cancellation*, not propagation to the caller.
- **Never use `GlobalScope` for background work.** You lose all lifecycle control. Tie scopes to `ApplicationStopping` so in-flight jobs complete during graceful shutdown.
- **`withContext(sdkScope.coroutineContext)` inherits the scope's job.** This is how you get isolation. A bare `withContext(Dispatchers.IO)` without the dedicated scope's `SupervisorJob` will not protect you.
- **Gauge registration is not idempotent in all Micrometer registries.** Register `jobs.active` gauges once at startup, not per-request.
## Wrapping Up
Here is the gotcha that will save you hours: the difference between a resilient Ktor service and a 3 AM page is three scopes with `SupervisorJob`, a `CoroutineExceptionHandler` on each, and Micrometer wired into job states. Default `coroutineScope` cascades single failures into full request failures. Dedicated background and SDK isolation scopes contain the blast radius. Get this hierarchy right once and your on-call rotation gets a lot quieter.
Top comments (0)