Streaming LLM Responses to Mobile Clients

#webdev #programming

---
title: "Streaming LLM Responses to Mobile: SSE vs WebSockets with Ktor + Compose"
published: true
description: "A hands-on guide to streaming token-by-token LLM output from Ktor to Jetpack Compose — covering SSE vs WebSocket tradeoffs, backpressure, and failure handling for mobile AI chat."
tags: kotlin, android, architecture, api
canonical_url: https://blog.mvp-factory.com/streaming-llm-responses-to-mobile-sse-vs-websockets
---

## What We're Building

Let me show you the end-to-end plumbing for streaming LLM tokens from a Ktor backend into a Jetpack Compose UI. By the end of this tutorial, you'll have a working architecture that handles the protocol choice (SSE vs WebSockets), server-side backpressure with Kotlin Flows, client-side token batching to eliminate recomposition jank, and graceful degradation on flaky mobile networks.

The happy path is easy. The hard part is what happens when the network drops mid-stream. That's what we're solving here.

## Prerequisites

- Kotlin + Ktor server basics
- Jetpack Compose fundamentals
- Familiarity with Kotlin `Flow` and coroutines

## Step 1: Pick Your Protocol

Here's the gotcha that will save you hours of debate. LLM streaming is inherently unidirectional — the client sends a prompt, then receives tokens. You don't need bidirectional framing for that.

| Factor | SSE | WebSocket |
|---|---|---|
| Direction | Server → Client | Bidirectional |
| Reconnection | Built-in (`Last-Event-ID`) | Manual implementation |
| HTTP/2 multiplexing | Yes, shares connection pool | No, dedicated TCP socket |
| Battery impact | Lower (idle HTTP conn) | Higher (persistent frame pings) |
| Proxy/CDN compatibility | Excellent | Often problematic |
| Mobile network switching | Graceful (HTTP retry semantics) | Connection drops, full re-handshake |

SSE gives you automatic reconnection with `Last-Event-ID`, which matters enormously on mobile where Wi-Fi-to-cellular transitions happen constantly. I've only reached for WebSockets when I needed server-push *and* client-push simultaneously — collaborative editing, multiplayer features. For AI chat, SSE wins and it's not close.

## Step 2: Wire Up the Ktor Backend with Flow Buffering

Here is the minimal setup to get this working. Pair `respondSseEvents` with a Kotlin `Flow` wrapping your LLM client:

kotlin
get("/chat/stream") {
val prompt = call.receive()
call.respondSseEvents(
llmClient.streamTokens(prompt.message)
.buffer(Channel.BUFFERED) // 64-element default
.map { token ->
ServerSentEvent(data = token)
}
)
}


That `buffer(Channel.BUFFERED)` matters more than it looks. Without it, a slow mobile client creates backpressure that propagates all the way to your LLM API connection. With the buffer, the backend absorbs token bursts while the client catches up.

For structured JSON responses arriving mid-stream, accumulate tokens and only emit parse-ready chunks:

kotlin
fun Flow.chunkedJson(): Flow = flow {
val buffer = StringBuilder()
collect { token ->
buffer.append(token)
if (buffer.hasCompleteJsonFragment()) {
emit(buffer.toString())
buffer.clear()
}
}
if (buffer.isNotEmpty()) emit(buffer.toString())
}


This avoids the client trying to parse `{"name": "Jo` — a surprisingly common source of crashes in production.

## Step 3: Batch Tokens in Compose to Kill Jank

Here is a pattern I use in every project that streams text. Emitting every token as a state update causes per-character recomposition. At 50–80 tokens/second from a fast LLM, that's 50–80 recompositions per second on `Text()`, and you will see frame drops.

The fix is batching with a time window:

kotlin
@Composable
fun StreamingMessage(tokenFlow: Flow) {
val message = remember { mutableStateOf("") }

LaunchedEffect(tokenFlow) {
    tokenFlow
        .chunked(durationMillis = 48) // ~3 frames at 60fps
        .collect { batch ->
            message.value += batch.joinToString("")
        }
}

Text(text = message.value)

}


Batching into ~48ms windows means roughly 20 recompositions per second. Smooth enough visually, well within Compose's performance budget. When I'm deep in profiling recomposition traces for hours, I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running in the background — break reminders are genuinely useful when you lose track of time staring at layout inspector output.

## Step 4: Design for Failure from the Start

Mobile networks are hostile. Your streaming architecture needs layered defenses:

1. **Timeout with partial results.** If the SSE connection stalls for more than 10 seconds, surface whatever tokens have arrived with a "response interrupted" indicator.
2. **Exponential backoff with jitter.** On reconnection, use `Last-Event-ID` to resume. Add jitter to prevent thundering herd when a cell tower comes back online and 10,000 devices reconnect simultaneously.
3. **Fall back to non-streaming.** If three SSE attempts fail, make a standard POST that returns the complete response. The user loses the token animation but still gets their answer.

kotlin
sealed class StreamState {
data class Streaming(val tokens: String) : StreamState()
data class Interrupted(val partial: String) : StreamState()
data class Fallback(val complete: String) : StreamState()
data class Error(val message: String) : StreamState()
}


Model your UI state around these cases. Every `when` branch in your Compose UI should handle all four.

## Gotchas

- **Missing `buffer()` on the server** — without it, one slow client on 3G can stall your LLM connection for everyone in that coroutine scope.
- **Per-token recomposition** — the docs do not mention this, but Compose will happily recompose 80 times per second if you let it. Always batch.
- **Parsing partial JSON** — if your LLM returns structured output, never parse until you have a complete fragment. Buffer server-side.
- **Ignoring `Last-Event-ID`** — SSE's built-in reconnection is only useful if your server actually tracks and resumes from event IDs. Implement it.

## Conclusion

Pick SSE over WebSockets for LLM streaming to mobile. Buffer on the server with `Channel.BUFFERED`, batch on the client in ~48ms windows, and design every state transition around failure. The architecture is straightforward once you know the patterns — the real craft is in the resilience layer.

DEV Community

Streaming LLM Responses to Mobile Clients

Top comments (0)