Allan Roberto

Posted on Mar 22

Improve the interaction with Stream AI Responses

#java #springboot #ai #programming

I had a Spring Boot API talking to AI providers, and at first it did the most obvious thing: send the prompt, wait for the model to finish, and then return the full response as JSON.

It worked.

But it also felt wrong.

When you are dealing with AI-generated text, waiting several seconds for a complete response is a pretty bad experience. The model is already producing tokens progressively, but the API was hiding that and making the client wait for everything. So I decided to fix that and add proper streaming support.

This post is about that change.

Not a giant rewrite. Just a practical refactor to make AI responses feel alive instead of delayed.

The original problem

The first version of the endpoint was synchronous. The flow was basically:

Receive the prompt
Call the AI provider
Wait for the entire answer
Return one JSON response

That is simple, but it creates an awkward UX. Even when the model is generating steadily, the user sees nothing until the very end.

For normal CRUD APIs, that is fine.
For AI, it is not.

People expect text to appear as it is generated. Once you have used ChatGPT, Claude, or Ollama in a streaming UI, it is hard to go back.

The goal

I wanted the backend to expose a streaming endpoint so the frontend could start rendering text immediately.

At the same time, I did not want the implementation to become provider-specific spaghetti.

The app already supported more than one provider, so the solution had to work in a way that kept the service layer clean.

That led to two decisions:

Expose streaming through a dedicated endpoint
Make both AI clients implement the same streaming contract

Adding a streaming endpoint

I kept the normal JSON endpoint and added a second one for streaming.

The streaming endpoint returns text/event-stream and uses Spring’s SseEmitter:

@PostMapping(path = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter stream(@Valid @RequestBody ChatPromptRequest body) {
  String prompt = body.prompt().trim();
  SseEmitter emitter = new SseEmitter(0L);

  Thread.ofVirtual().start(() -> {
    try {
      chatService.stream(prompt, chunk -> sendChunk(emitter, chunk));
      emitter.complete();
    } catch (Exception exception) {
      emitter.completeWithError(exception);
    }
  });

  return emitter;
}

What I like about this approach is that it is pretty direct.
The controller does not need to know how Ollama or Claude stream internally. It only knows that chunks are coming in, and it forwards those chunks to the client as SSE events.

Each chunk is emitted like this:

emitter.send(SseEmitter.event().name("chunk").data(chunk));

So the frontend receives a stream that looks like:

event: chunk
data: Hello

event: chunk
data: world

That is very different from returning a JSON array. This is actual streaming, not "collect everything and send it later in a different shape."

The tricky part: providers do not stream the same way

This was the part that mattered most in the refactor.

The application supports both Ollama and Claude, but they do not return streamed data in the same format.

Ollama streams newline-delimited JSON.

Claude streams Server-Sent Events containing event types and payloads.

So even though both are "streaming," they are not interchangeable at the raw HTTP level.

If I pushed that difference too high in the application, the controller and service code would get messy fast. I wanted the rest of the app to think in terms of text chunks, not provider-specific wire formats.

So I added a common method to the client contract:

public interface LlmClient {
  ChatResponse chat(ChatRequest request);

  default void streamChat(ChatRequest request, Consumer<String> onChunk) {
    throw new UnsupportedOperationException("Streaming is not supported by this provider");
  }
}

That gave both clients the same responsibility: take a request, read the upstream stream, and call onChunk every time there is new text.

Once that was in place, the service and controller stayed simple.

Handling Ollama streaming

For Ollama, the request needed to enable streaming explicitly:

Map<String, Object> body = Map.of(
    "model", model,
    "prompt", request.prompt(),
    "stream", true
);

Then the client reads the response stream line by line, parses each JSON object, and extracts the response field.
Conceptually, it works like this:

Read one line
Parse it as JSON
If it has text in response, forward it
Repeat until the stream ends

That maps nicely to a Consumer<String> callback.

Handling Claude streaming

Claude needed a different implementation because its stream is SSE-based upstream.

That means the code has to deal with:

Event names
data: lines
Event boundaries
Content deltas

In practice, I only wanted the actual text fragments, so the Claude client listens for the right event type and extracts the text delta before forwarding it.

Same application-level behavior, different provider-level parsing.

That is the part I think was worth doing carefully. The point was not to pretend both providers behave the same. The point was to hide those differences behind one consistent interface for the rest of the app.

Sharing the parsing behavior without overengineering it

I also pulled the repetitive stream parsing logic into a small shared helper.

Not a big framework. Just a utility to handle:

JSON-line iteration
SSE event iteration

That helped keep the provider clients focused on provider logic instead of low-level stream mechanics.

This was one of those refactors that made the code easier to read immediately. Both clients still do different things, but now the differences are easier to spot because the noise is lower.

What the frontend should do

One thing that is easy to get wrong: this endpoint is a POST, so the frontend should not use EventSource.
EventSource only works well with GET.

Since the streaming endpoint is POST /api/chat/stream, the frontend should use fetch() and read the response body as a stream.

That is the right match for this backend.

So the flow on the frontend becomes:

Send the prompt with fetch
Read chunks from response.body
Parse SSE frames
Append each chunk event to the displayed answer

That gives the user progressive rendering without waiting for the model to finish the entire response.

And honestly, that one change makes the app feel much faster, even if the total generation time is exactly the same.

A couple of small supporting improvements

The main work here was streaming, but while I was in the code I cleaned up two smaller things that made the API nicer.

First, I replaced the request payload from Map<String, String> with a proper record:

public record ChatPromptRequest(
    @NotBlank(message = "prompt must not be blank")
    String prompt
) {
}

and used @Valid in the controller.

That is a lot better than manually fishing values out of a map. The request contract is clearer, validation is built in, and the OpenAPI docs become more accurate automatically.
Second, I added a small global exception handler returning ProblemDetail for validation failures and unexpected server errors.

That was not directly about streaming, but it made the endpoints behave more consistently, especially when bad input is sent or something fails during processing.

I would still treat those as side improvements, though. The main story here is the streaming refactor.

What changed in practice

Before this work, the API waited for the full AI answer and returned one final payload.

After the refactor:

The API can stream text progressively to the client
Ollama and Claude both support the same application-level streaming feature
The controller does not care about provider-specific stream formats
The frontend can render the answer as it arrives

That is the kind of change I like: it improves the user experience, but it also improves the shape of the backend code.

Usually you only get one or the other.

Final thought

Streaming sounds like a UI feature, but it is really an API design decision too.

If the backend hides the model’s incremental output, the frontend has no chance to create a responsive experience.

Once I changed the API to treat streamed tokens as a first-class concern, the whole flow started making more sense.

The nice part is that the code did not need to become complicated to support it. It just needed a cleaner contract.

And that, more than anything, was the real improvement.

DEV Community