DEV Community

Cover image for Building a streaming AI companion in your own API
Sam Vanhoutte
Sam Vanhoutte

Posted on

Building a streaming AI companion in your own API

I am co-founder of libelo, which is a platform for discovering parks and nature highlights. One of the features we are building is a conversational companion: an AI assistant that helps users plan hikes, find species, check trail conditions, and get context about the nature around them. The backend is powered by Azure AI Foundry. This post explains why the mobile app does not talk to Foundry directly, how we structure the response so the app can render rich UI cards alongside the text, and how the streaming connection between client and server works.

Why route through your own API

The straightforward implementation is to put Azure AI Foundry credentials in the mobile app and have it call the agent endpoint directly. That approach has enough problems that we ruled it out immediately.

Security.

We are using Azure Entra External Id as authentication mechanism, so we want to keep those tokens for our internal validation and security rules. Storing Azure credentials on a client device are credentials waiting to be extracted. Routing through the API keeps all secrets server-side. The mobile app authenticates to our API with its existing JWT; the API authenticates to Foundry using a managed identity. The client never sees an Azure key.

Monitoring.

We run OpenTelemetry traces on every controller action. By going through our own endpoint we get request counts, latency, token usage, and error rates in the same dashboards as the rest of the API, without any extra instrumentation work. Token consumption shows up as a metric tagged with user ID and operation, which matters when you are watching costs.

Validation and context enrichment.

The companion endpoint accepts a message and an optional context block containing the user's current location, the park or highlight they are viewing, their activity goals, and available time. A FluentValidation pipeline checks all of that before the agent ever sees it. The API then assembles a per-request system instruction block that injects this context into the conversation, something a direct Foundry call from the client would have to replicate itself.

Resilience.

Foundry has transient errors. The server-side client wraps every agent call in a Polly retry pipeline with exponential backoff. The controller catches any unrecoverable failure and returns a 503 Service Unavailable rather than letting an SDK exception propagate to the client as a 500.

Setting up the Azure AI Foundry agent

The agent itself lives in Azure AI Foundry and is created once through the portal. The application code never creates or modifies the agent definition; it only retrieves the existing agent by ID and attaches the per-request tool instances to it.

Start at ai.azure.com. You need an Azure AI Foundry Hub and a project inside it. The Hub is the shared resource that holds model deployments; the project is the workspace where agents live. If you are deploying into an existing Azure subscription, create both from the portal under the Azure AI Foundry service.

Inside the project, deploy a model from the model catalog. For a conversational assistant that calls tools, GPT-4o is the practical choice, but here we are using claude-sonet-4-6. Give the deployment a name you will reference in the agent configuration.

Open the Agents section in the left menu and click New agent. Configure the following:

  • Model: select the deployment you just created
  • Name: a display name for the agent (used for identification in logs and the portal)
  • Instructions: the base system prompt that describes the agent's role and behaviour. Ensure you really test these through. (we have for example added an instruction to not promote competitive products 🤣)

Below you can see the setup of another agent we have to collect user feedback. This agent is having a tool configured that will take input from users and automatically create tickets in a github repo in our organization. (just enabling the tool with a PAT is all what was needed)

Configuring the agent

The instructions here are the static part of the system prompt. The application injects per-request context (user location, current park, language, safety warnings) on top of this as a system message on each turn. Keep the base instructions focused on persona and rules; leave context for the application to supply.

Once the agent is saved, the portal shows an Agent ID on the overview page. That value goes into your Companion:AgentId configuration setting. The project endpoint URL is shown under Project details and takes the form https://<project>.services.ai.azure.com/api/projects/<project>. That goes into Companion:AzureAIProjectEndpoint.

Granting access to the managed identity

The application authenticates to Azure AI Foundry using DefaultAzureCredential. In production that resolves to the user-assigned managed identity attached to the container app. That identity needs the Azure AI User role assigned on the AI Foundry project resource. This role grants permission to retrieve agent definitions, create conversation sessions, run agents, and read streaming results.

In the Azure Portal, navigate to the AI Foundry project resource, open Access control (IAM), click Add role assignment, and select Azure AI User. Assign it to your managed identity.

Access rights

For local development, DefaultAzureCredential falls back to your Azure CLI login. Run az login and make sure your account has the same Azure AI User role on the project. The application code requires no changes between local and production.

With the agent ID, the project endpoint, and the role assignment in place, the AIProjectClient initialisation resolves cleanly at startup:

services.AddSingleton<AIProjectClient>(sp =>
{
    var options = sp.GetRequiredService<IOptions<CompanionOptions>>().Value;
    return new AIProjectClient(
        new Uri(options.AzureAIProjectEndpoint),
        new DefaultAzureCredential());
});
Enter fullscreen mode Exit fullscreen mode

The tools

The agent is not just a chat model. It will have access to a set of tools that let it query (sometimes live) data.

  • ParkStatusTool - retrieves the current status and trail difficulty ratings for a park from the Libelo database
  • WeatherTool - uses our Meteo Service to get current conditions at a coordinate; weather codes above a severity threshold are prefixed with [SAFETY] so the model can warn users appropriately
  • RecentSightingsTool - searches the species catalog with an optional name filter
  • UserFavouritesTool - retrieves the user's saved parks and highlights so the agent can reference places they already know
  • EmitCardsTool - a special final-turn tool described in the next section

Tools are registered with the agent using AIFunctionFactory.Create() from the Microsoft.Agents.AI package:

public IList<AITool> BuildTools()
{
    return
    [
        AIFunctionFactory.Create(parkStatusTool.GetParkStatusAsync),
        AIFunctionFactory.Create(weatherTool.GetWeatherAsync),
        AIFunctionFactory.Create(recentSightingsTool.GetRecentSightingsAsync),
        AIFunctionFactory.Create(userFavouritesTool.GetUserFavouritesAsync),
        AIFunctionFactory.Create(emitCardsTool.EmitCardsAsync),
    ];
}
Enter fullscreen mode Exit fullscreen mode

Each tool class is registered as a scoped dependency, so it gets a fresh instance per request with its own injected services.

Structured replies: cards

When the agent responds, it often references specific entities: a park it recommends, a species it identified, a trail it describes. The mobile app needs to render those as tappable UI elements that deep-link into the relevant screen, not just as text the user has to read and act on manually.

The mechanism for this is cards. A card is a typed entity reference: a type (park, highlight, species) and a GUID. The final CompanionMessageResponse carries both the message text and an array of cards that the app renders below the response.

public class CompanionMessageResponse
{
    public string Message { get; set; } = string.Empty;
    public List<CompanionCard> Cards { get; set; } = [];
}

public class CompanionCard
{
    public string Type { get; set; } = string.Empty;
    public Guid Id { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

The challenge is that language models can hallucinate. If you let the model emit card references without checking them against what it actually retrieved, it will sometimes reference parks or species it invented. We prevent that with an anti-hallucination check that tracks two sets of IDs per request.

Surfaced IDs are recorded by the data tools. When ParkStatusTool returns park and trail data, it records those IDs in the buffer. When RecentSightingsTool returns species, it records those. These are the entities the model actually saw in its context window.

Emitted IDs are recorded by EmitCardsTool, which the model calls at the end of its turn to declare which entities it wants to surface in the response.

public async Task<string> EmitCardsAsync(List<CompanionCard> cards)
{
    var validCards = cards
        .Where(c => validTypes.Contains(c.Type))
        .Where(c => c.Id != Guid.Empty)
        .ToList();

    cardBuffer.RecordEmitted(validCards);
    return "Cards emitted successfully.";
}
Enter fullscreen mode Exit fullscreen mode

When the response is assembled, GetValidatedCards() returns only the intersection: entities that were both surfaced by a tool and claimed by the model. If the model tries to emit a card for an entity it never actually retrieved, it is silently dropped.

public IReadOnlyList<CompanionCard> GetValidatedCards()
{
    return emitted
        .Where(e => surfaced.Any(s => s.Type == e.Type && s.Id == e.Id))
        .ToList();
}
Enter fullscreen mode Exit fullscreen mode

This approach means the card list in the response is always grounded in data the model actually had access to during that turn.

Streaming the response

Waiting for the full agent response before sending anything to the client produces noticeable latency. Instead, the endpoint streams tokens as they arrive using Server-Sent Events (SSE). The client receives text deltas immediately and renders them word by word, with the final structured data arriving at the end.

The controller sets up the SSE response and delegates to the service:

[HttpPost("threads/{threadId}/messages")]
[ProducesResponseType(StatusCodes.Status200OK)]
[ProducesResponseType(StatusCodes.Status503ServiceUnavailable)]
public async Task<IActionResult> SendMessageAsync(
    [FromRoute] Guid threadId,
    [FromBody] SendCompanionMessageRequest request,
    CancellationToken cancellationToken)
{
    Response.Headers.ContentType = "text/event-stream";
    Response.Headers.CacheControl = "no-cache";
    Response.Headers.Connection = "keep-alive";

    try
    {
        await companionService.StreamMessageAsync(
            threadId, userId, request, Response.Body, cancellationToken);
        return new EmptyResult();
    }
    catch (CompanionUnavailableException)
    {
        return StatusCode(StatusCodes.Status503ServiceUnavailable);
    }
}
Enter fullscreen mode Exit fullscreen mode

The service writes two kinds of SSE events to the response body. Text delta events arrive continuously as the model generates its response:

data: The Hoge Veluwe national park covers

data:  5,400 hectares of heathland and forest.
Enter fullscreen mode Exit fullscreen mode

Once streaming completes and the card buffer has been validated, a final envelope event is written containing the complete message text and the card array:

data: __envelope__:{"message":"The Hoge Veluwe...","cards":[{"type":"park","id":"..."}]}
Enter fullscreen mode Exit fullscreen mode

The mobile app accumulates the text deltas for live rendering, then on the envelope event it replaces the accumulated text with the canonical message and renders the card strip. This gives users the perception of a fast response while ensuring the final state is always consistent with the validated data.

The agent integration

All direct interaction with the Azure AI Foundry SDK is isolated in a single class: FoundryAgentClient. No other class in the application imports Microsoft.Agents.AI types. This keeps the rest of the codebase insulated from SDK changes and makes it straightforward to swap the underlying model provider if needed.

The client wraps a FoundryAgent instance that is built per-request by the DI container, with the scoped tool instances for that request injected at construction time:

services.AddScoped<FoundryAgent>(sp =>
{
    var projectClient = sp.GetRequiredService<AIProjectClient>();
    var options = sp.GetRequiredService<IOptions<CompanionOptions>>().Value;
    var toolFactory = sp.GetRequiredService<CompanionToolFactory>();

    return projectClient.GetFoundryAgentClient().GetAgent(
        options.AgentId,
        new FoundryAgentOptions { Tools = toolFactory.BuildTools() });
});
Enter fullscreen mode Exit fullscreen mode

The agent holds the Foundry agent ID (configured at startup) and the tool list for this request. When RunStreamingAsync is called, it creates or resumes a Foundry conversation session, injects the per-request system instructions and the user message, then streams back AgentResponseUpdate events:

public async IAsyncEnumerable<string> RunStreamingAsync(
    string sessionId,
    string systemInstructions,
    string userMessage,
    [EnumeratorCancellation] CancellationToken cancellationToken)
{
    var session = await agent.CreateConversationSessionAsync(sessionId, cancellationToken);

    await session.SendSystemMessageAsync(systemInstructions, cancellationToken);
    await session.SendMessageAsync(userMessage, cancellationToken);

    await foreach (var update in session.RunStreamingAsync(cancellationToken))
    {
        if (update.Text is { Length: > 0 } text)
            yield return text;
    }
}
Enter fullscreen mode Exit fullscreen mode

The system instructions are built fresh each request by CompanionContextBuilder, which formats the user's language preference, current coordinates, the park or highlight in view, and any safety warnings triggered by trail difficulty or weather severity. This context is never stored in Foundry; it travels as a system message on every turn so the model always has an accurate picture of where the user is and what they are looking at.

Putting it together

The full flow from a mobile app perspective is simple: send a POST with a message and optional context, read the SSE stream for text deltas, wait for the envelope event, render the cards. The complexity lives entirely on the server.

POST /api/v1/companion/threads/{threadId}/messages
Authorization: Bearer <user-jwt>

{
  "message": "Do you have a nice suggestion for a walk today?",
  "context": {
    "latitude": 52.0705,
    "longitude": 5.1214,
    "availableMinutes": 120
  }
}
Enter fullscreen mode Exit fullscreen mode

On the server, that request creates or resumes a conversation session, enriches it with context, streams through the Foundry agent with live tool calls, validates the card buffer, and writes the SSE response. The mobile app gets a token stream it can render immediately and a final envelope it can trust.

The required configuration is minimal:

{
  "Companion": {
    "AzureAIProjectEndpoint": "https://<project>.services.ai.azure.com/api/projects/<project>",
    "AgentId": "<foundry-agent-name>"
  }
}
Enter fullscreen mode Exit fullscreen mode

Authentication to Azure uses DefaultAzureCredential, so locally it picks up your Azure CLI login and in production it uses the managed identity assigned to the container app. No secrets in configuration files.

The pattern scales cleanly. Adding a new tool means implementing a scoped service, registering it in DI, and adding one line to BuildTools(). The card buffer picks it up automatically as long as the new tool records its surfaced IDs before returning. The rest of the pipeline does not change.

Top comments (0)