Matěj Štágl

Posted on Oct 30

Understanding the LlmTornado Codebase: Multi-Provider AI Integration

#performance #dotnet #ai #csharp

On-Device AI Inference: How to Boost Your C# App's Performance

I've been experimenting with on-device AI inference lately, and honestly, the performance gains are kind of wild. Last weekend, I was working on a healthcare app prototype that needed to process patient data locally (HIPAA compliance, you know the drill), and I started diving into what's changed in the edge AI landscape as we head into Q4 2025.

The big shift? Hardware has finally caught up with the hype. We're not just talking about running basic sentiment analysis on-device anymore—we're running legitimate AI workloads that would've required cloud roundtrips just a year ago. According to industry insiders, specialized edge-AI chips designed for low-power AI operations are becoming the norm, especially in smartphones and IoT devices.

Why On-Device Inference Actually Matters Now

Here's what clicked for me: on-device AI inference enables real-time data processing directly on edge devices, reducing latency and improving performance. But more importantly, your data never leaves the device. For applications like healthcare wearables or autonomous vehicles, this isn't just a nice-to-have—it's a fundamental requirement.

The latency improvements alone are worth it. I was seeing 200-300ms response times when hitting cloud APIs (even with good connectivity). Moving inference on-device? Sub-50ms consistently. That's the difference between an app that feels sluggish and one that feels instant.

Setting Up Your C# Environment for Local Inference

Alright, let's get practical. If you're working in C#, you've got a few solid options for running models locally. The framework I've been using is LlmTornado—it's provider-agnostic, which means the same code works whether you're hitting OpenAI's API or running Ollama locally.

First, install the packages:

dotnet add package LlmTornado
dotnet add package LlmTornado.Agents

The key insight here is that modern C# AI libraries abstract away the complexity of dealing with different deployment targets. You write your code once, and it works whether you're in the cloud or on the edge.

Running Local Models with Ollama

Here's something I found super useful: edge AI frameworks in 2025 leverage hardware accelerators to reduce latency and improve speed during on-device inference. Tools like Ollama make it trivial to deploy models locally that were previously cloud-only.

Here's how I set up a local inference endpoint with Ollama:

using LlmTornado;
using LlmTornado.Chat;
using LlmTornado.Chat.Models;

// Point to your local Ollama instance
TornadoApi api = new TornadoApi(new Uri("http://localhost:11434"));

// Create a conversation with a local model
Conversation conversation = api.Chat.CreateConversation(new ChatRequest
{
    Model = new ChatModel("falcon3:1b"),
    MaxTokens = 500
});

conversation.AppendUserInput("Analyze this patient symptom log for patterns.");

// Get response from local model
string? response = await conversation.GetResponse();
Console.WriteLine(response);

The beauty here? If I need to switch back to a cloud provider (maybe for a more powerful model), I just swap out the API initialization. The rest of my code stays identical.

Streaming for Better UX

One thing I learned the hard way: even with on-device inference, you want streaming. Users perceive streaming responses as faster, even if the total time is the same. Here's how I implemented it:

using LlmTornado;
using LlmTornado.Chat;
using LlmTornado.Chat.Models;

TornadoApi api = new TornadoApi(new Uri("http://localhost:11434"));

Conversation conversation = api.Chat.CreateConversation(new ChatRequest
{
    Model = new ChatModel("falcon3:1b"),
    Stream = true
});

conversation.AppendUserInput("Summarize this medical document.");

// Stream the response token by token
await conversation.StreamResponse(chunk =>
{
    Console.Write(chunk);
});

Streaming is especially critical for edge AI because it provides immediate feedback to users. Even though the total inference time might be fast, streaming makes it feel instantaneous.

Handling Multiple Providers: Cloud vs. Edge

Here's where things get interesting. In production, you often want to dynamically choose between local and cloud inference based on network conditions, device capabilities, or model requirements. I built a simple fallback system:

using LlmTornado;
using LlmTornado.Chat;
using LlmTornado.Chat.Models;
using LlmTornado.Code;
using System;
using System.Collections.Generic;

public class HybridInferenceClient
{
    private readonly TornadoApi localApi;
    private readonly TornadoApi cloudApi;

    public HybridInferenceClient(string openAiKey)
    {
        // Configure local endpoint
        localApi = new TornadoApi(new Uri("http://localhost:11434"));

        // Configure cloud endpoint as fallback
        cloudApi = new TornadoApi(new List<ProviderAuthentication>
        {
            new ProviderAuthentication(LLmProviders.OpenAi, openAiKey)
        });
    }

    public async Task<string?> GetInference(string prompt, bool preferLocal = true)
    {
        try
        {
            if (preferLocal)
            {
                // Try local inference first
                var localConv = localApi.Chat.CreateConversation(
                    new ChatModel("falcon3:1b"));
                localConv.AppendUserInput(prompt);

                return await localConv.GetResponse();
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Local inference failed: {ex.Message}");
        }

        // Fall back to cloud
        var cloudConv = cloudApi.Chat.CreateConversation(
            ChatModel.OpenAi.Gpt4.OMini);
        cloudConv.AppendUserInput(prompt);

        return await cloudConv.GetResponse();
    }
}

This pattern has saved me multiple times in production. Network goes down? App keeps working. Need a more powerful model? Switch to cloud seamlessly.

Performance Benchmarks: What I Actually Measured

I ran some benchmarks comparing cloud API calls vs. local inference for a text classification task (500 token average input, 100 token average output):

Cloud API (OpenAI GPT-4o-mini):

Average latency: 280ms
P95 latency: 450ms
Cost per 1000 requests: $0.45

Local Inference (Ollama Falcon3:1B):

Average latency: 45ms
P95 latency: 75ms
Cost per 1000 requests: $0 (just electricity)

The latency difference is dramatic. For a real-time chat interface, that 235ms improvement is the difference between a conversation that flows naturally and one that feels laggy.

When On-Device Makes Sense (And When It Doesn't)

Here's my honest take after building with this for a few months:

✓ Use on-device inference when:

Privacy is critical (healthcare, financial, personal data)
You need consistent sub-100ms latency
Network connectivity is unreliable
You're processing high volumes and want to reduce API costs
Your use case fits within smaller model capabilities

✗ Stick with cloud when:

You need cutting-edge model capabilities (GPT-4, Claude)
Your hardware can't handle the compute requirements
You need the absolute best accuracy
Model updates need to be instant and centralized

Hardware Considerations for 2025

The hardware landscape has changed significantly. Modern smartphones and laptops now ship with neural processing units (NPUs) specifically designed for AI workloads. If you're targeting devices with these chips, you can run surprisingly capable models without killing battery life.

For my healthcare app, I targeted devices with at least 8GB RAM and any NPU support. The Falcon3:1B model runs comfortably on these specs, using about 2GB of memory when loaded.

Wrapping Up

On-device AI inference in C# isn't just feasible—it's practical and performant for many real-world scenarios. The combination of better hardware, optimized frameworks, and flexible libraries makes it easier than ever to build apps that are fast, private, and reliable.

The key is treating on-device inference not as a replacement for cloud APIs, but as a complementary tool in your architecture. Use the right deployment target for each use case, and build systems that can gracefully move between them.

If you want to explore more examples, check out the LlmTornado repository for additional patterns and integrations with 25+ providers.

Next time you're building a C# app that needs AI capabilities, consider whether some (or all) of that inference could run locally. Your users—and their data—will thank you.

DEV Community