Vikrant Bagal

Posted on Apr 8

Local LLM Integration in .NET: Running Phi-4, Llama 3 & Mistral With ONNX Runtime

#react #webdev #javascript #dotnet

Running large language models on your .NET applications is no longer sci-fi — it's production-ready reality.

Why Local Inference Matters

Cost Savings

Developers running intensive AI-assisted workflows often report monthly bills in the $200-$400 range. Switching development and testing traffic to a local model brings that dramatically down — often to under $50/month for the same development throughput.

Privacy & Compliance

HIPAA and GDPR require knowing where data is processed. Local inference means patient records, PII, and confidential business data never leave your network. No BAA negotiation, no data processing addendum — the data simply doesn't move.

Offline Capability

Laptops lose connectivity. CI environments sometimes firewall external APIs. A local model works identically on a plane at 35,000 feet and in an air-gapped staging environment.

Latency

A well-configured local model on modern consumer GPU hardware produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300-800ms depending on region and load.

Phi-4 Model Family

Microsoft's Phi-4 family delivers competitive reasoning and coding performance in models small enough to run on a developer laptop.

Model	Parameters	VRAM	Best For
Phi-4	14B	16GB	Complex reasoning, code generation
Phi-4-mini	3.8B	4GB GPU / CPU	Simple tasks, fast iteration
Phi-4-multimodal	5.6B	8GB	Text + image understanding

Pro tip: Quantized variants (Q4_K_M) run comfortably in 3GB VRAM. This is the recommended starting point for developer laptops with 8GB unified memory.

Four Integration Paths

1. Ollama + OllamaSharp

The fastest path from zero to running inference:

ollama pull phi4-mini
ollama run phi4-mini "Hello, can you help me write C# code?"

C# Integration

using OllamaSharp;
using Microsoft.Extensions.AI;

// Option A: Using OllamaSharp DI extension
builder.Services.AddSingleton(new OllamaApiClient(new Uri("http://localhost:11434")));
builder.Services.AddSingleton<IChatClient>(sp =>
    sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));

// Option B: OpenAI-compatible endpoint
builder.Services.AddOpenAIChatClient(
    modelId: "phi4-mini",
    endpoint: new Uri("http://localhost:11434/v1"),
    apiKey: "ollama");

Semantic Kernel Integration

var kernelBuilder = Kernel.CreateBuilder();
kernelBuilder.AddOpenAIChatCompletion(
    modelId: "phi4-mini",
    endpoint: new Uri("http://localhost:11434/v1"),
    apiKey: "ollama");

var kernel = kernelBuilder.Build();
var result = await kernel.InvokePromptAsync("Explain async/await in C# in 3 sentences.");

Semantic Kernel treats Ollama as an OpenAI-compatible provider. The apiKey value can be any non-empty string.

2. ONNX Runtime GenAI

For maximum performance on local hardware, ONNX Runtime GenAI compiles models to native code targeting your specific hardware: DirectML for Windows GPU, CUDA for NVIDIA, or optimized CPU kernels.

using Microsoft.ML.OnnxRuntimeGenAI;

var model = new Model("./phi4-mini-onnx");
var tokenizer = new Tokenizer(model);

var prompt = "<|system|>You are a helpful assistant.<|end|><|user|>Write a C# hello world.<|end|><|assistant|>";
var sequences = tokenizer.Encode(prompt);

var generatorParams = new GeneratorParams(model);
generatorParams.SetInputSequences(sequences);
generatorParams.SetSearchOption("max_length", 512);

using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var outputTokens = generator.GetSequence(0);
    var newToken = outputTokens[^1];
    Console.Write(tokenizer.Decode([newToken]));
}

The generation loop gives you token-by-token streaming output.

3. Azure AI Foundry Local

Enterprise-grade option with Azure integration.

4. LlamaSharp

Cross-framework library supporting multiple model backends.

When Should You Use Local?

✅ Use local for:

Development iteration and prototyping
Sensitive PII or healthcare data
Offline-first applications
Cost optimization on heavy workloads
Low-latency interactive features

❌ Use cloud for:

Production workloads requiring constant availability
Extremely complex tasks needing larger models
Multi-region deployments with strict SLAs

Next Steps

Most .NET developers should start with Phi-4-mini Q4_K_M via Ollama. It responds quickly enough for interactive use, fits on almost any modern GPU, and handles the majority of typical development tasks — code explanation, LINQ generation, unit test scaffolding, and documentation drafting.

Want to go deeper? Check out the official ONNX Runtime documentation and Microsoft's Phi model repository on Hugging Face for pre-converted models ready for ONNX Runtime GenAI.

About the Author

Vikrant is a developer advocate and architect passionate about building practical AI-powered applications. Connect with me on LinkedIn for more .NET and AI content.

🔗 LinkedIn Profile

What local AI use cases are you most excited about? Share your thoughts in the comments below!

Related Reading:

DEV Community