Running large language models on your .NET applications is no longer sci-fi — it's production-ready reality.
Why Local Inference Matters
Cost Savings
Developers running intensive AI-assisted workflows often report monthly bills in the $200-$400 range. Switching development and testing traffic to a local model brings that dramatically down — often to under $50/month for the same development throughput.
Privacy & Compliance
HIPAA and GDPR require knowing where data is processed. Local inference means patient records, PII, and confidential business data never leave your network. No BAA negotiation, no data processing addendum — the data simply doesn't move.
Offline Capability
Laptops lose connectivity. CI environments sometimes firewall external APIs. A local model works identically on a plane at 35,000 feet and in an air-gapped staging environment.
Latency
A well-configured local model on modern consumer GPU hardware produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300-800ms depending on region and load.
Phi-4 Model Family
Microsoft's Phi-4 family delivers competitive reasoning and coding performance in models small enough to run on a developer laptop.
| Model | Parameters | VRAM | Best For |
|---|---|---|---|
| Phi-4 | 14B | 16GB | Complex reasoning, code generation |
| Phi-4-mini | 3.8B | 4GB GPU / CPU | Simple tasks, fast iteration |
| Phi-4-multimodal | 5.6B | 8GB | Text + image understanding |
Pro tip: Quantized variants (Q4_K_M) run comfortably in 3GB VRAM. This is the recommended starting point for developer laptops with 8GB unified memory.
Four Integration Paths
1. Ollama + OllamaSharp
The fastest path from zero to running inference:
ollama pull phi4-mini
ollama run phi4-mini "Hello, can you help me write C# code?"
C# Integration
using OllamaSharp;
using Microsoft.Extensions.AI;
// Option A: Using OllamaSharp DI extension
builder.Services.AddSingleton(new OllamaApiClient(new Uri("http://localhost:11434")));
builder.Services.AddSingleton<IChatClient>(sp =>
sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));
// Option B: OpenAI-compatible endpoint
builder.Services.AddOpenAIChatClient(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama");
Semantic Kernel Integration
var kernelBuilder = Kernel.CreateBuilder();
kernelBuilder.AddOpenAIChatCompletion(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama");
var kernel = kernelBuilder.Build();
var result = await kernel.InvokePromptAsync("Explain async/await in C# in 3 sentences.");
Semantic Kernel treats Ollama as an OpenAI-compatible provider. The apiKey value can be any non-empty string.
2. ONNX Runtime GenAI
For maximum performance on local hardware, ONNX Runtime GenAI compiles models to native code targeting your specific hardware: DirectML for Windows GPU, CUDA for NVIDIA, or optimized CPU kernels.
using Microsoft.ML.OnnxRuntimeGenAI;
var model = new Model("./phi4-mini-onnx");
var tokenizer = new Tokenizer(model);
var prompt = "<|system|>You are a helpful assistant.<|end|><|user|>Write a C# hello world.<|end|><|assistant|>";
var sequences = tokenizer.Encode(prompt);
var generatorParams = new GeneratorParams(model);
generatorParams.SetInputSequences(sequences);
generatorParams.SetSearchOption("max_length", 512);
using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var outputTokens = generator.GetSequence(0);
var newToken = outputTokens[^1];
Console.Write(tokenizer.Decode([newToken]));
}
The generation loop gives you token-by-token streaming output.
3. Azure AI Foundry Local
Enterprise-grade option with Azure integration.
4. LlamaSharp
Cross-framework library supporting multiple model backends.
When Should You Use Local?
✅ Use local for:
- Development iteration and prototyping
- Sensitive PII or healthcare data
- Offline-first applications
- Cost optimization on heavy workloads
- Low-latency interactive features
❌ Use cloud for:
- Production workloads requiring constant availability
- Extremely complex tasks needing larger models
- Multi-region deployments with strict SLAs
Next Steps
Most .NET developers should start with Phi-4-mini Q4_K_M via Ollama. It responds quickly enough for interactive use, fits on almost any modern GPU, and handles the majority of typical development tasks — code explanation, LINQ generation, unit test scaffolding, and documentation drafting.
Want to go deeper? Check out the official ONNX Runtime documentation and Microsoft's Phi model repository on Hugging Face for pre-converted models ready for ONNX Runtime GenAI.
About the Author
Vikrant Aghara is a developer advocate and architect passionate about building practical AI-powered applications. Connect with me on LinkedIn for more .NET and AI content.
What local AI use cases are you most excited about? Share your thoughts in the comments below!
Related Reading:
Top comments (0)