Adrián Bailador

Posted on Oct 18

Build Your Local AI with C# and .NET

#csharp #dotnet #ai #opensource

Introduction

In a world where data privacy and control over technological infrastructure are increasingly important, running artificial intelligence models completely offline is no longer just a technical curiosity: it's a necessity.

In this guide you'll learn to build your own local AI in C# and .NET, using llama.cpp and its native wrapper LLamaSharp.

You'll be able to generate text, answer questions, or build your own chatbot without internet connection, without depending on OpenAI or other APIs, and with complete control over your data.

⚠️ Minimum requirements: 8 GB RAM, modern CPU (4+ cores), 10 GB disk space

What We'll Cover

0/ Installing and setting up llama-cli

1/ Downloading and using GGUF models

2/ Running text generation locally without an internet connection

3/ Building a simple C# integration to send prompts and receive AI responses

4/ Why learning to run AI offline is critical — for privacy, cost control, performance, and future-proofing your projects

5/ Best practices for using offline AI in your own apps

Whether you're building secure apps, working with sensitive data, or just want full control over your AI models, this guide will help you get started!

0️⃣ Installing and Setting Up llama-cli

What is llama.cpp?

llama.cpp is a CPU-optimised C/C++ implementation that allows you to run language models (LLMs) locally — without expensive GPUs or internet connection. It's the foundation of many current offline AI tools.

Installation on Windows

# Download precompiled binaries from:
# https://github.com/ggerganov/llama.cpp/releases

# Or compile from source:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release

Installation on Linux/macOS

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Once finished, you'll have the llama-cli executable available.

1️⃣ Downloading and Using GGUF Models

What are GGUF models?

GGUF (GPT-Generated Unified Format) is the standard format for running quantised language models, optimised for CPU. Thanks to quantisation, models are lighter and faster, with minimal quality loss.

Where to download models

Hugging Face
TheBloke — high-quality GGUF conversions

Recommended models

Model	Size	RAM needed	Ideal for
Llama-3.2-3B	~2 GB	4 GB	Basic tasks
Mistral-7B	~4 GB	8 GB	General use
Llama-3.1-8B	~5 GB	10 GB	High quality

Download example

mkdir models && cd models
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

💡 Tip: Use the Q4_K_M version for the best balance between performance and quality.

2️⃣ Running Locally Without Connection

Basic command

./llama-cli -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  -p "Explain what artificial intelligence is in simple terms" \
  -n 256 -c 2048

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*",
  "AI": {
    "ModelPath": "C:\\Users\\Adri\\AI\\models\\mistral-7b-instruct-v0.2.Q4_K_M.gguf"
  }
}

Parameter	Description
`-m`	Path to GGUF model
`-p`	Prompt or user input
`-n`	Maximum output tokens
`-c`	Context size
`--temp`	Temperature (0.1–1.0)

Interactive mode

./llama-cli -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --interactive \
  --reverse-prompt "User:" \
  -c 4096

3️⃣ Simple C# Integration

Option A: Running llama-cli from C

using System.Diagnostics;

public class LocalAI
{
    private readonly string _modelPath;
    private readonly string _llamaPath;

    public LocalAI(string modelPath, string llamaPath)
    {
        _modelPath = modelPath;
        _llamaPath = llamaPath;
    }

    public async Task<string> GenerateAsync(string prompt)
    {
        var args = $"-m \"{_modelPath}\" -p \"{prompt}\" -n 256 -c 2048 --log-disable";

        var psi = new ProcessStartInfo
        {
            FileName = _llamaPath,
            Arguments = args,
            RedirectStandardOutput = true,
            UseShellExecute = false,
            CreateNoWindow = true
        };

        using var process = Process.Start(psi)!;
        var output = await process.StandardOutput.ReadToEndAsync();
        await process.WaitForExitAsync();

        return CleanOutput(output);
    }

    private string CleanOutput(string text)
    {
        var lines = text.Split('\n')
            .Where(l => !l.Contains("llama_") && !l.Contains("main:"));
        return string.Join('\n', lines).Trim();
    }
}

Usage:

var ai = new LocalAI(
    modelPath: @"C:\Users\Adri\AI\models\mistral-7b.gguf",
    llamaPath: @"C:\Users\Adri\llama.cpp\llama-cli.exe"
);

var response = await ai.GenerateAsync("What is .NET?");
Console.WriteLine(response);

Option B: Using LLamaSharp

dotnet add package LLamaSharp
dotnet add package LLamaSharp.Backend.Cpu

using LLama;
using LLama.Common;

public class LLamaService
{
    private readonly LLamaContext _context;

    public LLamaService(string modelPath)
    {
        var parameters = new ModelParams(modelPath)
        {
            ContextSize = 2048,
            GpuLayerCount = 0
        };

        var weights = LLamaWeights.LoadFromFile(parameters);
        _context = weights.CreateContext(parameters);
    }

    public async Task<string> GenerateAsync(string prompt)
    {
        var executor = new InteractiveExecutor(_context);
        var config = new InferenceParams { MaxTokens = 256 };

        var result = new StringBuilder();
        await foreach (var token in executor.InferAsync(prompt, config))
            result.Append(token);

        return result.ToString();
    }
}

Minimal REST API

// Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddSingleton(_ => 
    new LLamaService(@"C:\Users\Adri\AI\models\mistral-7b.gguf"));

var app = builder.Build();

app.MapPost("/generate", async (PromptRequest req, LLamaService ai) =>
{
    var response = await ai.GenerateAsync(req.Prompt);
    return Results.Ok(new { response });
});

app.Run();

record PromptRequest(string Prompt);

4️⃣ Why Running AI Offline is Critical

🔒 Privacy and Security

✅ Complete data control — Nothing leaves your infrastructure

✅ Regulatory compliance — GDPR, HIPAA compatible

✅ No leak risk — Your prompts never travel to the cloud

✅ Protected intellectual property — Your code remains private

⚡ Performance and Availability

Ultra-low latency: 10-50ms vs 500-2000ms with cloud APIs
Always available: No internet dependency
No rate limits: Process as many requests as your hardware supports
Controlled scalability: Adjust resources as needed

Future-Proofing

Models constantly improve and become more efficient
Privacy regulations are tightening
The ability to run AI locally will be a competitive advantage
Technological independence and no vendor lock-in

5️⃣ Best Practices

Optimisation

Use Q4_K_M quantised models for the best balance
Adjust --threads to your CPU core count
Consider GPU if you have NVIDIA (with GpuLayerCount > 0)
Implement caching for frequent prompts

Security

Validate and sanitise all user inputs
Implement rate limiting on public APIs
Monitor resource usage (CPU, RAM)
Log metrics to detect anomalies

Monitoring

public async Task<(string Response, TimeSpan Duration)> GenerateWithMetricsAsync(string prompt)
{
    var sw = Stopwatch.StartNew();
    var response = await _ai.GenerateAsync(prompt);
    sw.Stop();

    _logger.LogInformation("Generated {Tokens} tokens in {Ms}ms", 
        response.Split(' ').Length, sw.ElapsedMilliseconds);

    return (response, sw.Elapsed);
}

Persistence

// Simple cache with IMemoryCache
public async Task<string> GetOrGenerateAsync(string prompt)
{
    var key = $"ai:{prompt.GetHashCode()}";

    if (_cache.TryGetValue(key, out string? cached))
        return cached!;

    var result = await _ai.GenerateAsync(prompt);
    _cache.Set(key, result, TimeSpan.FromHours(1));

    return result;
}

Use Cases

Internal support chatbot — Without exposing data to third parties

Code assistant — Generate tests, document functions

Sensitive data processing — Analyse confidential information

Ticket auto-response — Automated help system

Documentation generation — Explain legacy code

Next Steps

Download a GGUF model and try llama-cli
Integrate with C# using one of the methods shown
Measure performance on your hardware
Implement caching to improve speed
Explore fine-tuning with your own data

Conclusion

Running AI offline with C# and .NET is accessible, powerful, and necessary.

With llama.cpp and LLamaSharp you can:

✅ Protect your users' privacy

✅ Drastically reduce operational costs

✅ Gain technological independence

✅ Build faster and more reliable applications

Offline AI isn't the future — it's already the present.

And with .NET, you can master it today.

DEV Community