DEV Community

Cover image for Running Local AI Models in .NET with Ollama (Step-by-Step Guide)
alinabi19
alinabi19

Posted on

Running Local AI Models in .NET with Ollama (Step-by-Step Guide)

Most developers who start experimenting with AI tend to follow the same path.

You integrate a cloud AI API into your application. The prototype works beautifully. Responses are fast, integration is simple, and everything feels almost magical.

Then the production questions start appearing.

How much will this cost at scale?

Do we really want sensitive data leaving our infrastructure?

What happens if the API rate limits us?

And the big one many developers eventually ask:

Can we run AI models locally instead?

The answer is yes. And tools like Ollama make it much easier than most developers expect.

Ollama allows you to run powerful language models directly on your machine and access them through a simple HTTP API. This means you can integrate local AI into ASP.NET Core APIs, background services, or internal tools without relying on external providers.

In this guide we will walk through:

  • why local AI models are becoming popular
  • how Ollama works
  • how to run models locally
  • how to call Ollama from a .NET application
  • how to build a simple AI-powered ASP.NET Core endpoint

If you are a .NET developer curious about integrating AI without depending entirely on cloud APIs, this is a great place to start.


Why Run AI Models Locally?

Cloud AI APIs are extremely powerful, but they are not always the best solution for every scenario.

Running models locally offers a few advantages that become very attractive in production environments.

No API Usage Costs

Most cloud AI providers charge based on tokens or requests.

That works well for prototypes. But once usage grows, costs can scale quickly.

Running models locally removes the per-request cost entirely, which makes a big difference for internal tools or heavy workloads.

Full Data Privacy

Many enterprise systems process sensitive information such as:

  • internal documentation
  • support tickets
  • logs
  • customer records

Sending this data to external AI APIs can raise security and compliance concerns.

Local models keep everything inside your infrastructure.

Lower Latency

Cloud inference requires a network round trip.

Local inference removes that dependency completely.

For internal assistants, dashboards, or developer tooling, this often results in noticeably faster responses.

Offline AI Capabilities

Local models can run without internet access.

This is useful in environments like:

  • secure enterprise networks
  • air-gapped systems
  • developer tools running locally

Ideal for Internal Tools

Local AI is especially useful for building tools such as:

  • internal chat assistants
  • log summarization tools
  • documentation search
  • developer copilots
  • AI-powered dashboards

This is exactly the kind of scenario where Ollama shines.

What is Ollama?

Ollama is a tool that allows developers to run LLMs (large language models) locally with minimal setup.

Instead of manually managing model weights, runtime environments, and inference servers, Ollama handles the heavy lifting.

It manages:

  • model downloads
  • model execution
  • memory handling
  • inference APIs

Once installed, Ollama exposes a local HTTP API.

That means any language capable of making HTTP requests can interact with it, including:

  • C#
  • Python
  • JavaScript
  • Go
  • Java

For backend developers, this makes integration extremely straightforward.

Supported Models

Ollama supports many popular open-source models, including:

  • Llama 3
  • Mistral
  • Gemma
  • Code Llama
  • various other community models

Different models are optimized for different tasks.

For example:

Model Best Use Case
llama3 General AI tasks
mistral Fast responses
codellama Code generation
gemma Lightweight inference

One of the biggest advantages of Ollama is how easy it is to switch between models.
Before choosing a model, it's worth understanding the hardware requirements involved in running these models locally.

Resource Considerations

Before running models locally, it is important to understand that LLMs still require system resources.

One thing you'll quickly notice when running local models is that inference latency can vary significantly depending on hardware.

During development, it’s common for responses to take several seconds when running on CPU-only machines.

For internal tools this is usually acceptable, but it’s worth keeping in mind when designing user-facing APIs.

For example:

Llama 3 8B models typically require 8–16GB RAM

CPU inference works but may be slower

GPUs significantly improve performance

For many internal tools, smaller or quantized models provide the best balance between performance and resource usage.

Architecture of a .NET Application Using Ollama

Before writing code, it helps to understand where Ollama fits into the architecture.

A typical integration looks like this:

Client
   ↓
ASP.NET Core API
   ↓
AI Service Layer
   ↓
Ollama Local API
   ↓
Local LLM Model
Enter fullscreen mode Exit fullscreen mode

Request flow:

  1. Client sends a request to the ASP.NET Core API
  2. The API calls an AI service layer
  3. The service sends a prompt to Ollama
  4. Ollama runs the model locally
  5. The generated response returns to the client

This separation keeps the architecture clean and maintainable.

Step 1: Installing Ollama

First, install Ollama on your machine.

Go to the official website:

https://ollama.com

Download the installer for your operating system. Ollama currently supports:

  • macOS
  • Linux
  • Windows

Run the installer and complete the setup.

After installation, open a terminal or command prompt and verify that Ollama is installed correctly by running:

ollama --version
Enter fullscreen mode Exit fullscreen mode

If Ollama is installed properly, you should see the installed version printed in the terminal.

Download a Model

Next, download a language model that Ollama will run locally.

For this guide, we will use Llama 3.

Run the following command:

ollama pull llama3
Enter fullscreen mode Exit fullscreen mode

This command downloads the model weights and prepares them for local inference.

Depending on your internet speed, the download may take a few minutes because LLM models are several gigabytes in size.

Run the Model

Once the model is downloaded, you can start it using:

ollama run llama3
Enter fullscreen mode Exit fullscreen mode

Ollama will load the model and open an interactive prompt.

Try entering a simple question:

Explain what REST APIs are
Enter fullscreen mode Exit fullscreen mode

If the model responds with an answer, your local AI environment is working correctly.

Local API Endpoint

Behind the scenes, Ollama also exposes an HTTP API that applications can call.

By default, the API runs at:

http://localhost:11434
Enter fullscreen mode Exit fullscreen mode

This is the endpoint your ASP.NET Core application will communicate with.

Step 2: Creating a .NET 8 Web API

Next create a new ASP.NET Core API project.

dotnet new webapi -n LocalAIApi
Enter fullscreen mode Exit fullscreen mode

A simple project structure might look like this:

Controllers
Services
Models
Program.cs
Enter fullscreen mode Exit fullscreen mode

Keeping AI logic separated into services helps maintain clean architecture.

Step 3: Calling the Ollama API from .NET

Ollama exposes a simple endpoint for generating responses.

POST http://localhost:11434/api/generate
Enter fullscreen mode Exit fullscreen mode

Example request payload:

{
  "model": "llama3",
  "prompt": "Explain dependency injection in ASP.NET Core",
  "stream": false
}
Enter fullscreen mode Exit fullscreen mode

Example .NET Call

public async Task<string> GenerateAsync(
    string prompt,
    CancellationToken cancellationToken)
{
    var request = new
    {
        model = "llama3",
        prompt,
        stream = false
    };

    var response = await _httpClient.PostAsJsonAsync(
        "api/generate",
        request,
        cancellationToken);

    response.EnsureSuccessStatusCode();

    var result = await response.Content
        .ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);

    return result?.Response ?? string.Empty;
}
Enter fullscreen mode Exit fullscreen mode

Response model:

public class OllamaResponse
{
    public string Response { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

Using a typed model instead of dynamic makes the code safer and easier to maintain.

Step 4: Creating an AI Service Layer

One design rule worth following:

Avoid putting AI logic directly inside controllers.

Instead, isolate it inside a service layer.

Service Interface

public interface IAiService
{
    Task<string> GenerateAsync(string prompt, CancellationToken cancellationToken);
}
Enter fullscreen mode Exit fullscreen mode

Implementation

public class OllamaService : IAiService
{
    private readonly HttpClient _httpClient;

    public OllamaService(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public async Task<string> GenerateAsync(
        string prompt,
        CancellationToken cancellationToken)
    {
        var request = new
        {
            model = "llama3",
            prompt,
            stream = false
        };

        var response = await _httpClient.PostAsJsonAsync(
            "api/generate",
            request,
            cancellationToken);

        response.EnsureSuccessStatusCode();

        var result = await response.Content
            .ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);

        return result?.Response ?? string.Empty;
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Registering the Service

In Program.cs:

builder.Services.AddHttpClient<IAiService, OllamaService>(client =>
{
    client.BaseAddress = new Uri("http://localhost:11434");
    client.Timeout = TimeSpan.FromMinutes(2);
});
Enter fullscreen mode Exit fullscreen mode

Using HttpClientFactory ensures efficient connection management.

Step 6: Building an AI Endpoint

Now expose the AI functionality through a controller.

Request model:

public class AiRequest
{
    public string Prompt { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

Controller:

[ApiController]
[Route("api/ai")]
public class AiController : ControllerBase
{
    private readonly IAiService _aiService;

    public AiController(IAiService aiService)
    {
        _aiService = aiService;
    }

    [HttpPost("generate")]
    public async Task<IActionResult> Generate(
        AiRequest request,
        CancellationToken cancellationToken)
    {
        if (string.IsNullOrWhiteSpace(request.Prompt))
            return BadRequest("Prompt is required.");

        var result = await _aiService.GenerateAsync(
            request.Prompt,
            cancellationToken);

        return Ok(new { response = result });
    }
}
Enter fullscreen mode Exit fullscreen mode

Your ASP.NET Core API now exposes a local AI-powered endpoint.

Step 7: Switching Models

One of the nicest things about Ollama is how easy it is to switch between models.

For example:

ollama pull mistral
ollama pull codellama
Enter fullscreen mode Exit fullscreen mode

Then update the request:

{
  "model": "mistral"
}
Enter fullscreen mode Exit fullscreen mode

In practice, testing a few models usually produces better results than simply choosing the largest one.

Step 8: Improving Performance

Local models can still be resource intensive.

A few practical optimizations can help significantly.

Use Streaming
Streaming responses improves perceived latency for longer outputs.

Reduce Prompt Size
Large prompts increase inference time.
Send only the context that the model truly needs.

Cache Repeated Requests
If prompts repeat frequently, caching responses can reduce compute usage.

Keep Calls Asynchronous
Always use async APIs when calling models to keep your backend scalable.

When Local Models Are Not the Best Choice

Local models are powerful, but they are not always the right solution.

For example, cloud AI services may still be better when:

  • you need extremely large models
  • you require massive scaling
  • GPU infrastructure is unavailable
  • inference workloads are very high

In practice, many teams use a hybrid approach, combining cloud models and local models depending on the use case.


Final Thoughts

The local AI ecosystem is moving incredibly fast right now.

Just a few years ago, running large language models required specialized machine learning environments. Today tools like Ollama make it possible for everyday backend developers to experiment with local LLMs using familiar technologies.

From a .NET perspective, integrating Ollama is actually much simpler than it looks at first.

Instead of relying entirely on external APIs, you can build AI-powered systems that are:

  • private
  • cost-efficient
  • low latency
  • fully controlled by your infrastructure

For internal tools, developer assistants, and AI-powered APIs, local models are quickly becoming a practical and powerful alternative to cloud AI services.

A Few Things Worth Remembering

  • Ollama makes running local LLMs simple
  • .NET applications can interact with Ollama using HTTP APIs
  • Use a dedicated AI service layer to keep architecture clean
  • Choose models based on your use case, not just size
  • Optimize prompts and responses for better performance

Top comments (0)