alinabi19

Posted on Mar 13

Running Local AI Models in .NET with Ollama (Step-by-Step Guide)

#dotnet #ai #ollama #backend

Most developers who start experimenting with AI tend to follow the same path.

You integrate a cloud AI API into your application. The prototype works beautifully. Responses are fast, integration is simple, and everything feels almost magical.

Then the production questions start appearing.

How much will this cost at scale?

Do we really want sensitive data leaving our infrastructure?

What happens if the API rate limits us?

And the big one many developers eventually ask:

Can we run AI models locally instead?

The answer is yes. And tools like Ollama make it much easier than most developers expect.

Ollama allows you to run powerful language models directly on your machine and access them through a simple HTTP API. This means you can integrate local AI into ASP.NET Core APIs, background services, or internal tools without relying on external providers.

In this guide we will walk through:

why local AI models are becoming popular
how Ollama works
how to run models locally
how to call Ollama from a .NET application
how to build a simple AI-powered ASP.NET Core endpoint

If you are a .NET developer curious about integrating AI without depending entirely on cloud APIs, this is a great place to start.

Why Run AI Models Locally?

Cloud AI APIs are extremely powerful, but they are not always the best solution for every scenario.

Running models locally offers a few advantages that become very attractive in production environments.

No API Usage Costs

Most cloud AI providers charge based on tokens or requests.

That works well for prototypes. But once usage grows, costs can scale quickly.

Running models locally removes the per-request cost entirely, which makes a big difference for internal tools or heavy workloads.

Full Data Privacy

Many enterprise systems process sensitive information such as:

internal documentation
support tickets
logs
customer records

Sending this data to external AI APIs can raise security and compliance concerns.

Local models keep everything inside your infrastructure.

Lower Latency

Cloud inference requires a network round trip.

Local inference removes that dependency completely.

For internal assistants, dashboards, or developer tooling, this often results in noticeably faster responses.

Offline AI Capabilities

Local models can run without internet access.

This is useful in environments like:

secure enterprise networks
air-gapped systems
developer tools running locally

Ideal for Internal Tools

Local AI is especially useful for building tools such as:

internal chat assistants
log summarization tools
documentation search
developer copilots
AI-powered dashboards

This is exactly the kind of scenario where Ollama shines.

What is Ollama?

Ollama is a tool that allows developers to run LLMs (large language models) locally with minimal setup.

Instead of manually managing model weights, runtime environments, and inference servers, Ollama handles the heavy lifting.

It manages:

model downloads
model execution
memory handling
inference APIs

Once installed, Ollama exposes a local HTTP API.

That means any language capable of making HTTP requests can interact with it, including:

C#
Python
JavaScript
Go
Java

For backend developers, this makes integration extremely straightforward.

Supported Models

Ollama supports many popular open-source models, including:

Llama 3
Mistral
Gemma
Code Llama
various other community models

Different models are optimized for different tasks.

For example:

Model	Best Use Case
llama3	General AI tasks
mistral	Fast responses
codellama	Code generation
gemma	Lightweight inference

One of the biggest advantages of Ollama is how easy it is to switch between models.
Before choosing a model, it's worth understanding the hardware requirements involved in running these models locally.

Resource Considerations

Before running models locally, it is important to understand that LLMs still require system resources.

One thing you'll quickly notice when running local models is that inference latency can vary significantly depending on hardware.

During development, it’s common for responses to take several seconds when running on CPU-only machines.

For internal tools this is usually acceptable, but it’s worth keeping in mind when designing user-facing APIs.

For example:

Llama 3 8B models typically require 8–16GB RAM

CPU inference works but may be slower

GPUs significantly improve performance

For many internal tools, smaller or quantized models provide the best balance between performance and resource usage.

Architecture of a .NET Application Using Ollama

Before writing code, it helps to understand where Ollama fits into the architecture.

A typical integration looks like this:

Client
   ↓
ASP.NET Core API
   ↓
AI Service Layer
   ↓
Ollama Local API
   ↓
Local LLM Model

Request flow:

Client sends a request to the ASP.NET Core API
The API calls an AI service layer
The service sends a prompt to Ollama
Ollama runs the model locally
The generated response returns to the client

This separation keeps the architecture clean and maintainable.

Step 1: Installing Ollama

First, install Ollama on your machine.

Go to the official website:

https://ollama.com

Download the installer for your operating system. Ollama currently supports:

macOS
Linux
Windows

Run the installer and complete the setup.

After installation, open a terminal or command prompt and verify that Ollama is installed correctly by running:

ollama --version

If Ollama is installed properly, you should see the installed version printed in the terminal.

Download a Model

Next, download a language model that Ollama will run locally.

For this guide, we will use Llama 3.

Run the following command:

ollama pull llama3

This command downloads the model weights and prepares them for local inference.

Depending on your internet speed, the download may take a few minutes because LLM models are several gigabytes in size.

Run the Model

Once the model is downloaded, you can start it using:

ollama run llama3

Ollama will load the model and open an interactive prompt.

Try entering a simple question:

Explain what REST APIs are

If the model responds with an answer, your local AI environment is working correctly.

Local API Endpoint

Behind the scenes, Ollama also exposes an HTTP API that applications can call.

By default, the API runs at:

http://localhost:11434

This is the endpoint your ASP.NET Core application will communicate with.

Step 2: Creating a .NET 8 Web API

Next create a new ASP.NET Core API project.

dotnet new webapi -n LocalAIApi

A simple project structure might look like this:

Controllers
Services
Models
Program.cs

Keeping AI logic separated into services helps maintain clean architecture.

Step 3: Calling the Ollama API from .NET

Ollama exposes a simple endpoint for generating responses.

POST http://localhost:11434/api/generate

Example request payload:

{
  "model": "llama3",
  "prompt": "Explain dependency injection in ASP.NET Core",
  "stream": false
}

Example .NET Call

public async Task<string> GenerateAsync(
    string prompt,
    CancellationToken cancellationToken)
{
    var request = new
    {
        model = "llama3",
        prompt,
        stream = false
    };

    var response = await _httpClient.PostAsJsonAsync(
        "api/generate",
        request,
        cancellationToken);

    response.EnsureSuccessStatusCode();

    var result = await response.Content
        .ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);

    return result?.Response ?? string.Empty;
}

Response model:

public class OllamaResponse
{
    public string Response { get; set; }
}

Using a typed model instead of dynamic makes the code safer and easier to maintain.

Step 4: Creating an AI Service Layer

One design rule worth following:

Avoid putting AI logic directly inside controllers.

Instead, isolate it inside a service layer.

Service Interface

public interface IAiService
{
    Task<string> GenerateAsync(string prompt, CancellationToken cancellationToken);
}

Implementation

public class OllamaService : IAiService
{
    private readonly HttpClient _httpClient;

    public OllamaService(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public async Task<string> GenerateAsync(
        string prompt,
        CancellationToken cancellationToken)
    {
        var request = new
        {
            model = "llama3",
            prompt,
            stream = false
        };

        var response = await _httpClient.PostAsJsonAsync(
            "api/generate",
            request,
            cancellationToken);

        response.EnsureSuccessStatusCode();

        var result = await response.Content
            .ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);

        return result?.Response ?? string.Empty;
    }
}

Step 5: Registering the Service

In Program.cs:

builder.Services.AddHttpClient<IAiService, OllamaService>(client =>
{
    client.BaseAddress = new Uri("http://localhost:11434");
    client.Timeout = TimeSpan.FromMinutes(2);
});

Using HttpClientFactory ensures efficient connection management.

Step 6: Building an AI Endpoint

Now expose the AI functionality through a controller.

Request model:

public class AiRequest
{
    public string Prompt { get; set; }
}

Controller:

[ApiController]
[Route("api/ai")]
public class AiController : ControllerBase
{
    private readonly IAiService _aiService;

    public AiController(IAiService aiService)
    {
        _aiService = aiService;
    }

    [HttpPost("generate")]
    public async Task<IActionResult> Generate(
        AiRequest request,
        CancellationToken cancellationToken)
    {
        if (string.IsNullOrWhiteSpace(request.Prompt))
            return BadRequest("Prompt is required.");

        var result = await _aiService.GenerateAsync(
            request.Prompt,
            cancellationToken);

        return Ok(new { response = result });
    }
}

Your ASP.NET Core API now exposes a local AI-powered endpoint.

Step 7: Switching Models

One of the nicest things about Ollama is how easy it is to switch between models.

For example:

ollama pull mistral
ollama pull codellama

Then update the request:

{
  "model": "mistral"
}

In practice, testing a few models usually produces better results than simply choosing the largest one.

Step 8: Improving Performance

Local models can still be resource intensive.

A few practical optimizations can help significantly.

Use Streaming
Streaming responses improves perceived latency for longer outputs.

Reduce Prompt Size
Large prompts increase inference time.
Send only the context that the model truly needs.

Cache Repeated Requests
If prompts repeat frequently, caching responses can reduce compute usage.

Keep Calls Asynchronous
Always use async APIs when calling models to keep your backend scalable.

When Local Models Are Not the Best Choice

Local models are powerful, but they are not always the right solution.

For example, cloud AI services may still be better when:

you need extremely large models
you require massive scaling
GPU infrastructure is unavailable
inference workloads are very high

In practice, many teams use a hybrid approach, combining cloud models and local models depending on the use case.

Final Thoughts

The local AI ecosystem is moving incredibly fast right now.

Just a few years ago, running large language models required specialized machine learning environments. Today tools like Ollama make it possible for everyday backend developers to experiment with local LLMs using familiar technologies.

From a .NET perspective, integrating Ollama is actually much simpler than it looks at first.

Instead of relying entirely on external APIs, you can build AI-powered systems that are:

private
cost-efficient
low latency
fully controlled by your infrastructure

For internal tools, developer assistants, and AI-powered APIs, local models are quickly becoming a practical and powerful alternative to cloud AI services.

A Few Things Worth Remembering

Ollama makes running local LLMs simple
.NET applications can interact with Ollama using HTTP APIs
Use a dedicated AI service layer to keep architecture clean
Choose models based on your use case, not just size
Optimize prompts and responses for better performance

DEV Community

Running Local AI Models in .NET with Ollama (Step-by-Step Guide)

Why Run AI Models Locally?

No API Usage Costs

Full Data Privacy

Lower Latency

Offline AI Capabilities

Ideal for Internal Tools

What is Ollama?

Supported Models

Resource Considerations

Architecture of a .NET Application Using Ollama

Step 1: Installing Ollama

Download a Model

Run the Model

Local API Endpoint

Step 2: Creating a .NET 8 Web API

Step 3: Calling the Ollama API from .NET

Example .NET Call

Step 4: Creating an AI Service Layer

Service Interface

Implementation

Step 5: Registering the Service

Step 6: Building an AI Endpoint

Step 7: Switching Models

Step 8: Improving Performance

When Local Models Are Not the Best Choice

Final Thoughts

A Few Things Worth Remembering

Top comments (0)