Most developers who start experimenting with AI tend to follow the same path.
You integrate a cloud AI API into your application. The prototype works beautifully. Responses are fast, integration is simple, and everything feels almost magical.
Then the production questions start appearing.
How much will this cost at scale?
Do we really want sensitive data leaving our infrastructure?
What happens if the API rate limits us?
And the big one many developers eventually ask:
Can we run AI models locally instead?
The answer is yes. And tools like Ollama make it much easier than most developers expect.
Ollama allows you to run powerful language models directly on your machine and access them through a simple HTTP API. This means you can integrate local AI into ASP.NET Core APIs, background services, or internal tools without relying on external providers.
In this guide we will walk through:
- why local AI models are becoming popular
- how Ollama works
- how to run models locally
- how to call Ollama from a .NET application
- how to build a simple AI-powered ASP.NET Core endpoint
If you are a .NET developer curious about integrating AI without depending entirely on cloud APIs, this is a great place to start.
Why Run AI Models Locally?
Cloud AI APIs are extremely powerful, but they are not always the best solution for every scenario.
Running models locally offers a few advantages that become very attractive in production environments.
No API Usage Costs
Most cloud AI providers charge based on tokens or requests.
That works well for prototypes. But once usage grows, costs can scale quickly.
Running models locally removes the per-request cost entirely, which makes a big difference for internal tools or heavy workloads.
Full Data Privacy
Many enterprise systems process sensitive information such as:
- internal documentation
- support tickets
- logs
- customer records
Sending this data to external AI APIs can raise security and compliance concerns.
Local models keep everything inside your infrastructure.
Lower Latency
Cloud inference requires a network round trip.
Local inference removes that dependency completely.
For internal assistants, dashboards, or developer tooling, this often results in noticeably faster responses.
Offline AI Capabilities
Local models can run without internet access.
This is useful in environments like:
- secure enterprise networks
- air-gapped systems
- developer tools running locally
Ideal for Internal Tools
Local AI is especially useful for building tools such as:
- internal chat assistants
- log summarization tools
- documentation search
- developer copilots
- AI-powered dashboards
This is exactly the kind of scenario where Ollama shines.
What is Ollama?
Ollama is a tool that allows developers to run LLMs (large language models) locally with minimal setup.
Instead of manually managing model weights, runtime environments, and inference servers, Ollama handles the heavy lifting.
It manages:
- model downloads
- model execution
- memory handling
- inference APIs
Once installed, Ollama exposes a local HTTP API.
That means any language capable of making HTTP requests can interact with it, including:
- C#
- Python
- JavaScript
- Go
- Java
For backend developers, this makes integration extremely straightforward.
Supported Models
Ollama supports many popular open-source models, including:
- Llama 3
- Mistral
- Gemma
- Code Llama
- various other community models
Different models are optimized for different tasks.
For example:
| Model | Best Use Case |
|---|---|
| llama3 | General AI tasks |
| mistral | Fast responses |
| codellama | Code generation |
| gemma | Lightweight inference |
One of the biggest advantages of Ollama is how easy it is to switch between models.
Before choosing a model, it's worth understanding the hardware requirements involved in running these models locally.
Resource Considerations
Before running models locally, it is important to understand that LLMs still require system resources.
One thing you'll quickly notice when running local models is that inference latency can vary significantly depending on hardware.
During development, it’s common for responses to take several seconds when running on CPU-only machines.
For internal tools this is usually acceptable, but it’s worth keeping in mind when designing user-facing APIs.
For example:
Llama 3 8B models typically require 8–16GB RAM
CPU inference works but may be slower
GPUs significantly improve performance
For many internal tools, smaller or quantized models provide the best balance between performance and resource usage.
Architecture of a .NET Application Using Ollama
Before writing code, it helps to understand where Ollama fits into the architecture.
A typical integration looks like this:
Client
↓
ASP.NET Core API
↓
AI Service Layer
↓
Ollama Local API
↓
Local LLM Model
Request flow:
- Client sends a request to the ASP.NET Core API
- The API calls an AI service layer
- The service sends a prompt to Ollama
- Ollama runs the model locally
- The generated response returns to the client
This separation keeps the architecture clean and maintainable.
Step 1: Installing Ollama
First, install Ollama on your machine.
Go to the official website:
Download the installer for your operating system. Ollama currently supports:
- macOS
- Linux
- Windows
Run the installer and complete the setup.
After installation, open a terminal or command prompt and verify that Ollama is installed correctly by running:
ollama --version
If Ollama is installed properly, you should see the installed version printed in the terminal.
Download a Model
Next, download a language model that Ollama will run locally.
For this guide, we will use Llama 3.
Run the following command:
ollama pull llama3
This command downloads the model weights and prepares them for local inference.
Depending on your internet speed, the download may take a few minutes because LLM models are several gigabytes in size.
Run the Model
Once the model is downloaded, you can start it using:
ollama run llama3
Ollama will load the model and open an interactive prompt.
Try entering a simple question:
Explain what REST APIs are
If the model responds with an answer, your local AI environment is working correctly.
Local API Endpoint
Behind the scenes, Ollama also exposes an HTTP API that applications can call.
By default, the API runs at:
http://localhost:11434
This is the endpoint your ASP.NET Core application will communicate with.
Step 2: Creating a .NET 8 Web API
Next create a new ASP.NET Core API project.
dotnet new webapi -n LocalAIApi
A simple project structure might look like this:
Controllers
Services
Models
Program.cs
Keeping AI logic separated into services helps maintain clean architecture.
Step 3: Calling the Ollama API from .NET
Ollama exposes a simple endpoint for generating responses.
POST http://localhost:11434/api/generate
Example request payload:
{
"model": "llama3",
"prompt": "Explain dependency injection in ASP.NET Core",
"stream": false
}
Example .NET Call
public async Task<string> GenerateAsync(
string prompt,
CancellationToken cancellationToken)
{
var request = new
{
model = "llama3",
prompt,
stream = false
};
var response = await _httpClient.PostAsJsonAsync(
"api/generate",
request,
cancellationToken);
response.EnsureSuccessStatusCode();
var result = await response.Content
.ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);
return result?.Response ?? string.Empty;
}
Response model:
public class OllamaResponse
{
public string Response { get; set; }
}
Using a typed model instead of dynamic makes the code safer and easier to maintain.
Step 4: Creating an AI Service Layer
One design rule worth following:
Avoid putting AI logic directly inside controllers.
Instead, isolate it inside a service layer.
Service Interface
public interface IAiService
{
Task<string> GenerateAsync(string prompt, CancellationToken cancellationToken);
}
Implementation
public class OllamaService : IAiService
{
private readonly HttpClient _httpClient;
public OllamaService(HttpClient httpClient)
{
_httpClient = httpClient;
}
public async Task<string> GenerateAsync(
string prompt,
CancellationToken cancellationToken)
{
var request = new
{
model = "llama3",
prompt,
stream = false
};
var response = await _httpClient.PostAsJsonAsync(
"api/generate",
request,
cancellationToken);
response.EnsureSuccessStatusCode();
var result = await response.Content
.ReadFromJsonAsync<OllamaResponse>(cancellationToken: cancellationToken);
return result?.Response ?? string.Empty;
}
}
Step 5: Registering the Service
In Program.cs:
builder.Services.AddHttpClient<IAiService, OllamaService>(client =>
{
client.BaseAddress = new Uri("http://localhost:11434");
client.Timeout = TimeSpan.FromMinutes(2);
});
Using HttpClientFactory ensures efficient connection management.
Step 6: Building an AI Endpoint
Now expose the AI functionality through a controller.
Request model:
public class AiRequest
{
public string Prompt { get; set; }
}
Controller:
[ApiController]
[Route("api/ai")]
public class AiController : ControllerBase
{
private readonly IAiService _aiService;
public AiController(IAiService aiService)
{
_aiService = aiService;
}
[HttpPost("generate")]
public async Task<IActionResult> Generate(
AiRequest request,
CancellationToken cancellationToken)
{
if (string.IsNullOrWhiteSpace(request.Prompt))
return BadRequest("Prompt is required.");
var result = await _aiService.GenerateAsync(
request.Prompt,
cancellationToken);
return Ok(new { response = result });
}
}
Your ASP.NET Core API now exposes a local AI-powered endpoint.
Step 7: Switching Models
One of the nicest things about Ollama is how easy it is to switch between models.
For example:
ollama pull mistral
ollama pull codellama
Then update the request:
{
"model": "mistral"
}
In practice, testing a few models usually produces better results than simply choosing the largest one.
Step 8: Improving Performance
Local models can still be resource intensive.
A few practical optimizations can help significantly.
Use Streaming
Streaming responses improves perceived latency for longer outputs.
Reduce Prompt Size
Large prompts increase inference time.
Send only the context that the model truly needs.
Cache Repeated Requests
If prompts repeat frequently, caching responses can reduce compute usage.
Keep Calls Asynchronous
Always use async APIs when calling models to keep your backend scalable.
When Local Models Are Not the Best Choice
Local models are powerful, but they are not always the right solution.
For example, cloud AI services may still be better when:
- you need extremely large models
- you require massive scaling
- GPU infrastructure is unavailable
- inference workloads are very high
In practice, many teams use a hybrid approach, combining cloud models and local models depending on the use case.
Final Thoughts
The local AI ecosystem is moving incredibly fast right now.
Just a few years ago, running large language models required specialized machine learning environments. Today tools like Ollama make it possible for everyday backend developers to experiment with local LLMs using familiar technologies.
From a .NET perspective, integrating Ollama is actually much simpler than it looks at first.
Instead of relying entirely on external APIs, you can build AI-powered systems that are:
- private
- cost-efficient
- low latency
- fully controlled by your infrastructure
For internal tools, developer assistants, and AI-powered APIs, local models are quickly becoming a practical and powerful alternative to cloud AI services.
A Few Things Worth Remembering
- Ollama makes running local LLMs simple
- .NET applications can interact with Ollama using HTTP APIs
- Use a dedicated AI service layer to keep architecture clean
- Choose models based on your use case, not just size
- Optimize prompts and responses for better performance
Top comments (0)