Originally published at hafiz.dev
API costs add up fast during AI development. You prompt an agent 50 times debugging a tool, that's 50 API calls. You run your test suite, that's another batch. Multiply that across a team and you're spending real money before shipping anything.
Ollama solves this cleanly. It runs open-source models locally on your machine (Llama 3, Qwen, Mistral, and dozens more) and the Laravel AI SDK treats it as a first-party provider, exactly like OpenAI or Anthropic. Switch between them with a single environment variable. No code changes, no new packages, no API keys.
This post covers the full setup: installing Ollama, configuring it in the Laravel AI SDK, building agents that run locally, and the dev/production workflow that lets you use Ollama locally while shipping with a cloud provider.
What Ollama Does
Ollama is a lightweight tool that downloads and serves open-source language models locally. Once it's running, it exposes an HTTP API on localhost:11434 that the Laravel AI SDK connects to directly.
There's no internet connection required after the initial model download. No rate limits. No costs per token. If you've built your app with the Laravel AI SDK smart assistant tutorial, your existing agents work with Ollama with a one-line change.
The tradeoff is hardware. Larger models need more RAM and a capable GPU to run at acceptable speeds. But for development, smaller models like Llama 3.2:3B run well on any modern developer machine.
Installing Ollama
macOS:
brew install ollama
Or download the macOS app from ollama.com which installs as a menu bar app and starts automatically.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com. Ollama runs as a background service after installation.
Once installed, verify it's running:
curl http://localhost:11434
# Should return: Ollama is running
Pulling Models
Download a model with ollama pull:
# General purpose, runs on any machine with 4GB+ RAM
ollama pull llama3.2
# Smaller version, 2GB, good for constrained machines
ollama pull llama3.2:1b
# Strong at code-related tasks, good for Laravel AI agents
ollama pull qwen2.5-coder:7b
# Mistral, fast and capable general model
ollama pull mistral
You can list all downloaded models:
ollama list
And test a model from the terminal before wiring it into Laravel:
ollama run llama3.2 "Explain Laravel service containers in one sentence"
For the artisan commands used in this guide, having at least one model pulled before starting saves debugging time.
Configuring the Laravel AI SDK
The SDK ships with Ollama support out of the box. The only .env addition is:
OLLAMA_API_KEY=
Leave the value blank. Ollama doesn't require authentication for local use, but the SDK expects the variable to exist. Add it to your .env and .env.example.
If Ollama is running on the default port, that's all you need. If you've changed the port or are running Ollama on a remote machine, configure the URL in config/ai.php:
'providers' => [
// ... other providers
'ollama' => [
'driver' => 'ollama',
'key' => env('OLLAMA_API_KEY', ''),
'url' => env('OLLAMA_URL', 'http://localhost:11434/api'),
],
],
And in .env:
OLLAMA_URL=http://localhost:11434/api
The default URL is http://localhost:11434/api so for standard setups you don't need to add this. It works without it.
Using Ollama in Your Agents
Two ways to route an agent to Ollama: set it as the default provider for a specific agent class, or override it per-prompt at runtime.
Per-Agent with PHP Attributes
Add #[Provider] and #[Model] attributes to your agent class:
<?php
namespace App\Ai\Agents;
use Laravel\Ai\Attributes\Model;
use Laravel\Ai\Attributes\Provider;
use Laravel\Ai\Contracts\Agent;
use Laravel\Ai\Enums\Lab;
use Laravel\Ai\Promptable;
#[Provider(Lab::Ollama)]
#[Model('llama3.2')]
class SupportAgent implements Agent
{
use Promptable;
public function instructions(): string
{
return 'You are a helpful support agent. Answer questions about our product concisely.';
}
}
Now every time you prompt this agent, it uses Ollama locally:
$response = SupportAgent::make()->prompt('How do I reset my password?');
return (string) $response;
This is the cleanest pattern for development. You write your agent once with Ollama attributes, build and test locally with no API costs, then change the attributes (or override them via .env) when deploying to production.
Overriding Per-Prompt
For one-off local testing without modifying the agent class:
use Laravel\Ai\Enums\Lab;
$response = SupportAgent::make()
->prompt('How do I reset my password?', provider: Lab::Ollama, model: 'llama3.2');
This is useful when you want to quickly compare responses between Ollama and a cloud provider without changing the agent configuration.
The Dev/Production Workflow
The cleanest approach is to set a default provider at the application level in config/ai.php, driven by environment variables:
'default' => [
'text' => [
'provider' => env('AI_PROVIDER', 'openai'),
'model' => env('AI_MODEL', 'gpt-4o'),
],
],
Then in your local .env:
AI_PROVIDER=ollama
AI_MODEL=llama3.2
And in production .env (or your Forge/Vapor environment):
AI_PROVIDER=anthropic
AI_MODEL=claude-sonnet-4-5
Zero code changes between environments. Your agents, tools, and structured output stay identical. Only the provider changes. This works well for any agent that doesn't use PHP attribute overrides; those take precedence over the default config.
For agents with explicit #[Provider] attributes, you'd need to either remove the attributes or use a different approach for environment-based switching. The attribute approach is better for agents that should always use a specific provider (a code review agent that truly needs a smart model in all environments). The default config approach is better for general-purpose agents where Ollama in dev and a cloud model in prod makes sense.
Which Models to Use
Not all models are equal, and the right choice depends on what your agent is doing. Here's a practical guide based on common Laravel AI SDK use cases.
Llama 3.2 (3B or 8B) is the safe default for most use cases. The 3B version runs comfortably on any developer machine with 4GB RAM. The 8B version is noticeably better at following complex instructions but needs 8GB. Good for support agents, document summarisation, and general Q&A. Start here if you're not sure.
Qwen 2.5 Coder (7B) is the right choice for agents that work with code. It outperforms Llama on code generation and review tasks despite similar size. If you're building an agent that analyzes PHP files, generates migrations, or reviews code quality, use this one instead.
Mistral (7B) is fast and reliable for instruction-following tasks. If you need quick responses and the task isn't code-heavy, Mistral is worth trying. It tends to be faster than Llama 3.2 at the same quality level.
Avoid very large models (30B+) for development. They're slow on typical developer machines and the speed penalty makes iteration painful. The quality gap between 7B and 30B matters less in development where you're primarily testing tool calls and output format, not production response quality. Save the big models for your production cloud provider.
A practical setup for a Laravel SaaS would be: use llama3.2:8b for general agents and qwen2.5-coder:7b for any agent touching code. Both run on a 16GB machine without issues. If you're on a 8GB machine, use llama3.2:3b for everything and accept slightly weaker instruction following in exchange for speed.
If you've already built a multi-agent system with the SDK, you can route different sub-agents to different Ollama models the same way you'd assign different cloud models, and the AI SDK overview covers the broader SDK capabilities worth knowing before diving into local model optimization.
Embeddings with Ollama
Ollama also works for local embeddings, which means you can do RAG development with zero API costs:
use Laravel\Ai\Facades\Ai;
use Laravel\Ai\Enums\Lab;
$embedding = Ai::embed(
'How do I cancel my subscription?',
provider: Lab::Ollama,
model: 'nomic-embed-text'
);
Pull the embedding model first:
ollama pull nomic-embed-text
nomic-embed-text is a solid local embedding model that produces 768-dimension vectors. For production RAG you'd swap to OpenAI's text-embedding-3-small or a similar cloud model, but for building and testing your vector search logic, Ollama keeps costs at zero.
What Ollama Doesn't Support
The Laravel AI SDK's Ollama integration covers text generation and embeddings. It does not support image generation, text-to-speech, speech-to-text, or file uploads. If your agents use those capabilities, you'll need a cloud provider for those specific features.
This is usually fine for a dev/production split. Most agent logic (tools, structured output, conversation flow) doesn't depend on images or audio. You can run the core agent logic against Ollama locally, and the multimedia features only come into play in staging or production against cloud providers.
Testing Agents That Use Ollama
One thing to be aware of: when running your test suite, you probably don't want tests making real Ollama calls any more than you'd want real OpenAI calls. The SDK's fake testing utilities work regardless of which provider is configured:
it('responds to password reset questions', function () {
SupportAgent::fake([
'To reset your password, visit the login page and click "Forgot password".',
]);
$response = SupportAgent::make()->prompt('How do I reset my password?');
expect((string) $response)->toContain('password');
});
Faking the agent response means your tests are fast, deterministic, and don't depend on Ollama being installed or running. The agent safety post covers more on keeping agent behavior predictable in tests.
The development workflow then becomes: build and iterate against real Ollama locally, run the test suite with faked responses, deploy with cloud providers in production.
Ollama on a Shared Dev Server
If your team uses a shared development server, you can run Ollama there and point everyone's local Laravel instances at it. Just update OLLAMA_URL in each developer's .env:
OLLAMA_URL=http://your-dev-server:11434/api
Make sure Ollama is configured to accept connections from outside localhost on the server:
OLLAMA_HOST=0.0.0.0 ollama serve
This means one machine does the model serving and your team shares it, without everyone needing to pull and run models locally. Useful if some team members are on constrained hardware.
FAQ
Does Ollama work with Laravel AI SDK agents that use tools?
Yes, but model quality matters more for tool use. Some smaller models handle tool calls inconsistently. Llama 3.2 8B is reliable for tool use. If you're seeing missed or malformed tool calls, try a larger or more capable model.
Can I use Ollama in production?
You can if you have dedicated server hardware with enough RAM and ideally a GPU. Most teams use Ollama for local development and testing, then cloud providers in production. The cost and maintenance overhead of running Ollama in production usually outweighs the savings unless you have high volume and a specific privacy requirement.
What's the difference between Ollama and running models via API?
With Ollama, the model runs on your machine. No data leaves your network. With cloud APIs (OpenAI, Anthropic), your prompts are sent to the provider's servers. For development involving sensitive or proprietary data, Ollama is the better choice.
Do I need a GPU?
No. Most models run on CPU, just more slowly. For development iteration a CPU is fine. Responses take 5-15 seconds depending on model size and your hardware. A GPU drops that to under 2 seconds for 7B models.
Can I use Ollama with the sub-agents pattern?
Yes. Each sub-agent can have its own #[Provider(Lab::Ollama)] and #[Model] attributes. The sub-agents guide covers the full pattern; the Ollama attributes drop in without any other changes.
Start Locally
The setup comes down to four steps: install Ollama, pull a model, add OLLAMA_API_KEY= to your .env, and add #[Provider(Lab::Ollama)] to your agent class. After that, you're running AI locally with no API costs and no rate limits while you build.
In production, switch back to OpenAI or Anthropic by changing the provider attribute or your default config. The rest of your code stays exactly the same.
If you're setting this up for a team or have questions about the dev/production split, get in touch.
Top comments (0)