I Found a Python Package That Runs Local LLMs With One pip install
Most local AI setups look something like this:
Install Ollama
Pull a model
Start the service
Configure everything
Write code
After doing this across multiple projects, I started wondering:
Why does every application need to know how to run an LLM?
Why should every app handle:
- model selection
- context storage
- session management
- fallback logic
- tool calling
- backend switching
That's when I came across freeaiagent.
And the architecture immediately caught my attention.
The Core Idea
Instead of embedding AI logic into every application, freeaiagent runs as a local HTTP service.
Your applications simply call it.
Your Apps
|
v
localhost:7731
|
v
freeaiagent
├─ Router
├─ Context
├─ Fallback Chain
└─ Tool Calling
|
+--> Local Model
+--> Ollama
+--> Groq
+--> Gemini
+--> OpenRouter
This means:
- Flask apps
- Django apps
- FastAPI services
- CLI tools
- Automation scripts
all share the same AI service.
Installation
pip install freeaiagent
Download a local model:
freeaiagent pull
Start the service:
freeaiagent start
Done.
The server starts at:
http://localhost:7731
There is also a built-in Chat UI:
http://localhost:7731/ui
No Ollama Required
This was the part that surprised me.
The package uses llamafile underneath and automatically downloads and runs local GGUF models.
So you get:
✅ Local models
✅ Offline inference
✅ No API key
✅ No separate runtime installation
Supported local models include:
- Llama 3.2 1B
- Llama 3.2 3B
- Phi-3 Mini
- Gemma 2B
- Qwen 2.5 7B
- Llama 3.1 8B
- Qwen 2.5 14B
Example:
freeaiagent pull qwen2.5-7b
freeaiagent config set default_model qwen2.5-7b
Any HuggingFace GGUF Model
Another feature I wasn't expecting:
freeaiagent search qwen2.5
Search public GGUF models.
Then pull one directly:
freeaiagent pull hf:bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q4_K_M.gguf
No extra tooling required.
The Built-In Fallback Chain
One thing every AI application eventually needs is reliability.
freeaiagent has automatic backend fallback:
{
"fallback_order": [
"llamafile",
"ollama",
"groq"
]
}
If the current backend fails:
- local unavailable → try Ollama
- Ollama unavailable → try Groq
- Groq unavailable → continue down the chain
Your application keeps working.
Calling It From Python
The integration is intentionally simple.
import urllib.request
import json
req = urllib.request.Request(
"http://localhost:7731/chat",
data=json.dumps({
"message": "Explain vector databases"
}).encode(),
headers={
"Content-Type": "application/json"
}
)
response = json.loads(
urllib.request.urlopen(req).read()
)
print(response["response"])
No SDK required.
No OpenAI client.
No LangChain.
Just HTTP.
Per-App Context
A nice touch:
headers={
"X-Caller-ID": "my-app"
}
Every application automatically gets its own conversation history.
Context is stored in SQLite.
No custom session layer required.
Streaming
Token streaming is available through:
POST /chat/stream
Example:
curl -N -X POST \
http://localhost:7731/chat/stream
Responses are streamed via Server-Sent Events (SSE).
Tool Calling
Register an HTTP endpoint:
POST /tools/register
Then enable tools:
{
"message": "What's the weather in Paris?",
"tools": true
}
The model can call your API endpoint and use the result in its response.
Supported Backends
Local:
- llamafile
- Ollama
- LM Studio
- Jan
- LocalAI
Cloud:
- Groq
- Gemini
- OpenRouter
- Together AI
- Cerebras
Switching providers doesn't require application changes.
Why I Think This Is Interesting
Most AI tooling focuses on models.
This package focuses on architecture.
Instead of every application implementing:
- prompts
- memory
- model management
- routing
- fallbacks
once per project,
it centralizes those concerns into a single local service.
The result feels closer to how we use databases, Redis, or Elasticsearch:
run a service once and let every application use it.
That's a surprisingly clean approach.
Try It
pip install freeaiagent
freeaiagent pull
freeaiagent start
A few minutes later you'll have:
- Local AI
- HTTP API
- Chat UI
- Persistent memory
- Tool calling
- Automatic fallbacks
running entirely on your machine.
I'd be curious to hear how others are handling local AI infrastructure and whether you're embedding LLM logic directly into applications or using a service layer like this.
Top comments (0)