Originally published on DevToolHub, where I keep this guide updated every time Ollama revises its limits.
Ollama Cloud is one of the most searched topics in the local AI space right now — and the number one question is always the same: what do you actually get on the free tier, and is Pro worth paying for?
This guide covers the plan limits, how usage is actually measured (it's not tokens), and when upgrading makes sense. All data is pulled from the official Ollama pricing page.
What Ollama Cloud is
Ollama Cloud is a managed inference service that runs large open-source models on Ollama's datacenter GPUs — no local GPU required. The key advantage: your existing local Ollama setup works identically with cloud models. No code rewrites, no new SDKs. Just point at a cloud model and run:
ollama run gpt-oss:120b-cloud
Same CLI, same OpenAI-compatible API, different hardware.
The three tiers
| Free | Pro | Max | |
|---|---|---|---|
| Price | $0 | $20/mo ($200/yr) | $100/mo |
| Cloud usage | Base quota | ~50x Free | Highest |
| Concurrent cloud models | Limited | 3 at a time | More <!-- CHECK exact number against your live post --> |
| Model access | Lighter cloud models | Full catalog | Full catalog + priority |
Running models on your own hardware is always unlimited — the plans only govern cloud usage.
How usage is actually measured (most posts get this wrong)
Ollama doesn't cap you at a fixed number of tokens or requests. Usage reflects actual utilization of their cloud infrastructure — primarily GPU time, which depends on model size and request duration. Two things follow from that:
- Limits reset on two clocks: session limits reset every 5 hours, weekly limits reset every 7 days.
-
Heavier models burn quota faster. Models are grouped into usage levels from level 1 (light models like
gpt-oss:20b) up to level 4 (extra-heavy models likedeepseek-v4-pro).
Practical tip: on the Free tier, stick to level 1 and level 2 models to stretch your quota. Shorter prompts and prompts that share cached context also consume less.
Concurrency and queueing
Requests beyond your plan's concurrency limit are queued and processed when a slot opens. The queue itself has a fixed depth — if it's full, requests are rejected until a slot frees up. This is the main reason production agent workloads end up on Max: it's about sustained concurrent access, not just raw quota.
Privacy
Prompt and response data is never logged or trained on, and Ollama requires zero-data-retention policies from its hosting partners. Worth knowing if you're considering cloud inference for work data.
So which tier should you pick?
- Free — genuinely useful for experimenting with large models you can't fit locally. Stay on level 1–2 models.
- Pro ($20/mo) — the right call for daily engineering work. Full catalog, 3 concurrent cloud models, enough quota that most individual developers never hit the wall.
- Max ($100/mo) — for production agent and RAG workloads that need sustained, concurrent access to the heaviest models.
And if you'd rather own the hardware: a GPU droplet running self-hosted Ollama flips the economics once your usage is steady — I break down that setup separately.
One warning
Ollama has revised its cloud quotas more than once since launch. I keep the original post on DevToolHub updated against the official pricing page every time the limits change — bookmark that one if you want current numbers.
I write hands-on DevOps and self-hosted AI guides at devtoolhub.com. Questions about your specific workload? Drop a comment.
Top comments (0)