What Is Cloudflare Workers AI?
Cloudflare Workers AI is a serverless AI inference platform built on top of Cloudflare’s global edge network. Instead of sending your requests to a single data center, inference runs at the Cloudflare location closest to your users — across 300+ cities worldwide. The result: low-latency AI responses without managing GPU servers.
Cloudflare added Workers AI to its free tier in 2024, making it one of the few platforms where you can run real AI models — including text generation, image generation, speech transcription, and embeddings — at zero cost for reasonable workloads.
What’s Free on Cloudflare Workers AI
Workers AI uses a “neurons” unit to measure compute. On the free Workers plan:
- 10,000 neurons per day — free, no credit card required
- Standard tier models: most text, embedding, and classification models
- All 47+ models included — you pick which one to call per request
For reference, generating a ~500-token text response with Llama 3 costs roughly 400–600 neurons. That means around 15–25 text generation calls per day on the free tier — fine for prototyping, small side projects, or demos.
If you need more, the paid Workers plan costs $5/month and includes $5 in compute credits (with $0.011 per 1,000 neurons beyond that).
47+ Models: What You Can Actually Run
| Category | Models Available | Use Case |
|---|---|---|
| Text Generation | Llama 3.1 (8B, 70B), Llama 3.2, Mistral 7B, Phi-2, Qwen 1.5 | Chatbots, summarization, code generation |
| Text Embeddings | BAAI/bge-small-en-v1.5, bge-base-en-v1.5 | Semantic search, RAG pipelines |
| Image Generation | Stable Diffusion XL, dreamshaper-8-lcm | AI image creation, thumbnails |
| Image Classification | SqueezeNet, ResNet-50, MobileNetV2 | Image labeling, content moderation |
| Speech to Text | Whisper (tiny, base, large-v3-turbo) | Audio transcription, voice interfaces |
| Translation | M2M-100, Meta NLLB-200 | Multi-language support (200 languages) |
| Object Detection | DETR-ResNet-50 | Detect objects in images |
| Text Classification | distilbert-sst-2-int8 | Sentiment analysis |
Notable: the 70B Llama 3.1 model is available but only on the paid tier (it’s classified as a “GA” model with higher neuron costs). The 8B models run on the free tier without issues.
Getting Started: Your First Workers AI Request
You need a free Cloudflare account. No credit card required for the free tier.
Option 1: REST API (any language, instant)
Generate your API token from the Cloudflare dashboard with the “Workers AI” permission. Then call the API directly:
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct \
-H "Authorization: Bearer {API_TOKEN}" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain edge computing in one paragraph."}
]
}'
Replace {ACCOUNT_ID} with your account ID (found in the Cloudflare dashboard sidebar) and {API_TOKEN} with your token.
Response format:
{
"result": {
"response": "Edge computing brings computation closer to data sources..."
},
"success": true
}
Option 2: Python Client
pip install cloudflare
import os
from cloudflare import Cloudflare
client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])
response = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/meta/llama-3.1-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Cloudflare Workers AI?"}
]
)
print(response.response)
Option 3: Cloudflare Worker (JavaScript, deployed to the edge)
This is the native way — run inference inside a Worker function deployed globally:
export default {
async fetch(request, env) {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Summarize the concept of edge AI.' }
]
});
return new Response(JSON.stringify(response));
}
};
Deploy with the Wrangler CLI:
# Install Wrangler
npm install -g wrangler
# Login to Cloudflare
wrangler login
# Create a new Worker project
wrangler init my-ai-worker
# Add AI binding to wrangler.toml
# [[ai]]
# binding = "AI"
# Deploy
wrangler deploy
Image Generation with Stable Diffusion XL
Cloudflare Workers AI supports Stable Diffusion XL — free, at the edge, no GPU setup required:
import os
from cloudflare import Cloudflare
client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])
image_response = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/stabilityai/stable-diffusion-xl-base-1.0",
prompt="A futuristic city skyline at sunset, digital art",
num_steps=20
)
# image_response returns binary PNG data
with open("output.png", "wb") as f:
f.write(image_response)
print("Image saved to output.png")
Audio Transcription with Whisper
Transcribe audio files using Cloudflare’s hosted Whisper model:
import os
import base64
from cloudflare import Cloudflare
client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])
with open("audio.mp3", "rb") as f:
audio_bytes = f.read()
result = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/openai/whisper-large-v3-turbo",
audio=list(audio_bytes)
)
print(result.text) # Transcribed text
Build a Semantic Search Pipeline with Embeddings
Workers AI’s embedding models enable vector search — the foundation of RAG (Retrieval-Augmented Generation):
import os
import numpy as np
from cloudflare import Cloudflare
client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])
def get_embedding(text: str) -> list[float]:
result = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/baai/bge-base-en-v1.5",
text=[text]
)
return result.data[0]
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example: find most similar document
docs = [
"Cloudflare Workers AI runs models at the edge",
"PostgreSQL is an open-source relational database",
"Edge computing reduces latency by moving compute closer to users"
]
query = "What are the benefits of edge inference?"
query_embedding = get_embedding(query)
doc_embeddings = [get_embedding(doc) for doc in docs]
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_match_idx = np.argmax(similarities)
print(f"Most relevant: {docs[best_match_idx]}")
print(f"Similarity: {similarities[best_match_idx]:.4f}")
Use Workers AI with OpenClaw for Edge AI Agents
Cloudflare Workers AI pairs well with OpenClaw for building AI agents that run globally. OpenClaw can orchestrate complex multi-step workflows — for example, using Workers AI for fast local inference while routing expensive tasks to other APIs.
A practical setup: an OpenClaw agent that classifies incoming requests using Workers AI’s lightweight classification models, then routes complex queries to a full-size LLM only when needed. This hybrid approach keeps costs down while maintaining quality.
import os
import requests
from cloudflare import Cloudflare
client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])
def classify_query(user_query: str) -> str:
"""Use Workers AI to classify if query is simple or complex."""
result = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/meta/llama-3.1-8b-instruct",
messages=[
{"role": "system", "content": "Classify the query as 'simple' or 'complex'. Reply with one word only."},
{"role": "user", "content": user_query}
],
max_tokens=10
)
return result.response.strip().lower()
def handle_query(user_query: str):
query_type = classify_query(user_query)
if query_type == "simple":
# Handle with Workers AI directly (free, fast)
result = client.workers.ai.run(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
model_name="@cf/meta/llama-3.1-8b-instruct",
messages=[{"role": "user", "content": user_query}]
)
return {"source": "workers-ai", "response": result.response}
else:
# Route to OpenClaw for complex agentic tasks
response = requests.post(
"https://api.openclaw.ai/v1/run",
headers={"Authorization": f"Bearer {os.environ['OPENCLAW_API_KEY']}"},
json={"task": user_query}
)
return {"source": "openclaw", "response": response.json()}
# Example usage
result = handle_query("What's 2+2?")
print(result)
Cloudflare Workers AI vs Other Free AI APIs
| Feature | Cloudflare Workers AI | Groq API | Google Gemini API | OpenRouter (free) |
|---|---|---|---|---|
| Free text generation | 10k neurons/day | 14,400 req/day | 1,500 req/day | 20 req/min (varies) |
| Image generation | Yes (SDXL) | No | Gemini Flash only | Some models |
| Speech transcription | Yes (Whisper) | Yes (Whisper) | No (free tier) | No |
| Embeddings | Yes (BGE models) | No | Yes | Some models |
| Edge deployment | Yes (300+ locations) | No | No | No |
| Model quality (text) | Llama 3.1 8B/70B | Llama 3.3 70B, Mixtral | Gemini 1.5 Flash | Various |
| Latency | Very low (edge) | Very low (custom chips) | Low | Varies by model |
| Credit card required | No | No | No | No |
The key advantage Cloudflare Workers AI has over every competitor: multi-modal free inference. Text, images, speech, embeddings, translation, and object detection — all in one API, all free within the daily neuron limit. Groq is faster for text, but it doesn’t do image generation. Gemini is more capable, but the free tier is text-only.
Limits and What to Watch Out For
- 10,000 neurons/day is not unlimited. For a production chatbot handling 100+ users/day, you’ll hit the cap. Use caching (Cloudflare KV) to avoid redundant calls on common queries.
- Model availability changes. Cloudflare occasionally moves models from beta to GA (General Availability), which changes their neuron pricing. Check the models page before building.
- Context window is limited. Llama 3.1 8B on Workers AI has a 4,096 token context — smaller than what you’d get from the full model elsewhere.
- Image generation is slow. SDXL on Workers AI takes 10–30 seconds per image. It’s fine for async workflows, not for real-time generation.
- No fine-tuning. You can’t upload custom weights — you’re limited to Cloudflare’s hosted model catalog.
When to Choose Cloudflare Workers AI
Workers AI is the right choice when:
- You’re already using Cloudflare for DNS/CDN and want AI inference in the same stack
- You need multi-modal inference (text + image + speech) in a single free API
- Latency matters and your users are globally distributed
- You want to run AI at the edge inside a Cloudflare Worker function
- You need a free embedding API for a RAG prototype
Use a different API when:
- You need the fastest text generation (use Groq — it’s 3–5x faster)
- You need the most capable free model (use Google Gemini)
- You need more than 10,000 neurons/day without paying (use OpenRouter with free models)
- You’re not on the Cloudflare stack and don’t plan to be
Getting Your API Keys
- Create a free account at cloudflare.com
- Go to My Profile → API Tokens → Create Token
- Use the “Workers AI” template or manually select the
Workers AI: Runpermission - Copy your Account ID from the dashboard homepage (right sidebar)
- Set environment variables:
CLOUDFLARE_API_TOKENandCLOUDFLARE_ACCOUNT_ID
The entire setup takes under 5 minutes, and the free tier starts immediately — no waitlist, no approval required.
Related Reads
- Cohere Free API: The Best Free Embedding and Rerank API for RAG in 2026
- Groq vs Cerebras vs Gemini: Which Free AI API Is Actually Fastest in 2026?
- Cerebras Inference API: The Fastest Free AI API You’ve Never Heard Of
- Mistral AI Free API: Call Nemo and Mixtral for Free with Any OpenAI SDK
- GitHub Models: Free GPT-4o and Llama API for Every Developer
Final Thoughts
Cloudflare Workers AI is the most versatile free AI API available in 2026 by model type coverage. The 10,000 neurons/day free limit is modest for production use, but for developers building side projects, RAG pipelines, image generation tools, or audio transcription apps, it’s genuinely useful — especially when you don’t need to provide a credit card.
The edge inference angle is unique: no other free AI API runs your inference in 300+ global locations. For latency-sensitive applications — chatbots, real-time translation, interactive demos — that matters.
If you’re on Cloudflare already, Workers AI is the easiest way to add AI to your stack without leaving the platform. Get started for free at developers.cloudflare.com/workers-ai.
Originally published at toolfreebie.com.
Top comments (0)