DEV Community

Cover image for Cloudflare Workers AI: Free Edge AI Inference with 47+ Models
toolfreebie
toolfreebie

Posted on • Originally published at toolfreebie.com

Cloudflare Workers AI: Free Edge AI Inference with 47+ Models

What Is Cloudflare Workers AI?

Cloudflare Workers AI is a serverless AI inference platform built on top of Cloudflare’s global edge network. Instead of sending your requests to a single data center, inference runs at the Cloudflare location closest to your users — across 300+ cities worldwide. The result: low-latency AI responses without managing GPU servers.

Cloudflare added Workers AI to its free tier in 2024, making it one of the few platforms where you can run real AI models — including text generation, image generation, speech transcription, and embeddings — at zero cost for reasonable workloads.

What’s Free on Cloudflare Workers AI

Workers AI uses a “neurons” unit to measure compute. On the free Workers plan:

  • 10,000 neurons per day — free, no credit card required
  • Standard tier models: most text, embedding, and classification models
  • All 47+ models included — you pick which one to call per request

For reference, generating a ~500-token text response with Llama 3 costs roughly 400–600 neurons. That means around 15–25 text generation calls per day on the free tier — fine for prototyping, small side projects, or demos.

If you need more, the paid Workers plan costs $5/month and includes $5 in compute credits (with $0.011 per 1,000 neurons beyond that).

47+ Models: What You Can Actually Run

Category Models Available Use Case
Text Generation Llama 3.1 (8B, 70B), Llama 3.2, Mistral 7B, Phi-2, Qwen 1.5 Chatbots, summarization, code generation
Text Embeddings BAAI/bge-small-en-v1.5, bge-base-en-v1.5 Semantic search, RAG pipelines
Image Generation Stable Diffusion XL, dreamshaper-8-lcm AI image creation, thumbnails
Image Classification SqueezeNet, ResNet-50, MobileNetV2 Image labeling, content moderation
Speech to Text Whisper (tiny, base, large-v3-turbo) Audio transcription, voice interfaces
Translation M2M-100, Meta NLLB-200 Multi-language support (200 languages)
Object Detection DETR-ResNet-50 Detect objects in images
Text Classification distilbert-sst-2-int8 Sentiment analysis

Notable: the 70B Llama 3.1 model is available but only on the paid tier (it’s classified as a “GA” model with higher neuron costs). The 8B models run on the free tier without issues.

Getting Started: Your First Workers AI Request

You need a free Cloudflare account. No credit card required for the free tier.

Option 1: REST API (any language, instant)

Generate your API token from the Cloudflare dashboard with the “Workers AI” permission. Then call the API directly:

curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct \
  -H "Authorization: Bearer {API_TOKEN}" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain edge computing in one paragraph."}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Replace {ACCOUNT_ID} with your account ID (found in the Cloudflare dashboard sidebar) and {API_TOKEN} with your token.

Response format:

{
  "result": {
    "response": "Edge computing brings computation closer to data sources..."
  },
  "success": true
}
Enter fullscreen mode Exit fullscreen mode

Option 2: Python Client

pip install cloudflare
Enter fullscreen mode Exit fullscreen mode
import os
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

response = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Cloudflare Workers AI?"}
    ]
)

print(response.response)
Enter fullscreen mode Exit fullscreen mode

Option 3: Cloudflare Worker (JavaScript, deployed to the edge)

This is the native way — run inference inside a Worker function deployed globally:

export default {
  async fetch(request, env) {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'Summarize the concept of edge AI.' }
      ]
    });

    return new Response(JSON.stringify(response));
  }
};
Enter fullscreen mode Exit fullscreen mode

Deploy with the Wrangler CLI:

# Install Wrangler
npm install -g wrangler

# Login to Cloudflare
wrangler login

# Create a new Worker project
wrangler init my-ai-worker

# Add AI binding to wrangler.toml
# [[ai]]
# binding = "AI"

# Deploy
wrangler deploy
Enter fullscreen mode Exit fullscreen mode

Image Generation with Stable Diffusion XL

Cloudflare Workers AI supports Stable Diffusion XL — free, at the edge, no GPU setup required:

import os
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

image_response = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/stabilityai/stable-diffusion-xl-base-1.0",
    prompt="A futuristic city skyline at sunset, digital art",
    num_steps=20
)

# image_response returns binary PNG data
with open("output.png", "wb") as f:
    f.write(image_response)

print("Image saved to output.png")
Enter fullscreen mode Exit fullscreen mode

Audio Transcription with Whisper

Transcribe audio files using Cloudflare’s hosted Whisper model:

import os
import base64
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

with open("audio.mp3", "rb") as f:
    audio_bytes = f.read()

result = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/openai/whisper-large-v3-turbo",
    audio=list(audio_bytes)
)

print(result.text)  # Transcribed text
Enter fullscreen mode Exit fullscreen mode

Build a Semantic Search Pipeline with Embeddings

Workers AI’s embedding models enable vector search — the foundation of RAG (Retrieval-Augmented Generation):

import os
import numpy as np
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

def get_embedding(text: str) -> list[float]:
    result = client.workers.ai.run(
        account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
        model_name="@cf/baai/bge-base-en-v1.5",
        text=[text]
    )
    return result.data[0]

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example: find most similar document
docs = [
    "Cloudflare Workers AI runs models at the edge",
    "PostgreSQL is an open-source relational database",
    "Edge computing reduces latency by moving compute closer to users"
]

query = "What are the benefits of edge inference?"

query_embedding = get_embedding(query)
doc_embeddings = [get_embedding(doc) for doc in docs]

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_match_idx = np.argmax(similarities)

print(f"Most relevant: {docs[best_match_idx]}")
print(f"Similarity: {similarities[best_match_idx]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Use Workers AI with OpenClaw for Edge AI Agents

Cloudflare Workers AI pairs well with OpenClaw for building AI agents that run globally. OpenClaw can orchestrate complex multi-step workflows — for example, using Workers AI for fast local inference while routing expensive tasks to other APIs.

A practical setup: an OpenClaw agent that classifies incoming requests using Workers AI’s lightweight classification models, then routes complex queries to a full-size LLM only when needed. This hybrid approach keeps costs down while maintaining quality.

import os
import requests
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

def classify_query(user_query: str) -> str:
    """Use Workers AI to classify if query is simple or complex."""
    result = client.workers.ai.run(
        account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
        model_name="@cf/meta/llama-3.1-8b-instruct",
        messages=[
            {"role": "system", "content": "Classify the query as 'simple' or 'complex'. Reply with one word only."},
            {"role": "user", "content": user_query}
        ],
        max_tokens=10
    )
    return result.response.strip().lower()

def handle_query(user_query: str):
    query_type = classify_query(user_query)

    if query_type == "simple":
        # Handle with Workers AI directly (free, fast)
        result = client.workers.ai.run(
            account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
            model_name="@cf/meta/llama-3.1-8b-instruct",
            messages=[{"role": "user", "content": user_query}]
        )
        return {"source": "workers-ai", "response": result.response}
    else:
        # Route to OpenClaw for complex agentic tasks
        response = requests.post(
            "https://api.openclaw.ai/v1/run",
            headers={"Authorization": f"Bearer {os.environ['OPENCLAW_API_KEY']}"},
            json={"task": user_query}
        )
        return {"source": "openclaw", "response": response.json()}

# Example usage
result = handle_query("What's 2+2?")
print(result)
Enter fullscreen mode Exit fullscreen mode

Cloudflare Workers AI vs Other Free AI APIs

Feature Cloudflare Workers AI Groq API Google Gemini API OpenRouter (free)
Free text generation 10k neurons/day 14,400 req/day 1,500 req/day 20 req/min (varies)
Image generation Yes (SDXL) No Gemini Flash only Some models
Speech transcription Yes (Whisper) Yes (Whisper) No (free tier) No
Embeddings Yes (BGE models) No Yes Some models
Edge deployment Yes (300+ locations) No No No
Model quality (text) Llama 3.1 8B/70B Llama 3.3 70B, Mixtral Gemini 1.5 Flash Various
Latency Very low (edge) Very low (custom chips) Low Varies by model
Credit card required No No No No

The key advantage Cloudflare Workers AI has over every competitor: multi-modal free inference. Text, images, speech, embeddings, translation, and object detection — all in one API, all free within the daily neuron limit. Groq is faster for text, but it doesn’t do image generation. Gemini is more capable, but the free tier is text-only.

Limits and What to Watch Out For

  • 10,000 neurons/day is not unlimited. For a production chatbot handling 100+ users/day, you’ll hit the cap. Use caching (Cloudflare KV) to avoid redundant calls on common queries.
  • Model availability changes. Cloudflare occasionally moves models from beta to GA (General Availability), which changes their neuron pricing. Check the models page before building.
  • Context window is limited. Llama 3.1 8B on Workers AI has a 4,096 token context — smaller than what you’d get from the full model elsewhere.
  • Image generation is slow. SDXL on Workers AI takes 10–30 seconds per image. It’s fine for async workflows, not for real-time generation.
  • No fine-tuning. You can’t upload custom weights — you’re limited to Cloudflare’s hosted model catalog.

When to Choose Cloudflare Workers AI

Workers AI is the right choice when:

  • You’re already using Cloudflare for DNS/CDN and want AI inference in the same stack
  • You need multi-modal inference (text + image + speech) in a single free API
  • Latency matters and your users are globally distributed
  • You want to run AI at the edge inside a Cloudflare Worker function
  • You need a free embedding API for a RAG prototype

Use a different API when:

  • You need the fastest text generation (use Groq — it’s 3–5x faster)
  • You need the most capable free model (use Google Gemini)
  • You need more than 10,000 neurons/day without paying (use OpenRouter with free models)
  • You’re not on the Cloudflare stack and don’t plan to be

Getting Your API Keys

  1. Create a free account at cloudflare.com
  2. Go to My ProfileAPI TokensCreate Token
  3. Use the “Workers AI” template or manually select the Workers AI: Run permission
  4. Copy your Account ID from the dashboard homepage (right sidebar)
  5. Set environment variables: CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID

The entire setup takes under 5 minutes, and the free tier starts immediately — no waitlist, no approval required.

Related Reads

Final Thoughts

Cloudflare Workers AI is the most versatile free AI API available in 2026 by model type coverage. The 10,000 neurons/day free limit is modest for production use, but for developers building side projects, RAG pipelines, image generation tools, or audio transcription apps, it’s genuinely useful — especially when you don’t need to provide a credit card.

The edge inference angle is unique: no other free AI API runs your inference in 300+ global locations. For latency-sensitive applications — chatbots, real-time translation, interactive demos — that matters.

If you’re on Cloudflare already, Workers AI is the easiest way to add AI to your stack without leaving the platform. Get started for free at developers.cloudflare.com/workers-ai.


Originally published at toolfreebie.com.

Top comments (0)