toolfreebie

Posted on May 3 • Originally published at toolfreebie.com

Cloudflare Workers AI: Free Edge AI Inference with 47+ Models

#ai #api #opensource

What Is Cloudflare Workers AI?

Cloudflare Workers AI is a serverless AI inference platform built on top of Cloudflare’s global edge network. Instead of sending your requests to a single data center, inference runs at the Cloudflare location closest to your users — across 300+ cities worldwide. The result: low-latency AI responses without managing GPU servers.

Cloudflare added Workers AI to its free tier in 2024, making it one of the few platforms where you can run real AI models — including text generation, image generation, speech transcription, and embeddings — at zero cost for reasonable workloads.

What’s Free on Cloudflare Workers AI

Workers AI uses a “neurons” unit to measure compute. On the free Workers plan:

10,000 neurons per day — free, no credit card required
Standard tier models: most text, embedding, and classification models
All 47+ models included — you pick which one to call per request

For reference, generating a ~500-token text response with Llama 3 costs roughly 400–600 neurons. That means around 15–25 text generation calls per day on the free tier — fine for prototyping, small side projects, or demos.

If you need more, the paid Workers plan costs $5/month and includes $5 in compute credits (with $0.011 per 1,000 neurons beyond that).

47+ Models: What You Can Actually Run

Category	Models Available	Use Case
Text Generation	Llama 3.1 (8B, 70B), Llama 3.2, Mistral 7B, Phi-2, Qwen 1.5	Chatbots, summarization, code generation
Text Embeddings	BAAI/bge-small-en-v1.5, bge-base-en-v1.5	Semantic search, RAG pipelines
Image Generation	Stable Diffusion XL, dreamshaper-8-lcm	AI image creation, thumbnails
Image Classification	SqueezeNet, ResNet-50, MobileNetV2	Image labeling, content moderation
Speech to Text	Whisper (tiny, base, large-v3-turbo)	Audio transcription, voice interfaces
Translation	M2M-100, Meta NLLB-200	Multi-language support (200 languages)
Object Detection	DETR-ResNet-50	Detect objects in images
Text Classification	distilbert-sst-2-int8	Sentiment analysis

Notable: the 70B Llama 3.1 model is available but only on the paid tier (it’s classified as a “GA” model with higher neuron costs). The 8B models run on the free tier without issues.

Getting Started: Your First Workers AI Request

You need a free Cloudflare account. No credit card required for the free tier.

Option 1: REST API (any language, instant)

Generate your API token from the Cloudflare dashboard with the “Workers AI” permission. Then call the API directly:

curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct \
  -H "Authorization: Bearer {API_TOKEN}" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain edge computing in one paragraph."}
    ]
  }'

Replace {ACCOUNT_ID} with your account ID (found in the Cloudflare dashboard sidebar) and {API_TOKEN} with your token.

Response format:

{
  "result": {
    "response": "Edge computing brings computation closer to data sources..."
  },
  "success": true
}

Option 2: Python Client

pip install cloudflare

import os
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

response = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Cloudflare Workers AI?"}
    ]
)

print(response.response)

Option 3: Cloudflare Worker (JavaScript, deployed to the edge)

This is the native way — run inference inside a Worker function deployed globally:

export default {
  async fetch(request, env) {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'Summarize the concept of edge AI.' }
      ]
    });

    return new Response(JSON.stringify(response));
  }
};

Deploy with the Wrangler CLI:

# Install Wrangler
npm install -g wrangler

# Login to Cloudflare
wrangler login

# Create a new Worker project
wrangler init my-ai-worker

# Add AI binding to wrangler.toml
# [[ai]]
# binding = "AI"

# Deploy
wrangler deploy

Image Generation with Stable Diffusion XL

Cloudflare Workers AI supports Stable Diffusion XL — free, at the edge, no GPU setup required:

import os
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

image_response = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/stabilityai/stable-diffusion-xl-base-1.0",
    prompt="A futuristic city skyline at sunset, digital art",
    num_steps=20
)

# image_response returns binary PNG data
with open("output.png", "wb") as f:
    f.write(image_response)

print("Image saved to output.png")

Audio Transcription with Whisper

Transcribe audio files using Cloudflare’s hosted Whisper model:

import os
import base64
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

with open("audio.mp3", "rb") as f:
    audio_bytes = f.read()

result = client.workers.ai.run(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    model_name="@cf/openai/whisper-large-v3-turbo",
    audio=list(audio_bytes)
)

print(result.text)  # Transcribed text

Build a Semantic Search Pipeline with Embeddings

Workers AI’s embedding models enable vector search — the foundation of RAG (Retrieval-Augmented Generation):

import os
import numpy as np
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

def get_embedding(text: str) -> list[float]:
    result = client.workers.ai.run(
        account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
        model_name="@cf/baai/bge-base-en-v1.5",
        text=[text]
    )
    return result.data[0]

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example: find most similar document
docs = [
    "Cloudflare Workers AI runs models at the edge",
    "PostgreSQL is an open-source relational database",
    "Edge computing reduces latency by moving compute closer to users"
]

query = "What are the benefits of edge inference?"

query_embedding = get_embedding(query)
doc_embeddings = [get_embedding(doc) for doc in docs]

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_match_idx = np.argmax(similarities)

print(f"Most relevant: {docs[best_match_idx]}")
print(f"Similarity: {similarities[best_match_idx]:.4f}")

Use Workers AI with OpenClaw for Edge AI Agents

Cloudflare Workers AI pairs well with OpenClaw for building AI agents that run globally. OpenClaw can orchestrate complex multi-step workflows — for example, using Workers AI for fast local inference while routing expensive tasks to other APIs.

A practical setup: an OpenClaw agent that classifies incoming requests using Workers AI’s lightweight classification models, then routes complex queries to a full-size LLM only when needed. This hybrid approach keeps costs down while maintaining quality.

import os
import requests
from cloudflare import Cloudflare

client = Cloudflare(api_token=os.environ["CLOUDFLARE_API_TOKEN"])

def classify_query(user_query: str) -> str:
    """Use Workers AI to classify if query is simple or complex."""
    result = client.workers.ai.run(
        account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
        model_name="@cf/meta/llama-3.1-8b-instruct",
        messages=[
            {"role": "system", "content": "Classify the query as 'simple' or 'complex'. Reply with one word only."},
            {"role": "user", "content": user_query}
        ],
        max_tokens=10
    )
    return result.response.strip().lower()

def handle_query(user_query: str):
    query_type = classify_query(user_query)

    if query_type == "simple":
        # Handle with Workers AI directly (free, fast)
        result = client.workers.ai.run(
            account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
            model_name="@cf/meta/llama-3.1-8b-instruct",
            messages=[{"role": "user", "content": user_query}]
        )
        return {"source": "workers-ai", "response": result.response}
    else:
        # Route to OpenClaw for complex agentic tasks
        response = requests.post(
            "https://api.openclaw.ai/v1/run",
            headers={"Authorization": f"Bearer {os.environ['OPENCLAW_API_KEY']}"},
            json={"task": user_query}
        )
        return {"source": "openclaw", "response": response.json()}

# Example usage
result = handle_query("What's 2+2?")
print(result)

Cloudflare Workers AI vs Other Free AI APIs

Feature	Cloudflare Workers AI	Groq API	Google Gemini API	OpenRouter (free)
Free text generation	10k neurons/day	14,400 req/day	1,500 req/day	20 req/min (varies)
Image generation	Yes (SDXL)	No	Gemini Flash only	Some models
Speech transcription	Yes (Whisper)	Yes (Whisper)	No (free tier)	No
Embeddings	Yes (BGE models)	No	Yes	Some models
Edge deployment	Yes (300+ locations)	No	No	No
Model quality (text)	Llama 3.1 8B/70B	Llama 3.3 70B, Mixtral	Gemini 1.5 Flash	Various
Latency	Very low (edge)	Very low (custom chips)	Low	Varies by model
Credit card required	No	No	No	No

The key advantage Cloudflare Workers AI has over every competitor: multi-modal free inference. Text, images, speech, embeddings, translation, and object detection — all in one API, all free within the daily neuron limit. Groq is faster for text, but it doesn’t do image generation. Gemini is more capable, but the free tier is text-only.

Limits and What to Watch Out For

10,000 neurons/day is not unlimited. For a production chatbot handling 100+ users/day, you’ll hit the cap. Use caching (Cloudflare KV) to avoid redundant calls on common queries.
Model availability changes. Cloudflare occasionally moves models from beta to GA (General Availability), which changes their neuron pricing. Check the models page before building.
Context window is limited. Llama 3.1 8B on Workers AI has a 4,096 token context — smaller than what you’d get from the full model elsewhere.
Image generation is slow. SDXL on Workers AI takes 10–30 seconds per image. It’s fine for async workflows, not for real-time generation.
No fine-tuning. You can’t upload custom weights — you’re limited to Cloudflare’s hosted model catalog.

When to Choose Cloudflare Workers AI

Workers AI is the right choice when:

You’re already using Cloudflare for DNS/CDN and want AI inference in the same stack
You need multi-modal inference (text + image + speech) in a single free API
Latency matters and your users are globally distributed
You want to run AI at the edge inside a Cloudflare Worker function
You need a free embedding API for a RAG prototype

Use a different API when:

You need the fastest text generation (use Groq — it’s 3–5x faster)
You need the most capable free model (use Google Gemini)
You need more than 10,000 neurons/day without paying (use OpenRouter with free models)
You’re not on the Cloudflare stack and don’t plan to be

Getting Your API Keys

Create a free account at cloudflare.com
Go to My Profile → API Tokens → Create Token
Use the “Workers AI” template or manually select the Workers AI: Run permission
Copy your Account ID from the dashboard homepage (right sidebar)
Set environment variables: CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID

The entire setup takes under 5 minutes, and the free tier starts immediately — no waitlist, no approval required.

Final Thoughts

Cloudflare Workers AI is the most versatile free AI API available in 2026 by model type coverage. The 10,000 neurons/day free limit is modest for production use, but for developers building side projects, RAG pipelines, image generation tools, or audio transcription apps, it’s genuinely useful — especially when you don’t need to provide a credit card.

The edge inference angle is unique: no other free AI API runs your inference in 300+ global locations. For latency-sensitive applications — chatbots, real-time translation, interactive demos — that matters.

If you’re on Cloudflare already, Workers AI is the easiest way to add AI to your stack without leaving the platform. Get started for free at developers.cloudflare.com/workers-ai.

Originally published at toolfreebie.com.

DEV Community