DEV Community

Alex Spinov
Alex Spinov

Posted on

Fireworks AI Has a Free API: Deploy Open-Source Models 10x Faster

What is Fireworks AI?

Fireworks AI is a generative AI inference platform optimized for speed and cost. They serve open-source models like Llama 3, Mixtral, and their own FireFunction model with industry-leading latency — often 2-10x faster than competitors.

Why Fireworks AI?

  • Free tier — 600K tokens/day free, no credit card required
  • Fastest inference — custom FireAttention engine optimized beyond standard vLLM
  • OpenAI-compatible API — drop-in replacement
  • Function calling — FireFunction-v2 rivals GPT-4 for tool use at 1/10th the cost
  • Fine-tuning — LoRA fine-tuning from $0.40/hour
  • On-demand deployment — deploy any HuggingFace model in minutes

Quick Start

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="your-fireworks-key"  # Free at fireworks.ai
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[{"role": "user", "content": "Explain GitOps in 3 sentences"}]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Function Calling with FireFunction

tools = [{
    "type": "function",
    "function": {
        "name": "get_stock_price",
        "description": "Get real-time stock price",
        "parameters": {
            "type": "object",
            "properties": {
                "symbol": {"type": "string", "description": "Stock ticker symbol"}
            },
            "required": ["symbol"]
        }
    }
}]

response = client.chat.completions.create(
    model="accounts/fireworks/models/firefunction-v2",
    messages=[{"role": "user", "content": "What is Apple stock at?"}],
    tools=tools
)
# FireFunction-v2 matches GPT-4 on function calling benchmarks
Enter fullscreen mode Exit fullscreen mode

Structured Output (JSON Mode)

from pydantic import BaseModel

class ProductReview(BaseModel):
    sentiment: str
    score: float
    key_points: list[str]

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[{"role": "user", "content": "Analyze this review: Great product, fast shipping, but packaging was damaged"}],
    response_format={"type": "json_object"}
)
Enter fullscreen mode Exit fullscreen mode

Deploy Custom Models

# Deploy any HuggingFace model
fireworks models deploy \
  --model-id my-org/my-fine-tuned-model \
  --display-name "My Custom Model" \
  --gpu-type A100
Enter fullscreen mode Exit fullscreen mode

Speed and Cost Comparison

Provider Llama 3 70B Speed Cost per 1M tokens Free Tier
Fireworks 200+ tok/s $0.90 600K tok/day
Groq 500+ tok/s $0.59 Rate limited
Together 80 tok/s $0.90 $5 credits
OpenAI (GPT-4) 30 tok/s $30.00 None
Anthropic 40 tok/s $15.00 None

Real-World Use Case

A document processing startup needed to extract structured data from 100K PDFs per day. OpenAI costs: $4,500/day. Fireworks with Llama 3 70B + JSON mode: $135/day — same extraction quality at 97% cost reduction. The speed improvement also cut processing time from 8 hours to 45 minutes.


Optimizing your AI inference costs? I help teams migrate to open-source models with maximum performance. Contact spinov001@gmail.com or explore my automation tools on Apify.

Top comments (0)