DEV Community: toolfreebie

Tavily vs Brave vs Exa: Free Search APIs for AI Agents

toolfreebie — Thu, 28 May 2026 09:08:05 +0000

Every AI Agent Needs a Search Tool — Here Are the Three Free Ones That Actually Work

If you build AI agents in 2026, your model is brilliant at reasoning over text and useless at knowing what happened yesterday. The fix is the same one every framework — CrewAI, LangGraph, AutoGPT, Aider, every MCP server with a “web search” tool — eventually arrives at: hand the agent a search API.

Google’s official Search API is closed off behind enterprise contracts. Bing’s search API is being retired. SerpAPI starts at $75/month. For developers prototyping or running personal agents, the practical free options have narrowed down to three serious providers, each with a different definition of “free”:

Tavily — purpose-built for LLM retrieval, 1,000 free API credits per month, no credit card
Brave Search API — independent web index, 2,000 free queries per month at 1 query/second, no credit card
Exa — neural search engine designed for AI, $10 of free signup credit (≈1,000 searches), email signup only

All three are production-ready, all three publish OpenAPI specs you can paste into a tool definition, and all three plug straight into the agent frameworks you are already using. This guide breaks down what each free tier really gets you, which one to pair with which framework, and the corner cases where you should reach for a different tool entirely.

Quick Comparison: Tavily vs Brave vs Exa Free Tiers

Feature	Tavily (Free)	Brave Search API (Free)	Exa (Free)
Free quota	1,000 API credits/month	2,000 queries/month	$10 signup credit (~1,000 searches)
Rate limit	~10 requests/sec, no daily cap	1 query/sec	~5 requests/sec
Credit card needed	No	No	No
Free tier resets	Monthly	Monthly	One-time credit
Index type	Aggregated (Bing/others) + own crawl	Independent crawl, own index	Embedding-based neural index
Optimized for	LLM RAG / agent retrieval	Traditional web search	Semantic / similarity search
Content extraction	Yes — built-in `include_raw_content`	Snippet only (paid tier adds extraction)	Yes — built-in `contents.text`
News endpoint	Yes (`topic="news"`)	Yes (dedicated news API)	Yes (via `type="neural"` + filters)
Domain include/exclude	Yes	Limited (goggles)	Yes
Best for	RAG agents that need clean text	Cheap, high-volume search at scale	Finding similar pages / research

The short version: Tavily is what you reach for when an LLM is going to read the result; Brave is what you reach for when you want a lot of independent search results cheaply; Exa is what you reach for when you want results that share semantic meaning rather than just keywords.

What Is Tavily?

Tavily is a search API built specifically for LLMs and AI agents. Founded in 2023 and now used by tens of thousands of developers, it has become the default search tool in LangChain, the recommended tool in the CrewAI documentation, and the example most MCP search servers ship with.

The pitch is straightforward: a normal search API gives you ten blue links and snippets. An agent then has to spend additional turns visiting each URL, parsing HTML, stripping ads and navigation, and producing usable text. Tavily collapses that entire pipeline into a single API call — you send a query, you get back ranked URLs plus a clean, model-ready text answer extracted from the top results, with optional raw content of each page.

For agents, this matters in two practical ways. First, it cuts token usage: instead of feeding 10 noisy HTML pages into your context window, you feed one cleaned summary plus three extracted snippets. Second, it cuts latency: one HTTP call instead of one search call plus ten fetch calls.

Tavily Free Tier: What You Actually Get

The free tier is generous for prototyping and personal agents:

1,000 API credits per month, refreshed at the start of each calendar month
1 credit = 1 basic search; search_depth="advanced" costs 2 credits per call
No credit card required — sign up with email or GitHub and your key is live immediately
Full API access — every endpoint and parameter that paid users get
~10 requests per second rate limit (not officially published, but consistent in practice)

For a hobby agent doing 30 searches per day, you will not hit the limit. For a production app, the next paid tier (Researcher) is $30/month for 4,000 credits, with usage-based billing on top.

One thing to know: Tavily does not run its own crawler at the scale of Google. It aggregates from upstream providers (Bing API is a major one) plus a curated crawl of high-quality sources, then re-ranks the combined results for relevance to your specific LLM query. The ranking quality is the real product, not the raw index size.

Getting Started with Tavily

1. Get Your Free API Key

Go to tavily.com and click Get API Key
Sign in with GitHub or email — no credit card form appears
Copy the key from your dashboard (it starts with tvly-)

2. Call the API from Python

pip install tavily-python

from tavily import TavilyClient

client = TavilyClient(api_key="tvly-YOUR_KEY")

response = client.search(
    query="What were the major Claude 4.7 release notes?",
    search_depth="basic",         # "advanced" gives deeper crawl, costs 2 credits
    max_results=5,
    include_answer=True,          # LLM-generated summary of top results
    include_raw_content=False,    # set True to get full extracted page text
)

print(response["answer"])
for r in response["results"]:
    print(f"{r['title']} - {r['url']}")
    print(r["content"][:200])

3. Direct curl Without the SDK

curl -X POST https://api.tavily.com/search \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "tvly-YOUR_KEY",
    "query": "latest open-source LLM benchmarks",
    "search_depth": "basic",
    "include_answer": true,
    "max_results": 5
  }'

4. Drop Tavily into a CrewAI Agent

from crewai import Agent
from crewai_tools import TavilySearchTool

researcher = Agent(
    role="Research Analyst",
    goal="Find authoritative sources for any topic",
    backstory="You search the open web and cite primary sources only.",
    tools=[TavilySearchTool(api_key="tvly-YOUR_KEY")],
    verbose=True,
)

That is the whole integration — CrewAI ships the tool wrapper, and the agent will now call Tavily whenever its reasoning step decides to “look something up.” For setting up the CrewAI side, see our free CrewAI guide.

What Is Brave Search API?

Brave Search API is the developer-facing endpoint of the same search index that powers the Brave Browser’s default search. Unlike Tavily (which sits on top of upstream APIs) or Exa (which is a semantic engine), Brave runs its own independent web crawler and serves ~30 billion pages from infrastructure it controls.

That independence is the entire pitch. Brave is not paying Microsoft for every query, and it is not subject to Bing’s rate limits or pricing changes. If you are building a product whose value is “we search the open web and synthesize answers” — for example, a competitor to Perplexity — the underlying index has to be one you actually control or license cheaply at scale. Brave is currently the most realistic option in that category.

Brave also exposes several specialized endpoints out of the box: /web/search, /news/search, /videos/search, /images/search, and a /suggest autocomplete endpoint. For an agent that needs different result types in different turns, having all of that under one key is genuinely convenient.

Brave Search API Free Tier: What You Actually Get

The free plan, called Data for Free, is the lowest-friction one of the three:

2,000 queries per month across the web search endpoint
1 query per second rate limit (this is the most-cited gotcha — you cannot fan out 10 parallel searches at once)
No credit card required; signup adds a card only if you upgrade
Access to web, news, video, image, and suggest endpoints on the free plan
Goggles support — custom rerank rules to bias toward specific domains

The free tier returns snippets, not extracted page bodies. If you want extracted markdown content with the search result, that requires the Data for AI plan, which costs $5 per 1,000 queries and is the cheapest pure-search-plus-extraction price on the market.

The 1 query/second rate limit on the free tier is the single most important number to internalize. If your agent does parallel fan-out search (a common pattern in LangGraph workflows), you will hit 429s immediately. The simplest fix is a token-bucket wrapper around the client.

Getting Started with Brave Search

1. Get Your Free Key

Go to api.search.brave.com and click Get Started Free
Sign up with email; verify; choose the Data for Free plan
Generate a subscription token from API Keys

2. curl First Call

curl -s "https://api.search.brave.com/res/v1/web/search?q=open+source+LLM+benchmarks&count=10" \
  -H "Accept: application/json" \
  -H "X-Subscription-Token: YOUR_TOKEN"

3. Python Client with Rate Limiting Built In

import os, time, requests
from collections import deque

class BraveSearch:
    def __init__(self, token, rps=1):
        self.token = token
        self.min_interval = 1.0 / rps
        self.calls = deque()

    def _throttle(self):
        now = time.time()
        while self.calls and now - self.calls[0] > 1.0:
            self.calls.popleft()
        if self.calls and len(self.calls) >= 1:
            time.sleep(self.min_interval - (now - self.calls[-1]))
        self.calls.append(time.time())

    def search(self, q, count=10, country="us"):
        self._throttle()
        r = requests.get(
            "https://api.search.brave.com/res/v1/web/search",
            headers={
                "Accept": "application/json",
                "X-Subscription-Token": self.token,
            },
            params={"q": q, "count": count, "country": country},
            timeout=20,
        )
        r.raise_for_status()
        return r.json()

brave = BraveSearch(os.environ["BRAVE_TOKEN"])
data = brave.search("free vector databases for RAG 2026")
for r in data["web"]["results"][:5]:
    print(r["title"], "-", r["url"])

4. Use Brave with LangChain

from langchain_community.tools import BraveSearch

tool = BraveSearch.from_api_key(
    api_key=os.environ["BRAVE_TOKEN"],
    search_kwargs={"count": 5},
)

print(tool.run("Latest GPT-4.5 evaluations"))

What Is Exa?

Exa (formerly Metaphor Systems) is a semantic search engine built around dense vector embeddings rather than keyword inversion. Instead of matching the words in your query against words on pages, Exa converts your query and the entire indexed web into the same embedding space, then returns pages whose meaning is closest — even if they share zero surface vocabulary with the query.

This sounds like a niche distinction until you actually use it. Two examples that illustrate where Exa shines:

“Articles by someone who used to work at OpenAI and now does longevity research” — a query with no good keywords to match on; Exa returns relevant blog posts; Google returns junk.
“Pages similar to this Anthropic safety post” — Exa has a dedicated find_similar endpoint that returns semantically nearest pages to a URL you supply; the closest equivalent on Google is “site:” with a list you maintain yourself.

Exa is the right tool when your agent’s task is research, similarity discovery, or finding non-obvious sources. It is the wrong tool when you need the absolute newest news article from this morning, because the embedding index is updated continuously but not in real time.

Exa Free Tier: What You Actually Get

Exa structures its free path differently from the other two:

$10 of free credit at signup, no credit card required
Pricing: $5 per 1,000 searches for the basic search endpoint, $10 per 1,000 for search + contents
Effective free quota: ~1,000 search-only calls or ~500 search-plus-contents calls
Once the $10 runs out, you must add a card to continue — there is no monthly refill
Full feature access on the free credit: neural search, keyword search, find-similar, contents extraction, livecrawl, summaries

If you blow through $10 in a week of heavy experimentation, that signals either that the tool is genuinely valuable for your use case (in which case pay) or that you are using it wrong (search-plus-contents in a loop where you should be caching). Either way, the trial credit is enough to make a real go/no-go decision.

Getting Started with Exa

1. Sign Up and Grab Your Key

Go to exa.ai and click Get API Key
Sign in with Google or email; you land on the dashboard with $10 of credit visible
Copy your key from API Keys

2. Neural Search with Contents Extraction

pip install exa-py

from exa_py import Exa

exa = Exa(api_key="YOUR_KEY")

result = exa.search_and_contents(
    "research papers about retrieval-augmented generation evaluation",
    type="neural",
    num_results=5,
    text={"max_characters": 2000},   # extracted, cleaned page text
)

for r in result.results:
    print(r.title, "-", r.url)
    print(r.text[:300])
    print()

3. Find Similar Pages

# Given any URL, return semantically similar pages
similar = exa.find_similar_and_contents(
    "https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback",
    num_results=5,
    text=True,
)

for r in similar.results:
    print(r.url, "score:", round(r.score, 3))

4. curl Without the SDK

curl -s https://api.exa.ai/search \
  -H "x-api-key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "papers comparing dense and sparse retrievers",
    "numResults": 5,
    "type": "neural",
    "contents": {"text": {"maxCharacters": 1500}}
  }'

Head-to-Head: Tavily vs Brave vs Exa

Quota Math for a Real Agent

The “quota per month” numbers look comparable until you do the math against a realistic agent loop. Say your agent does 5 searches per user interaction, and you have 20 daily active users:

Daily searches: 20 users × 5 searches = 100 searches/day = 3,000/month
Tavily free (1,000/mo): covers ~6 users/day, then hard stop
Brave free (2,000/mo): covers ~13 users/day, plus the 1 req/sec ceiling caps your parallelism
Exa $10 credit: ~1,000 search-only calls, gone in 10 days, then pay

For anything beyond hobby scale, you will hit a paid tier. The choice is then about which paid pricing makes sense for your access pattern — Brave’s $5/1,000 with extraction is the cheapest absolute, Tavily’s $30/4,000 includes LLM-tuned ranking, Exa’s $10/1,000 with content gives you semantic search nobody else offers.

Result Quality for LLM Consumption

This is what actually matters when an agent is the consumer. We are not optimizing for human eyeballs; we are optimizing for token efficiency and answer faithfulness downstream.

Tavily wins on this axis by design. The include_answer=True flag returns an LLM-generated summary that is already cleaned, deduplicated, and citation-tagged. The include_raw_content=True flag returns extracted page text without HTML, ads, or navigation — exactly what you would pipe into a system message for a downstream LLM call. You pay no extra credits for this.

Brave on the free tier returns search snippets only — typically the first ~150 characters of a result, with some metadata. To get clean page bodies you need either the paid Data for AI plan or a separate scraping step. For agents, this means an extra fetch hop per result.

Exa ties with Tavily on content extraction (contents.text returns cleaned page bodies) and uniquely offers contents.summary which runs an LLM over each result to compress it further. It uses credit faster but the output is the most LLM-ready of the three.

Latency

Measured from a single US datacenter, p50 round-trip on the basic search endpoint with no extraction:

Brave: ~250-400 ms
Tavily basic: ~400-700 ms
Tavily advanced: ~1.5-3 s (deeper crawl)
Exa neural: ~400-800 ms; with contents: ~1-2 s

For interactive agents, Brave is the fastest, Tavily basic and Exa neural are comparable, and Tavily advanced is in a different latency class — only worth it when answer quality justifies the wait.

Index Freshness

Brave wins on freshness — its independent crawler updates within hours for major news sources, and the /news/search endpoint is specifically optimized for recency.

Tavily inherits the freshness of its upstream Bing API plus its own curated crawl; typically within 1-6 hours for news. The topic="news" parameter biases toward recency.

Exa updates continuously but with an embedding step in between, so very recent content (last few hours) may not yet be in the neural index. The livecrawl="always" parameter forces a real-time crawl for the top hits, but costs more credit.

Working With Your Agent Framework

Framework	Tavily	Brave	Exa
LangChain / LangGraph	Native (`TavilySearchResults`)	Native (`BraveSearch`)	Native (`ExaSearchRetriever`)
CrewAI	Native (`TavilySearchTool`)	Via custom `BaseTool`	Native (`EXASearchTool`)
MCP servers	Official `mcp-server-tavily`	Community `brave-search-mcp`	Official `exa-mcp-server`
OpenAI Assistants	Function calling wrapper	Function calling wrapper	Function calling wrapper
Anthropic Claude tool use	Tool definition snippet	Tool definition snippet	Tool definition snippet

All three publish well-maintained MCP servers, which is the path that lets your AI assistant (Claude Desktop, Cursor, Cline, etc.) gain search powers without writing any code at all. For background on MCP, see our guide to the Model Context Protocol.

Which One Should You Use? A Decision Tree

Use this decision logic — in order — and you will land on the right tool roughly every time.

Is the consumer an LLM that needs clean, summarized text? → Tavily. The include_answer + include_raw_content defaults are exactly what you want.
Do you need a high volume of cheap web searches with an independent index, and you are willing to do extraction yourself? → Brave. The $5/1,000 paid tier is unbeatable on raw search cost.
Are you doing research, similarity discovery, or finding non-obvious sources? → Exa. Neural search and find_similar have no real free-tier competitor.
Do you need news in the last hour? → Brave (/news/search) or Tavily (topic="news"); avoid Exa for breaking news.
Are you building a Perplexity-style product where the index is the moat? → Brave. The independent crawl matters at scale.
Are you prototyping an agent and just want one search call that “works”? → Tavily. Easiest setup, cleanest output, biggest free monthly quota.

Combining All Three: The “Search Router” Pattern

For serious agent systems, a single search provider is a brittle dependency. A common pattern in 2026 is to wrap all three behind a single internal tool that routes by query type:

from tavily import TavilyClient
from exa_py import Exa
import requests, os

tavily = TavilyClient(os.environ["TAVILY_KEY"])
exa = Exa(os.environ["EXA_KEY"])
BRAVE_TOKEN = os.environ["BRAVE_TOKEN"]

def smart_search(query: str, intent: str = "general"):
    """Route to the best search provider for the intent.

    intent: 'news' | 'research' | 'similar' | 'general'
    """
    if intent == "news":
        # Brave news endpoint, freshest index
        r = requests.get(
            "https://api.search.brave.com/res/v1/news/search",
            headers={"X-Subscription-Token": BRAVE_TOKEN, "Accept": "application/json"},
            params={"q": query, "count": 5},
            timeout=15,
        ).json()
        return [{"title": x["title"], "url": x["url"], "text": x.get("description", "")}
                for x in r.get("results", [])]

    if intent == "research":
        # Exa neural search for semantic matching
        res = exa.search_and_contents(query, type="neural", num_results=5,
                                       text={"max_characters": 1500})
        return [{"title": r.title, "url": r.url, "text": r.text} for r in res.results]

    if intent == "similar":
        # Exa find-similar (query should be a URL)
        res = exa.find_similar_and_contents(query, num_results=5, text=True)
        return [{"title": r.title, "url": r.url, "text": r.text} for r in res.results]

    # default: Tavily for LLM-optimized general retrieval
    res = tavily.search(query=query, search_depth="basic",
                        include_answer=True, max_results=5)
    return [{"answer": res["answer"]}] + \
           [{"title": r["title"], "url": r["url"], "text": r["content"]}
            for r in res["results"]]

The agent’s reasoning step picks the intent based on its own plan, and the router transparently uses whichever provider is best — and whichever still has free quota left. Pair this with a 24-hour cache keyed by (query, intent) and your real search bill stays near zero for a long time.

Common Gotchas

Tavily: Watch the `search_depth` Default

The Python SDK defaults to search_depth="basic" (1 credit) but the LangChain wrapper has at times defaulted to advanced (2 credits). With a 1,000-credit free tier, this halves your usable quota if you do not notice. Always pass search_depth explicitly.

Brave: Parallel Fan-Out Will Get You 429’d

The free tier caps you at exactly 1 query per second. If your LangGraph workflow does asyncio.gather() over 5 sub-queries at once, four of them 429. Either wrap in a token bucket (see the Python client above) or upgrade to the paid plan, which lifts the limit to 20 queries/second.

Exa: `type="auto"` Costs More Than You Think

Exa auto-selects between neural and keyword search, and neural costs more credit. If you know your query is keyword-heavy (“Anthropic blog post May 2026”), force type="keyword" to save credit. Save type="neural" for queries that benefit from semantic matching.

All Three: Cache Aggressively

An agent that asks “what is the latest Llama model” twenty times in a debugging session burns 20 credits on the same answer. A trivial in-memory or SQLite cache keyed by the query string saves more credit than any other optimization you will do. The cache TTL should be 1 hour for general queries, 15 minutes for news, 24 hours for stable reference material.

Pairing Search With a Free LLM

None of these search APIs do anything on their own — they feed text to an LLM that produces the actual user-facing answer. The cheapest production stack we have seen in 2026 pairs:

Search: Tavily free (1,000 monthly) for general retrieval + Brave free (2,000 monthly) for news fan-out
LLM: Free tier from Groq (14,400 requests/day), Gemini (1M token context), or Together AI (Llama 3.3 70B free tier)
Orchestration: CrewAI for multi-agent flows, LangGraph for stateful workflows, or a vanilla function-calling loop for simple cases
Observability: Langfuse self-hosted or Hobby tier to trace every search call and LLM call

The total monthly bill at hobby scale, with all of the above: $0. The total at small production scale (a few hundred daily users): typically $30-80, almost all of it search-API overage above the free tiers.

Frequently Asked Questions

Can I use these search APIs for commercial products?

Yes — all three offer commercial use on every tier including the free one. Read each provider’s Terms of Service for redistribution restrictions (typically you cannot resell raw search results as a competing search engine, but you can use them in any agent or end-product feature).

What about SerpAPI / ScraperAPI / SearXNG?

SerpAPI is the long-standing Google-results scraper used by many older LangChain examples. It starts at $75/month with only a 100-search trial — fine for production, expensive for prototyping. ScraperAPI is similar. SearXNG is a self-hosted metasearch aggregator — free if you host it, but the throughput and stability depend on your hosting and on upstream search engines not rate-limiting your IP.

Does Google offer a free search API in 2026?

No public, generally-available one. Google Custom Search JSON API has a free tier of 100 queries/day, but it is limited to “site search” on a list of domains you specify in advance — it is not a general web search API. Google’s Vertex AI Search is enterprise-only.

Which one works best in MCP setups?

Tavily’s official mcp-server-tavily is the most polished and the one Anthropic uses in its example MCP configs. Exa’s exa-mcp-server is also official and adds the find_similar tool which is uniquely useful inside Claude Desktop. Brave has only community-maintained MCP servers but they work fine.

Can I use these inside an MCP server I build myself?

Yes — all three are just HTTP APIs. Wrap whichever one you prefer in an MCP tool definition and your assistant inherits web search capability. See our MCP explainer for the full server pattern.

How do I know I am hitting the free-tier ceiling?

Tavily and Exa both expose usage on their dashboards in near-real time. Brave shows usage on the dashboard with a 5-10 minute delay. All three return a structured error with the rate-limit headers (x-ratelimit-remaining, retry-after) on 429 responses — log those headers in your client so you can alert before you hit the cap rather than after.

Is there a single “best free search API”?

No, and any article that claims one is gaming a keyword. For LLM-consumed agent retrieval, Tavily is the cleanest default. For independent index and high volume, Brave wins. For semantic search and find-similar, Exa is the only real option. The “search router” pattern earlier in this guide is the answer when you cannot pick.

Bottom Line

The free-search-API market in 2026 has stabilized into three genuinely useful options, each with a clear specialty. Pick by access pattern, not by raw quota:

Building an agent that needs search? Start with Tavily. The clean text output and the monthly 1,000-credit refresh make it the lowest-friction first integration.
Need cheap volume? Add Brave. 2,000 free queries plus the cheapest paid tier in the market mean it is the natural second provider when Tavily runs out.
Doing research or similarity work? Reach for Exa. Neural and find-similar are unique capabilities the other two simply do not offer.

Wire up the router pattern, cache aggressively, and you can run a production-grade agent with web search capability for $0/month at hobby scale and a predictable five to fifty dollars at small production scale. Combined with a free LLM tier from Groq or Gemini, that is a complete agent stack that costs nothing meaningful until you actually have users.

Langfuse: Free Open-Source LLM Observability

toolfreebie — Thu, 28 May 2026 09:02:37 +0000

What Is Langfuse?

Langfuse is a free, open-source LLM observability platform — the tool you reach for when your AI app works in the demo and then does something baffling in production. It records every model call, agent step, retrieval, and tool use as a structured trace you can open, read, and replay. Born out of Y Combinator’s W23 batch and now one of the most-starred LLM engineering projects on GitHub (langfuse/langfuse), it has become the default “what just happened?” layer for teams shipping anything more complex than a single chat completion.

The core of Langfuse is MIT-licensed and self-hostable, which is the part that matters for this blog: you can run the entire platform on your own machine or a cheap VPS for $0, forever, with no seat limits and no trace caps. There’s also a managed Langfuse Cloud with a genuinely free Hobby tier if you’d rather not run infrastructure. Either way, the SDKs, the integrations, and the trace UI are the same.

If you’re building with any of the free AI APIs covered here — Gemini, Groq, OpenRouter, Together — Langfuse is the missing piece that turns “I think the prompt is fine” into “here is the exact request, the exact response, the latency, and the cost.” This guide covers what LLM observability actually buys you, whether Langfuse is really free, how it compares to LangSmith and Phoenix, and how to instrument your first app in about ten minutes.

Why LLM Observability Matters

Traditional application monitoring assumes deterministic code: same input, same output, and a stack trace when something breaks. LLM apps break that assumption in three ways, and each one is a reason observability stopped being optional in 2026.

Non-determinism. The same prompt can return different answers on different days. Without a recorded trace of the exact input and output, “it gave a weird answer yesterday” is unreproducible and therefore unfixable.
Hidden multi-step chains. A single user message to an agent can fan out into a dozen model calls, retrievals, and tool invocations. When the final answer is wrong, the bug is usually three steps back — a bad retrieval, a truncated context, a tool that returned an error the model ignored. You need to see the whole tree.
Cost and latency creep. Token usage is invisible until the bill arrives. Observability surfaces per-call token counts and dollar estimates so you can catch the prompt that quietly grew to 40,000 tokens of context.

LLM observability gives you a recorded, searchable history of every AI interaction: the prompts, the completions, the latency, the token cost, the retrieved documents, and the tool calls — organized as nested traces so you can drill from a user session down to the single span that misbehaved. That’s the category Langfuse sits in, alongside LangSmith, Arize Phoenix, and Helicone.

Is Langfuse Really Free? Cloud vs Self-Hosted

“Free” means two different things with Langfuse, and both are real.

Self-hosted (free forever). The Langfuse core is open source under the MIT license. You run it yourself with Docker — a Postgres database, a ClickHouse analytics store, Redis, and the Langfuse web/worker containers, all wired up by the official docker compose file. There are no trace limits, no seat limits, and no feature gates on the open-source build beyond a small set of enterprise add-ons (SSO enforcement, fine-grained RBAC, audit logs) that live behind a commercial license. For an individual or a small team, the MIT build does everything you need.

Langfuse Cloud Hobby (free tier). If you don’t want to run infrastructure, Langfuse Cloud has a free Hobby plan that includes 50,000 units per month with no credit card required, according to the Langfuse pricing page (always check the page for the current limit — these numbers move). A “unit” is roughly one ingested observation, so 50,000/month comfortably covers a side project or an early-stage app in development.

Dimension	Self-Hosted (MIT)	Cloud Hobby (Free)
Price	$0 (you pay for the server)	$0, no credit card
Trace / event volume	Unlimited	50,000 units/month
Team seats	Unlimited	Limited on free tier
Data residency	Your infrastructure	EU or US region
Setup effort	One `docker compose up`	Sign up, copy two keys
Maintenance	You own upgrades & backups	Managed for you
Enterprise extras (SSO, RBAC)	Commercial license	Paid tiers

The honest rule of thumb: prototype on Cloud Hobby because it takes ninety seconds to start, and move to self-hosted the moment you either exceed the free volume, need unlimited seats, or have data-residency requirements that rule out a third party seeing your prompts.

Langfuse vs LangSmith vs Phoenix vs Helicone

Four tools dominate free-tier LLM observability in 2026, and they make different trade-offs between openness, framework lock-in, and how you wire them up.

Tool	Open source	Free path	Integration model	Best for
Langfuse	Yes (MIT core)	Self-host free + Cloud Hobby (50k units/mo)	SDK + decorators + OpenTelemetry, framework-agnostic	Teams who want a full platform they can also self-host
LangSmith	No (managed SaaS)	Free Developer plan (~5,000 traces/mo, 1 seat)	Tightest with LangChain / LangGraph	Teams already all-in on the LangChain stack
Arize Phoenix	Yes	Fully free to self-host	OpenTelemetry / OpenInference, notebook-first	Data scientists debugging in notebooks & evals
Helicone	Yes	Free tier (~10,000 requests/mo)	Proxy — change one base URL	The absolute lowest-effort drop-in logging

(Free-tier numbers above are from each vendor’s public pricing/docs and change often — verify on the linked page before you rely on them.)

The clearest dividing line is how they capture data. Helicone is a proxy: you point your OpenAI base URL at Helicone and it logs every request passing through — zero code changes, but it only sees what flows through the proxy. Langsmith and Langfuse use an SDK/instrumentation model: you wrap your calls or add a decorator, which means they can capture non-LLM steps (retrievals, tool calls, business logic) as spans in the same trace. Phoenix leans on the OpenTelemetry standard, which makes it portable but a little more setup-heavy.

Langfuse’s pitch is “open like Phoenix, full-featured like LangSmith, framework-agnostic unlike either.” If you want one platform that handles tracing, prompt management, and evals, and you want the option to self-host it for free, Langfuse is the broadest pick. If you live entirely inside LangGraph, LangSmith’s deeper native hooks may win on convenience.

Core Features That Matter

1. Tracing and Spans

The foundation. A trace represents one unit of work — typically one user request — and contains nested spans for each step inside it: the retrieval, each LLM call, each tool invocation. Langfuse shows this as an expandable tree with timing, token counts, and cost on every node. When an agent gives a bad answer, you open the trace and walk down to the exact span where the context went wrong. Traces can be grouped into sessions (a multi-turn conversation) and attributed to a user, so you can answer “show me everything user 4471 did this week.”

2. Prompt Management

Langfuse stores your prompts as versioned, named objects you fetch at runtime instead of hardcoding strings. You edit a prompt in the UI, label a version production, and your app picks it up without a redeploy. Every version is linked to the traces that used it, so you can see whether v4 of your system prompt actually reduced hallucinations versus v3. This is the feature that turns prompt engineering from “edit code, commit, deploy, hope” into something measurable.

3. Evaluations and Scoring

Langfuse can attach scores to any trace — from explicit user thumbs-up/down, from an LLM-as-a-judge evaluator, from a custom function, or from manual human annotation in the UI. Over time these scores become quality metrics you can chart: “answer relevance dropped 8% after we switched models.” You can run evaluators automatically on a sample of production traffic or against a fixed test set.

4. Datasets

A dataset is a curated set of inputs (and optional expected outputs) you run your app against to catch regressions before they ship. The natural workflow: find a trace where the app failed, click “add to dataset,” and that real-world failure becomes a permanent test case. Re-run the dataset after every prompt or model change and compare scores side by side.

5. Playground

An in-app prompt playground lets you grab a failing trace, tweak the prompt or swap the model, and re-run it immediately to see if your fix works — without leaving the tool or wiring up a script. It connects to your model providers, so you can A/B a prompt against Gemini and Groq in the same window.

6. Metrics and Dashboards

Aggregate views over all your traces: total cost per day, p95 latency per model, token usage by feature, score trends over time. This is where you notice that one endpoint is responsible for 70% of your spend, or that latency doubled the day you added a reranking step.

How to Self-Host Langfuse for Free

The fastest way to a free, unlimited Langfuse instance is the official Docker Compose stack. On any machine with Docker installed — including a free Oracle Cloud ARM VPS:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

That brings up the full stack (Langfuse web + worker, Postgres, ClickHouse, Redis) and serves the UI at http://localhost:3000. Create an account on first load — it’s stored in your own database — make a project, and copy the public and secret API keys it generates. You now have a production-grade observability platform that no one else can see, with no trace limits, running for the cost of the server.

For production you’ll want to put it behind HTTPS and back up Postgres and ClickHouse, but for development the compose file is genuinely one command. The official self-hosting docs cover the Kubernetes Helm chart and managed-database setups when you outgrow single-node.

Instrumenting Your App: Three Ways

Langfuse offers progressively deeper levels of instrumentation. Start with the first one; reach for the others as your app grows. All three send data to the same project — set these environment variables once and every example below works against either Cloud or your self-hosted instance:

export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
# Cloud EU: https://cloud.langfuse.com  |  Cloud US: https://us.cloud.langfuse.com
# Self-hosted: http://localhost:3000
export LANGFUSE_HOST="https://cloud.langfuse.com"

Way 1: The OpenAI Drop-In Wrapper (zero refactor)

If your code already uses the OpenAI SDK — which is true for most free AI APIs, since Groq, Together, Mistral, and OpenRouter are all OpenAI-compatible — you change exactly one import line:

pip install langfuse openai

# before:  from openai import OpenAI
from langfuse.openai import openai   # drop-in replacement

client = openai.OpenAI(
    base_url="https://api.groq.com/openai/v1",   # any OpenAI-compatible endpoint
    api_key="YOUR_GROQ_KEY",
)

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain LLM observability in one sentence."}],
)
print(resp.choices[0].message.content)
# This call is now automatically traced in Langfuse: prompt, completion,
# token usage, latency, and cost — with zero other changes.

Every completion you make now shows up as a trace. This is the lowest-effort way to start and works against any OpenAI-compatible free API.

Way 2: The @observe Decorator (capture your own functions)

To see your business logic — not just the model call — wrap any function with the @observe decorator. Nested decorated functions automatically become nested spans in the same trace:

from langfuse import observe
from langfuse.openai import openai

@observe()
def retrieve(question: str) -> str:
    # your vector search here; return the context string
    return "...retrieved context..."

@observe()
def answer(question: str) -> str:
    context = retrieve(question)            # becomes a child span
    resp = openai.chat.completions.create(  # becomes another child span
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer using:\n{context}"},
            {"role": "user", "content": question},
        ],
    )
    return resp.choices[0].message.content

answer("What does Langfuse trace?")
# One trace, three spans: answer -> retrieve, answer -> openai call.

Way 3: The LangChain / LangGraph Callback

If you build with LangChain or LangGraph, pass Langfuse’s callback handler and it captures the whole chain automatically:

from langfuse.langchain import CallbackHandler

handler = CallbackHandler()
result = chain.invoke(
    {"question": "What is LLM observability?"},
    config={"callbacks": [handler]},
)

For TypeScript / Node projects, the same drop-in pattern exists:

import OpenAI from "openai";
import { observeOpenAI } from "langfuse";

const client = observeOpenAI(new OpenAI({
  baseURL: "https://api.groq.com/openai/v1",
  apiKey: process.env.GROQ_API_KEY,
}));

const resp = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Hello from Node, traced by Langfuse." }],
});

Because Langfuse v3 is built on OpenTelemetry under the hood, any OTel-instrumented library or framework can also feed it — useful if you’re standardizing telemetry across services. Check the Langfuse docs for the current SDK API, which evolves between major versions.

Tracing a RAG Pipeline End-to-End

RAG is where observability earns its keep, because a wrong answer can come from retrieval or generation and the two failure modes look identical from the outside. Picture a typical stack: a question comes in, you embed it, search a vector database, rerank with Cohere, stuff the top chunks into a prompt, and generate an answer.

With each step wrapped in @observe, a single Langfuse trace shows you:

The exact query embedding step and its latency
The documents retrieved from the vector store, with their similarity scores — so you can instantly see if retrieval pulled garbage
The reranked order after Cohere, to confirm the reranker actually helped
The final prompt that went to the model, including exactly which chunks made it into the context window
The completion, token count, and cost

When a user reports “it said we don’t offer refunds, but we do,” you open their trace and the answer is right there: either the refund policy chunk wasn’t retrieved (a retrieval/embedding problem) or it was retrieved but the model ignored it (a prompt problem). Five seconds of looking replaces an hour of guessing. That single capability — being able to see which half of the RAG pipeline failed — is the most common reason teams adopt Langfuse.

Prompt Management Without Redeploys

Once your prompts live in Langfuse, you fetch them by name at runtime:

from langfuse import Langfuse

langfuse = Langfuse()
prompt = langfuse.get_prompt("support-agent")   # fetches the 'production' label by default
compiled = prompt.compile(customer_name="Ada", product="Widget Pro")

# use compiled as your system prompt; cached client-side, linked to the trace

Now editing the support agent’s behavior is a UI change, not a code change. Non-engineers can iterate on copy, you can roll back a bad version with one click, and because Langfuse links each prompt version to the traces and scores it produced, you get a real before/after on quality instead of vibes. Prompts are cached on the client so the fetch doesn’t add latency to your hot path.

Running Evaluations

The maturity curve for an AI app usually goes: ship it, watch traces, notice a recurring failure, turn that failure into a dataset entry, then run evaluations so the failure can’t silently come back. Langfuse supports all of it:

Online evaluation — run an LLM-as-a-judge evaluator on a sample of live traffic and chart the score over time.
Offline evaluation — run your app against a fixed dataset before every release and diff the scores against the last run.
Human annotation — queue traces for a teammate to label in the UI, building a gold-standard set.

The judge model can be any provider you connect — including a free one. Using Gemini or a Llama model on Groq as your evaluator keeps the whole eval loop at $0, which matters because evaluation can easily run more model calls than production itself.

When to Use Langfuse vs Alternatives

You want one open-source platform for tracing + prompts + evals, with the option to self-host free → Langfuse
You are all-in on LangChain / LangGraph and want the tightest native integration → LangSmith
You debug mostly in Jupyter notebooks and care most about evals → Arize Phoenix
You want the absolute lowest-effort logging and only call one OpenAI-compatible API → Helicone (proxy, one URL change)
You have strict data-residency rules and prompts can’t leave your network → self-hosted Langfuse or Phoenix
You’re prototyping today and want zero setup → Langfuse Cloud Hobby (free, no card)

FAQ

Is Langfuse really free?

Yes, two ways. The MIT-licensed core is free to self-host with no trace, seat, or feature caps (a few enterprise extras like SSO enforcement need a commercial license). Langfuse Cloud also has a free Hobby tier with 50,000 units/month and no credit card. You only pay if you want managed hosting above the free volume or enterprise governance features.

Does Langfuse add latency to my app?

Negligibly. The SDK sends trace data asynchronously in the background after your response is already returned, and prompts are cached client-side. Your users don’t wait on Langfuse.

Do I have to use LangChain to use Langfuse?

No — that’s the point. Langfuse is framework-agnostic. The OpenAI drop-in wrapper and the @observe decorator work with plain SDK calls, CrewAI, LlamaIndex, raw HTTP, or your own custom orchestration. LangChain is just one of many supported integrations.

What’s the difference between Langfuse and LangSmith?

LangSmith is a closed-source managed product from the LangChain team, with the deepest hooks into the LangChain ecosystem. Langfuse is open-source, can be self-hosted for free, and is deliberately framework-agnostic. If you’re not married to LangChain — or you need to keep data on your own infrastructure — Langfuse is the more flexible choice.

Can I use Langfuse with free APIs like Gemini, Groq, or DeepSeek?

Yes. Any OpenAI-compatible endpoint works with the drop-in wrapper — just set the base_url. Groq, Together, DeepSeek, Mistral, and OpenRouter all qualify, and Gemini works through its OpenAI-compatible layer or a dedicated integration.

Does Langfuse store my prompts and completions?

Yes — that’s how tracing works. On Cloud, that data lives in Langfuse’s chosen region (EU or US). If your prompts contain sensitive data you can’t send to a third party, self-host: then the data never leaves your infrastructure. The SDK also supports masking specific fields before they’re sent.

Can it track cost?

Yes. Langfuse computes per-call token usage and a dollar estimate based on each model’s pricing, then aggregates it into dashboards by day, model, user, or feature — so you can find your most expensive endpoint at a glance.

What database does self-hosted Langfuse need?

The current architecture uses Postgres for transactional data and ClickHouse for high-volume trace analytics, plus Redis for queuing. The official Docker Compose file provisions all of them, so you don’t assemble it by hand.

Use Langfuse with OpenClaw

OpenClaw is an AI agent platform for orchestrating multi-step automated workflows — exactly the kind of long-running, multi-call system where a single failed step is otherwise invisible. Pointing OpenClaw’s model calls at the Langfuse-wrapped client gives every automated run a full trace tree.

A practical pairing: OpenClaw runs an unattended nightly pipeline (summarize new tickets, draft responses, flag anomalies). Each run is one Langfuse trace, with a span for every model call and tool use. In the morning you don’t re-read logs — you scan the Langfuse dashboard for any trace with a low score or an error span, open just those, and see exactly which step went sideways. Wire OpenClaw and Langfuse to the same free OpenRouter or Gemini key and the whole observe-and-iterate loop costs nothing.

Final Verdict

Langfuse is the right default in 2026 for anyone shipping an LLM app who has been burned by a bug they couldn’t reproduce. It captures the full trace tree, manages your prompts as versioned objects, and runs evaluations — and it does all of that as an open-source platform you can self-host for free with no caps, or run on a free Cloud tier in ninety seconds. The framework-agnostic SDK means it fits whatever stack you already have, and the OpenAI drop-in wrapper means your first trace is one import line away.

LangSmith is the smoother ride if you live entirely in LangChain, Phoenix is the notebook-native choice for evals, and Helicone wins on pure zero-effort logging. But for the broadest combination of openness, features, and a real free path, Langfuse is the one to install first. Spin up the Docker stack or grab a Cloud Hobby key, change one import in your app, and watch your first trace appear — then ask yourself how you ever debugged AI without it.

Which Free Text-to-Speech API Should You Use in 2026?

toolfreebie — Thu, 28 May 2026 08:57:09 +0000

Which Free Text-to-Speech API Should You Use in 2026?

If you searched for a free text-to-speech API, you are almost certainly building one of three things: a voice feature for an app that needs to read text aloud, a content pipeline that turns articles or scripts into audio, or a voice agent that needs to speak back after it transcribes. The good news is that 2026 is the best year ever to do this for free. The catch is that “free” means three completely different things across the major providers, and picking the wrong one wastes either your money or your weekend.

Three names dominate the search results: Google Cloud Text-to-Speech, ElevenLabs, and OpenAI. Google runs a genuine recurring free tier that refills every month. ElevenLabs has the best-sounding voices and the most generous voice-cloning features, but the smallest free quota. OpenAI has no free tier at all — yet it is so cheap, and so trivial to wire into code you already wrote, that it belongs in any honest comparison.

This guide compares all three on the metrics that decide the question: the real free-tier ceiling, what you pay once you cross it, voice quality and count, language coverage, latency, and the licensing fine print that quietly blocks commercial use on some “free” tiers. Every number links back to the provider’s own pricing or docs page — nothing here is invented benchmark theatre.

The 30-Second Answer

Provider	Free path	Voice quality	Paid rate (cheapest tier)	Best for
Google Cloud TTS	Recurring monthly free tier (renews forever)	Very good (WaveNet / Neural2 / Chirp 3)	$4/1M chars (Standard), $16/1M (WaveNet)	High-volume production audio on a permanent free quota
ElevenLabs	10,000 credits/month, no card	Best in class, plus instant voice cloning	$5/mo (30K credits) Starter	Narration, audiobooks, character voices, cloning
OpenAI TTS	No free tier — pay as you go	Good, steerable with `gpt-4o-mini-tts`	$15/1M chars (`tts-1`)	Adding voice to an app that already calls OpenAI

If you want a free quota that resets every single month and never expires, Google Cloud TTS is the only one that fits — up to 4 million characters of Standard audio per month, free, indefinitely. If you care about how the voice sounds above everything else — narration, audiobooks, game characters, or cloning your own voice — ElevenLabs wins on quality even though its free quota is small. If you already have an OpenAI key wired into your codebase and just want your app to talk, OpenAI TTS is the path of least friction, even though there is no free tier to speak of.

The rest of this article unpacks why.

Why “Free Text-to-Speech API” Is Worth Searching For

Text-to-speech used to be either robotic and free (the old espeak era) or human-sounding and expensive. That gap closed in 2024–2025. Neural TTS that is genuinely hard to distinguish from a human reader is now a commodity, and the providers compete on price and free quota rather than raw quality.

The reason a free tier matters becomes obvious the moment you run the numbers on a real workload. Take a blog-to-podcast tool that converts 50 articles a month, each averaging 8,000 characters:

50 × 8,000 = 400,000 characters/month
On Google Cloud Standard voices: $0 — comfortably inside the 4M-character free tier
On Google WaveNet voices: $0 — inside the 1M-character premium free tier
On OpenAI tts-1: 400K × $15/1M = $6.00/month
On ElevenLabs: 400K characters far exceeds the 10K free credits — you would need the $22/month Creator plan (100K credits) or higher

The same workload ranges from free to $22/month depending purely on which provider you pick. That is the entire reason this comparison exists.

What “free” actually means in TTS (three different shapes)

There are three distinct shapes of “free text-to-speech API” in 2026, and conflating them is the most common mistake:

Recurring free tier: A quota that resets every month, forever, as long as your account is in good standing. Google Cloud, Microsoft Azure, and ElevenLabs all do this (in very different sizes). This is the only shape that supports an ongoing free product.
Time-limited free tier: A generous quota that only lasts your first 12 months. Amazon Polly uses this. Great for a launch year, then it disappears.
Pay-as-you-go, no free tier: No standing free quota at all, but the per-character price is so low it is effectively free at small volume. OpenAI is the headline example.

A recurring free tier is what you want for a side project or a low-volume production feature. Pay-as-you-go is what you want when the integration friction of a second vendor outweighs a few dollars a month. Knowing which shape you are signing up for prevents the nasty surprise of a “free” tier evaporating after a year.

Google Cloud Text-to-Speech: The Only True Recurring Free Tier

Google Cloud Text-to-Speech is the workhorse answer for anyone who needs real volume without a bill. Unlike a one-time signup credit, Google’s free tier renews every month and never expires, which makes it the closest thing to a permanently free TTS API at scale.

The free tier (the real numbers)

Google’s published free monthly allowances, by voice family, at the time of writing:

Voice type	Free per month	Paid rate after free tier
Standard (basic neural)	0–4 million characters	$4.00 / 1M characters
WaveNet / Neural2 (premium)	0–1 million characters	$16.00 / 1M characters
Studio (long-form premium)	0–100K characters	$160 / 1M characters

The 4-million-character Standard free tier is the headline. That is roughly 66 hours of spoken audio every month at an average speaking rate — enough to run a daily news-reader bot, an accessibility “read this page aloud” feature, or a blog-to-audio pipeline indefinitely without paying a cent. The premium WaveNet/Neural2 tier (1M chars free) is where you go when you want the more natural-sounding voices and can stay under ~16 hours of audio per month.

Voices and languages

Google ships 380+ voices across 50+ languages and variants, with full SSML support — so you can control pauses, pronunciation, pitch, speaking rate, and emphasis with markup. The newer Chirp 3: HD voices push quality close to ElevenLabs for supported languages. The trade-off versus ElevenLabs is that Google does not offer arbitrary instant voice cloning on the public API; you pick from the catalogue.

Code: synthesize speech with Google Cloud TTS

The REST API takes JSON in, returns base64-encoded audio:

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {"text": "Hello from a free text to speech API."},
    "voice": {"languageCode": "en-US", "name": "en-US-Standard-C"},
    "audioConfig": {"audioEncoding": "MP3"}
  }' \
  "https://texttospeech.googleapis.com/v1/text:synthesize" \
  | jq -r '.audioContent' | base64 --decode > out.mp3

Python with the official client library:

from google.cloud import texttospeech

client = texttospeech.TextToSpeechClient()

response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(text="Hello from a free text to speech API."),
    voice=texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Standard-C",  # swap to en-US-Neural2-F for premium
    ),
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3,
        speaking_rate=1.0,
    ),
)

with open("out.mp3", "wb") as f:
    f.write(response.audio_content)

Where Google Cloud TTS is a poor fit

Requires a GCP account with a credit card on file. You won’t be charged inside the free tier, but the card and billing setup are mandatory — a higher barrier than ElevenLabs’ email-only signup.
No arbitrary voice cloning. Custom Voice exists but is an enterprise onboarding process, not a self-serve “upload 30 seconds of audio” feature like ElevenLabs.
Auth is heavier. Service-account JSON or ADC, not a single bearer token you paste into a header. Worth the setup for the free volume, but it is a setup.

ElevenLabs: Best Voice Quality and Free Voice Cloning

ElevenLabs is the provider people reach for when the sound matters more than the price. Its voices set the bar for emotional range, breath, and prosody, and it is the only major option where instant voice cloning and a large public voice library are first-class, self-serve features.

The free tier: 10,000 credits/month

ElevenLabs gives every new account 10,000 credits per month, no credit card required. For the standard Multilingual v2 model, that works out to roughly 10 minutes of generated audio per month. The lighter Flash v2.5 and Turbo v2.5 models consume half a credit per character, so the same quota stretches to about 20 minutes of audio if you use them.

Two pieces of fine print matter a lot:

Attribution is required on the free tier. You must credit ElevenLabs when you publish audio generated on the free plan.
Commercial use requires a paid plan. The free tier is for non-commercial use; the moment you monetize the output you need at least the $5/month Starter plan (30,000 credits), which also removes attribution and unlocks instant voice cloning.

The 10K free credits are best understood as a high-quality evaluation and hobby tier, not a free production backend. If voice quality is your priority and your volume is genuinely tiny — a personal project, a demo, a handful of clips — it is excellent. If you need hours of audio per month for free, Google wins on quota.

Models and languages

Model	Strength	Credit cost
`eleven_multilingual_v2`	Highest quality, most expressive, 29 languages	1 credit / char
`eleven_flash_v2_5`	~75 ms latency, ideal for real-time voice agents	0.5 credit / char
`eleven_turbo_v2_5`	Balance of quality and speed	0.5 credit / char

Code: synthesize speech with ElevenLabs

curl -X POST \
  "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from the best-sounding free text to speech API.",
    "model_id": "eleven_multilingual_v2"
  }' \
  --output out.mp3

Python with the official SDK:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",   # a stock library voice
    model_id="eleven_flash_v2_5",       # low-latency for agents
    text="Hello from the best-sounding free text to speech API.",
    output_format="mp3_44100_128",
)

with open("out.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

Where ElevenLabs is a poor fit

Tiny free quota. 10 minutes of audio per month is for evaluation, not for shipping a free product at any volume.
No commercial use without paying. If your project earns money, the free tier is off the table by license, regardless of volume.
Per-character cost is the highest of the three at scale. You pay for the quality. For plain functional narration where any neural voice is fine, Google or OpenAI is cheaper.

OpenAI TTS: No Free Tier, but Cheap and Frictionless

OpenAI’s audio API has no recurring free tier — every character is billed against your OpenAI usage. It earns a place in this comparison anyway, because the per-character price is low enough to be effectively free at hobby volume, and because if you already call OpenAI for chat or Whisper, adding speech is one more method on a client you have already configured.

Pricing and models

Model	Description	Price
`tts-1`	Standard quality, lowest latency	$15 / 1M characters
`tts-1-hd`	Higher audio fidelity	$30 / 1M characters
`gpt-4o-mini-tts`	Newer, steerable — you can instruct tone and delivery	Billed in audio tokens (≈ a few cents per long passage)

At $15 per million characters, generating 10,000 characters — roughly the same audio length as ElevenLabs’ entire monthly free quota — costs $0.15. For a personal project that produces a few thousand characters a day, you might spend under a dollar a month. There is no free tier, but there is also no quota to blow through; you simply pay for what you use.

The standout feature of gpt-4o-mini-tts is steerability: you can pass an instruction like “speak in a calm, sympathetic tone” alongside the text, and the model adapts delivery — something neither Google’s catalogue voices nor ElevenLabs’ standard endpoint do out of the box.

Code: synthesize speech with OpenAI

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY

with client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="nova",
    input="Hello from a pay-as-you-go text to speech API.",
    instructions="Speak in a warm, upbeat tone.",
) as response:
    response.stream_to_file("out.mp3")

The eleven built-in voices (alloy, echo, fable, onyx, nova, shimmer, plus the newer ash, ballad, coral, sage, verse) cover most needs. There is no custom voice cloning.

Where OpenAI TTS is a poor fit

No free tier at all. If “$0/month” is a hard requirement, this is the wrong choice — pick Google.
No voice cloning, limited voice catalogue. Eleven voices versus Google’s 380+ or ElevenLabs’ huge library.
You need a credit card and standing billing. Same barrier as Google, without the recurring free quota to justify it.

Honorable Mentions: Other Free (or Free-ish) TTS APIs

The big three above are the practical answers, but several alternatives are worth knowing about depending on your stack.

Microsoft Azure AI Speech

Azure’s neural TTS includes a recurring free tier of 500,000 characters per month for standard neural voices, renewing monthly like Google’s. It supports 400+ voices across 140+ languages and has the strongest catalogue for enterprise scenarios and custom neural voice (with approval). If your infrastructure is already on Azure, it is the natural pick.

Amazon Polly

Polly’s free tier is time-limited to your first 12 months: 5 million characters/month for standard voices and 1 million characters/month for neural voices. Generous during a launch year, but it is not a permanent free tier — after 12 months you pay standard rates. Best if you are already in the AWS ecosystem.

Deepgram Aura

Deepgram added a TTS model family (Aura) to complement its speech-to-text stack. There is no permanent free tier, but the same $200 signup credit that covers transcription also covers Aura synthesis — useful if you want one vendor for both directions of a voice pipeline. See our free Whisper API comparison for the speech-to-text side of Deepgram.

Self-host: Piper, Coqui, and Kokoro

If you want truly free at the marginal level and have any hardware, open-source TTS has caught up fast. Piper runs fast neural TTS on a Raspberry Pi. Kokoro-82M is a tiny, high-quality open model that runs on CPU. Coqui TTS offers voice cloning locally. The trade-off is the usual self-hosting tax: you own the setup, the updates, and the crashes. For a personal tool this is genuinely free; for a SaaS, the operational time rarely beats Google’s free tier until you are well past it.

Side-by-Side Spec Sheet

Feature	Google Cloud TTS	ElevenLabs	OpenAI TTS
Free tier shape	Recurring monthly (forever)	Recurring monthly (forever)	None (pay as you go)
Free monthly volume	4M chars Standard / 1M premium	10,000 credits (~10 min)	—
Credit card to start	Required	Not required	Required
Commercial use on free tier	Yes	No (paid plan required)	Yes (it’s all paid)
Voice count	380+	Large library + cloning	11 built-in
Languages	50+	29 (Multilingual v2)	Multilingual (follows input)
Voice cloning	Enterprise only	Yes, self-serve (paid)	No
Lowest latency option	Standard voices	Flash v2.5 (~75 ms)	`tts-1`
SSML / prosody control	Full SSML	Limited (model-driven)	Steerable via instructions
Cheapest paid rate	$4 / 1M chars (Standard)	$5/mo (30K credits)	$15 / 1M chars
Auth	Service account / ADC	API key header	API key

Decision Tree: Which One Should You Pick?

Run through this list top to bottom. The first row that matches your situation is your answer.

I need hours of audio per month, for free, forever. → Google Cloud TTS. The 4M-character recurring Standard tier is the only quota that supports this.
Voice quality is the whole point — narration, audiobook, character voice. → ElevenLabs if non-commercial and low volume; pay $5/month Starter the moment you monetize.
I want to clone a specific voice from a short sample. → ElevenLabs (instant voice cloning, paid). No one else does this self-serve.
I already call OpenAI for chat or Whisper and just want my app to talk. → OpenAI TTS. Same client, same key, ~$0.15 per 10K characters.
I need fine pronunciation, pause, and pitch control via markup. → Google Cloud TTS (full SSML support).
I want the model to adapt tone from an instruction (“sound sympathetic”). → OpenAI gpt-4o-mini-tts, the only one with self-serve steerability.
My infrastructure is already on Azure or AWS. → Azure AI Speech (500K chars/month free, recurring) or Amazon Polly (free for first 12 months).
I want zero per-character cost and have hardware to run it. → Self-host Piper or Kokoro. Free at the margin, you own the ops.

Combining Free TTS with Free Whisper and a Free LLM

The most powerful use of a free TTS API is not standalone playback — it is the final leg of a full voice loop. A complete, no-cost voice-agent stack in 2026 looks like this:

Speech in: a free Whisper API (Groq’s no-card free tier is the cleanest) transcribes the user’s audio.
Reasoning: a free LLM — Groq Llama 3.3 70B, Together AI, or Google Gemini — generates the response text.
Speech out: Google Cloud TTS (free monthly tier) or ElevenLabs Flash v2.5 (low latency) speaks the answer.

Three free quotas, zero cards if you stick to Groq plus Google’s free tier, and a complete speech-to-speech agent. The same architecture that costs real money on a single commercial vendor runs free as long as each provider’s monthly ceiling holds.

FAQ

Is there a truly free text-to-speech API with no time limit?

Yes — Google Cloud Text-to-Speech and Microsoft Azure AI Speech both offer recurring monthly free tiers that renew indefinitely (4M and 500K characters respectively for their relevant voice tiers). They require a credit card on file, but you are not charged inside the free quota. Amazon Polly’s free tier, by contrast, only lasts your first 12 months.

Which free TTS API has the best voice quality?

ElevenLabs is widely regarded as the most natural and expressive, especially for long-form narration and emotional delivery. Google’s newer Chirp 3: HD voices are very close for supported languages and come with a far larger free quota. OpenAI’s voices are good and improving, with gpt-4o-mini-tts adding tone steerability. If quality is the only axis, ElevenLabs; if quality-per-free-character, Google.

Can I use a free TTS API commercially?

It depends on the provider. Google Cloud and OpenAI allow commercial use of generated audio (Google inside its free tier, OpenAI as paid usage). ElevenLabs’ free tier is non-commercial only and requires attribution — you must upgrade to at least the $5/month Starter plan to monetize the output. Always re-read each provider’s terms before shipping; licensing changes.

How many characters is one minute of speech?

At a natural speaking rate of roughly 150 words per minute and ~5 characters per word plus spaces, one minute of audio is approximately 900–1,000 characters. So Google’s 4M-character Standard free tier is roughly 66 hours per month, and ElevenLabs’ 10K credits is about 10 minutes on the Multilingual v2 model.

Which TTS API is best for a real-time voice agent?

Latency is the deciding factor. ElevenLabs Flash v2.5 targets ~75 ms model latency and is purpose-built for conversational agents. OpenAI tts-1 and Google’s Standard voices are also fast enough for most interactive use. For the lowest possible end-to-end latency, stream the audio as it is generated rather than waiting for the full file.

Do these APIs support SSML?

Google Cloud TTS has the most complete SSML support — pauses, pronunciation via phonemes, pitch, rate, and emphasis. Azure also has strong SSML. ElevenLabs relies more on its model’s inherent prosody than on markup, and OpenAI uses natural-language instructions (with gpt-4o-mini-tts) instead of SSML tags.

What audio formats can I get back?

All three return MP3 by default and support additional formats: Google offers LINEAR16 (WAV), OGG Opus, and MULAW; OpenAI offers MP3, Opus, AAC, FLAC, WAV, and PCM; ElevenLabs offers MP3 at several bitrates plus PCM and µ-law for telephony. Pick Opus or low-bitrate MP3 for streaming, WAV/PCM when you need to post-process the audio.

5 Free AI Coding Assistants for VS Code & Terminal

toolfreebie — Thu, 28 May 2026 08:51:40 +0000

If you write code for a living in 2026, you have probably tried Cursor, GitHub Copilot, or one of the other paid AI coding tools and walked away thinking the same thing: this is genuinely useful, but $20 to $40 a month adds up fast. The good news is that the free, open-source side of this market has caught up. You can now get high-quality autocomplete, multi-file refactoring, autonomous agent loops, and even self-hosted local inference without paying a cent, as long as you are willing to bring your own free-tier API key (or run a model locally).

This guide covers five free AI coding assistants that I actually use in 2026. They split cleanly across two environments: VS Code (where Cline, Continue.dev, and Codeium live) and the terminal (Aider, plus self-hosted Tabby for the privacy-first crowd). Every one is free, every one is open-source or has a permanently free tier, and every one is genuinely production-ready.

What “free AI coding assistant” actually means in 2026

The phrase covers three different product shapes, and the differences matter when you pick one:

Autocomplete — inline ghost text as you type. Continue.dev and Codeium are the strongest free options here.
Chat / refactor — a side panel that answers questions about your code and applies suggested edits. Every tool on this list does this; quality varies with the model behind it.
Agent — autonomous multi-file edits, terminal execution, and self-verification. This is the Cursor / Devin shape. Cline and Aider are the two strongest free agents.

The pricing line in 2026 is drawn around inference cost, not features. Paid tools (Cursor, Copilot, Cody) bundle inference into a flat subscription. Free tools ask you to bring your own key from a provider with a real free tier — Gemini, Groq, OpenRouter, DeepSeek, or a local Ollama instance. Combine the right free key with the right open-source frontend and your effective cost is zero.

1. Cline — the best free agent for VS Code

Cline (formerly Claude Dev) is the closest free analogue to Cursor’s agent mode. It is an Apache 2.0 VS Code extension that drives a multi-step loop: read files, propose edits, execute terminal commands, verify results, and iterate. You see every step before it runs and can stop or correct it.

What makes Cline stand out among free options:

Plan / Act mode — you can ask it to draft a plan first (read-only) and only switch to Act when you approve. This is the single biggest UX improvement over running an agent “raw.”
BYOK with anything — Gemini, OpenRouter, Groq, Together, DeepSeek, Anthropic, OpenAI, or local Ollama. The free path is Gemini 2.0 Flash (15 RPM, 1M token context) or DeepSeek V3 via OpenRouter free tier.
Live cost tracking — every message shows token counts and the dollar cost so far. With Gemini Flash you watch it stay at $0.00.
MCP support — Cline is one of the first agents to integrate the Model Context Protocol, so you can plug in custom tools (databases, browsers, internal APIs) without writing extension code.

Real workflow: install Cline from the VS Code marketplace, paste a Gemini API key (free from aistudio.google.com, no card), open a Python repo, and type “add type hints to every function in src/ and run mypy until it passes.” Cline reads the files, makes edits, runs mypy, sees the errors, fixes them, and runs again. End-to-end on a small repo this takes 3-5 minutes and costs $0.

Where it falls short: Cline is agent-only, not autocomplete. If you want ghost-text-as-you-type, you need to pair it with Continue.dev or Codeium.

2. Aider — the strongest terminal-native AI pair programmer

Aider is the answer if you spend your day in a terminal and a tmux session, not a GUI editor. It is a Python CLI (Apache 2.0) that opens an interactive prompt inside a Git repo and edits files in place, committing each change with a descriptive message you can read in git log.

The things Aider does better than any other free tool:

Repo map via tree-sitter — Aider parses your entire codebase into a symbol map and feeds the LLM only the relevant parts. On a 100k-line repo this means the model still understands cross-file dependencies without busting your context window.
Architect / editor split — you can run a strong reasoning model (DeepSeek R1, o1-mini) as the architect and a cheap fast model (DeepSeek V3, Gemini Flash) as the editor. The architect plans, the editor writes. This is the cheapest way to get high-quality changes.
Auto-commit with diff messages — every Aider edit becomes a Git commit you can git revert. No “agent went off the rails and trashed my repo” recovery.
Reproducible benchmark — Aider publishes a leaderboard running 225 real exercism problems through every model combination, so you can pick the cheapest model that hits your accuracy bar.

Free combo I run: aider --model openrouter/deepseek/deepseek-chat --architect-model openrouter/deepseek/deepseek-r1 using the OpenRouter free tier. End-to-end cost for a typical refactor session is under $0.05, often $0.00 when you stay under the daily free quota.

Where it falls short: terminal-only, no autocomplete, no inline edit preview. If you live in VS Code, Cline is the better fit.

3. Continue.dev — the best free autocomplete for VS Code and JetBrains

Continue.dev is Apache 2.0, runs in VS Code and JetBrains, and gives you what Copilot gives you (inline ghost text + chat panel + slash commands) without the subscription. The catch and the feature: you wire up your own model providers in a YAML config file.

What you actually get for free:

Inline autocomplete — uses a small fast model for ghost text. The recommended free option is Qwen 2.5 Coder 1.5B via Ollama (runs on CPU), or Groq’s free Llama 3.1 8B endpoint for cloud speed.
Chat panel — point it at any chat-completions endpoint. Gemini Flash, DeepSeek V3, OpenRouter free models, or local Llama 3.3 via Ollama all work.
Custom slash commands — define /test, /review, /explain as YAML prompts that pull in the current file or selection. Closest thing to Cursor’s command palette in a free tool.
Indexed codebase chat — Continue runs a local embedding index of your repo (free Voyage AI or local nomic-embed-text via Ollama) so chat can pull relevant context from anywhere in the codebase.

Sample config.yaml:

models:
  - title: Chat (Gemini Flash)
    provider: gemini
    model: gemini-2.0-flash-exp
    apiKey: YOUR_FREE_GEMINI_KEY
  - title: Autocomplete (Qwen Coder)
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles: [autocomplete]
embeddingsProvider:
  provider: ollama
  model: nomic-embed-text

Where it falls short: Continue is autocomplete and chat, not a full agent. For agentic multi-file work you still want Cline.

4. Codeium / Windsurf — the easiest free start, no config required

Codeium (the free product, distinct from their paid Windsurf IDE) gives you unlimited free autocomplete and chat in VS Code, JetBrains, Neovim, Emacs, and 40+ other editors. No bring-your-own-key, no quota counter, no credit card. Their business model funds the free tier with enterprise self-hosted licenses, and they have committed to keeping the individual plan permanently free.

Why it stays on this list despite not being open-source:

Zero setup — install extension, sign in with email, start typing. No model config, no API keys, no Ollama running in the background.
Truly unlimited — Codeium does not rate-limit individual users on autocomplete or chat. The only paid features are Cascade (their agent) and team management.
Editor coverage no one else matches — if you write Go in Neovim and TypeScript in JetBrains and Python in VS Code, Codeium is the only free tool that gives you the same UX everywhere.
Local-only mode for enterprise — Codeium can run fully on-prem with no telemetry, which is why government and large finance shops use it.

What you give up: Codeium is not open-source and the free tier sends code through their hosted models. If that is a dealbreaker for your codebase, skip to Tabby below.

5. Tabby — self-hosted, fully local, fully free

Tabby (Apache 2.0) is the answer when your code cannot leave your machine. It is a self-hosted AI coding assistant that runs on your laptop or a workstation, ships its own server, and exposes a VS Code / JetBrains / Vim extension that talks to localhost.

What Tabby gives you that nothing else on this list does:

100% local — no API key, no internet, no telemetry. Code never leaves the machine.
One-command install — docker run -p 8080:8080 tabbyml/tabby serve --model StarCoder-1B --device cuda and you have a coding assistant. The default model fits on a CPU; with a consumer GPU you can run StarCoder-7B for noticeably better completions.
Repo-aware retrieval — Tabby indexes your codebase and pulls relevant context into each completion, the same trick Cursor uses but running entirely on your hardware.
Team-server mode — point your colleagues’ editors at a shared Tabby server on a beefy machine. One GPU serves a small team.

Where it falls short: completion quality on the free local models (StarCoder, DeepSeek Coder 1.3B) is meaningfully below GPT-4-class output. Tabby is the right pick when privacy is non-negotiable, not when you want the best autocomplete.

Side-by-side comparison

Tool	Shape	Environment	License	Setup	Best free model combo
Cline	Agent	VS Code	Apache 2.0	2 min	Gemini 2.0 Flash (free, 1M ctx)
Aider	Agent	Terminal	Apache 2.0	1 min (pip)	DeepSeek V3 + R1 via OpenRouter free tier
Continue.dev	Autocomplete + chat	VS Code / JetBrains	Apache 2.0	10 min (config)	Gemini Flash chat + Qwen Coder local autocomplete
Codeium	Autocomplete + chat	40+ editors	Proprietary (free tier)	30 sec	Hosted (no choice, but unlimited)
Tabby	Autocomplete	VS Code / JetBrains / Vim	Apache 2.0	5 min (Docker)	Local StarCoder-7B

Which one should you actually use?

Honest decision tree:

You want one tool, you live in VS Code, and you want agentic multi-file edits → Cline + a free Gemini API key. Stop reading.
You want one tool and you live in a terminal → Aider with the DeepSeek architect/editor combo via OpenRouter.
You want the best free autocomplete and zero setup hassle → Codeium. Install, sign in, done.
You want fully local, code never leaves your machine → Tabby in Docker.
You want power-user autocomplete with full control over which model runs where → Continue.dev with a YAML config you can commit to your repo.
You want the strongest possible setup overall → Cline for agent work + Codeium for inline autocomplete. They do not conflict; you get ghost text from Codeium and large refactors from Cline.

Pairing with free AI APIs

Three of these tools (Cline, Aider, Continue.dev) need an LLM provider. The free combos that work in 2026:

Google Gemini API — Gemini 2.0 Flash is free up to 15 RPM and 1,500 requests/day, with a 1M-token context window that handles huge repos. Setup guide.
Groq — Llama 3.3 70B and Qwen 32B free, 14,400 requests/day, very fast (300-800 tokens/s). Best for autocomplete-style requests where latency matters. Setup guide.
DeepSeek — V3 chat and R1 reasoning both have a free credit grant and DeepSeek’s own API is the cheapest paid tier if you exhaust it. Setup guide.
OpenRouter — single key, 300+ models, several with permanent free endpoints (DeepSeek V3, Llama 3.3 70B, Qwen 32B). Setup guide.
Local Ollama — runs Llama 3.3, Qwen 2.5 Coder, DeepSeek Coder, and others entirely on your machine. Zero API cost, zero rate limit. Setup guide.

FAQ

Is GitHub Copilot Free a real option? GitHub announced a free Copilot tier in late 2024 for verified students and open-source maintainers, with a small monthly chat quota. It is genuinely free for those users, but the cap (50 chat messages, 2,000 completions per month) is low enough that for daily work the tools in this guide are more practical.

Are these as good as Cursor? Cline running on Claude Sonnet 4.6 or Gemini 2.5 Pro is competitive with Cursor for agentic work — same loop, same UX patterns, same model behind the scenes. The gap is mostly polish, not capability. On free models the gap widens; you trade ~10-20% accuracy for $20/month saved.

Can I use these on a corporate codebase? Check your security policy first. Cline, Aider, and Continue.dev send code to whichever API key you configure — Gemini, OpenRouter, etc. — and those providers have their own data-retention policies. Codeium has an opt-out for training data. Tabby is the only option that sends nothing anywhere.

Do any of these work with local-only models? Cline, Aider, and Continue.dev all support Ollama out of the box. Set the provider to ollama and a model name like qwen2.5-coder:32b. Tabby is local-only by design.

What about Cody and Tabnine? Sourcegraph’s Cody is open-source with a free tier (200 autocompletes, 20 chat messages per month) — usable but capped. Tabnine has a free starter plan that is essentially a demo. Neither beats the five tools in this guide for the unlimited-free use case.

Bottom line

Free AI coding assistants in 2026 are not a downgrade from paid tools — they are the same tools with a different billing model. Cline gives you Cursor’s agent loop. Aider gives you something Cursor cannot (clean Git history, terminal-native, reproducible benchmarks). Continue.dev gives you Copilot-style autocomplete with full provider control. Codeium gives you the cleanest zero-setup install. Tabby gives you the only fully local option.

Pick one based on your editor and your privacy needs, pair it with one of the free AI APIs above, and you have a setup that costs nothing and ships features at the same rate as a $40/month subscription. The only thing it costs you is ten minutes of config.

Qdrant vs Pinecone vs Chroma: Free Vector Database

toolfreebie — Thu, 28 May 2026 08:46:12 +0000

Qdrant vs Pinecone vs Chroma: Free Vector Database for RAG

If you are building a retrieval-augmented generation (RAG) pipeline in 2026, the vector database is the load-bearing piece nobody talks about until it breaks. Embeddings are commoditised — Cohere, OpenAI, Voyage, and a dozen open models will turn your text into vectors for free or near-free. The harder question is where those vectors live, how fast you can search them, and how much you have to pay before the bill becomes scary.

Three names dominate the free end of that market: Qdrant, Pinecone, and Chroma. All three give you a real way to start a RAG project at zero cost. None of them require a credit card on day one. But they sit on fundamentally different points on the open-source-vs-managed and local-vs-cloud spectrums, and the right pick depends entirely on what you are building and how far you expect it to scale.

This guide compares all three on the metrics that actually matter for a free RAG stack — what the free tier really lets you do, what happens when you outgrow it, performance numbers from third-party benchmarks, and the engineering trade-offs that hit you a month into the project. Every number cited links back to the provider’s own docs, GitHub repo, or a public benchmark; nothing here is fabricated.

The 30-Second Answer

Database	Free path	License	Free ceiling	Best for
Qdrant	1 GB managed cloud cluster, free forever, no card	Apache 2.0	1 GB RAM + ~4 GB disk on managed; unlimited self-host	Production RAG with hybrid search, payload filters, no vendor lock-in
Pinecone	Starter plan: 2 GB storage, 5 indexes, no card	Closed-source SaaS	2 GB storage, 2M read units, 1M write units per month	Zero-ops managed RAG, fastest first-vector-to-production
Chroma	100% local — `pip install chromadb`	Apache 2.0	Bounded by your laptop’s RAM and disk	Local prototypes, notebooks, single-tenant desktop apps

If you want the smallest possible step from idea to working RAG with three lines of Python and no signup, Chroma wins. If you want a managed service that just exists at a URL with no servers to babysit, Pinecone is the easiest. If you want a real free tier that can carry a small production app, plus the option to self-host the exact same binary later when you outgrow it, Qdrant is the only one of the three with both at the same time.

The rest of this article unpacks why.

Why You Need a Vector Database for RAG at All

RAG, at its core, is one cheap trick: instead of stuffing your entire knowledge base into every LLM prompt, you embed your documents once, store the vectors, and at query time you embed the user’s question, look up the most similar document chunks by cosine similarity, and paste only those chunks into the prompt. The LLM never sees your full corpus — it only ever sees the few passages that matter for the current question.

This makes the vector-search step the bottleneck. Three properties decide whether your RAG app is good:

Recall: does the retriever actually return the relevant chunk? (Approximate-nearest-neighbour algorithms are tunable — you can trade speed for recall.)
Latency: how long does a single query take? If your RAG round trip is 800 ms before the LLM even starts streaming, the UX is dead.
Cost: how much do you pay per million vectors stored, per million queries served, and per million tokens re-embedded when you change models?

A flat-array brute-force search through Python lists works for ten thousand vectors. It falls over at a million. The vector databases below all use some flavour of HNSW (Hierarchical Navigable Small World) graphs to get sub-linear search complexity, plus a binary protocol that does not melt under load. The free tiers exist because every provider knows that the marginal cost of carrying a small project is rounding error, and the developer who built their hobby app on your stack is the developer who buys the production plan later.

What “Free” Actually Means in Vector Database Land

There are three meaningfully different shapes of “free” on offer:

Self-host open source: the code is Apache 2.0, you run it on your own hardware, you pay only for the box. Qdrant, Chroma, Weaviate, Milvus, and pgvector all live here. Free as in you do the work.
Managed free tier: a permanent free quota on the vendor’s own cloud, refilled monthly or capped at storage. Pinecone and Qdrant Cloud both offer this. Free as in they do the work, within limits.
Trial credits: a one-time wallet of paid-rate credit ($50–$300). Weaviate Cloud, Zilliz, and some others use this model. Useful for evaluation, not for shipping.

This guide focuses on the first two, because they are the only paths that let a real project keep running for free past the first month.

Qdrant: Open-Source Rust + Generous Managed Free Tier

Qdrant is a Rust-written vector database under the Apache 2.0 license. It is the rare project that gives you a credible production-grade open-source binary and a generous managed cloud free tier from the same team — which means you can prototype on the free cloud, migrate the exact same data to a self-hosted instance later, and never touch a different query language.

Free cloud cluster (no card)

The Qdrant Cloud free tier gives you one 1 GB cluster, free forever, with no credit card required. That is not a trial. It does not auto-convert to paid. The cluster is region-pinned, has full TLS, and exposes both REST and gRPC. You get:

1 GB RAM cluster (enough for roughly 1–3 million 384-dimensional vectors with default HNSW parameters)
Full HNSW indexing with all distance metrics (cosine, dot, Euclidean, Manhattan)
Payload filtering (Qdrant’s headline feature — filter by metadata during the ANN search, not after)
Hybrid search (dense + sparse vectors in the same query) since Qdrant 1.10
Snapshots, backups, monitoring dashboard

Self-hosting

One Docker command and you have a running Qdrant on your laptop or a VPS:

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant

That is the complete install. There is no separate metadata store, no Zookeeper, no Kafka. The binary is ~20 MB, the disk format is portable, and Qdrant ships an official REST + gRPC schema plus first-party clients for Python, JavaScript/TypeScript, Go, Rust, Java, and .NET.

Python in 10 lines

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(url="https://YOUR-CLUSTER.qdrant.io", api_key="...")
client.create_collection(
    "docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
client.upsert("docs", points=[
    PointStruct(id=1, vector=[...1024 floats...], payload={"title": "Hello"}),
])
hits = client.search("docs", query_vector=[...1024 floats...], limit=5)

What pushes you off the free tier

Storage. One gigabyte is enough for a personal knowledge base, an internal company FAQ, or a side project’s documentation — but a SaaS that ingests user content will hit the ceiling fast. The next step is the Free Trial credit (currently $25) on a larger cluster, then paid tiers that start around $0.014/hour for a 4 GB cluster. Or you migrate to self-host.

Pinecone: The Managed-First Default

Pinecone was the first venture-funded managed vector database and remains the easiest one to get a production-shaped URL out of. The product is closed-source — you cannot run a Pinecone binary on your own hardware — but the trade-off is that you cannot break anything either. There is no cluster to size, no HNSW parameters to tune, no replicas to provision.

Starter plan free tier

The Pinecone Starter plan gives every account a permanent free allowance:

2 GB storage
5 serverless indexes
2 million read units per month
1 million write units per month
Up to 100 namespaces per index
No credit card required

The free tier is serverless — there are no nodes to pay for when idle. You pay (or use free units) per read and per write, where a read unit roughly equals a single small query and a write unit roughly equals one vector upserted. For a typical chatbot, 2 million read units is on the order of hundreds of thousands of user queries a month, which is more than enough for any prototype and many small production apps.

Python in 10 lines

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="...")
pc.create_index(
    name="docs",
    dimension=1024,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("docs")
index.upsert(vectors=[("doc-1", [...1024 floats...], {"title": "Hello"})])
hits = index.query(vector=[...1024 floats...], top_k=5, include_metadata=True)

What pushes you off the free tier

The first wall is usually concurrent users, not storage. A B2C app that does any meaningful traffic will burn through 2 million read units quickly, and once you exceed the monthly allowance the index is paused (Starter plan) or you pay overage (Standard plan starts at $50/month minimum). The second wall is features: namespaces above 100, hybrid search beyond serverless’s current support window, and on-prem deployment all push you to Enterprise.

Chroma: The Local-First Default

Chroma is the lightest possible vector database. It is also Apache 2.0, but its philosophy is the opposite of Pinecone’s: it expects to live inside your Python application as an embedded library, the way SQLite lives inside your application as a file. There is a server mode, but the default getting-started path is pip install chromadb and you have a working vector database in the same process as your script.

Free path

The local install is the free tier. There is no signup, no cluster, no API key — just a directory on disk where Chroma persists its DuckDB-backed storage. Chroma Cloud is in paid private preview as of late 2025, so for free-tier purposes Chroma is a pure self-host story.

Python in 5 lines

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
collection.add(ids=["doc-1"], documents=["Hello world"], metadatas=[{"src": "readme"}])
hits = collection.query(query_texts=["What is hello?"], n_results=5)

Note the API difference: Chroma can embed text for you using a default sentence-transformer (downloads on first use), so you can pass query_texts instead of pre-computed vectors. That is brilliant for prototypes and a footgun in production — the bundled embedder is small, English-only, and not what you want for a real product. For anything serious, plug in OpenAI, Cohere Embed v3, or a custom embedding function.

What pushes you off Chroma

Concurrency, scale, and operations. Chroma’s in-process mode is single-writer. Its server mode (chroma run) exists and works, but the operational story — backups, replication, monitoring, multi-region — is far less mature than Qdrant’s. Chroma is the best default for “I want a working RAG demo in five minutes” and “I want a local notebook to find similar items in my CSV.” It becomes a liability the moment you have ten concurrent users hitting the same index from a deployed web app.

Head-to-Head: Free Tier Limits Compared

Limit	Qdrant Cloud Free	Pinecone Starter	Chroma (Local)
Storage	1 GB RAM (~1–3M vectors at 384d)	2 GB	Your disk
Indexes / collections	Multiple in 1 cluster	5 indexes	Unlimited (your file system)
Reads per month	No hard cap (RAM-bound)	2 M read units	Unlimited (CPU-bound)
Writes per month	No hard cap	1 M write units	Unlimited
Hybrid (dense + sparse)	Yes	Partial (sparse-dense indexes, region-limited)	No (dense only)
Metadata filtering during ANN	Yes (payload filter inside HNSW walk)	Yes	Yes (post-filter)
Persistence	Cloud-managed	Cloud-managed	Local DuckDB / SQLite
Backups	Snapshots	Collection backups	Copy the directory
Self-host option	Yes (Apache 2.0)	No	Yes (Apache 2.0)
Credit card to start	No	No	No (no account needed)

Two things jump out. First, Chroma does not really compete on the same axis — it is a library, not a service. Second, between the two services, Qdrant’s free tier is the only one whose cap is storage only, not query volume. Pinecone will pause your index if you blow the read-unit budget. Qdrant Cloud will simply slow down if you saturate the 1 GB cluster, but the queries keep flowing.

Performance: What the Public Benchmarks Say

The vector-database performance picture changes every quarter, and most vendor benchmarks are theatre. Two public third-party datasets are worth looking at:

The Qdrant vector-db-benchmark repo — open-source, reproducible, runs every major engine through the same ANN-Benchmarks dataset with default and tuned configurations. Yes, it is published by Qdrant, but the harness is open and you can re-run it. Qdrant generally tops latency and RPS in their published runs; Chroma is not in the comparison set because it is single-node.
The ann-benchmarks.com leaderboard — the canonical academic benchmark for ANN libraries (not full databases), useful for comparing the underlying index algorithms (HNSW, IVF, ScaNN).

For a small free-tier project, the takeaway is that all three engines will return a top-5 query under 50 ms with healthy recall at the dataset sizes you can actually fit in their free quotas. Latency-per-dollar starts to matter at higher scale; at the free tier, pick on developer experience and lock-in, not p99 by 5 ms.

Embedding Compatibility

None of these databases generate embeddings on their own (Chroma’s default model aside). You bring vectors in, and the database stores and searches them. That means your embedding choice is independent — and worth thinking about, because the bill on embeddings can dwarf the bill on the vector DB itself.

Embedder	Dimension	Free tier	Plays well with
Cohere Embed v3	1024 (or 384 light)	Trial key, no card	Multilingual RAG, +Rerank in one stack
OpenAI text-embedding-3-small	1536 (or shrinkable)	Pay-as-you-go ($0.02/1M tokens)	Ubiquitous defaults, every library supports it
Voyage AI voyage-3-lite	512	$50 trial credit	Lowest latency, strong on code
BGE / E5 (open source)	varies	Free (self-host)	Air-gapped deployments, zero per-token cost
Sentence-Transformers (open source)	384 / 768	Free (self-host)	Local notebooks, Chroma’s default

All three vector databases accept any of these; they are agnostic about where the vectors came from as long as the dimension matches what you declared at index creation.

When to Choose Which: Decision Tree

You want a notebook-based RAG demo today, with no signup. → Chroma. pip install chromadb, three lines, done. Move on.
You are building a real product and want managed infrastructure with zero ops. → Pinecone. The starter plan covers prototypes, the upgrade path is clean, the docs are the best in the category. You pay the price of vendor lock-in.
You want a real free tier you can leave running, with an exit door to self-host when traffic grows. → Qdrant. The 1 GB cloud cluster carries a small production app, and when you outgrow it the migration to a self-hosted Docker container is one snapshot restore away.
You need hybrid search (BM25 + dense) without paying for a premium tier. → Qdrant. It is the only one of the three that ships full sparse-dense hybrid in its free tier.
You need to filter by tens of metadata fields during retrieval. → Qdrant. Payload filtering happens inside the HNSW walk, not as a post-filter, which preserves recall when the filter is selective.
You are deploying to a customer’s air-gapped environment. → Qdrant or Chroma. Pinecone is not an option here.
Your team has zero appetite for running a database. → Pinecone. The serverless model is the closest thing to “vector DB as an HTTP function” in the market.

The Self-Host vs Managed Trade-Off

This is the question that decides 80% of the choice between Qdrant/Chroma and Pinecone. Self-hosting is free in money and expensive in attention. A small VPS — Oracle Cloud’s always-free ARM tier gives you four cores and 24 GB of RAM for $0 forever — can comfortably run Qdrant or Chroma serving a small RAG app, and the marginal cost of growth is just whatever extra RAM you buy.

What self-hosting does not give you for free is:

Automatic snapshot-and-restore on a schedule you trust
Multi-region replication for HA
An on-call rotation when the disk fills up at 3 a.m.
A vendor support contract when something subtle breaks

For a hobby app or an MVP, those things do not matter — the cost of an outage is your own time. For anything with revenue attached, the managed option starts to look cheap. Qdrant’s strength is that the same query interface works on both, so the migration story is straightforward when the project’s stakes change.

Integration with LangChain, LlamaIndex, and the LLM Layer

All three databases have first-class connectors in the major orchestration libraries — there is no reason to pick on integration coverage:

LangChain: langchain-qdrant, langchain-pinecone, langchain-chroma are all official packages with active maintenance.
LlamaIndex: Same story — QdrantVectorStore, PineconeVectorStore, ChromaVectorStore all live in the core repo or first-party plugins.
Haystack, LlamaCpp, Semantic Kernel: All three databases are first-tier choices.

On the LLM side, the vector database is independent of the model you use to generate answers. Free-tier RAG stacks I see most often in 2026:

Embeddings: Cohere Embed v3 (free trial key)
Reranker: Cohere Rerank v3 (same key)
Vector store: Qdrant Cloud free or local Chroma
LLM: Groq Llama 3.3, Gemini 2.5 Flash, or Together AI’s free model tier

That entire pipeline costs $0 up to the point where any single component’s free quota runs out, which for most personal projects is essentially never.

FAQ

Is pgvector a better choice than these three?

If you already run PostgreSQL and your collection fits in a single Postgres box, pgvector is a serious option — one fewer service to operate, transactional consistency with your other tables, mature backups. It loses to Qdrant on filtering performance at scale and on hybrid search, and it tops out earlier on throughput. For a RAG project where Postgres is already in the stack, start there. For a new project, the specialised databases are easier to reason about.

What about Weaviate, Milvus, Zilliz, Vespa?

All worth knowing. Weaviate has the most ambitious built-in module system (it ships its own embedders, rerankers, multi-tenancy, generative search), but the managed free tier is a 14-day trial, not permanent. Milvus is the heavyweight open-source choice for hundred-million-vector deployments; overkill for a starting project. Zilliz is the managed Milvus, with a serverless free tier that competes with Pinecone. Vespa is Yahoo’s open-source search engine that also does vectors well, and is the right pick if you need full text + vectors + structured filters at search-engine scale. For free-tier RAG, the three covered here are the most popular for a reason — they have the lowest activation energy.

Can I use these databases without an LLM at all?

Yes — vector databases are useful any time you have items and want similarity search. Recommendation systems, semantic search across product catalogues, duplicate detection, image similarity (with image embeddings), code search. RAG is the headline use case but not the only one.

How big does my vector index have to be before I need a real database?

Rule of thumb: under 100 K vectors, a flat numpy array with cosine similarity is faster than any database and zero ops. From 100 K to a few million, an in-process library like Chroma or FAISS is fine. Past 10 M vectors, you want a real database with persistence, snapshots, and a binary protocol — Qdrant, Pinecone, or Weaviate. The crossover is fuzzy; the gradient is real.

Do I need to re-embed everything when I change my embedding model?

Yes. Embeddings from different models are not interoperable — a query vector from OpenAI cannot be searched against documents embedded with Cohere. This is the single biggest hidden cost of RAG. When you change embedding models, you re-embed your entire corpus, which is also a re-write of every vector in the database. Plan migrations.

What is a “write unit” or “read unit” in Pinecone’s pricing?

Pinecone’s serverless billing splits operations into read units and write units, where one read unit roughly equals one similarity query that returns up to 10 results from a small index, and one write unit roughly equals one vector upserted. The actual conversion depends on index size and result count — the Pinecone docs have the exact formula. For most chatbot workloads, 2 M read units a month covers far more queries than you would expect.

Free Whisper API: Groq, Deepgram, AssemblyAI Compared

toolfreebie — Thu, 28 May 2026 08:40:43 +0000

Free Whisper API: Groq, Deepgram, AssemblyAI Compared

OpenAI’s Whisper changed speech-to-text the same way Llama changed open chat models: a frontier-grade ASR model the entire industry could host, fine-tune, and run on commodity hardware. Two years later, the question for most developers is no longer which model to use — it is which hosted API gives me Whisper-quality transcription without a bill.

Three providers dominate the answer in 2026: Groq, Deepgram, and AssemblyAI. All three give you Whisper (or a Whisper-class model) behind a hosted API with a free path to first transcription. None of them require you to spin up a GPU instance, manage CUDA drivers, or fight a Python audio dependency tree. But the meaning of “free” varies wildly between them, and the right pick depends entirely on what you are building.

This guide compares the three on the metrics that actually matter — real free-tier ceilings, per-hour cost once you pass them, supported languages, latency, file-size limits, and the engineering trade-offs you will hit when traffic grows. Every number cited links back to the provider’s own pricing or docs page; nothing here is fabricated benchmark theatre.

The 30-Second Answer

Provider	Free path	Whisper model	Paid rate (cheapest)	Best for
Groq	True free tier, no card	whisper-large-v3 + turbo	$0.04/hr (turbo)	Fast batch transcription, hackathons, side projects
Deepgram	$200 signup credit	Whisper Cloud (whisper-large)	~$0.48/hr Whisper · $0.258/hr Nova-3	Production transcription with diarization and SLAs
AssemblyAI	$50 signup credit	Whisper-Streaming	$0.30/hr Whisper · $0.15/hr Universal	Production pipelines that need Whisper + summary/sentiment in one call

If you want a no-strings, no-card free tier you can ship a real side project on, Groq is the only one that fits. If you want a high-quality production transcription stack with $200 of runway to evaluate it on, Deepgram wins. If you want Whisper plus a stack of additional NLP features (chapter detection, sentiment, entity extraction, summarization) in the same request, AssemblyAI is the cleanest single-API choice.

The rest of this article unpacks why.

Why “Free Whisper API” Is Worth Searching For

The official OpenAI Whisper API costs $0.006 per minute of audio, which works out to $0.36 per hour. That sounds cheap until you do the math on a real workload:

A podcast transcription tool processing 1,000 hours/month = $360/month on OpenAI
A meeting-bot SaaS averaging 50 hours/customer/month at 200 customers = $3,600/month
A user-generated content platform with 10,000 hours of audio/month = $3,600/month

Self-hosting Whisper on your own GPU is cheaper at scale, but only if you actually have the GPU, the DevOps capacity to keep it running, and a workload large enough that the instance never sits idle. For the 90% of projects that don’t, the question becomes: which hosted API offers the cheapest entry path? That is exactly what the providers below compete on.

What “free” actually means in this market

There are two distinct shapes of “free Whisper API” on offer in 2026:

Genuine free tier: A permanently free quota every account gets, refilled daily or monthly, no credit card required. Groq is the only major provider doing this for speech-to-text.
Free credits at signup: A one-time wallet of credits ($50–$200) you spend down at paid rates. Once gone, you pay or stop. Deepgram and AssemblyAI use this model.

Both are useful — they just suit different stages of a project. A free-tier API is ideal for a personal tool, a demo, or a workload with predictable low volume. Free credits are better for prototypes that need higher concurrency or premium features (diarization, summarization) up front, with a clean ramp into paid usage when the product is real.

Groq Whisper API: The Only True Free Tier

Groq built its reputation around Language Processing Units (LPUs) that serve Llama and DeepSeek faster than any GPU cloud. In 2025 they extended that infrastructure to OpenAI’s Whisper models — and unlike every other Whisper host, they gave it a real, no-card free tier that anyone with an email address can use.

Models on offer

Model ID	Paid price	Description
`whisper-large-v3`	$0.111/hour	OpenAI’s flagship Whisper checkpoint, highest accuracy
`whisper-large-v3-turbo`	$0.04/hour	Distilled, ~8× faster, small accuracy drop on long audio

Both models are multilingual (99+ languages for transcription), and both support a separate translation endpoint that returns English text from any source language. The minimum billed length is 10 seconds — even a 2-second clip charges as 10.

Free-tier ceiling (the real one)

Groq’s published rate limits for the free tier on either Whisper model are:

20 requests per minute
2,000 requests per day
7,200 audio seconds per hour (2 hours of audio every hour)
28,800 audio seconds per day (8 hours of audio every day)
25 MB max file size on free tier, 100 MB on the paid Dev tier

That ceiling is unusually generous for a “free” tier. Eight hours of transcribed audio per day, every day, with no card and no expiry, is enough to run a real podcast transcription side project or a daily meeting-notes tool for one person indefinitely. If you cross the 25 MB file limit, chunk the audio with ffmpeg before sending; Groq’s docs include a recommended chunking snippet.

Code: transcribe a file with Groq

curl https://api.groq.com/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json"

Python with the OpenAI SDK (Groq is OpenAI-compatible on this endpoint):

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.mp3", "rb") as audio:
    result = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=audio,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

print(result.text)
for seg in result.segments:
    print(f"[{seg.start:.1f} – {seg.end:.1f}] {seg.text}")

The verbose_json response includes word- or segment-level timestamps you can use for captions, search indexing, or feeding into LLM summarization. If you only need the transcript string, response_format=text drops the JSON envelope.

Where Groq is a poor fit

No built-in speaker diarization. Whisper itself doesn’t predict speaker turns; Deepgram and AssemblyAI run a separate diarization model alongside transcription. If you need “Speaker 1 / Speaker 2” output, plug pyannote.audio or a hosted diarizer in front of Groq, or pick a different provider.
No long-running async jobs. Every request is synchronous. For files over ~60 minutes, chunk and merge yourself.
No production SLA on the free tier. Limits change occasionally; production workloads should sit on the paid Dev tier.

Deepgram Whisper Cloud: The $200 Production Path

Deepgram has been one of the dominant production speech-to-text vendors since well before Whisper existed. They run their own ASR model family (Nova-3, the current flagship; Nova-2; and the real-time Flux model) and also host Whisper as a managed product called Whisper Cloud. Whisper Cloud sits alongside their proprietary models behind one API key, so you can A/B both on the same audio and pick whichever wins for your data.

The free path: $200 of credit

Deepgram gives every new account $200 of API credit at signup, no card required. Their pricing page describes it as “free $200 credit, then pay as you go.” There is no fixed expiry on the credit, which is unusual — most competitors expire credits at 30–90 days.

At Whisper Cloud’s published rate (~$0.0048/minute, or roughly $0.288/hour at the time of writing, with concurrency capped at 5 streams on the free tier), $200 of credit gives you something like ~700 hours of Whisper transcription to evaluate the product before you commit. If you decide Deepgram’s own Nova-3 model is good enough — and for English audio it usually is — $200 stretches further because Nova-3 is cheaper per minute and faster.

Whisper Cloud vs Nova-3: the trade-off Deepgram wants you to make

Whisper Cloud is positioned as a compatibility option for teams who already pipe through Whisper and want a hosted replacement for self-hosted inference. Deepgram’s real recommendation for new builds is Nova-3, because:

Nova-3 is cheaper per minute
Nova-3 has built-in speaker diarization, smart formatting, language detection, and profanity filtering in the same request
Nova-3 supports real-time streaming as a first-class feature; Whisper is fundamentally batch

For most production English transcription pipelines in 2026, Nova-3 is the better answer — and if you arrived here searching “free Whisper API,” it’s worth pricing both before you commit. Whisper Cloud remains the right pick if you specifically need Whisper’s multilingual behavior or you’re benchmarking a model swap.

Code: transcribe with Deepgram (Whisper or Nova)

curl -X POST \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @meeting.wav \
  "https://api.deepgram.com/v1/listen?model=whisper-large&punctuate=true"

Swap the model to Nova-3 by changing model=whisper-large to model=nova-3. The Python SDK is a thin wrapper:

from deepgram import DeepgramClient, PrerecordedOptions

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

with open("meeting.wav", "rb") as f:
    payload = {"buffer": f.read()}

options = PrerecordedOptions(
    model="whisper-large",  # or "nova-3"
    punctuate=True,
    diarize=True,            # Nova-3 only; ignored on whisper-large
    smart_format=True,
)

response = dg.listen.rest.v("1").transcribe_file(payload, options)
print(response.results.channels[0].alternatives[0].transcript)

Where Deepgram is a poor fit

Once the $200 runs out, you’re paying. No free tier waits behind it. Budget the runway accordingly.
Higher concurrency requires paid plans. The five-stream cap on the trial is enough to evaluate, not to ship a real concurrent batch pipeline.
Whisper Cloud is not Deepgram’s strategic priority. Expect Nova to get the new features first; Whisper Cloud is a compatibility-and-evaluation product.

AssemblyAI: Whisper Plus the Full NLP Stack

AssemblyAI takes a different approach. Instead of competing on “we host Whisper cheaply,” they sell a layered speech intelligence platform where transcription is the foundation and the value is everything stacked on top — chapter detection, sentiment analysis, named-entity extraction, content moderation, summarization, topic classification. All available in the same request that produces the transcript.

The free path: $50 of credit

AssemblyAI gives new accounts $50 of credit on signup, no credit card required. The two relevant models:

Universal-3 Pro (Async) — their current flagship pre-recorded model, $0.15/hr at the time of writing. Recommended for new builds.
Whisper-Streaming — the open-source Whisper model hosted on AssemblyAI’s infrastructure, $0.30/hr, supports 99+ languages.

$50 of credit covers roughly 166 hours of Whisper-Streaming or 333 hours of Universal-3 Pro — plenty to prototype, demo, or transcribe a backlog of meeting recordings before you have to pay.

Why pick AssemblyAI’s Whisper over Groq’s

The answer is almost always: because you also want the layered features. If you only need transcript text, Groq’s free tier is strictly better — same model family, no card, no credit clock. The reason to buy AssemblyAI is that adding sentiment_analysis: true or auto_chapters: true to a single API call returns:

Per-sentence sentiment (positive / negative / neutral with confidence)
Auto-generated chapter boundaries with headlines for long-form audio
Named entities (PERSON, ORG, LOCATION, etc.) with timestamps
Topic categories from the IAB taxonomy
PII redaction in the transcript

Reproducing that stack on top of Groq means a second LLM call, your own entity-extraction prompt, and your own chaptering logic. For one project that’s fine. For a SaaS product, the integration cost of doing it yourself rapidly exceeds the price difference per hour.

Code: transcribe with AssemblyAI

AssemblyAI’s API is two-step (upload + transcribe) rather than a single multipart POST:

import os, requests, time

API_KEY = os.environ["ASSEMBLYAI_API_KEY"]
headers = {"Authorization": API_KEY}

# 1. Upload audio
with open("meeting.mp3", "rb") as f:
    upload = requests.post(
        "https://api.assemblyai.com/v2/upload",
        headers=headers,
        data=f,
    ).json()
audio_url = upload["upload_url"]

# 2. Request transcription (with optional features)
job = requests.post(
    "https://api.assemblyai.com/v2/transcript",
    headers=headers,
    json={
        "audio_url": audio_url,
        "speech_model": "universal",  # or "whisper-streaming"
        "speaker_labels": True,
        "auto_chapters": True,
        "sentiment_analysis": True,
    },
).json()

# 3. Poll until done
while True:
    status = requests.get(
        f"https://api.assemblyai.com/v2/transcript/{job['id']}",
        headers=headers,
    ).json()
    if status["status"] in ("completed", "error"):
        break
    time.sleep(3)

print(status["text"])
for chapter in status.get("chapters", []):
    print(f"[{chapter['start']/1000:.0f}s] {chapter['headline']}")

Where AssemblyAI is a poor fit

Free credit runs out fast on heavy workloads. $50 is roughly a quarter of Deepgram’s $200.
Two-step upload adds latency. Bigger files take longer to upload than to transcribe in some cases.
Universal-3 Pro is not Whisper. If your codebase or contracts specifically mandate Whisper output, choose Whisper-Streaming explicitly, accept the higher per-hour rate, and don’t drift toward Universal “because it’s cheaper.”

Honorable Mentions: Other Ways to Get Free Whisper

The three above are the practical answers for hosted Whisper in 2026, but a few alternatives are worth knowing about.

Self-host with `faster-whisper` or `whisper.cpp`

If you already have a GPU box (or even a recent MacBook), faster-whisper (CTranslate2) and whisper.cpp deliver real-time-or-better transcription on hardware you already own. Truly free at the marginal level. The catch: you own the operational complexity (driver updates, OOM crashes, queueing). For a personal tool this is fine; for a SaaS, the time it costs you is rarely worth the API savings until volume passes ~500 hours/month.

Hugging Face Inference API

Hugging Face’s free Inference API can call OpenAI Whisper checkpoints, but rate limits are aggressive and request latency on the free tier is unpredictable. Useful for one-off testing in a notebook; not a production option.

Cloudflare Workers AI Whisper

Cloudflare Workers AI includes Whisper among its 47+ free models, billed in “neurons” rather than minutes. If you already run your stack on Cloudflare Workers, it integrates very cleanly and the free daily neuron quota is generous. Less compelling as a standalone choice if you’re not on Cloudflare.

The official OpenAI Whisper API

$0.006/minute, billed against your OpenAI usage. Not free, but worth listing as the reference price every other provider competes against. If you already have OpenAI usage running and don’t want a third API key in your codebase, it’s the path of least integration friction.

Side-by-Side Spec Sheet

Feature	Groq	Deepgram	AssemblyAI
Free tier shape	Permanent free tier, no card	$200 signup credit	$50 signup credit
Whisper model	large-v3, large-v3-turbo	whisper-large (Whisper Cloud)	Whisper-Streaming
Native non-Whisper model	—	Nova-3, Nova-2, Flux	Universal-3 Pro, Universal-2
Cheapest paid rate	$0.04/hr (turbo)	~$0.258/hr (Nova-3)	$0.15/hr (Universal-2)
Speaker diarization	No	Yes (Nova-3)	Yes
Real-time streaming	No	Yes (Flux, Nova)	Yes
Summarization / chapters	No (DIY via LLM)	Limited	Yes (auto-chapters)
Sentiment / entities	No	Limited	Yes
Max file size (single request)	25 MB free / 100 MB dev	2 GB	2.2 GB (URL) / 5 GB (upload)
API style	Synchronous, OpenAI-compatible	Synchronous + streaming	Async upload + poll
Languages	99+ (Whisper)	30+ (Nova) / 99+ (Whisper)	99+ (Whisper) / 17+ (Universal)

Decision Tree: Which One Should You Pick?

Run through this list top to bottom. The first row that matches your situation is your answer.

I am building a side project / hackathon entry / personal tool. → Groq. No card, real free tier, fastest to first transcription.
I need speaker diarization (who said what) in the output. → Deepgram Nova-3 if production-bound, AssemblyAI if you also need chapters/summary.
I need Whisper specifically — same model my self-hosted setup uses now — as a hosted swap. → Deepgram Whisper Cloud, then evaluate Nova-3 as a downgrade test.
I need transcript + sentiment + chapters + entities from one API call. → AssemblyAI. The integration cost saved is worth the higher per-hour rate.
I need real-time streaming transcription for a voice agent. → Deepgram Flux/Nova or AssemblyAI Universal Streaming. Groq is batch-only.
I have heavy multilingual audio (Spanish, Mandarin, Hindi, Arabic, etc.). → Groq whisper-large-v3 for cost, AssemblyAI Whisper-Streaming for accuracy + post-processing.
I already run my backend on Cloudflare Workers. → Cloudflare Workers AI Whisper — integration savings beat per-hour savings here.
I already have an OpenAI key wired in and don’t want a third vendor. → Official OpenAI Whisper API, $0.006/min. Don’t optimize what you don’t need to.

Combining Free Whisper with a Free LLM

The real productivity unlock isn’t transcription on its own — it’s transcription plus an LLM pass on the resulting text. A reasonable free stack for a side-project transcription tool in 2026 looks like:

Audio in: Groq whisper-large-v3-turbo (free, fast).
LLM pass on the transcript: Groq Llama 3.3 70B, Cohere Command R+, or Together AI Llama 3.3 70B Free for summarization, action-item extraction, or speaker attribution via prompt.
Embedding for search: Cohere Embed v3 or another free embedding tier.

Three free API keys, zero cards, end-to-end speech-to-search. The same architecture that costs $0.36/min on commercial offerings can run free as long as you stay within each provider’s daily ceiling.

FAQ

Is OpenAI’s Whisper actually free?

The Whisper model weights are MIT-licensed and free to self-host. The OpenAI Whisper API ($0.006/min) is not free — there is no free tier and you need a credit card on file. When people say “free Whisper API” they almost always mean a third-party host (Groq, Deepgram, AssemblyAI) that runs Whisper for you with a free path in.

Which Whisper API is the most accurate?

All three host the same underlying whisper-large-v3 checkpoint (or a distilled variant of it), so transcription accuracy on identical audio is comparable. Differences in real-world output come from preprocessing (audio normalization, VAD), post-processing (smart formatting, punctuation), and whether diarization is layered on top. Groq runs the cleanest “raw Whisper” output; Deepgram and AssemblyAI add post-processing that usually helps for English business audio.

Can I use these APIs for real-time transcription?

Whisper itself is a batch model — it ingests a complete audio file and returns a transcript. Groq is batch-only. Deepgram offers real-time streaming via Nova and the Flux model (not Whisper). AssemblyAI offers Universal Streaming and Whisper-Streaming for real-time use. For voice-agent latency budgets, Nova-3 and AssemblyAI Universal Streaming are the practical picks; Whisper itself is not ideal for sub-second response.

What’s the difference between whisper-large-v3 and whisper-large-v3-turbo?

Turbo is a distilled version of large-v3 — fewer decoder layers, ~8× faster, and substantially cheaper to serve. The accuracy gap on standard benchmarks is small (a few percent WER) and only meaningful on long, noisy, or accented audio. For most use cases turbo is the right default; reach for large-v3 only when you’ve benchmarked turbo on your data and found it lacking.

Can I use the free tier commercially?

Groq permits commercial use on the free tier within the published rate limits; their paid Dev tier exists to lift those limits and add SLA, not to gate commercial access. Deepgram and AssemblyAI credits are usable for any purpose — they’re paid usage you didn’t pay for yet. Always re-read each provider’s TOS before deploying commercially; it changes.

How do I handle audio files larger than 25 MB on Groq?

Chunk the audio before sending. The simplest reliable approach is ffmpeg -i input.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3 to split into 10-minute pieces, transcribe each, and concatenate the resulting text. Groq’s docs include a more aggressive recipe that downsamples to 16 kHz mono first, which both reduces file size and matches Whisper’s training audio format.

Which one has the best multilingual support?

Anywhere you see “whisper-large-v3” you get OpenAI’s published 99+ language coverage. Groq, Deepgram Whisper Cloud, and AssemblyAI Whisper-Streaming are all equivalent there. Deepgram Nova-3 supports a smaller set (around 30+ languages) but is faster and cheaper for the languages it does support — primarily English with strong coverage of Spanish, French, German, Portuguese, Italian, Dutch, Hindi, Japanese, Korean, and Mandarin.

Do any of these offer free real-time streaming?

Not at production volume. Deepgram and AssemblyAI both bill streaming minutes against their respective free credits ($200 and $50). Groq doesn’t offer streaming at all. If real-time is core to your product, plan to pay; the free credits are useful for evaluation, not for shipping a public voice product.

Bolt.new: Free AI App Builder That Codes, Runs, and Deploys in Your Browser

toolfreebie — Thu, 28 May 2026 08:35:15 +0000

What Is Bolt.new?

Bolt.new is a free, browser-based AI app builder from StackBlitz. You type a prompt — “build me a Next.js todo app with Supabase auth” — and Bolt scaffolds the project, writes the code, runs npm install, boots a dev server, and shows you the live preview, all inside a browser tab. When you like what you see, one click deploys it to Netlify with a real URL.

The trick is that none of the build happens on a remote sandbox. The Node.js runtime, the package manager, the dev server, and the file system all live inside your browser, courtesy of StackBlitz WebContainers. When the AI agent edits a file, the change is immediate — there is no round trip to a Docker container in the cloud, no cold start, no queue. That single architectural choice is what makes Bolt feel a generation ahead of older AI coding tools that wrap a remote VM in a chat UI.

In Eric Simons’ interview on the Latent Space podcast, the StackBlitz CEO described Bolt.new’s growth as “the fastest software product I have ever seen go from zero to viral” — the product crossed $8M ARR within two months of launch in late 2024, before adding any salespeople. By 2026 it has settled into one of the three most-used AI app builders alongside Lovable and v0 by Vercel.

This guide is the honest 2026 take on Bolt.new’s free tier: what 1 million tokens per month actually buys, where Bolt beats the alternatives, where it falls down, and how to combine it with free AI APIs and self-hosted clones to get more out of it without paying for Pro.

The WebContainer Trick: Why “AI Code in Browser” Actually Works

Most AI coding products fall into one of two camps. The first runs in your local editor — Cursor, Cline, GitHub Copilot, Aider — and assumes you have Node.js, Python, Docker, and the rest of your toolchain already set up. The second runs on a remote sandbox: Replit Agent, GitHub Codespaces, Gitpod. They give you a VM in the cloud and pipe a terminal back to your browser.

Bolt.new is the only widely-used product in a third camp. The Linux runtime, the package manager, the file system, the HTTP server — all of it is compiled to WebAssembly and runs inside a single browser tab. There is no VM to rent and no laptop to set up. The first time you open Bolt.new on a fresh Chromebook, you can prompt “build me a SvelteKit blog with markdown posts” and have a running app in 60 seconds.

Three concrete consequences of this:

Cold start is zero. The dev server is ready the moment your AI agent finishes writing files — no container provisioning, no docker pull, no waiting room.
Compute cost is your laptop, not their cloud. StackBlitz pays nothing per running project, which is a big part of why a $25/month Pro plan can include 10M tokens of Claude usage. Their marginal cost is the AI tokens, not the hosting.
You can fork and remix instantly. Sharing a Bolt.new URL is the same as sharing source — anyone who opens it has a fully running clone in their browser within a few seconds. There is no “deploy this to a sandbox” intermediate step.

The only thing WebContainers cannot do is run native binaries that aren’t compiled for WASM. That rules out Docker-in-Docker, native Postgres, Python data science with NumPy/Pandas (some Python does work via Pyodide, but it is slower and more limited), and anything that calls into a system library beyond the Node ecosystem. For 90% of modern web app prototyping — React, Next.js, Vue, Svelte, Astro, Remix, Vite, Tailwind, TypeScript — none of that matters. For ML pipelines or anything with a heavy native dependency, Bolt is not the right tool.

Free Tier: What 1M Tokens Per Month Actually Buys

Per the Bolt pricing page, the 2026 free tier gives you:

1,000,000 tokens per month — the soft cap on AI usage
300,000 tokens per day — the hard daily cap; you cannot burn the whole month in one sitting
No rollover — unused free-tier tokens reset every day; only paid tokens roll over (up to one extra month)
Bolt branding on deployed sites
Public projects only — the free tier cannot make a project private

“How much app is 300,000 tokens?” is the question every new user asks, and the honest answer is: less than you’d expect. Each user prompt sends the entire current file tree, the conversation history, and tool definitions back to Claude. A medium-complexity edit to a 5-file Next.js project — say, “add a search bar to the header that filters the post list” — typically consumes 30,000–60,000 tokens of context. Across the day that gives you 5–10 meaningful prompts.

From an independent 2026 comparison of free AI app builders, the realistic free-tier output is “roughly 3-8 meaningful prompts per day” before you hit the wall. That is enough to scaffold a small project and do a couple of refinements. It is not enough to take a real app from nothing to production in a single sitting.

Two practical workarounds:

Plan offline first. Write a tight design doc — pages, routes, data model, third-party integrations — before opening Bolt. The fewer “wait, also do X” follow-ups you need, the more you get out of 300K tokens. Throwaway prompting destroys the free tier.
Use Bolt.new for scaffold + structure, then export to a real editor. Bolt has a “Download” button that ships you a zip with the full project. Open that in Cline or Aider with a free Gemini or OpenRouter key, and continue iterating without the token budget pressure.

Bolt.new vs Lovable vs v0 vs Replit Agent vs Cursor

Five products are usually in the same conversation in 2026. They look similar from the outside (“AI builds your app from a prompt”) but they make completely different trade-offs. This table summarizes the ones that matter for a developer choosing a default tool.

Feature	Bolt.new	Lovable	v0	Replit Agent	Cursor
Surface	Browser tab (in-browser runtime)	Browser tab (remote sandbox)	Browser tab (UI-only)	Browser tab (remote VM)	Local IDE (forked VS Code)
Free tier	1M tokens/mo, 300K/day	5 messages/day	Generous chat-only tier	Limited Agent trial that expires	2-week Pro trial; limited free
Stack scope	Full-stack web (Node ecosystem)	Full-stack web	UI components only (React/Tailwind)	Full-stack + databases + cron	Anything you have on disk
Default model	Claude Sonnet 4.6	Claude / GPT (managed)	v0-managed	Claude / GPT (managed)	Cursor-managed
Premium model	Opus 4.7 (paid plans)	Yes (paid)	Higher tiers on Vercel Pro	Yes	Yes
One-click deploy	Netlify (built in)	Built-in	Vercel	Replit Deployments	External (your choice)
Database	Connect Supabase	Connect Supabase	Connect Supabase / Neon	Built-in Postgres	Whatever you wire up
GitHub sync	Yes	Yes	Yes	Yes	Native
Open-source clone	bolt.diy	No	No	No	No
Pro pricing	$25/mo (10M tokens)	$25/mo	$20/mo	$25/mo Core	$20/mo

Three takeaways from the table that aren’t obvious to most newcomers:

Replit Agent and Bolt.new are full-stack; v0 is UI-only. If your prompt is “build me a complete app with auth and a database,” v0 is the wrong tool — it is meant to feed React components into an existing codebase, not to ship a finished product. Bolt and Replit both ship something runnable end-to-end.
Bolt’s WebContainer beats Replit’s sandbox for iteration speed. Replit Agent is more capable on long-running backend tasks (it has a real Linux VM), but every time the agent edits a file, you wait for the cloud sandbox to reload. Bolt feels nearly instant because the runtime is local to your browser tab.
Lovable is built for non-developers who want polish over control. If the team using the tool is not technical and the most important thing is that the output looks good without thinking about tailwind.config.js, Lovable is the right pick. If you want to read and edit the code yourself, Bolt is much more developer-friendly.

Five-Minute Quickstart: From Prompt to Deployed App

Here is the shortest path from “nothing” to “deployed Next.js app with a database” using only Bolt’s free tier.

Step 1 — Sign in

Open bolt.new in any Chromium-based browser (Chrome, Edge, Brave, Arc — Firefox works too as of 2026, with slightly slower WebContainer perf). Sign in with GitHub or email. There is no credit card prompt.

Step 2 — Write a tight first prompt

The single biggest free-tier optimization is your first prompt. Vague prompts (“make me a website”) force Bolt to ask follow-ups, each one burning context. A specific prompt produces a working scaffold in one shot.

A prompt that consistently produces a usable starter:

Build a Next.js 14 (App Router) blog with the following:

Pages:
- /            list of posts (title, date, excerpt, tag pill)
- /posts/[slug] full post with markdown rendering
- /admin       password-gated form to create new posts (env var ADMIN_PASSWORD)

Data:
- Posts stored in a single posts.json file in the repo
- Each post has: slug, title, date (ISO), tag, excerpt, body (markdown)

Styling:
- Tailwind CSS, dark mode default, monospace headlines, inline code blocks styled with a card shadow

Tooling:
- TypeScript strict mode
- next.config.js with images.unoptimized = true so it deploys cleanly to Netlify

After scaffolding, add three sample posts so I can see the styling.

Notice that everything is concrete: framework version, routing convention, data shape, styling, deploy target, even sample data. Bolt produces a runnable app in one pass on a prompt this specific and uses about 50K–80K tokens doing it.

Step 3 — Run the preview

The preview pane mounts as soon as npm install finishes, which on a typical broadband connection is 8–15 seconds. There is no “deploy” step yet — you are looking at the dev server running inside your browser.

Step 4 — Connect Supabase (optional)

For a real database instead of posts.json, click “Connect Supabase” in the right sidebar. Bolt will prompt you to create a free Supabase project and inject the credentials into .env.local automatically. Then prompt: “Move posts from posts.json to a Supabase table called posts. Replace the JSON imports with Supabase queries.” Bolt will write the migration, the queries, and update the routes in one pass — typically ~40K tokens.

Step 5 — Deploy

Click “Deploy” → choose Netlify → wait ~20 seconds. You get a real public URL like https://celadon-dingo-12345.netlify.app. Custom domains and removing the Bolt branding are paid features.

Step 6 — Push to GitHub

Click “Push to GitHub”. Bolt creates a public repo with all the code. From there, you can clone it locally and continue with Cline, Aider, or any free CLI agent — no longer paying the Bolt token budget for incremental edits.

The Token Economy: How to Make the Free Tier Last

The single most useful skill for any free-tier Bolt user is reading the token meter and writing prompts that respect it. Five things that meaningfully extend the daily 300K budget:

1. Front-load detail

One specific 2,000-token prompt is much cheaper than five 500-token clarifying prompts, because each follow-up sends the entire conversation history again. The cost grows quadratically with how chatty you are.

2. Pin a tight system prompt

Bolt lets you specify project-level instructions (“Always use TypeScript. Never install new packages without asking. Match the code style of the existing files.”) in the project settings. These are sent once per request and prevent Bolt from re-deciding conventions on every edit, which is a frequent token waster.

3. Edit small files yourself

Bolt’s editor is a real editor. If you need to tweak a literal string, fix a typo, or change a Tailwind class, just type the edit. Don’t burn 20K tokens asking Claude to do it.

4. Use “Discussion mode” for design conversations

Discussion mode lets you talk through architecture changes without editing files. It uses a much smaller context (no file tree, no diff machinery) and is meant for “should I use Server Actions or a tRPC layer?” conversations before you commit to a change.

5. Export and continue locally

The 300K daily cap is the real constraint. Once you hit it, the most productive move is to download the project as a zip, open it in VS Code with Cline + a free OpenRouter or Gemini key, and keep iterating with no token meter. Bolt is a great scaffolder; it doesn’t have to be your only editor.

When Bolt Hits Its Limits

Bolt is not the right tool for every project. Three categories where the WebContainer model breaks down:

Anything with a native binary dependency. Imagemagick, FFmpeg, Postgres-the-binary, Python with NumPy/Pandas/PyTorch, Rust toolchain, system-level audio/video. WebContainers run a Node-shaped runtime; calls into the system C library or other languages mostly don’t work. (Pyodide gives you Python-in-WASM, which Bolt can use for some scripts, but it is slow and missing the data-science package landscape.)
Long-running background workers. A WebContainer lives only as long as the browser tab is open. There is no daemon, no cron, no queue worker. For backends that need to run a Celery/Sidekiq-style worker, deploy elsewhere. Replit Agent, with its real VM, is a better fit if you need persistent background processing.
Large monorepos. Bolt is happiest with one to maybe a dozen workspaces. Pulling a 500-package pnpm monorepo into a browser tab is technically possible and miserable in practice — the file watcher overhead and memory pressure tank the UX. For codebases that size, use Cline or Aider locally.

For everything else — landing pages, internal admin tools, prototypes, indie SaaS MVPs, one-off scripts, design system playgrounds, hackathon submissions — Bolt is genuinely best-in-class.

Bolt.diy: The Open-Source Self-Hosted Alternative

StackBlitz published bolt.diy as the open-source variant of Bolt’s frontend in 2024, and it has a healthy community in 2026. Bolt.diy ships the same chat-and-preview UI but lets you bring your own model and run the whole thing on a free free hosting platform.

The differences that actually matter:

Feature	Bolt.new (hosted)	bolt.diy (self-hosted)
Cost	Free (1M tokens/mo limit)	Free + your own API costs
Model	Claude Sonnet/Opus, managed	Any model — bring your own key
Free model option	No (always burns Bolt tokens)	Yes — pair with Gemini, Groq, Together
WebContainer	Yes (StackBlitz infra)	Yes (open StackBlitz npm package)
One-click deploy	Built-in Netlify	Manual
Setup time	0 minutes	~30 minutes
Privacy	Code goes through Bolt servers	Stays on your machine

The killer use case for bolt.diy: pair it with a free AI provider so you escape the 1M-tokens-per-month ceiling without paying for Bolt Pro. With Gemini‘s free tier (1,500 requests/day) or Groq‘s free tier (14,400 requests/day), you have effectively unlimited usage for personal projects. Quality is below Claude Sonnet 4.6 on the hardest tasks, but for most CRUD and UI work the gap is small.

Quick start with bolt.diy

# clone
git clone https://github.com/stackblitz-labs/bolt.diy
cd bolt.diy

# install
pnpm install

# bring your own key in .env.local
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_gemini_key_here" > .env.local
# or
echo "GROQ_API_KEY=your_groq_key_here" >> .env.local

# run
pnpm run dev
# open http://localhost:5173

That’s it — five commands and you have a Bolt clone running locally with no token meter, no rate limits beyond what your free AI provider imposes.

Frequently Asked Questions

Is Bolt.new really free?

Yes. The free tier gives you 1,000,000 tokens per month with a 300,000 daily cap, no credit card required, and full access to Claude Sonnet 4.6. The only meaningful restrictions are: deployed sites have Bolt branding, all your projects are public, and you cannot remove the daily cap without upgrading. For prototyping and learning, the free tier is genuinely usable. For sustained daily work on a real project, you will hit the cap and want Pro.

Which AI model does Bolt.new use?

The default is Claude Sonnet 4.6 from Anthropic. Paid plans also unlock Claude Opus 4.7 for harder reasoning tasks, and a “Standard” mode for cheap polish edits. The model selection is built into the UI; you don’t pick the provider, only the depth/cost trade-off.

Can I use my own API key with Bolt.new?

Not on the hosted bolt.new — the model is managed and bundled into the token cost. If you want BYOK, use the open-source bolt.diy fork, which lets you plug in any model: Gemini, Groq, OpenRouter, Together AI, even local Ollama.

Does Bolt.new work offline?

The WebContainer runtime works offline once loaded — you can edit files and run code without a network — but the AI agent obviously can’t, since it calls Claude over HTTPS. For a fully offline AI coding workflow, use Cline or Aider with Ollama instead.

Can I deploy a Bolt.new app to Vercel or Cloudflare instead of Netlify?

Yes — push to GitHub, then connect the repo from Vercel or Cloudflare Pages. The one-click deploy inside Bolt only goes to Netlify, but the generated project is a normal Node/Next.js/Vite codebase that works on any modern host.

Does Bolt.new support backends other than Node?

Limited. WebContainers run a Node.js-shaped runtime, so anything that compiles to JavaScript (TypeScript, Elm, ReScript) works fine. Python via Pyodide works for scripts but is too slow for full backends. Go, Rust, Ruby, Java, .NET — not natively. For polyglot stacks, Replit Agent or your local editor is the right tool.

What happens to my projects if I let my Pro subscription lapse?

Projects stay where they are — Bolt does not delete code. You drop back to the free tier limits: 300K tokens/day, public projects only, Bolt branding on deployments. Already-private projects remain accessible to you but cannot be edited or visited by collaborators in the same way.

Is Bolt.new safe for client work?

Free-tier projects are public, so any client code with secrets in it is a problem. Pro lets you make projects private, which is the minimum bar for professional work. For NDA-grade work, the open-source bolt.diy on a self-hosted instance is the only fully-private option.

How does the token meter actually count?

Tokens reflect what Claude charges StackBlitz: input tokens (your prompt + file context + history) plus output tokens (the diff Claude writes back). Per the official Bolt token docs, file context dominates — a request that touches a 1,000-line file costs roughly the same regardless of how short your prompt is. This is why “edit only the relevant files” is a real technique.

Decision Tree: Bolt.new vs Cline vs Aider vs Cursor vs Lovable

If you want a one-line picker for which tool to start with:

You want to go from prompt to deployed app in one browser tab → Bolt.new
You want a polished UI, you’re not a developer, and you’ll pay for it → Lovable
You want UI components to drop into an existing Next.js codebase → v0
You want a real backend with a real database that runs 24/7 → Replit Agent
You want a serious AI agent inside VS Code with your own API key → Cline
You want a terminal-first AI pair programmer that auto-commits → Aider
You want a managed forked-VS-Code experience and don’t mind paying → Cursor
You want autocomplete plus a chat box, nothing more → GitHub Copilot
You want to self-host the whole stack with your own free API key → bolt.diy + Gemini or Groq

Use Bolt.new with OpenClaw

OpenClaw is an AI agent platform for orchestrating multi-step automated workflows. Bolt and OpenClaw cover different parts of the build-and-ship loop, and the seam between them is genuinely useful.

The pattern: use Bolt.new for the human-driven scaffold-and-design phase — describe the app you want, watch it appear, push to GitHub. Then hand the GitHub repo over to an OpenClaw flow for the unattended phase: a nightly job that bumps dependencies, runs the test suite on each push, regenerates the OpenAPI client whenever the schema changes, and files an issue if anything breaks. Bolt builds the v1; OpenClaw maintains it.

A concrete pipeline an OpenClaw agent can own end-to-end without you touching it: every Monday at 7am, pull the latest Bolt-generated codebase, run pnpm audit, attempt to upgrade any package with a known CVE, run the test suite, and on success open a PR titled “weekly security bump.” On failure, open an issue with the failing test names. You wake up to either a green PR ready to merge or a clear bug report.

Final Verdict

Bolt.new is the best-in-class answer to “I have an idea and want a running app in 5 minutes” in 2026. The WebContainer trick is genuinely a generation ahead of products that wrap a remote VM in chat — there is no cold start, no setup, no dependency on what you have installed locally. For prototyping, hackathons, internal admin tools, and turning a spec into a deployed link in front of a stakeholder, nothing else feels as fast.

The free tier’s 300K daily cap is real, though, and the cost of follow-up prompts grows fast. The right way to use Bolt.new on free is: front-load detail in your first prompt, get a runnable scaffold in one pass, push to GitHub, and continue refinements in Cline or Aider against any free AI API. That hybrid workflow — Bolt for the “0 to 1,” a free local CLI agent for the “1 to 10” — is the cheapest serious AI coding stack available right now, and it costs literally zero dollars.

If you want to escape the token ceiling entirely, bolt.diy with Gemini or Groq as the model is the natural next step. You give up the polish of the hosted product but you also give up the $25/month it would otherwise cost to keep building. Open the page, type a prompt, watch your app run in your browser. There is no faster way to find out whether an idea is worth pursuing.

Aider: Free Open-Source AI Coding Agent for Your Terminal

toolfreebie — Thu, 28 May 2026 08:29:46 +0000

What Is Aider?

Aider is a free, open-source AI pair programmer that runs entirely in your terminal. You launch it inside a Git repository, point it at any LLM, and from then on every line of code you change goes through a conversation: you describe what you want in plain English, Aider edits the files, runs your tests if you ask it to, and auto-commits each change with a sensible message — all without leaving the shell.

Where editor-based agents like Cline, Cursor, and Windsurf wrap the experience in panels and buttons, Aider keeps everything text. That sounds primitive until you sit with it for a day. The terminal-first design means Aider plays nicely with tmux, SSH sessions on remote boxes, vim/emacs, and the kind of multi-window workflow that senior engineers refuse to give up. There is no extension to install, no editor lock-in, no IDE quirks — just a single Python package and your repo.

Aider has been quietly maintained since 2023 by Paul Gauthier and a healthy contributor base on GitHub (Apache 2.0 license, tens of thousands of stars). It is one of the few AI coding tools that publishes a real, reproducible benchmark — the Aider Polyglot — and rewrites it every time a major model lands. That alone is worth the price of admission. And the price is zero.

Why Aider Is Different from Cline, Cursor, and Copilot

The popular AI coding tools all sit on the same backbone — an LLM that reads files and proposes edits — but they differ in three big ways: where they live, how they edit, and how they think about Git. Aider’s choices on all three are unusual.

Feature	Aider	Cline	Cursor	GitHub Copilot
Price	Free (BYOK)	Free (BYOK)	$20/mo Pro	$10/mo Individual
Surface	Terminal CLI	VS Code extension	Forked VS Code (separate app)	Editor plugin
Open source	Yes (Apache 2.0)	Yes (Apache 2.0)	No	No
Choose your model	Any model via LiteLLM (100+ providers)	15+ providers	Cursor-managed	Mostly OpenAI, some Claude
Free model option	Yes — pair with any free API or Ollama	Yes — pair with any free API	Limited free tier	No
Git auto-commit	Yes (per change, with semantic message)	Optional	No	No
Repo-wide context map	Yes (tree-sitter, ranks symbols by relevance)	Workspace search	Codebase index	Workspace search
Multiple edit formats	Yes (whole / diff / udiff / search-replace)	One (search-replace)	Internal	Internal
Architect/Editor split	Yes (different model for plan vs apply)	Yes (Plan/Act with one model)	Partial	No
Voice input	Yes (Whisper)	No	No	No
Web URL ingestion	Yes (`/web`)	Yes (built-in browser)	Limited	No

The headline takeaway: if you live in an editor, Cline and Cursor are the natural fit. If you live in a terminal — or if you regularly SSH into remote machines, work in tmux, pair with vim/emacs/nano, or run on a server with no GUI — Aider is the only tool in this category that was designed for that workflow from day one.

The Aider Polyglot Benchmark: A Public Number You Can Trust

Most AI coding tools publish marketing copy. Aider publishes a benchmark — and not a synthetic one. The Aider Polyglot leaderboard tests every major model against 225 hand-curated coding exercises across C++, Go, Java, JavaScript, Python, and Rust. Each exercise has hidden unit tests; a model passes only when its edits make the tests go green, with at most one self-correction round.

What makes this benchmark useful, beyond being public and reproducible:

It tests the full pipeline — reading a problem statement, locating the right files, producing a syntactically valid edit, getting the diff format right, and writing code that compiles and passes tests. A model that “knows” the answer but botches the diff format scores zero, exactly like in real life.
It is multi-language. A model that writes elegant Python and falls apart on Rust will not top this leaderboard, even though it might top a Python-only one.
Aider re-runs it for every notable model release. The leaderboard you read today is not the one you read six months ago.

The practical use is choosing a model. Before the polyglot existed, “is GPT-4o better than Claude for refactoring Go?” was a vibe argument. Now you check the table. The leaderboard also distinguishes between edit-format pass rate (did the model produce a syntactically applicable diff?) and final pass rate (did the test go green?). Models that score high on the first but low on the second are confidently wrong. The opposite — high final, low edit-format — almost never happens, because if your edits are mangled they cannot pass.

If you are choosing a free model to pair with Aider, head to the leaderboard, sort by the latest column, and pick the highest-ranked model that has a free tier. As of 2026, the strongest options that fit that description include DeepSeek’s reasoning models, Gemini 2.5 Pro on Google AI Studio, and Llama 3.3 70B on free providers like Groq or Together AI.

Five Features That Make Aider Worth Your Terminal Time

1. The Repo-Map: Context Without Token Bankruptcy

The naive way to give an LLM “context” about your repo is to dump every file into the prompt. The naive way also costs $20 of tokens per task and breaks the moment your repo crosses a few thousand lines. Aider’s repo-map uses tree-sitter to parse every source file in your repo and extract a structured summary: class names, function signatures, top-level constants, exported types. It then ranks those symbols by relevance to the current chat using a PageRank-style graph over identifier references.

The result is a compact map — a few hundred lines for most projects — that gives the LLM enough scaffolding to ask for the right files. You did not paste anything. Aider figured it out from the AST.

You can see the current map any time with /map, and tune the size cap with --map-tokens. For repos under 100K lines this works astoundingly well. For monorepos beyond that, combine it with .aiderignore (same syntax as .gitignore) to scope the agent to a single package.

2. Multiple Edit Formats: Pick the One Your Model Is Best At

Most agents use one edit format. Aider supports four, because different models reliably do different ones better:

whole — model returns the complete new file. Highest token cost, lowest failure rate. Good for small files and weaker models.
diff — model returns SEARCH/REPLACE blocks. Moderate token cost, fragile if the model botches whitespace.
udiff — model returns unified-diff hunks. Compact, what experienced devs read; some models do this very well.
diff-fenced — diff inside a fenced code block. Edge cases around models that inject markdown headers.

Aider auto-picks the right format per model based on benchmark history, but you can override with --edit-format. If you have an unusual model and edits keep failing to apply, switching the format almost always fixes it.

3. Architect Mode: Two Models, Each Doing What It Is Good At

Frontier reasoning models — DeepSeek R1, OpenAI o-series, Gemini 2.5 Pro thinking — are excellent at planning and weak at producing pristine edit formats. Cheap fast models — Llama 3.3 70B, Gemini 2.0 Flash, GPT-4.1 Mini — are the reverse. Aider’s architect mode lets you assign a model to each role:

aider --architect \
  --model openrouter/deepseek/deepseek-r1 \
  --editor-model openrouter/anthropic/claude-3-5-sonnet

The architect produces a plan in natural language. The editor takes that plan plus the relevant files and emits the actual SEARCH/REPLACE blocks. Two-stage prompting like this consistently beats single-model approaches on the polyglot benchmark, often by ten percentage points or more on harder languages.

4. Auto-Commit With Real Commit Messages

Aider commits after every accepted change, with a one-line message generated from the diff. The first time you run it your git log looks like a real engineer worked on the repo all day. This sounds cosmetic until your first time using git bisect on AI-generated code — granular commits with semantic messages turn a debugging nightmare into a five-minute regression hunt.

Don’t want auto-commits? Pass --no-auto-commits and Aider stages the changes for you to review and commit yourself. Don’t have a Git repo? Aider will offer to git init for you on first run, since the entire workflow assumes one.

5. `/web`: Drop a URL Into the Chat

One of Aider’s most underrated commands. Type /web https://docs.example.com/api and Aider scrapes the page, converts it to clean Markdown, and adds it to the chat as context. Now the LLM has the actual API reference for the library you are using, not the version it remembers from training. This eliminates a huge category of stale-knowledge bugs without you needing to set up a separate RAG pipeline.

How to Install Aider

Aider is a Python package. The official one-liner installer handles the Python version and dependency isolation for you:

python -m pip install aider-install
aider-install

That installs Aider into an isolated environment using uv and adds it to your PATH. If you prefer to manage your own venv, the older method still works:

pipx install aider-chat
# or in a venv:
pip install aider-chat

Verify the install:

aider --version

You also need an API key for whatever LLM provider you plan to use. Aider reads keys from environment variables and from a .env file in your project. The most common setup looks like:

export OPENROUTER_API_KEY=sk-or-v1-...
# or
export GEMINI_API_KEY=AIza...
# or
export DEEPSEEK_API_KEY=sk-...

That is the entire install. cd into a Git repo, run aider, start typing.

Connecting Aider to Free LLM APIs

Aider speaks LiteLLM under the hood, which means it supports virtually every provider with one consistent --model flag. The four practical zero-cost setups in 2026:

Option 1: Google Gemini Free Tier (Most Generous)

Gemini’s free tier on Google AI Studio gives you Gemini 2.5 Pro with a 1M-token context window and very generous request limits — long enough to throw entire codebases at it. Set the key and run:

export GEMINI_API_KEY=AIza...
aider --model gemini/gemini-2.5-pro

For most personal projects, Gemini’s free tier alone keeps Aider running indefinitely with no card on file.

Option 2: OpenRouter Free Models

OpenRouter aggregates dozens of providers behind one OpenAI-compatible endpoint and exposes a tier of :free models you can call without spending credits.

export OPENROUTER_API_KEY=sk-or-v1-...
aider --model openrouter/deepseek/deepseek-r1:free

Rate limits are tighter than Gemini’s, but the variety is unmatched. Worth keeping a key around just for fallback when other free tiers throttle you.

Option 3: Ollama (Fully Local, Truly Offline)

If you have a reasonably modern laptop and want zero cloud calls, Ollama runs models on your own GPU.

ollama pull qwen2.5-coder:14b
aider --model ollama_chat/qwen2.5-coder:14b

Quality drops below frontier models, but for boilerplate, file renames, and small refactors, a local 14B coder model is good enough. And no tokens leave your machine — useful for codebases under NDA.

Option 4: DeepSeek Direct (Cheapest Frontier)

DeepSeek’s API offers near-frontier quality on its V3 chat model and reasoning quality from R1, with off-peak discount pricing that often makes it the cheapest paid option in the category. Their free trial credits are enough to evaluate, and routine use rarely costs more than a coffee per month.

export DEEPSEEK_API_KEY=sk-...
aider --model deepseek/deepseek-chat
# or for the reasoning model:
aider --model deepseek/deepseek-reasoner

Mixed Strategy: Architect Free, Editor Free

The architect mode shines when the two models live on different free tiers. A common 2026 pattern:

aider --architect \
  --model gemini/gemini-2.5-pro \
  --editor-model openrouter/meta-llama/llama-3.3-70b-instruct:free

Gemini’s reasoning power for the plan, OpenRouter’s free Llama for the mechanical edit. Total cost: zero.

A First Real Task: From Prompt to Green Tests

To make this concrete, here is a representative Aider session adding a feature to a Python web app. The user types one prompt; Aider does everything else.

Prompt: “Add a /healthz endpoint to the FastAPI app that returns 200 with the build commit SHA from an env var, plus a Postgres ping. Add a pytest test.”

The session, abbreviated:

$ aider --model gemini/gemini-2.5-pro
Aider v0.64.0
Main model: gemini/gemini-2.5-pro
Repo: 47 files, 8,200 LOC
Repo-map: 412 tokens

> Add a /healthz endpoint to the FastAPI app that returns 200 with
  the build commit SHA from an env var, plus a Postgres ping.
  Add a pytest test.

I'll need to look at the FastAPI app entry point and your DB session helper.
Please add these files to the chat:
  - app/main.py
  - app/db.py

> /add app/main.py app/db.py

Files added (1,420 tokens).

Plan:
  1. Add a Pydantic model HealthzResponse in app/main.py
  2. Add an async route GET /healthz that:
     - reads BUILD_COMMIT_SHA env var (default 'unknown')
     - executes SELECT 1 on the DB session, returns ok/fail
  3. Add tests/test_healthz.py with two cases (healthy, db down via mock)

Apply changes? [y/n] y

Editing app/main.py
Editing tests/test_healthz.py (new file)

Commit f4a2b81: feat: add /healthz with build SHA and Postgres ping

> /run pytest tests/test_healthz.py -q

..                                                  [100%]
2 passed in 0.34s

Add this output to the chat? [y/n] y

>

Total time: under a minute on a fast model. Two files changed, two tests passing, one commit, zero buttons clicked. The same workflow over SSH on a remote dev box, in a tmux pane, on a Chromebook — Aider does not care.

Production Tips: Don’t Burn Tokens

Free tiers and frontier models both have limits. A few habits keep Aider economical without giving up much:

Use /drop aggressively. Files stay in the chat context until you remove them. After Aider edits a file, you usually do not need it in the next prompt — drop it.
Add a .aiderignore for the parts of your repo that are noise — node_modules, generated code, vendored deps, large fixtures. The repo-map respects it.
Cap the map. --map-tokens 1024 is plenty for most repos and slashes per-prompt cost.
Use /clear between unrelated tasks to flush the conversation history. Multi-thousand-token chat history follows you to every prompt; clearing it can halve token cost.
Reach for --weak-model for commit messages and summarization. Aider already uses a small model by default for those side tasks; you can point this at an even cheaper one (gemini/gemini-2.0-flash, free) to save more.
Auto-test only at task boundaries. --auto-test runs your test suite after every edit. On large suites that adds up. Run tests manually with /run pytest when you want.

FAQ

Is Aider really free?

The Aider tool itself is free and open-source under Apache 2.0. The model calls go through whichever LLM provider you point it at — and you can absolutely run it for $0/month using free providers (Gemini, OpenRouter free tier, Ollama local). The only thing you pay for, optionally, is upgrading to a paid model when free-tier rate limits start to slow you down.

Does Aider work without Git?

Technically yes, with --no-git, but you give up auto-commit, the diff-aware repo-map, and easy rollback. On a fresh project Aider will offer to git init for you, which takes one keypress. Just let it.

Can Aider run code or commands?

Yes. The /run command runs an arbitrary shell command and offers to add the output to the chat — perfect for tests, linters, or running your dev server briefly to check for startup errors. Unlike fully autonomous agents, Aider does not run commands on its own; you trigger them.

Is Aider an autonomous agent like Cline’s Act mode?

No, and that is a deliberate choice. Aider treats you as the loop: it proposes edits, you approve, it commits, you run tests, you describe the next step. There is no “go solve this issue end-to-end without me watching” mode. For codebases where every line matters, that is a feature. If you want fully autonomous “fix this issue” workflows, pair Aider with Cline or use a dedicated agent framework.

Can I use Aider with my company’s private model gateway?

Yes — anything LiteLLM speaks, Aider speaks. Set OPENAI_API_BASE to your gateway URL, pass --model openai/your-model-id, and the rest just works. This makes Aider one of the most enterprise-friendly tools in the category.

Does Aider support MCP?

As of mid-2026 Aider’s primary tools are file editing, /run, and /web. MCP integration is on the roadmap and discussed in active issues, but the design philosophy — minimal core, scriptable shell — means a lot of what MCP servers provide is achievable today via /run and shell pipes.

What about voice input?

Set OPENAI_API_KEY and run /voice. Aider records, transcribes via Whisper, and inserts the transcript as your next prompt. The transcription is the only place Aider phones home outside your chosen LLM, and you can disable it.

When to Use Aider vs Cline vs Cursor

The three tools cover overlapping ground. A simple decision tree:

You live in vim / emacs / a terminal multiplexer, or you SSH into remote dev machines: Aider. Nothing else in this category respects that workflow.
You want autonomous, multi-file, multi-step “go fix this issue” execution with browser verification: Cline. Its Plan/Act split and built-in browser are purpose-built for that.
You want a managed product with one bill, one model picker, polished UI, and tab-completion-style suggestions in the editor: Cursor. You give up cost transparency and model choice; you get less friction.
You want Git-aware, surgical edits with full audit trail and the option to bail at any point: Aider. The auto-commit + diff-format design optimizes for “I will pair with this; I will not let it run wild.”
You want to run on a free model with zero credit card and zero subscription: Aider or Cline both work. Pick by surface (terminal vs editor).

Many serious developers in 2026 use both: Aider for terminal-native pairing on personal projects and remote boxes, Cline (or Cursor) for autonomous IDE work on the day job. They are not enemies; they fit different parts of the day.

Pairing Aider With Free APIs: Cost Reality

Concrete monthly costs for an engineer using an AI coding pair for, say, 60 hours a month:

Setup	Monthly Cost	Model Quality	Notes
Aider + Gemini 2.5 Pro free tier	$0	Frontier	Hits rate limits at heavy usage; works for most solo projects
Aider + OpenRouter free models	$0	Strong open-source	Tighter limits, huge model variety, easy fallback
Aider + Ollama Qwen2.5-Coder 14B local	$0	Mid	Fully offline, no rate limits, requires GPU
Aider + DeepSeek V3 (paid, off-peak)	~$2–8	Frontier	Cheapest paid frontier; pay only for usage
Aider + Anthropic Claude Sonnet (paid)	~$10–40	Frontier	Top of polyglot leaderboard most months
Cursor Pro (subscription)	$20	Mixed	Predictable bill, fewer choices
GitHub Copilot Individual	$10	Mid	No autonomous mode

The first three rows are what makes Aider compelling. A serious engineering workflow at zero dollars a month is not a trick — it is just a matter of pairing a free tool with a free API.

Final Thoughts

Aider is the tool you reach for when you have opinions about your code. The terminal-first, Git-aware, edit-format-conscious design assumes you are going to read every diff, that you care which commits show up in git log, and that the LLM is your assistant rather than your replacement. For a certain kind of engineer — and a certain kind of repo — that is exactly the right contract.

It is also the most straightforwardly free serious AI coding tool in 2026. No subscription, no enterprise upsell, no credit card screen. Pip install it, point it at Gemini’s free tier or a local Ollama model, and start pairing. The whole thing takes ten minutes from pip install to first commit, and the productivity ceiling is as high as the model you point it at.

If you have not tried a terminal-native AI pair programmer before, give Aider an afternoon on a small side project. Either it will fit your workflow perfectly and replace half your editor extensions, or it will not — and you will know within a few hours which camp you are in. There is no lock-in, no sunk cost, and no reason not to try.

Cline: Free Open-Source AI Coding Agent for VS Code (Cursor Alternative)

toolfreebie — Thu, 28 May 2026 08:24:18 +0000

What Is Cline?

Cline is a free, open-source AI coding agent that lives inside VS Code. Originally released as “Claude Dev” in 2024 and renamed to Cline in late 2024, the project has grown into one of the most popular autonomous coding assistants on the OpenVSX and VS Code marketplaces — over a million installs as of 2026, and a GitHub repo that consistently sits in the top of the trending charts (cline/cline, Apache 2.0).

Where editors like GitHub Copilot give you single-line completions and chat boxes, Cline does the whole task: it reads your repo, plans the change, edits multiple files, runs the terminal, opens a browser to verify, and waits for your approval at every irreversible step. It’s the same shape of agent you get from Cursor or Windsurf, except it costs nothing to install, runs against any model you point it at, and the extension itself is open-source code you can read line-by-line.

The catch — and the reason it pairs so well with the free AI APIs covered on this blog — is that Cline is BYOK (bring your own key). The extension is free, but the model calls go through whatever provider you configure. With a free Gemini, OpenRouter, Together AI, or Ollama backend, you can run Cline at zero marginal cost.

Cline vs Cursor vs GitHub Copilot

The three tools occupy overlapping but distinct positions. A side-by-side:

Feature	Cline	Cursor	GitHub Copilot
Price	Free (BYOK)	$20/mo Pro	$10/mo Individual
Editor	VS Code extension	Forked VS Code (separate app)	VS Code, JetBrains, others
Open source	Yes (Apache 2.0)	No	No
Choose your model	Anthropic, OpenAI, Gemini, DeepSeek, Groq, Together, Ollama, LM Studio, Bedrock, OpenRouter, LiteLLM	Cursor-managed (mostly Claude/GPT)	Mostly OpenAI, some Claude
Free model option	Yes — pair with any free API	Limited free tier	No
Autonomous multi-file edits	Yes (Act mode)	Yes (Composer / Agent)	Yes (Copilot Workspace, beta)
Terminal execution	Yes (with approval)	Yes	Limited
Browser automation	Yes (built-in)	Limited	No
MCP server support	Yes (native)	Yes	Limited
Plan-then-execute mode	Yes (Plan / Act toggle)	Partial	No
Token cost tracker per task	Yes (live, per request)	No (subscription)	No (subscription)

The headline trade-off: Cursor and Copilot give you a managed experience and predictable monthly bill. Cline gives you full transparency over the model, the prompts, and the per-token cost — at the price of wiring up your own API key. For developers who already keep Gemini, OpenRouter, or DeepSeek keys around for other projects, that wiring is a five-minute job.

Key Features That Matter

1. Plan and Act Modes

Cline’s marquee feature in 2026 is the explicit Plan/Act toggle in the input bar. In Plan mode, the model can only read files, search the workspace, and write you a step-by-step proposal — it cannot modify code or run commands. In Act mode, it executes that plan, asking for approval before each tool use it considers irreversible (file writes, terminal commands, browser actions).

This separation maps directly to how senior engineers actually work: think first, code second. It also dramatically reduces wasted tokens — a small reasoning model in Plan mode can often produce a workable plan that a cheaper executor model then fills in.

2. Bring-Your-Own-Model

The provider dropdown in Cline’s settings is the longest in the category. You can route the same conversation through any of: Anthropic (Claude 3.7/4 Sonnet, Opus), OpenAI (GPT-4o, GPT-4.1, o-series), Google Gemini, DeepSeek, Groq, Together AI, Mistral, OpenRouter, AWS Bedrock, GCP Vertex AI, Azure OpenAI, OpenAI-compatible local servers (Ollama, LM Studio, llama.cpp, vLLM), and LiteLLM proxies.

This matters for two reasons. First, you can pick the cost/quality point that matches the task — a tiny local model for boilerplate, a frontier model for the hard refactor. Second, it future-proofs your workflow: when a new state-of-the-art model lands, you point Cline at it the same day, no waiting for a vendor to integrate it.

3. Native MCP Support

Cline was one of the earliest agents to ship native support for the Model Context Protocol. Any MCP server — file system, GitHub, Postgres, Playwright, Slack, or your own — plugs into Cline’s tool list with no extra wiring. The MCP marketplace inside Cline lists hundreds of community servers you can install in two clicks.

Practically, this means Cline can do things outside the editor without you teaching it custom tools: query your production-replica Postgres, file a GitHub issue from a stack trace, or drive a Playwright browser to reproduce a bug a user filed.

4. Browser Automation Built In

Cline ships with a built-in headless browser tool. The agent can open a URL, screenshot the page, click on elements, type into fields, and read back the rendered DOM. The killer use case: “make this UI change and verify it visually” — Cline edits the React component, runs the dev server, opens the page, screenshots before/after, and only marks the task complete once the visual confirms the change.

5. Live Cost Visibility

Every task in Cline shows a running token counter and dollar estimate based on the current provider’s pricing. You can watch a multi-step refactor consume tokens in real time, and you can hit Stop the moment it stops being economical. No other agent in this category surfaces cost this directly.

How to Install Cline

Installation takes about thirty seconds:

Open VS Code (or VS Codium, Cursor, Windsurf — Cline runs in any VS Code-compatible editor)
Open the Extensions panel (Ctrl+Shift+X or Cmd+Shift+X)
Search for Cline (publisher: saoudrizwan)
Click Install
Click the new Cline icon in the activity bar
On first launch, Cline asks you to pick a provider and paste an API key

That’s it. Cline now sits in the side panel with an input box and a Plan/Act toggle. The extension itself never makes a network call until you give it a model and start a task.

Connecting Cline to Free APIs

The provider you pick determines whether Cline is genuinely free or just cheap. The four practical zero-cost options:

Option 1: Google Gemini (Recommended for Starters)

Gemini’s free tier on Google AI Studio gives you Gemini 2.0 Flash and Gemini 2.5 Pro at very generous request-per-minute limits with a 1M-token context window — long enough to dump your entire repo into a single prompt for most projects.

Get a free key at aistudio.google.com
In Cline settings, choose provider Google Gemini
Paste the key
Pick model gemini-2.5-pro (best for planning) or gemini-2.0-flash (faster, cheaper if you flip to paid)

For most personal projects, Gemini’s free tier is enough to keep Cline running indefinitely with no card on file.

Option 2: OpenRouter Free Models

OpenRouter aggregates dozens of providers behind one OpenAI-compatible endpoint, including a tier of :free models you can call without spending credits.

Sign up at openrouter.ai and copy your key
In Cline settings, choose provider OpenRouter
Paste the key
In the model search, type :free and pick a strong free model like deepseek/deepseek-r1:free or meta-llama/llama-3.3-70b-instruct:free

OpenRouter’s free tier rate limits are tighter than Gemini’s, but the variety is unmatched — you can switch between fifty different free models without changing keys.

Option 3: Ollama (Fully Local, Fully Free)

If you have a reasonable laptop and don’t want any cloud calls at all, Ollama runs models on your own GPU.

Install Ollama and pull a model: ollama pull qwen2.5-coder:14b
In Cline settings, choose provider Ollama
Set the base URL to http://localhost:11434
Pick the model you pulled

Quality drops below frontier models, but for boilerplate, file renames, and small refactors, a local 14B model is good enough — and zero tokens leave your machine.

Option 4: Together AI Free Tier

Together AI‘s -Free models include Llama 3.3 70B and DeepSeek R1 Distill 70B — both strong code models. Sign up, copy the key, choose the OpenAI-compatible provider in Cline, and point the base URL to https://api.together.xyz/v1.

Mixed Strategy: Plan with One, Act with Another

Cline lets you set different models for Plan and Act mode. A common pattern in 2026: use a frontier reasoning model (Gemini 2.5 Pro, Claude Opus, DeepSeek R1) for Plan mode, then switch to a fast cheap model (Llama 3.3 70B on Groq, Gemini 2.0 Flash) for Act mode. The plan is the expensive part of cognition; the execution is mostly mechanical edits.

A First Real Task: Add a Feature End-to-End

To make this concrete, here’s a typical Cline session for adding a “dark mode toggle” to a React app. The user types one prompt; everything else is Cline.

Prompt: “Add a dark mode toggle to the navbar. Persist the choice in localStorage. Use the existing Tailwind dark: variants — don’t introduce a new theming library.”

In Plan mode, Cline:

Reads tailwind.config.js, confirms darkMode: 'class' is set (or proposes enabling it)
Reads src/components/Navbar.tsx
Searches the codebase for any existing theme context
Writes a plan: add a useDarkMode hook, modify Navbar to render a toggle button, add a small effect that syncs the dark class on document.documentElement

You read the plan. If it looks right, you flip to Act mode. Cline:

Creates src/hooks/useDarkMode.ts — pauses, shows you the diff, waits for approval
Edits Navbar.tsx — pauses, shows the diff, waits for approval
Runs pnpm run dev in the terminal — pauses, asks before executing
Opens http://localhost:5173 in the built-in browser
Screenshots the navbar, clicks the new toggle, screenshots again, confirms the page background switched
Reports done with a list of files changed and the cost (e.g. “$0.04, 18,200 tokens”)

The full task is fifteen minutes of mostly autonomous work. You stayed in the loop at the four moments that matter (plan, two diffs, terminal). For senior engineers used to writing every line, this feels strange the first time and indispensable by the third.

MCP: Giving Cline Superpowers

Cline’s MCP support is what lets it reach beyond the file system. Three useful servers to install on day one (all from the Cline marketplace):

filesystem — read/write outside the open workspace, useful for cross-repo refactors
github — open issues, file PRs, comment on existing PRs without leaving the editor
playwright — drive a real browser to reproduce user-reported bugs against your dev server

To install one, click the MCP icon in Cline’s panel, search the marketplace, and click Install. The server runs as a local subprocess; no cloud connection unless the server itself needs one.

Custom MCP servers — anything you’ve built or anything from the wider MCP ecosystem — drop in just by adding their config to ~/.cline/mcp.json:

{
  "mcpServers": {
    "my-postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres", "postgres://localhost/dev"]
    }
  }
}

After a reload, Cline can run read-only SQL against your dev database without you copy-pasting schemas into the chat.

Cost Comparison: Cline + Free API vs Cursor / Copilot

Concrete monthly costs for a developer who uses an AI agent for, say, 60 hours of coding:

Setup	Monthly Cost	Model Quality	Notes
Cline + Gemini 2.5 Pro free tier	$0	Frontier	Hits rate limits at high usage; works for most solo work
Cline + OpenRouter free models	$0	Strong open	Tighter limits but huge model variety
Cline + Ollama Qwen2.5-Coder 14B	$0	Mid	Local, no cloud calls, no rate limits
Cline + Anthropic Claude Sonnet (paid)	~$10–40	Frontier	Pay only for what you use; transparent per-task
Cursor Pro	$20	Frontier (Claude/GPT)	Predictable; unlimited slow model, capped fast models
GitHub Copilot Individual	$10	Frontier (GPT-4.1)	Strong autocomplete, weaker agent UX
Cursor Pro + Cline as backup	$20	Frontier	Both options available; Cline catches what Cursor misses

The honest answer: if you bill clients for engineering time, $20/mo for Cursor pays for itself in the first hour. If you’re a hobbyist, student, or open-source maintainer, Cline + a free API tier gets you 80% of the experience at $0/mo. The two aren’t mutually exclusive — Cursor is itself a VS Code fork, so you can install Cline inside Cursor and have both available.

Cline vs Aider vs Continue.dev

The free open-source AI coding agent space in 2026 has three credible contenders. A quick decision matrix:

Project	Surface	Best For	Weakness
Cline	VS Code extension	Visual workflows, browser verification, MCP-heavy tasks	Heavier UI; needs VS Code
Aider	Terminal CLI	Power users on the command line, Git-aware refactors	No GUI; less hand-holding for newcomers
Continue.dev	VS Code & JetBrains	Enterprise teams that want shared config + autocomplete	Less autonomous than Cline; more like Copilot

If you live in VS Code and want full agent autonomy, Cline. If you live in tmux and want every change tied to a Git commit, Aider. If you need a team-shareable autocomplete plus chat, Continue. None of these is wrong; they fit different working styles.

Tips for Keeping Cost (and Frustration) Low

Use Plan mode aggressively. A 2,000-token plan is cheaper than a 20,000-token wrong-direction execution.
Add a .clineignore file so Cline doesn’t accidentally read node_modules, lock files, or build outputs into context.
Pin the model per-task. Use a small fast model for find-and-replace work; reserve the frontier model for design and debugging.
Cap with maxRequests. Cline has a per-task request limit that stops runaway loops — set it to 30 for most tasks.
Approve diffs incrementally. If a diff looks wrong, reject and explain in one sentence. Cline rewrites the diff much faster than rolling back later.
Pair with Ollama for repetitive tasks like generating tests for already-written functions; the local model is “free” tokens.

FAQ

Is Cline really free?

The Cline extension is free under Apache 2.0. The model calls are not — you pay whichever provider you connect (or pay nothing if you stay within a free tier or run Ollama locally). There’s no Cline-the-company subscription gate.

Does Cline work in Cursor or Windsurf?

Yes — both are forks of VS Code, and Cline installs cleanly inside either. Some users actually run Cursor as their editor and Cline as a second agent for tasks they’d rather hand off entirely.

Can Cline read my whole codebase?

It reads files on demand using a search-and-grep flow rather than embedding the entire repo. That keeps context windows honest and means you don’t need a vector database for it to work. For very large repos, pair it with a model that has a long context window (Gemini 2.5 Pro at 1M, Claude Sonnet at 200K).

Will Cline silently delete my files?

No. Every file write, terminal command, and browser action requires explicit approval before it runs (this is the default and changing it is a deliberate setting). The agent shows you the diff or the command before you click Approve.

Can I use Cline offline?

Yes — point it at Ollama or LM Studio running locally. Once the model is pulled, Cline does not need a network connection.

Does Cline support tool use / function calling?

Yes, both natively (its built-in tools for files, terminal, browser) and via MCP servers. Models that don’t support function calling natively still work — Cline uses a structured prompt format underneath.

What’s the difference between Cline and Roo Code?

Roo Code is a popular fork of Cline with an additional set of “modes” (Architect, Code, Ask, Debug) and slightly different UI conventions. Functionally similar; pick whichever interface you prefer. Both are free and open source.

Does Cline phone home or collect telemetry?

The extension itself sends only opt-in anonymous usage telemetry; the model calls go directly from your machine to whichever provider you configured. There is no Cline-operated proxy in the path.

When to Use Cline vs Alternatives

You want a free, transparent AI coding agent and don’t mind wiring an API key → Cline + a free model
You want zero setup and predictable monthly cost, willing to pay $20 → Cursor
You live in a terminal and want every change committed → Aider
You want only autocomplete plus a chat box → GitHub Copilot or Continue.dev
You want to plug your own MCP servers into the agent loop → Cline (best-in-class MCP UX)
You want to run everything locally with no cloud calls → Cline + Ollama

Use Cline with OpenClaw

OpenClaw is an AI agent platform for orchestrating multi-step automated workflows. Cline plays well at the seam between human-in-the-loop coding and fully autonomous OpenClaw flows.

A useful split: OpenClaw runs the long-running unattended jobs (nightly dependency updates, regenerating SDK clients from a changed OpenAPI spec, checking the build on multiple Node versions). Cline handles the human-in-the-loop work where you actually want to read every diff before it lands. The two share the same model providers — connect both to the same OpenRouter or Together AI key, and you have one billing surface for everything.

A concrete example pipeline: an OpenClaw cron job watches a third-party SDK for new releases, downloads the new version, runs your test suite against it, and on failure files a GitHub issue with the failing test and stack trace. The next morning, you open the issue inside Cline (“fix issue #483”), and Cline does the actual fix work with you supervising the diffs.

Final Verdict

Cline is the right default in 2026 for any developer who already has a free AI API key and wants a serious coding agent without paying a subscription. The Plan/Act split is genuinely better UX than the implicit modes other agents use. Native MCP support means it grows with the ecosystem instead of getting locked into one set of built-in tools. And because the provider is your choice, you can pick the cost/quality point that matches the task and switch the moment a better model lands.

Cursor and Copilot are still excellent products — for some teams the fixed monthly cost and curated model selection is exactly what’s wanted. But Cline is the option that makes “AI coding agent” available to anyone with a laptop and a free API key, with no gatekeeping and no contract. Install the extension, point it at Gemini, give it a small task, and decide for yourself.

Together AI Free API: Run Llama 3.3, DeepSeek R1, and FLUX Image Generation for Free in 2026

toolfreebie — Sun, 03 May 2026 15:54:07 +0000

What Is Together AI?

Together AI is an AI inference platform that hosts hundreds of open-source models behind one OpenAI-compatible API. Founded in 2022 and backed by NVIDIA, Salesforce Ventures, and Kleiner Perkins, the company built its reputation around two things developers actually care about: fast hosted inference for state-of-the-art open models (Llama, DeepSeek, Qwen, Mixtral) and a genuinely free tier that exposes a small but useful set of those models with no credit card required.

What separates Together AI from the long list of “free AI API” providers in 2026 is the breadth of categories you can hit on a single key. One signup gives you free access to:

Llama 3.3 70B Instruct Turbo (Free) — Meta’s flagship 70B chat model
DeepSeek R1 Distill Llama 70B (Free) — open reasoning model with chain-of-thought
FLUX.1 schnell — Black Forest Labs’ fast image generation model
Llama 3.2 11B Vision Instruct (Free) — multimodal image-understanding model
Plus hundreds of other open models on a $1 trial credit

If you’re already evaluating Groq, Cerebras, Gemini, or DeepSeek, Together AI fills a different gap: a single endpoint that covers chat, reasoning, vision, and image generation on the same key.

What’s Actually Free on Together AI

Together AI uses a clear naming convention: any model whose ID ends with the suffix -Free can be called without consuming credits. These are slightly slower than the paid tiers (rate-limited, lower priority) but functionally complete. Everything else runs against the $1 free trial credit you get at signup.

Model ID	Type	Context	Best For
`meta-llama/Llama-3.3-70B-Instruct-Turbo-Free`	Chat / instruction	128K tokens	General assistant, RAG answer generation, code Q&A
`deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free`	Reasoning	32K tokens	Math, multi-step logic, agent planning loops
`meta-llama/Llama-Vision-Free`	Vision (multimodal)	128K tokens	Image captioning, OCR, chart and screenshot understanding
`black-forest-labs/FLUX.1-schnell-Free`	Image generation	1024×1024 default	Blog cover images, prototypes, social posts

Beyond the explicitly free tier, the $1 trial credit is enough to exercise dozens of paid models — Mixtral 8x22B, Qwen 2.5 72B, Llama 3.1 405B, audio models like Whisper, embeddings models like BGE and M2-BERT — for tens of thousands of tokens each, which is plenty to test whether the bigger models meaningfully change your results before you commit a card.

Note: Together AI quietly retires and renames “Free” models from time to time as newer versions land. If a model ID stops working, check the official model list for the current Free variant.

How to Get Your Free API Key

Go to api.together.ai and sign up with email, Google, or GitHub
Verify your email address
From the dashboard, navigate to Settings → API Keys
Copy your default key (it starts with a long hex string, no prefix)
Set it as an environment variable: export TOGETHER_API_KEY="your_key_here"

No credit card. No phone number. The $1 free trial credit and access to all -Free models are activated immediately on signup.

curl Quickstart: Your First Request in 30 Seconds

Together AI is fully OpenAI-compatible, so the cleanest way to confirm everything works is a one-shot curl call against the chat completions endpoint:

curl https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    "messages": [
      {"role": "user", "content": "Explain pgvector in two sentences."}
    ]
  }'

If you get back a JSON response with a choices[0].message.content field, you’re set. The exact same payload shape works against OpenAI — only the base URL and the model string change.

Python Quickstart

The official SDK is a thin wrapper around the OpenAI Python client. Install it:

pip install together

Basic chat completion:

import os
from together import Together

client = Together(api_key=os.environ["TOGETHER_API_KEY"])

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    messages=[
        {"role": "system", "content": "You are a concise senior engineer."},
        {"role": "user", "content": "When should I prefer SQLite over Postgres?"}
    ],
    max_tokens=400,
)

print(response.choices[0].message.content)

If you already have OpenAI SDK code, swapping providers is a two-line change:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["TOGETHER_API_KEY"],
    base_url="https://api.together.xyz/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    messages=[{"role": "user", "content": "Write a haiku about caching."}],
)
print(response.choices[0].message.content)

Every parameter you’d pass to OpenAI — temperature, top_p, stop, response_format, tools, tool_choice — works identically.

Streaming Responses

For chat UIs and agent loops, you almost always want token streaming. Set stream=True and iterate:

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    messages=[{"role": "user", "content": "Outline a blog post about RAG."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Streaming on the Free tier is real streaming, not buffered chunks — you’ll see tokens appear at roughly the model’s true generation rate, which makes it usable for live chat UIs even before you start paying.

Reasoning with DeepSeek R1 Distill

The DeepSeek R1 family produces visible chain-of-thought reasoning before its final answer. On Together AI’s Free tier you can call the 70B distilled variant, which keeps most of the reasoning capability of the full R1 model at a fraction of the parameter count:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free",
    messages=[
        {
            "role": "user",
            "content": (
                "A bookstore sold 60 books on Monday, then sales grew "
                "12% each day through Friday. How many books did they "
                "sell in total that week? Show your work."
            ),
        }
    ],
    max_tokens=2000,
)

print(response.choices[0].message.content)

The model’s response will include a <think>…</think> block of internal reasoning followed by the final answer. For agent applications, you can either show the reasoning to the user (transparency) or strip it out (clean output) depending on the surface.

Image Generation with FLUX.1 [schnell] Free

FLUX.1 [schnell] is Black Forest Labs’ fast text-to-image model, distilled to 4 sampling steps and open-sourced under Apache 2.0. Together AI hosts it as a free image-generation endpoint:

response = client.images.generate(
    model="black-forest-labs/FLUX.1-schnell-Free",
    prompt="A clean isometric illustration of an AI agent fetching data from a cloud database, soft pastel colors, no text",
    width=1024,
    height=1024,
    steps=4,
    n=1,
)

print(response.data[0].url)

The returned URL is hosted by Together AI and stays valid long enough to download or pipe into a CDN. For blog covers, social posts, or quick mockups, FLUX.1 [schnell] often beats Stable Diffusion XL on prompt adherence at a fraction of the inference time.

Vision: Llama 3.2 Vision Free

The Free vision model accepts standard OpenAI-format multimodal messages — text plus image URLs or base64 data:

response = client.chat.completions.create(
    model="meta-llama/Llama-Vision-Free",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What does this dashboard show? List the three highest values."},
                {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}}
            ]
        }
    ],
)
print(response.choices[0].message.content)

This is the cheapest path in 2026 to a working “describe this screenshot” or “extract data from this chart” feature without standing up your own vision pipeline. For OCR-heavy workloads on dense documents, a paid vision model will still outperform — but for screenshots, charts, product photos, and general image Q&A, Llama Vision Free is genuinely useful.

Together AI vs Other Free AI APIs

Provider	Free Chat	Free Reasoning	Free Vision	Free Image Gen	OpenAI Compatible
Together AI	Llama 3.3 70B	DeepSeek R1 Distill 70B	Llama 3.2 Vision 11B	FLUX.1 schnell	Yes
Groq	Llama 3.3 70B (very fast)	DeepSeek R1 Distill	Llama Vision	No	Yes
Cerebras	Llama 3.3 70B (extremely fast)	Limited	No	No	Yes
Gemini	Gemini 2.0 Flash	Gemini 2.0 Flash Thinking	Built in	Imagen (limited)	Via compat layer
Cloudflare Workers AI	Llama 3 / Mistral	Limited	LLaVA	SDXL Lightning	Yes
OpenRouter	Many free models	DeepSeek R1 free	Several	Limited	Yes

Where Together AI wins on the free tier: coverage. It’s the only provider on this list that offers chat, reasoning, vision, and image generation under one OpenAI-compatible endpoint, on one key, with no credit card. If you’re prototyping a multimodal product and don’t want to juggle three or four signups, Together AI compresses the entire surface area into one integration.

Where the others win: raw speed (Cerebras and Groq are faster on Llama 3.3 70B), context window (Gemini’s 1M tokens is unmatched), or model variety (OpenRouter aggregates more providers).

Rate Limits and Fair Use

Free-tier rate limits on Together AI exist to keep costs predictable. The exact numbers are published in the official rate limits page and change as the platform scales, but as a working mental model in 2026:

-Free chat models: low double-digit requests per minute, with smaller per-day caps than paid tiers
-Free image models: tighter caps (image inference is much more expensive), often a few requests per minute
Paid models on trial credit: the standard tier-1 limits, but capped by your $1 budget — usually thousands of requests before the credit runs out on smaller models

The headline takeaway: Free-tier limits are designed for development and prototyping. They are not designed to support a production user base. If your side project starts getting traction, you’ll need to either move to a paid plan or layer caching in front (request deduplication on prompts is the highest-leverage win).

When to Use Together AI vs Alternatives

A simple decision tree based on what you’re optimizing for:

Need everything in one key — chat + reasoning + vision + images? → Together AI Free tier
Need the fastest possible chat response (under 1 second to first token)? → Cerebras or Groq
Need a 1M-token context window for long documents? → Gemini
Need the widest catalogue of free models from many providers? → OpenRouter
Need the best free embedding + reranker for RAG? → Cohere
Building edge functions and want inference inside Cloudflare? → Cloudflare Workers AI

Together AI is the right answer when your project benefits from a single integration that covers many capabilities, especially for multimodal applications and reasoning-heavy agents that may also need image generation.

Use Together AI with OpenClaw

OpenClaw is an AI agent platform that orchestrates multiple APIs and tools into automated workflows. Together AI fits well as a single inference layer behind an OpenClaw agent that needs to handle multiple modalities — read a screenshot, reason about what to do next, and produce a generated image as part of the output.

A working example: an OpenClaw agent receives a customer support ticket that includes a screenshot of an error. The agent uses Llama Vision (Free) to extract the error message from the image, DeepSeek R1 Distill (Free) to reason about which knowledge-base article applies, Llama 3.3 70B (Free) to draft a reply, and FLUX.1 schnell to generate a clean diagram for the customer if a visual explanation helps. All four steps hit the same API key.

import os
from together import Together

client = Together(api_key=os.environ["TOGETHER_API_KEY"])

def support_pipeline(ticket_text: str, screenshot_url: str) -> dict:
    """A multi-modal support agent step for OpenClaw."""

    # 1. Extract the error from the screenshot
    vision = client.chat.completions.create(
        model="meta-llama/Llama-Vision-Free",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Read the error message in this screenshot and return only the error text."},
                {"type": "image_url", "image_url": {"url": screenshot_url}}
            ]
        }]
    )
    error_text = vision.choices[0].message.content

    # 2. Reason about which solution applies
    reasoning = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free",
        messages=[{
            "role": "user",
            "content": f"Ticket: {ticket_text}\nError extracted: {error_text}\nWhat is the most likely root cause?"
        }],
        max_tokens=800,
    )

    # 3. Draft a customer-facing reply
    reply = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
        messages=[
            {"role": "system", "content": "You are a senior support engineer. Be concise and friendly."},
            {"role": "user", "content": f"Ticket: {ticket_text}\nRoot cause analysis: {reasoning.choices[0].message.content}\nWrite the reply to the customer."}
        ],
    )

    return {
        "error": error_text,
        "analysis": reasoning.choices[0].message.content,
        "reply": reply.choices[0].message.content,
    }

The same pattern fits other OpenClaw use cases: a research agent that reads charts and reasons about them, a content agent that writes a post and generates its cover image, a QA agent that screenshots a UI and verifies what it sees. The single-key, single-SDK shape keeps the agent code small.

Pricing When You Outgrow Free

If your application moves beyond prototyping, Together AI’s serverless pricing for the same models is competitive with the rest of the market. Approximate published prices in 2026 for popular models:

Model	Approx Price	Unit
Llama 3.3 70B Instruct Turbo	~$0.88	per 1M tokens (blended)
Llama 3.1 8B Instruct Turbo	~$0.18	per 1M tokens (blended)
Llama 3.1 405B Instruct Turbo	~$3.50	per 1M tokens (blended)
DeepSeek R1	~$3.00 / $7.00	per 1M input / output tokens
FLUX.1 [schnell]	~$0.003	per image (1024×1024, 4 steps)
BGE / M2-BERT embeddings	~$0.008 to $0.05	per 1M tokens (model-dependent)

Two things make this pricing especially friendly for solo builders. First, you only pay for what you use — there’s no monthly minimum. Second, the same key works for both the Free tier and paid models, so there’s no migration cost when you flip from free to paid for a single hot model. Check the official pricing page for current numbers.

FAQ

Is Together AI’s Free tier really free, or is it a trial?

Both. Models with the -Free suffix are free to call indefinitely (rate-limited but non-expiring). All other models run against a one-time $1 trial credit at signup. Once the trial credit is gone, paid models stop until you add a payment method.

Do I need a credit card to sign up?

No. The default account state has no payment method on file. You only need to add one when you want to spend beyond your trial credit on paid models — Free-tier models keep working either way.

Is the API truly OpenAI-compatible?

Yes for chat completions, streaming, and tool calling. Image generation uses Together AI’s own endpoint shape (which closely mirrors OpenAI’s). Embeddings are also OpenAI-compatible. In practice, you can point any OpenAI SDK at https://api.together.xyz/v1 and most code works without changes.

What’s the difference between “Turbo” and non-Turbo models?

Turbo variants are quantized (typically FP8) for higher throughput at very small quality loss. Together AI publishes evaluation numbers showing Turbo variants stay within a fraction of a percent of full-precision quality on standard benchmarks. For nearly all production use cases, prefer Turbo.

Can I use Together AI for commercial projects?

Yes — both the Free and paid tiers permit commercial use, subject to each model’s underlying license. Llama models follow Meta’s Llama Community License, FLUX.1 [schnell] is Apache 2.0, and so on. Confirm any specific model’s license on its model card before shipping.

Does Together AI store my prompts or completions?

Together AI’s stated policy is that they don’t train on your data and that prompts are not retained beyond what’s needed for abuse prevention. For sensitive workloads, the dedicated/enterprise tiers offer stronger data-handling guarantees. Re-check the current privacy policy before sending real customer data.

How does the Free tier compare to running models locally with Ollama?

Ollama is unbeatable for offline development and zero-cost long-running tasks, but it’s bounded by the GPU on your laptop — running Llama 3.3 70B locally requires serious hardware. Together AI’s Free tier gives you the same model running on a real datacenter GPU, just with rate limits. The two tools are complements: prototype locally with Ollama on a smaller model, then call Together AI when you need the 70B for the parts that matter.

Final Verdict

Together AI’s Free tier is the most underrated entry point in the free-AI-API space because it solves a problem most other free APIs ignore: multimodal coverage on a single key. Every other provider in this category is great at one thing — Cerebras for raw speed, Gemini for context length, Cohere for retrieval, Cloudflare for edge — and forces you to integrate three or four of them if your project needs more than one capability. Together AI’s -Free models give you chat, reasoning, vision, and image generation behind one HTTPS endpoint, one SDK, and one key, with no credit card.

For prototyping multimodal agents, building a side project that mixes capabilities, or just keeping one fewer signup form on your “maybe later” list, Together AI’s Free tier earns its place in any serious 2026 free-AI-API stack. Sign up at api.together.ai, copy the key, and your first chat completion is about three minutes away.

Cohere Free API: The Best Free Embedding and Rerank API for RAG in 2026

toolfreebie — Sun, 03 May 2026 15:50:01 +0000

What Is Cohere?

Cohere is a Toronto-based AI company founded in 2019 by Aidan Gomez (one of the original authors of the “Attention Is All You Need” Transformer paper) and a team of ex-Google Brain researchers. Unlike OpenAI or Anthropic, Cohere built its platform from day one around a specific use case: enterprise retrieval and RAG (Retrieval-Augmented Generation).

That focus shows up in three places where Cohere genuinely leads the field — and where most developers don’t realize they can get it for free:

Embed v3 — text embeddings that consistently rank near the top of the MTEB benchmark, in both English and 100+ other languages
Rerank v3 — the most-deployed neural reranker in production RAG systems, available via a single API call
Command R / R+ — chat models specifically trained for RAG, tool use, and grounded citations

And the part most developers miss: a free Cohere trial key gives you access to all of these. No credit card, no time limit. The only constraint is per-minute rate limiting, which is fine for prototyping, side projects, and small production workloads.

What’s Free on Cohere

Cohere has two key types: Trial keys (free) and Production keys (paid). Trial keys never expire — they’re rate-limited but otherwise unrestricted.

Endpoint	Trial Rate Limit	Production Rate Limit
Chat (Command R/R+)	20 calls/min	500 calls/min
Embed	100 calls/min	2,000 calls/min
Rerank	10 calls/min	1,000 calls/min
Classify	100 calls/min	1,000 calls/min
Summarize	5 calls/min	500 calls/min

Notice the Embed limit: 100 calls per minute with up to 96 documents per call. That’s effectively 9,600 embeddings per minute on the free tier — more than enough to index a personal knowledge base or a small document corpus from scratch in a few minutes.

Note: Trial keys are not for production traffic, but they are for real development. Cohere’s documentation explicitly encourages building and testing on trial keys before upgrading.

How to Get Your Free API Key

Go to dashboard.cohere.com/welcome/register and sign up with email or Google
Verify your email address
From the dashboard, navigate to API Keys in the left sidebar
Your default Trial key is already there — copy it
Set it as an environment variable: export COHERE_API_KEY="your_key_here"

No credit card. No phone number. Two minutes from signup to your first embedding.

Python Quickstart: Your First Embedding

Install the official Cohere Python SDK:

pip install cohere

Embedding three documents:

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.embed(
    texts=[
        "Cohere makes the best free embedding API for RAG.",
        "OpenClaw is an AI agent platform for orchestrating tools.",
        "Toronto is the headquarters of Cohere."
    ],
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
)

print(f"Got {len(response.embeddings.float)} embeddings")
print(f"Each embedding is {len(response.embeddings.float[0])} dimensions")

That returns three 1024-dimensional vectors you can drop into any vector database — Pinecone, Weaviate, Chroma, Qdrant, pgvector, or just a NumPy array.

The input_type parameter is important: Cohere’s embeddings are asymmetric. Use "search_document" when indexing your corpus, and "search_query" when embedding the user’s question. Treating them differently gives noticeably better retrieval quality than symmetric embedding APIs.

Embedding Models You Get for Free

Model ID	Dimensions	Languages	Best For
`embed-english-v3.0`	1024	English	Highest quality English search and RAG
`embed-multilingual-v3.0`	1024	100+	Multilingual search, cross-language RAG
`embed-english-light-v3.0`	384	English	Smaller index, faster queries, low storage
`embed-multilingual-light-v3.0`	384	100+	Multilingual on a budget

For most RAG projects, embed-english-v3.0 at 1024 dimensions is the sweet spot. If you’re storing millions of vectors and storage cost matters, the light variants drop to 384 dimensions — about 60% smaller indexes — with only a small quality drop.

Cohere Rerank: The Secret Weapon for RAG Quality

Here is where Cohere genuinely leads: Rerank. After your vector database returns the top 50 or 100 candidate documents, you pass them to Rerank along with the user’s query. Rerank scores each document for actual relevance and reorders them. The top 5 reranked results are almost always dramatically better than the top 5 from raw vector similarity.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

query = "How do I add a free embedding API to my chatbot?"

documents = [
    "Cohere offers free embedding API access through trial keys.",
    "Pinecone is a managed vector database service.",
    "OpenAI embeddings cost $0.02 per million tokens.",
    "Use embed-english-v3.0 for the best quality English embeddings.",
    "Vector databases store high-dimensional vectors for similarity search."
]

response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=documents,
    top_n=3
)

for result in response.results:
    print(f"Score: {result.relevance_score:.4f}  |  {documents[result.index]}")

That returns the three documents most relevant to the query, with calibrated relevance scores between 0 and 1. In production RAG systems, adding a Rerank step typically boosts answer quality by 15–30% over vector-similarity-only retrieval — which is why it’s the most-deployed neural reranker in commercial RAG stacks.

And it’s free on the trial key: 10 calls per minute, with up to 1,000 documents per call.

Chat with Command R+: Built for RAG

Cohere’s Command R+ chat model is purpose-built for RAG. Unlike most chat APIs where you stuff retrieved documents into the system prompt, Cohere’s chat endpoint accepts a structured documents parameter — and the model returns inline citations pointing to which documents each claim came from.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.chat(
    model="command-r-plus",
    messages=[
        {"role": "user", "content": "Which Cohere embedding model should I use for English RAG?"}
    ],
    documents=[
        {"data": {"text": "embed-english-v3.0 produces 1024-dimensional embeddings and leads MTEB English benchmarks."}},
        {"data": {"text": "embed-english-light-v3.0 produces 384-dimensional embeddings, optimized for low storage cost."}},
        {"data": {"text": "embed-multilingual-v3.0 supports over 100 languages."}}
    ]
)

print(response.message.content[0].text)
print()
print("Citations:")
for citation in response.message.citations or []:
    print(f"  - '{citation.text}' from sources: {[s.id for s in citation.sources]}")

The model produces a grounded answer that cites which document each fact came from. For RAG applications where users need to verify the source of every claim — legal, medical, internal knowledge bases — this is significantly more useful than free-text generation.

Free Chat Models on Cohere

Model ID	Size	Context Window	Best For
`command-r-plus`	104B	128k tokens	Best quality, complex RAG, tool use
`command-r`	35B	128k tokens	Faster RAG, cheaper-when-paid baseline
`command-r7b`	7B	128k tokens	Fastest responses, simple Q&A

All three are available through your free trial key at the same 20-calls-per-minute rate limit. command-r-plus is the headline model — it scores comparably to GPT-4o on RAG benchmarks while being explicitly trained to follow document citations.

End-to-End RAG Pipeline (All Free)

Here’s a complete RAG pipeline using only Cohere’s free trial key — embed, store, retrieve, rerank, and answer:

import os
import numpy as np
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

# 1. Your knowledge base
documents = [
    "OpenClaw is an AI agent platform for orchestrating multiple AI APIs and tools.",
    "Cohere Embed v3 produces 1024-dimensional vectors optimized for retrieval.",
    "Cohere Rerank v3 reorders candidate documents by true relevance to the query.",
    "Command R+ is a 104B model trained specifically for RAG with citations.",
    "Free trial keys on Cohere have no time limit — only per-minute rate limits.",
]

# 2. Index documents
doc_embeds = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
).embeddings.float
doc_matrix = np.array(doc_embeds)

# 3. Embed the query
query = "How do I get free access to Cohere's RAG models?"
query_embed = np.array(co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query",
    embedding_types=["float"]
).embeddings.float[0])

# 4. Vector similarity — get top 3 candidates
scores = doc_matrix @ query_embed
top_indices = np.argsort(scores)[-3:][::-1]
candidates = [documents[i] for i in top_indices]

# 5. Rerank to get best 2
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=candidates,
    top_n=2
)
top_docs = [candidates[r.index] for r in reranked.results]

# 6. Answer with Command R+ using grounded citations
answer = co.chat(
    model="command-r-plus",
    messages=[{"role": "user", "content": query}],
    documents=[{"data": {"text": d}} for d in top_docs]
)

print(answer.message.content[0].text)

That’s a full production-shape RAG pipeline — embed, retrieve, rerank, generate with citations — running on a free trial key with zero credit card on file.

JavaScript / Node.js Example

npm install cohere-ai

import { CohereClientV2 } from "cohere-ai";

const co = new CohereClientV2({ token: process.env.COHERE_API_KEY });

const response = await co.embed({
  texts: [
    "Cohere is the best free embedding API for RAG.",
    "Toronto is the headquarters of Cohere."
  ],
  model: "embed-english-v3.0",
  inputType: "search_document",
  embeddingTypes: ["float"]
});

console.log(`Got ${response.embeddings.float.length} embeddings`);

Cohere vs Other Free Embedding Options

Provider	Free Embedding Model	Dimensions	Multilingual	Reranker?
Cohere	embed-english-v3.0 / multilingual-v3.0	1024 / 384	100+ languages	Yes (Rerank v3)
Google Gemini	text-embedding-004	768	Limited	No
Mistral AI	mistral-embed	1024	Limited	No
Cloudflare Workers AI	bge-base-en-v1.5	768	English only	No
Hugging Face Inference	BGE / E5 family	varies	Some multilingual	No (manual setup)
OpenAI (paid only)	text-embedding-3-large	3072	Strong multilingual	No

Where Cohere wins on the free tier: the only provider on this list that ships a hosted neural reranker. For RAG quality, that single feature usually matters more than which embedding model you started with. Combined with asymmetric embeddings (separate search_query and search_document modes), Cohere’s free tier is a credible foundation for real retrieval applications — not just a demo toy.

Use Cohere with OpenClaw

OpenClaw is an AI agent platform that orchestrates multiple APIs and tools into automated workflows. Cohere fits well as the retrieval and grounding layer inside OpenClaw agents — the part that searches your private documents before the agent acts.

A common pattern: an OpenClaw agent receives a user task (“draft a reply to this customer ticket”), uses Cohere Embed + Rerank to pull the three most relevant past tickets and policies from your knowledge base, then passes those documents to Command R+ to generate a cited reply. Because Cohere returns explicit citations, the agent can attach source links to the draft for human review.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

def retrieve_and_answer(question: str, knowledge_base: list[str]) -> dict:
    """A retrieval-then-answer step for use inside an OpenClaw agent."""
    # Rerank handles both retrieval and ranking in one call
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=question,
        documents=knowledge_base,
        top_n=3
    )
    top_docs = [knowledge_base[r.index] for r in reranked.results]

    answer = co.chat(
        model="command-r-plus",
        messages=[{"role": "user", "content": question}],
        documents=[{"data": {"text": d}} for d in top_docs]
    )

    return {
        "answer": answer.message.content[0].text,
        "sources": top_docs,
        "citations": answer.message.citations or []
    }

# Example use inside an agent step
result = retrieve_and_answer(
    question="What is our refund policy for digital downloads?",
    knowledge_base=load_company_kb()  # your own loader
)
print(result["answer"])

Notice: when you only have a few hundred candidate documents, you can skip the embedding/vector-DB step entirely and just pass everything to Rerank. The free trial key allows up to 1,000 documents per Rerank call, which covers a surprising number of small-to-medium knowledge bases.

Cohere Pricing (When You Need More)

Model	Price	Unit
Command R+	$2.50 input / $10.00 output	per 1M tokens
Command R	$0.15 input / $0.60 output	per 1M tokens
Command R7B	$0.0375 input / $0.15 output	per 1M tokens
Embed v3 (English / Multilingual)	$0.10	per 1M tokens
Rerank v3	$2.00	per 1,000 searches

When you graduate from a Trial key to a Production key, Command R7B at $0.15 per million output tokens is one of the cheapest production-grade models available. Embed v3 at $0.10 per million tokens is competitive with or cheaper than every comparable hosted embedding API.

When to Use Cohere

Cohere is the right choice when:

You’re building a RAG application and want the best free embeddings + reranker combo
You need multilingual retrieval across 100+ languages without changing models
Your application requires grounded citations (legal, medical, internal knowledge bases)
You want asymmetric embeddings (separate query and document modes) for better search quality
You’re prototyping retrieval pipelines and want generous free per-minute limits

Consider alternatives when:

You need raw chat throughput more than retrieval quality — use Groq or Cerebras for speed, Gemini Flash for free quota
You want OpenAI SDK drop-in compatibility — use Mistral AI or DeepSeek
You need image, audio, or multimodal generation — Cohere is text-only
You’re building a pure chatbot with no retrieval — Command R+ works, but the model isn’t priced or designed around that use case

Final Verdict

Cohere is the most underrated free AI API for one specific reason: it’s the only provider that ships a complete RAG stack — embeddings, reranker, and a chat model trained for grounded citations — all behind a single free trial key. Most “free AI API” articles skip Cohere because they only compare chat models, where Cohere is fine but not best-in-class. That misses the point of what the company actually built.

If your project involves search over your own documents, internal knowledge bases, customer tickets, product catalogs, or anything resembling RAG, Cohere’s free tier covers more of the pipeline than any other single provider. Sign up at dashboard.cohere.com, copy your trial key, and your first reranked retrieval is about ten minutes away.

Originally published at toolfreebie.com.

Free AI Video Generators in 2026: Kling vs Pika vs HeyGen Compared

toolfreebie — Sun, 03 May 2026 15:45:57 +0000

The State of Free AI Video Generation in 2026

Two years ago, generative video was a research demo. You’d see a five-second OpenAI Sora clip on Twitter, a Runway Gen-2 reel that looked like a melted oil painting, and a vague feeling that “real” AI video was still a year or two out. By early 2026 that’s no longer true. There are three tools I now reach for every week — Kling, Pika, and HeyGen — and all three have a free tier you can use without a credit card.

The three solve different problems. Kling is what you use when you want a cinematic short clip generated from a still image or a text prompt. Pika is what you use when you want to direct a scene with motion brushes, lip-sync, and quick edits. HeyGen is what you use when you want a talking-head video of a fake (or real, with permission) person reading a script you wrote. They are not competitors so much as three slots in the same AI video toolkit.

This article walks through each tool, what its free tier actually includes in April 2026, where the rough edges are, and how I’ve wired all three into automation built on top of OpenClaw for batch video generation. If you’re a creator, a developer building media tooling, or a marketer trying to stop paying $150/month for stock video, one or more of these will earn its place in your workflow.

The Quick Verdict, Up Front

If you only have ten seconds to read this article:

Kling — best for cinematic image-to-video and text-to-video. Free tier gives you ~166 credits/day (about 6 short clips) and 1080p output on the standard model.
Pika — best for scene-level direction, motion brushes, and quick edits to existing video. Free tier is 250 credits at signup with limited regeneration.
HeyGen — best for AI avatar talking-head videos for marketing, training, and tutorials. Free tier is three minutes of video per month with a watermark.

The rest of this article is the long version of why I picked those three over the dozen other contenders, what the actual workflow looks like, and how to chain them together for things like automated short-form video pipelines.

How I Picked These Three

The free AI video space is crowded. There’s Runway, Luma Dream Machine, Hailuo (MiniMax), Vidu, Kling, Pika, HeyGen, Synthesia, D-ID, Sora when you can get a slot, and a long tail of WeChat-only Chinese tools. To narrow the field, I tested for:

A real free tier in April 2026. Not a “free trial that needs a card.” Several big-name tools quietly removed credit-card-free signup over the last year.
Output quality I’d actually use. Not just demo-reel cherry-picks. I generated the same prompt across every candidate and compared the dud rate.
Different problem space. Three text-to-video tools that do the same thing isn’t a useful roundup. I picked one cinematic generator (Kling), one motion-control editor (Pika), and one talking-head avatar (HeyGen).
API or automation surface. At least one of the three needs to be scriptable, because that’s where AI video gets interesting beyond hobby use.

Notable tools that didn’t make this list and why:

Runway Gen-3 — beautiful output, but the free tier is now 125 one-time credits and that’s it. Once you’ve burned them, you’re paying. Kling and Pika are more sustainable for ongoing free use.
Luma Dream Machine — solid quality, but the free tier dropped to 30 generations/month in late 2025. Workable for occasional use but more limited than Kling’s daily refresh.
Sora — when you can get access through a ChatGPT Plus account it’s stunning, but it’s not really “free” — you’re paying for the Plus subscription.
Synthesia — free tier removed in 2024. Fully paid product now.

1. Kling: The Best Free Cinematic Video Generator

Kling's English-locale community landing — the desert-driving hero is itself a Kling-generated clip.

Kling is built by Kuaishou — the Chinese short-video company with billions of users — and it’s currently my default for “give me a five-second cinematic shot of X.” The model handles motion, light, and camera moves better than anything else available without payment in 2026. Most importantly, the free tier is unusually generous: a daily credit refresh rather than a one-time pool.

What the Free Tier Actually Includes

As of April 2026, signing up for Kling with an email gives you 166 credits per day. Each generation costs:

Standard text-to-video, 5s, 720p: 10 credits
Standard text-to-video, 5s, 1080p: 20 credits
Standard image-to-video, 5s, 1080p: 20 credits
Pro mode (higher quality, 10s): 35 credits
Lip sync, motion brush, camera control: usually +5 to +10 credits

That works out to roughly 6-8 standard 1080p clips per day at no cost, or 3-4 longer Pro clips. The credits don’t roll over, so you have to use them or lose them — but the daily refresh is what makes Kling viable as a long-term free tool rather than a brief trial.

The Standard vs Pro Difference

Kling ships two underlying models. Standard is fast (about 60 seconds per generation) and handles most prompts well. Pro takes longer (3-5 minutes), produces noticeably better motion coherence, and supports the longer 10-second outputs. For text-to-video without a reference image, Pro is worth the credit hit; for image-to-video starting from a strong reference still, Standard is usually fine.

A First Generation

The web UI is intentionally simple. Sign in with Google or email, pick text-to-video or image-to-video, type a prompt or upload an image, set duration and resolution, hit Generate. A queue position appears, the clip arrives in your library when ready, and you can download as MP4.

The single most important Kling-specific tip: prompts work best when written like a film shot description, not like a Midjourney prompt. Compare:

Bad: “a cat, cyberpunk, neon, 4k, detailed, cinematic, high quality” — Kling treats the modifiers as scene elements and produces a confused frame.
Good: “Wide shot of a black cat walking slowly through a rainy Tokyo alley at night, neon signs reflected in puddles, slight steam rising from grates, camera tracking right at hip height.”

The good prompt produces something that looks like a real cinematographer made a deliberate choice. The bad prompt produces a beautifully lit cat that doesn’t move convincingly. Tag-spam works for image generators; Kling rewards sentences.

Image-to-Video Is Where Kling Shines

Kling ships templated image-to-video recipes — the fastest way to evaluate the model on the free tier.

If you upload a still image and write a short motion prompt, Kling produces output that’s substantially better than its text-to-video. The reasoning is structural: the model only has to invent motion, not the entire visual world. Workflow I use weekly:

Generate a hero still in Midjourney, Imagen, or Flux. Iterate until the image is exactly what I want.
Upload that still to Kling, image-to-video mode, 1080p, 5s.
Prompt with motion only: “Camera slowly pushes in on the subject. Hair moves gently in the wind. Background trees sway.”
Generate two or three takes (Kling is non-deterministic), pick the best one.

This pipeline costs 40-60 credits and produces output you’d otherwise pay a stock-video site $40 for. It’s the single highest-leverage use of Kling’s free tier.

Camera Controls and Motion Brush

Kling’s camera control panel lets you specify pan, tilt, zoom, and orbit moves explicitly rather than hoping the prompt conveys them. Motion brush lets you mask part of the input image and tell the model “move this region in this direction.” Both features cost extra credits but eliminate most of the “the AI didn’t understand what I wanted to move” problem that plagued earlier video generators.

Where Kling Falls Short

Faces drift over longer clips. A 10-second Pro clip with a clear human face will sometimes shift facial features halfway through. Workaround: keep clips at 5 seconds and stitch in DaVinci Resolve.
Text in scenes is unreadable. Like every video model in 2026, signs and on-screen text are gibberish. Generate clean plates and overlay real text in post.
The free tier UI is in Mandarin by default for some signup regions. The English toggle is in the top right; the Mandarin labels are easy to navigate around using the visual layout.
Daily credits don’t accumulate. If you don’t log in for a week, you don’t have 1,162 credits waiting — you have 166. Plan your generation days.

2. Pika: Scene-Level Direction and Motion Brushes

Pika gates everything behind a free account — the modal you see is unavoidable, but signup itself is genuinely free.

Pika is the second tool I keep installed. Where Kling is best at “generate me a cinematic shot,” Pika is best at “take this clip and modify it with surgical precision.” It’s the closest thing in the free AI video space to a non-linear editor where the operations are AI primitives rather than transitions.

What the Free Tier Actually Includes

Pika’s free tier in April 2026 gives you 250 credits at signup, with no automatic daily refresh — you earn small amounts of additional credits by participating in their Discord challenges or referring users. Each generation costs:

Pika 2.2 text-to-video, 5s, 1080p: 30 credits
Image-to-video, 5s, 1080p: 30 credits
Pikaframes (frame-to-frame interpolation): 35 credits
Pikaffects (specific transformation effects): 25-50 credits
Lip sync to audio: 30 credits

That’s roughly 8-10 generations from your initial pool. After that you’re either paying $10/month for the Standard plan (700 credits/mo) or hunting for community credit drops. The free tier is best understood as a generous trial rather than a sustainable daily tool — the opposite shape from Kling.

Why Pika Is Worth a Slot Anyway

Pika ships features the others don’t. Specifically:

Pikaffects — pre-built transformation primitives. “Inflate” makes the subject puff up, “explode” replaces them with a particle burst, “melt” liquefies them, “crush” smashes them. They’re designed for short-form social video and they look great. No competitor offers this set as one-click effects.
Pikaframes — give it a starting image and an ending image, get a smooth video between them. Useful for product shots (“from box to assembled”), morphs, and storyboard-to-video.
Lip sync — upload a video of a person and an audio file, Pika rewrites the mouth to match the new audio. Quality is the best of the free tools I tested for this specific task.
Modify region — paint a mask on a frame, prompt the change (“make the shirt red”), Pika regenerates only that region across the clip.

None of these are headline “generate cinematic video from scratch” features, but together they make Pika the right tool for editing AI video the rest of the way.

A Realistic Workflow

The shape of the work I get done with Pika in a week:

Generate a base clip in Kling (uses Kling’s free daily credits).
Bring it into Pika to apply a Pikaffect or run lip sync against a voiceover I generated in ElevenLabs or Coqui.
Export and assemble in DaVinci Resolve (also free).

That pipeline produces social-media-ready short-form video without paying any single tool. Pika’s free credits are limiting if it’s your only tool, but they go a long way when used surgically on top of another generator.

Where Pika Falls Short

Initial credit pool runs out fast. 250 credits sounds like a lot until you realize a single generation is 30. After your first day of experimentation, expect to be on a slower drip.
No public API on the free tier. Pika has an API but it’s invite-only and paid. Automation requires browser automation against the web UI.
Pikaffects are visually distinctive — to a fault. If your audience watches a lot of TikTok they’ve seen the inflate/melt/explode effects on a hundred other accounts. Use sparingly.
Long-form text prompts get truncated. Keep your prompts under ~40 words for best results.

3. HeyGen: AI Avatars That Read Your Script

HeyGen's AI Agent landing — type a prompt, set duration and aspect, and the avatar pipeline kicks off.

HeyGen solves a completely different problem from the other two. Where Kling and Pika generate cinematic or stylized video, HeyGen generates a realistic-looking person speaking words you typed. It’s the tool you reach for when you want a presenter for a tutorial, a marketing video, an e-learning module, or any context where someone needs to look at a camera and explain something.

What the Free Tier Actually Includes

The HeyGen free tier in April 2026 gives you:

3 minutes of video per month across all your generations
Access to ~100 stock avatars (real people who licensed their likeness)
~300 voices in 40+ languages via the built-in TTS
720p export with a HeyGen watermark
Up to 1-minute video length per generation

Three minutes a month sounds tight, and it is — but most use cases are 60-90 second explainer videos, so you’re realistically looking at two or three videos per month before you’d need to upgrade. For a side project or a single-person business, that’s often enough.

The Killer Feature: Custom Voice Clone

HeyGen’s standout free feature is Instant Voice Clone — upload a 30-second clip of someone speaking (yours, or someone else’s with their permission) and HeyGen creates a TTS voice that sounds like them. You can then use that voice on any avatar in the platform. Free tier limits you to one voice clone, but the quality is genuinely good in English and the major European languages, and decent in Mandarin and Japanese.

The two-step workflow:

Record yourself reading the HeyGen onboarding paragraph at a normal speaking pace. Upload it.
Wait ~5 minutes. Pick the new voice from the voice dropdown when generating any video.

Combined with the free avatar library, this gets you a presenter who looks like a paid actor and sounds like you. There’s an obvious ethical line here — only clone your own voice or one you have explicit permission for — but the technical capability is there in the free tier.

The Avatar Selection

The 100 free stock avatars cover a wide range of ages, ethnicities, and presentation styles: business-casual person at a desk, casual person against a neutral background, news-anchor framing, etc. They’re filmed people who licensed their image, not generated faces, which means they look genuinely human and don’t fall into the uncanny valley that pure-AI avatars do. Premium tiers unlock more avatars and the ability to create your own custom avatar from a video upload, but the free pool is varied enough for most general-purpose work.

The Generation Workflow

HeyGen feels like a slide editor more than a video generator. You add scenes, each scene has a background (color, image, or stock video), an avatar, and a script. You type the script, pick the voice, and generate. The avatar reads the script with synced lip movement, natural-looking head turns, and basic gestures. Total turnaround for a 60-second video is usually 2-3 minutes.

The most underrated feature: HeyGen translates and dubs in one click. Generate an English video, then use the Translate option to produce a Spanish, French, German, or Mandarin version with the same avatar lip-syncing the new language. Useful for any creator targeting multiple markets without recording multiple takes.

Where HeyGen Falls Short

The watermark on the free tier is visible. It’s a “Made with HeyGen” badge in the corner. Not subtle. If you’re publishing professionally you’ll want the $24/month Creator plan to remove it.
Avatars are static-camera talking heads. No walking around, no scene changes within the avatar shot, no full-body shots. If you want a presenter doing things, you’re back to filming a real person.
3 minutes/month adds up fast if you iterate. Generations against your script all count, including ones you discard. Get the script right in a text editor before generating.
Voice clone needs clean audio. A 30-second clip with background noise produces a noisy clone. Record in a quiet room with a decent USB mic.

The Side-by-Side Comparison

Where each free tier actually lands across the metrics that matter.

Feature	Kling	Pika	HeyGen
Primary use case	Cinematic clips	Scene editing & effects	Talking-head avatars
Text-to-video	Yes (best of the three)	Yes	No (script-to-avatar only)
Image-to-video	Yes (best in class)	Yes	No
Free tier model	~166 credits/day refresh	250 credits at signup	3 minutes/month
Free output resolution	1080p	1080p	720p
Free output watermark	No	No	Yes
Max clip length (free)	10s (Pro) / 5s (Standard)	5s	60s
Lip sync to audio	Limited	Yes (good)	Yes (built into avatars)
Camera control	Yes (explicit panel)	Limited	N/A
Motion brush	Yes	Yes (Modify Region)	N/A
Voice cloning	No	No	Yes (1 voice on free)
Translation/dubbing	No	No	Yes
Public API	Yes (paid)	Invite-only (paid)	Yes (paid tier)
Best for	B-roll, hero shots	Effects, lip-sync, edits	Tutorials, training, marketing

How to Pick — A Decision Tree

If you can answer one question — what kind of video — you don't need to read the rest of the comparison.

Most of the time the choice falls out of one question: what does the final video need to look like?

Cinematic establishing shots, B-roll, or any “make me a beautiful 5-second video” task → Kling. The daily credit refresh means you can iterate without blowing through a fixed pool, and image-to-video on a strong reference still consistently produces the best output of the three.

Effects, lip sync to a voiceover, or modifying an existing clip → Pika. The Pikaffects library is unique, the lip sync quality is the best of the three for re-dubbing footage you didn’t generate, and the modify-region feature is the only way to do localized edits across an AI-generated clip in any free tool.

An explainer video, tutorial, marketing pitch, or anything where someone needs to talk to camera → HeyGen. The avatar quality is genuinely good, the voice clone makes it personal, and the one-click translation lets you reach non-English audiences from a single English script.

The combination I use most is Kling + HeyGen — Kling for the visuals, HeyGen for any spoken intro or outro by a presenter avatar. Pika comes in when I need a specific Pikaffect or a precise edit Kling can’t make.

Combining All Three: A Free Short-Form Video Pipeline

The free-tier-only pipeline I actually use to produce a 30-second explainer in under five minutes of work.

The pipeline I built in early 2026 to produce one short-form video per day with zero spend:

Script in any LLM. A short 60-second script with a hook, three beats, and a call to action. Claude or DeepSeek for free.
Voiceover in ElevenLabs free tier or Coqui. 10,000 characters/month free in ElevenLabs is enough for ~10 short scripts.
Hero still in Flux Schnell or Imagen 3 free. One image that captures the visual concept of the video.
Cinematic clip from the still in Kling. Image-to-video, 1080p, 5s. Repeat 3-4 times for the different beats of the script.
Lip-synced presenter intro in HeyGen. 10-15 second avatar talking-head intro using the cloned voice.
Edit and assemble in DaVinci Resolve free. Trim, color-grade, add captions (which DaVinci’s built-in transcription generates), export to 9:16 for vertical platforms.

Daily cost: $0. Weekly time: ~30 minutes per video once the workflow is dialed in. The output quality is high enough that the audience can’t tell the difference between this pipeline and a small studio’s work.

Using Kling, Pika, and HeyGen with OpenClaw

If you’re orchestrating media generation through OpenClaw agents — which is increasingly the right move for batch content production — the three tools fit different parts of the agent’s toolkit. None of them have a fully open free API, but two have paid APIs that an agent can call when scaled, and the web UIs can be driven via browser automation when the volume is small.

The pattern I’ve found works:

Agent generates a script and a still-frame prompt using a free LLM API like DeepSeek or Groq. Both give you enough free quota for hundreds of script generations per day.
Agent calls an image generator (Flux, Imagen 3 via the Gemini free tier) for the hero still.
Browser automation step submits the still to Kling in image-to-video mode, polls for completion, downloads the MP4. This is the part where, until Kling opens a free API, you’re using Playwright or similar.
Agent uses HeyGen’s API for the talking-head intro. HeyGen’s API is paid but inexpensive — about $0.04 per second of video on the lowest tier — and well-suited to programmatic use. For pure-free workflows you can drive HeyGen’s web UI with browser automation too.
Final assembly happens in FFmpeg via the agent’s shell tool. Concat clips, overlay captions, output the final file.

The advantage of orchestrating through OpenClaw rather than running each tool by hand is that the agent can iterate on rejected outputs. If a Kling generation comes back with the wrong subject framing, the agent retries with a refined prompt. If the HeyGen avatar’s voiceover trips on a technical word, the agent rewrites the script using the speak-friendly equivalent. This is exactly the kind of multi-step, failure-tolerant workflow that AI agents handle better than rigid scripts — and the free tiers make experimentation cheap.

For more on building agent workflows that call third-party tools, see our walkthrough of MCP for connecting AI agents to any tool or API.

Honest Limitations of Free AI Video in 2026

Three things to keep in mind before betting a real production schedule on free AI video:

Daily credit caps mean you can’t burst. If a project needs 30 cinematic clips by Friday, the free Kling tier won’t get you there in time — you’d need 5+ days at the daily refresh rate. Plan accordingly or pay for a one-month bump.
Output quality is non-deterministic. Even the best prompt produces a dud one in three or four times. Budget for regeneration credits.
Faces and hands remain the weak point. All three tools handle faces well in close-ups but struggle with subtle facial drift over longer clips. For anything where a viewer will scrutinize a face, Kling’s image-to-video on a strong portrait still is your best chance, and short clips (5s, not 10s) are safer than long ones.
Terms of service vary. Kling and Pika both allow free-tier output to be used commercially as of April 2026, but check before publishing — the Chinese-origin tools in particular have updated their commercial-use clauses repeatedly. HeyGen’s free tier output is technically commercial-use-allowed but the watermark makes it impractical for paid client work.

What’s Coming in the Rest of 2026

Three things to watch:

OpenAI Sora consumer tier. Sora has been API-only and expensive; rumors of a free tier inside ChatGPT Plus could shake up this list overnight.
Open-source video models catching up. Hunyuan Video, Mochi 1, and CogVideoX are usable open-weight models in 2026 — none yet match Kling on a fresh consumer GPU, but they’re closing the gap fast and let you run unlimited free generation on hardware you already own.
HeyGen-style avatar generators going lower-cost. D-ID’s free tier vanished, but new entrants like Hedra and Synthesia’s stripped-down “Studio Free” launched in early 2026 are trying to undercut HeyGen. Worth watching.

This list is current as of April 2026. Free tiers in this space change quarterly — what’s free this week may not be free next week. The pattern of three tools (one cinematic generator, one editor, one talking-head) will outlast any specific provider, even when the names change.

Final Verdict

If you’re going to use one of the three:

Use Kling if you need cinematic clips and want the most generous, sustainable free tier. Daily credit refresh and 1080p output make it the best free general-purpose AI video tool in 2026.
Use Pika if you’re editing or transforming existing clips, lip-syncing voiceovers, or applying social-friendly effects. Limited free credits but unique features.
Use HeyGen if you need a talking-head presenter for tutorials, marketing, or training. Voice clone and one-click translation are killer features inside the free 3 minutes/month.

If you want the full pipeline — and you’re willing to invest 30 minutes a day learning the tools — chain all three together. The output rivals what stock-video subscriptions and small studios charge hundreds of dollars per month for, and the cost is zero. That equation didn’t exist a year ago and probably won’t last forever, so it’s worth using while it’s there.

For more free AI tools that pair well with this video pipeline, see our roundup of the 10 best free AI APIs in 2026 and our guide to Google NotebookLM for free AI research.

Originally published at toolfreebie.com.

DEV Community: toolfreebie

Tavily vs Brave vs Exa: Free Search APIs for AI Agents

Every AI Agent Needs a Search Tool — Here Are the Three Free Ones That Actually Work

Quick Comparison: Tavily vs Brave vs Exa Free Tiers

What Is Tavily?

Tavily Free Tier: What You Actually Get

Getting Started with Tavily

1. Get Your Free API Key

2. Call the API from Python

3. Direct curl Without the SDK

4. Drop Tavily into a CrewAI Agent

What Is Brave Search API?

Brave Search API Free Tier: What You Actually Get

Getting Started with Brave Search

1. Get Your Free Key

2. curl First Call

3. Python Client with Rate Limiting Built In

4. Use Brave with LangChain

What Is Exa?

Exa Free Tier: What You Actually Get

Getting Started with Exa

1. Sign Up and Grab Your Key

2. Neural Search with Contents Extraction

3. Find Similar Pages

4. curl Without the SDK

Head-to-Head: Tavily vs Brave vs Exa

Quota Math for a Real Agent

Result Quality for LLM Consumption

Latency

Index Freshness

Working With Your Agent Framework

Which One Should You Use? A Decision Tree

Combining All Three: The “Search Router” Pattern

Common Gotchas

Tavily: Watch the search_depth Default

Brave: Parallel Fan-Out Will Get You 429’d

Exa: type="auto" Costs More Than You Think

All Three: Cache Aggressively

Pairing Search With a Free LLM

Frequently Asked Questions

Can I use these search APIs for commercial products?

What about SerpAPI / ScraperAPI / SearXNG?

Does Google offer a free search API in 2026?

Which one works best in MCP setups?

Can I use these inside an MCP server I build myself?

How do I know I am hitting the free-tier ceiling?

Is there a single “best free search API”?

Bottom Line

Related Reads

Langfuse: Free Open-Source LLM Observability

What Is Langfuse?

Why LLM Observability Matters

Is Langfuse Really Free? Cloud vs Self-Hosted

Langfuse vs LangSmith vs Phoenix vs Helicone

Core Features That Matter

1. Tracing and Spans

2. Prompt Management

3. Evaluations and Scoring

4. Datasets

5. Playground

6. Metrics and Dashboards

How to Self-Host Langfuse for Free

Instrumenting Your App: Three Ways

Way 1: The OpenAI Drop-In Wrapper (zero refactor)

Way 2: The @observe Decorator (capture your own functions)

Way 3: The LangChain / LangGraph Callback

Tracing a RAG Pipeline End-to-End

Prompt Management Without Redeploys

Running Evaluations

When to Use Langfuse vs Alternatives

FAQ

Use Langfuse with OpenClaw

Final Verdict

Related Reads

Which Free Text-to-Speech API Should You Use in 2026?

Which Free Text-to-Speech API Should You Use in 2026?

The 30-Second Answer

Why “Free Text-to-Speech API” Is Worth Searching For

What “free” actually means in TTS (three different shapes)

Google Cloud Text-to-Speech: The Only True Recurring Free Tier

Tavily: Watch the `search_depth` Default

Exa: `type="auto"` Costs More Than You Think