toolfreebie

Posted on May 3 • Originally published at toolfreebie.com

Cohere Free API: The Best Free Embedding and Rerank API for RAG in 2026

#ai #api #opensource

What Is Cohere?

Cohere is a Toronto-based AI company founded in 2019 by Aidan Gomez (one of the original authors of the “Attention Is All You Need” Transformer paper) and a team of ex-Google Brain researchers. Unlike OpenAI or Anthropic, Cohere built its platform from day one around a specific use case: enterprise retrieval and RAG (Retrieval-Augmented Generation).

That focus shows up in three places where Cohere genuinely leads the field — and where most developers don’t realize they can get it for free:

Embed v3 — text embeddings that consistently rank near the top of the MTEB benchmark, in both English and 100+ other languages
Rerank v3 — the most-deployed neural reranker in production RAG systems, available via a single API call
Command R / R+ — chat models specifically trained for RAG, tool use, and grounded citations

And the part most developers miss: a free Cohere trial key gives you access to all of these. No credit card, no time limit. The only constraint is per-minute rate limiting, which is fine for prototyping, side projects, and small production workloads.

What’s Free on Cohere

Cohere has two key types: Trial keys (free) and Production keys (paid). Trial keys never expire — they’re rate-limited but otherwise unrestricted.

Endpoint	Trial Rate Limit	Production Rate Limit
Chat (Command R/R+)	20 calls/min	500 calls/min
Embed	100 calls/min	2,000 calls/min
Rerank	10 calls/min	1,000 calls/min
Classify	100 calls/min	1,000 calls/min
Summarize	5 calls/min	500 calls/min

Notice the Embed limit: 100 calls per minute with up to 96 documents per call. That’s effectively 9,600 embeddings per minute on the free tier — more than enough to index a personal knowledge base or a small document corpus from scratch in a few minutes.

Note: Trial keys are not for production traffic, but they are for real development. Cohere’s documentation explicitly encourages building and testing on trial keys before upgrading.

How to Get Your Free API Key

Go to dashboard.cohere.com/welcome/register and sign up with email or Google
Verify your email address
From the dashboard, navigate to API Keys in the left sidebar
Your default Trial key is already there — copy it
Set it as an environment variable: export COHERE_API_KEY="your_key_here"

No credit card. No phone number. Two minutes from signup to your first embedding.

Python Quickstart: Your First Embedding

Install the official Cohere Python SDK:

pip install cohere

Embedding three documents:

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.embed(
    texts=[
        "Cohere makes the best free embedding API for RAG.",
        "OpenClaw is an AI agent platform for orchestrating tools.",
        "Toronto is the headquarters of Cohere."
    ],
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
)

print(f"Got {len(response.embeddings.float)} embeddings")
print(f"Each embedding is {len(response.embeddings.float[0])} dimensions")

That returns three 1024-dimensional vectors you can drop into any vector database — Pinecone, Weaviate, Chroma, Qdrant, pgvector, or just a NumPy array.

The input_type parameter is important: Cohere’s embeddings are asymmetric. Use "search_document" when indexing your corpus, and "search_query" when embedding the user’s question. Treating them differently gives noticeably better retrieval quality than symmetric embedding APIs.

Embedding Models You Get for Free

Model ID	Dimensions	Languages	Best For
`embed-english-v3.0`	1024	English	Highest quality English search and RAG
`embed-multilingual-v3.0`	1024	100+	Multilingual search, cross-language RAG
`embed-english-light-v3.0`	384	English	Smaller index, faster queries, low storage
`embed-multilingual-light-v3.0`	384	100+	Multilingual on a budget

For most RAG projects, embed-english-v3.0 at 1024 dimensions is the sweet spot. If you’re storing millions of vectors and storage cost matters, the light variants drop to 384 dimensions — about 60% smaller indexes — with only a small quality drop.

Cohere Rerank: The Secret Weapon for RAG Quality

Here is where Cohere genuinely leads: Rerank. After your vector database returns the top 50 or 100 candidate documents, you pass them to Rerank along with the user’s query. Rerank scores each document for actual relevance and reorders them. The top 5 reranked results are almost always dramatically better than the top 5 from raw vector similarity.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

query = "How do I add a free embedding API to my chatbot?"

documents = [
    "Cohere offers free embedding API access through trial keys.",
    "Pinecone is a managed vector database service.",
    "OpenAI embeddings cost $0.02 per million tokens.",
    "Use embed-english-v3.0 for the best quality English embeddings.",
    "Vector databases store high-dimensional vectors for similarity search."
]

response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=documents,
    top_n=3
)

for result in response.results:
    print(f"Score: {result.relevance_score:.4f}  |  {documents[result.index]}")

That returns the three documents most relevant to the query, with calibrated relevance scores between 0 and 1. In production RAG systems, adding a Rerank step typically boosts answer quality by 15–30% over vector-similarity-only retrieval — which is why it’s the most-deployed neural reranker in commercial RAG stacks.

And it’s free on the trial key: 10 calls per minute, with up to 1,000 documents per call.

Chat with Command R+: Built for RAG

Cohere’s Command R+ chat model is purpose-built for RAG. Unlike most chat APIs where you stuff retrieved documents into the system prompt, Cohere’s chat endpoint accepts a structured documents parameter — and the model returns inline citations pointing to which documents each claim came from.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.chat(
    model="command-r-plus",
    messages=[
        {"role": "user", "content": "Which Cohere embedding model should I use for English RAG?"}
    ],
    documents=[
        {"data": {"text": "embed-english-v3.0 produces 1024-dimensional embeddings and leads MTEB English benchmarks."}},
        {"data": {"text": "embed-english-light-v3.0 produces 384-dimensional embeddings, optimized for low storage cost."}},
        {"data": {"text": "embed-multilingual-v3.0 supports over 100 languages."}}
    ]
)

print(response.message.content[0].text)
print()
print("Citations:")
for citation in response.message.citations or []:
    print(f"  - '{citation.text}' from sources: {[s.id for s in citation.sources]}")

The model produces a grounded answer that cites which document each fact came from. For RAG applications where users need to verify the source of every claim — legal, medical, internal knowledge bases — this is significantly more useful than free-text generation.

Free Chat Models on Cohere

Model ID	Size	Context Window	Best For
`command-r-plus`	104B	128k tokens	Best quality, complex RAG, tool use
`command-r`	35B	128k tokens	Faster RAG, cheaper-when-paid baseline
`command-r7b`	7B	128k tokens	Fastest responses, simple Q&A

All three are available through your free trial key at the same 20-calls-per-minute rate limit. command-r-plus is the headline model — it scores comparably to GPT-4o on RAG benchmarks while being explicitly trained to follow document citations.

End-to-End RAG Pipeline (All Free)

Here’s a complete RAG pipeline using only Cohere’s free trial key — embed, store, retrieve, rerank, and answer:

import os
import numpy as np
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

# 1. Your knowledge base
documents = [
    "OpenClaw is an AI agent platform for orchestrating multiple AI APIs and tools.",
    "Cohere Embed v3 produces 1024-dimensional vectors optimized for retrieval.",
    "Cohere Rerank v3 reorders candidate documents by true relevance to the query.",
    "Command R+ is a 104B model trained specifically for RAG with citations.",
    "Free trial keys on Cohere have no time limit — only per-minute rate limits.",
]

# 2. Index documents
doc_embeds = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
).embeddings.float
doc_matrix = np.array(doc_embeds)

# 3. Embed the query
query = "How do I get free access to Cohere's RAG models?"
query_embed = np.array(co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query",
    embedding_types=["float"]
).embeddings.float[0])

# 4. Vector similarity — get top 3 candidates
scores = doc_matrix @ query_embed
top_indices = np.argsort(scores)[-3:][::-1]
candidates = [documents[i] for i in top_indices]

# 5. Rerank to get best 2
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=candidates,
    top_n=2
)
top_docs = [candidates[r.index] for r in reranked.results]

# 6. Answer with Command R+ using grounded citations
answer = co.chat(
    model="command-r-plus",
    messages=[{"role": "user", "content": query}],
    documents=[{"data": {"text": d}} for d in top_docs]
)

print(answer.message.content[0].text)

That’s a full production-shape RAG pipeline — embed, retrieve, rerank, generate with citations — running on a free trial key with zero credit card on file.

JavaScript / Node.js Example

npm install cohere-ai

import { CohereClientV2 } from "cohere-ai";

const co = new CohereClientV2({ token: process.env.COHERE_API_KEY });

const response = await co.embed({
  texts: [
    "Cohere is the best free embedding API for RAG.",
    "Toronto is the headquarters of Cohere."
  ],
  model: "embed-english-v3.0",
  inputType: "search_document",
  embeddingTypes: ["float"]
});

console.log(`Got ${response.embeddings.float.length} embeddings`);

Cohere vs Other Free Embedding Options

Provider	Free Embedding Model	Dimensions	Multilingual	Reranker?
Cohere	embed-english-v3.0 / multilingual-v3.0	1024 / 384	100+ languages	Yes (Rerank v3)
Google Gemini	text-embedding-004	768	Limited	No
Mistral AI	mistral-embed	1024	Limited	No
Cloudflare Workers AI	bge-base-en-v1.5	768	English only	No
Hugging Face Inference	BGE / E5 family	varies	Some multilingual	No (manual setup)
OpenAI (paid only)	text-embedding-3-large	3072	Strong multilingual	No

Where Cohere wins on the free tier: the only provider on this list that ships a hosted neural reranker. For RAG quality, that single feature usually matters more than which embedding model you started with. Combined with asymmetric embeddings (separate search_query and search_document modes), Cohere’s free tier is a credible foundation for real retrieval applications — not just a demo toy.

Use Cohere with OpenClaw

OpenClaw is an AI agent platform that orchestrates multiple APIs and tools into automated workflows. Cohere fits well as the retrieval and grounding layer inside OpenClaw agents — the part that searches your private documents before the agent acts.

A common pattern: an OpenClaw agent receives a user task (“draft a reply to this customer ticket”), uses Cohere Embed + Rerank to pull the three most relevant past tickets and policies from your knowledge base, then passes those documents to Command R+ to generate a cited reply. Because Cohere returns explicit citations, the agent can attach source links to the draft for human review.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

def retrieve_and_answer(question: str, knowledge_base: list[str]) -> dict:
    """A retrieval-then-answer step for use inside an OpenClaw agent."""
    # Rerank handles both retrieval and ranking in one call
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=question,
        documents=knowledge_base,
        top_n=3
    )
    top_docs = [knowledge_base[r.index] for r in reranked.results]

    answer = co.chat(
        model="command-r-plus",
        messages=[{"role": "user", "content": question}],
        documents=[{"data": {"text": d}} for d in top_docs]
    )

    return {
        "answer": answer.message.content[0].text,
        "sources": top_docs,
        "citations": answer.message.citations or []
    }

# Example use inside an agent step
result = retrieve_and_answer(
    question="What is our refund policy for digital downloads?",
    knowledge_base=load_company_kb()  # your own loader
)
print(result["answer"])

Notice: when you only have a few hundred candidate documents, you can skip the embedding/vector-DB step entirely and just pass everything to Rerank. The free trial key allows up to 1,000 documents per Rerank call, which covers a surprising number of small-to-medium knowledge bases.

Cohere Pricing (When You Need More)

Model	Price	Unit
Command R+	$2.50 input / $10.00 output	per 1M tokens
Command R	$0.15 input / $0.60 output	per 1M tokens
Command R7B	$0.0375 input / $0.15 output	per 1M tokens
Embed v3 (English / Multilingual)	$0.10	per 1M tokens
Rerank v3	$2.00	per 1,000 searches

When you graduate from a Trial key to a Production key, Command R7B at $0.15 per million output tokens is one of the cheapest production-grade models available. Embed v3 at $0.10 per million tokens is competitive with or cheaper than every comparable hosted embedding API.

When to Use Cohere

Cohere is the right choice when:

You’re building a RAG application and want the best free embeddings + reranker combo
You need multilingual retrieval across 100+ languages without changing models
Your application requires grounded citations (legal, medical, internal knowledge bases)
You want asymmetric embeddings (separate query and document modes) for better search quality
You’re prototyping retrieval pipelines and want generous free per-minute limits

Consider alternatives when:

You need raw chat throughput more than retrieval quality — use Groq or Cerebras for speed, Gemini Flash for free quota
You want OpenAI SDK drop-in compatibility — use Mistral AI or DeepSeek
You need image, audio, or multimodal generation — Cohere is text-only
You’re building a pure chatbot with no retrieval — Command R+ works, but the model isn’t priced or designed around that use case

Final Verdict

Cohere is the most underrated free AI API for one specific reason: it’s the only provider that ships a complete RAG stack — embeddings, reranker, and a chat model trained for grounded citations — all behind a single free trial key. Most “free AI API” articles skip Cohere because they only compare chat models, where Cohere is fine but not best-in-class. That misses the point of what the company actually built.

If your project involves search over your own documents, internal knowledge bases, customer tickets, product catalogs, or anything resembling RAG, Cohere’s free tier covers more of the pipeline than any other single provider. Sign up at dashboard.cohere.com, copy your trial key, and your first reranked retrieval is about ten minutes away.

Originally published at toolfreebie.com.

DEV Community