DEV Community

Cover image for Cohere Free API: The Best Free Embedding and Rerank API for RAG in 2026
toolfreebie
toolfreebie

Posted on • Originally published at toolfreebie.com

Cohere Free API: The Best Free Embedding and Rerank API for RAG in 2026

What Is Cohere?

Cohere is a Toronto-based AI company founded in 2019 by Aidan Gomez (one of the original authors of the “Attention Is All You Need” Transformer paper) and a team of ex-Google Brain researchers. Unlike OpenAI or Anthropic, Cohere built its platform from day one around a specific use case: enterprise retrieval and RAG (Retrieval-Augmented Generation).

That focus shows up in three places where Cohere genuinely leads the field — and where most developers don’t realize they can get it for free:

  • Embed v3 — text embeddings that consistently rank near the top of the MTEB benchmark, in both English and 100+ other languages
  • Rerank v3 — the most-deployed neural reranker in production RAG systems, available via a single API call
  • Command R / R+ — chat models specifically trained for RAG, tool use, and grounded citations

And the part most developers miss: a free Cohere trial key gives you access to all of these. No credit card, no time limit. The only constraint is per-minute rate limiting, which is fine for prototyping, side projects, and small production workloads.

What’s Free on Cohere

Cohere has two key types: Trial keys (free) and Production keys (paid). Trial keys never expire — they’re rate-limited but otherwise unrestricted.

Endpoint Trial Rate Limit Production Rate Limit
Chat (Command R/R+) 20 calls/min 500 calls/min
Embed 100 calls/min 2,000 calls/min
Rerank 10 calls/min 1,000 calls/min
Classify 100 calls/min 1,000 calls/min
Summarize 5 calls/min 500 calls/min

Notice the Embed limit: 100 calls per minute with up to 96 documents per call. That’s effectively 9,600 embeddings per minute on the free tier — more than enough to index a personal knowledge base or a small document corpus from scratch in a few minutes.

Note: Trial keys are not for production traffic, but they are for real development. Cohere’s documentation explicitly encourages building and testing on trial keys before upgrading.

How to Get Your Free API Key

  1. Go to dashboard.cohere.com/welcome/register and sign up with email or Google
  2. Verify your email address
  3. From the dashboard, navigate to API Keys in the left sidebar
  4. Your default Trial key is already there — copy it
  5. Set it as an environment variable: export COHERE_API_KEY="your_key_here"

No credit card. No phone number. Two minutes from signup to your first embedding.

Python Quickstart: Your First Embedding

Install the official Cohere Python SDK:

pip install cohere
Enter fullscreen mode Exit fullscreen mode

Embedding three documents:

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.embed(
    texts=[
        "Cohere makes the best free embedding API for RAG.",
        "OpenClaw is an AI agent platform for orchestrating tools.",
        "Toronto is the headquarters of Cohere."
    ],
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
)

print(f"Got {len(response.embeddings.float)} embeddings")
print(f"Each embedding is {len(response.embeddings.float[0])} dimensions")
Enter fullscreen mode Exit fullscreen mode

That returns three 1024-dimensional vectors you can drop into any vector database — Pinecone, Weaviate, Chroma, Qdrant, pgvector, or just a NumPy array.

The input_type parameter is important: Cohere’s embeddings are asymmetric. Use "search_document" when indexing your corpus, and "search_query" when embedding the user’s question. Treating them differently gives noticeably better retrieval quality than symmetric embedding APIs.

Embedding Models You Get for Free

Model ID Dimensions Languages Best For
embed-english-v3.0 1024 English Highest quality English search and RAG
embed-multilingual-v3.0 1024 100+ Multilingual search, cross-language RAG
embed-english-light-v3.0 384 English Smaller index, faster queries, low storage
embed-multilingual-light-v3.0 384 100+ Multilingual on a budget

For most RAG projects, embed-english-v3.0 at 1024 dimensions is the sweet spot. If you’re storing millions of vectors and storage cost matters, the light variants drop to 384 dimensions — about 60% smaller indexes — with only a small quality drop.

Cohere Rerank: The Secret Weapon for RAG Quality

Here is where Cohere genuinely leads: Rerank. After your vector database returns the top 50 or 100 candidate documents, you pass them to Rerank along with the user’s query. Rerank scores each document for actual relevance and reorders them. The top 5 reranked results are almost always dramatically better than the top 5 from raw vector similarity.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

query = "How do I add a free embedding API to my chatbot?"

documents = [
    "Cohere offers free embedding API access through trial keys.",
    "Pinecone is a managed vector database service.",
    "OpenAI embeddings cost $0.02 per million tokens.",
    "Use embed-english-v3.0 for the best quality English embeddings.",
    "Vector databases store high-dimensional vectors for similarity search."
]

response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=documents,
    top_n=3
)

for result in response.results:
    print(f"Score: {result.relevance_score:.4f}  |  {documents[result.index]}")
Enter fullscreen mode Exit fullscreen mode

That returns the three documents most relevant to the query, with calibrated relevance scores between 0 and 1. In production RAG systems, adding a Rerank step typically boosts answer quality by 15–30% over vector-similarity-only retrieval — which is why it’s the most-deployed neural reranker in commercial RAG stacks.

And it’s free on the trial key: 10 calls per minute, with up to 1,000 documents per call.

Chat with Command R+: Built for RAG

Cohere’s Command R+ chat model is purpose-built for RAG. Unlike most chat APIs where you stuff retrieved documents into the system prompt, Cohere’s chat endpoint accepts a structured documents parameter — and the model returns inline citations pointing to which documents each claim came from.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

response = co.chat(
    model="command-r-plus",
    messages=[
        {"role": "user", "content": "Which Cohere embedding model should I use for English RAG?"}
    ],
    documents=[
        {"data": {"text": "embed-english-v3.0 produces 1024-dimensional embeddings and leads MTEB English benchmarks."}},
        {"data": {"text": "embed-english-light-v3.0 produces 384-dimensional embeddings, optimized for low storage cost."}},
        {"data": {"text": "embed-multilingual-v3.0 supports over 100 languages."}}
    ]
)

print(response.message.content[0].text)
print()
print("Citations:")
for citation in response.message.citations or []:
    print(f"  - '{citation.text}' from sources: {[s.id for s in citation.sources]}")
Enter fullscreen mode Exit fullscreen mode

The model produces a grounded answer that cites which document each fact came from. For RAG applications where users need to verify the source of every claim — legal, medical, internal knowledge bases — this is significantly more useful than free-text generation.

Free Chat Models on Cohere

Model ID Size Context Window Best For
command-r-plus 104B 128k tokens Best quality, complex RAG, tool use
command-r 35B 128k tokens Faster RAG, cheaper-when-paid baseline
command-r7b 7B 128k tokens Fastest responses, simple Q&A

All three are available through your free trial key at the same 20-calls-per-minute rate limit. command-r-plus is the headline model — it scores comparably to GPT-4o on RAG benchmarks while being explicitly trained to follow document citations.

End-to-End RAG Pipeline (All Free)

Here’s a complete RAG pipeline using only Cohere’s free trial key — embed, store, retrieve, rerank, and answer:

import os
import numpy as np
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

# 1. Your knowledge base
documents = [
    "OpenClaw is an AI agent platform for orchestrating multiple AI APIs and tools.",
    "Cohere Embed v3 produces 1024-dimensional vectors optimized for retrieval.",
    "Cohere Rerank v3 reorders candidate documents by true relevance to the query.",
    "Command R+ is a 104B model trained specifically for RAG with citations.",
    "Free trial keys on Cohere have no time limit — only per-minute rate limits.",
]

# 2. Index documents
doc_embeds = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
).embeddings.float
doc_matrix = np.array(doc_embeds)

# 3. Embed the query
query = "How do I get free access to Cohere's RAG models?"
query_embed = np.array(co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query",
    embedding_types=["float"]
).embeddings.float[0])

# 4. Vector similarity — get top 3 candidates
scores = doc_matrix @ query_embed
top_indices = np.argsort(scores)[-3:][::-1]
candidates = [documents[i] for i in top_indices]

# 5. Rerank to get best 2
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=candidates,
    top_n=2
)
top_docs = [candidates[r.index] for r in reranked.results]

# 6. Answer with Command R+ using grounded citations
answer = co.chat(
    model="command-r-plus",
    messages=[{"role": "user", "content": query}],
    documents=[{"data": {"text": d}} for d in top_docs]
)

print(answer.message.content[0].text)
Enter fullscreen mode Exit fullscreen mode

That’s a full production-shape RAG pipeline — embed, retrieve, rerank, generate with citations — running on a free trial key with zero credit card on file.

JavaScript / Node.js Example

npm install cohere-ai
Enter fullscreen mode Exit fullscreen mode
import { CohereClientV2 } from "cohere-ai";

const co = new CohereClientV2({ token: process.env.COHERE_API_KEY });

const response = await co.embed({
  texts: [
    "Cohere is the best free embedding API for RAG.",
    "Toronto is the headquarters of Cohere."
  ],
  model: "embed-english-v3.0",
  inputType: "search_document",
  embeddingTypes: ["float"]
});

console.log(`Got ${response.embeddings.float.length} embeddings`);
Enter fullscreen mode Exit fullscreen mode

Cohere vs Other Free Embedding Options

Provider Free Embedding Model Dimensions Multilingual Reranker?
Cohere embed-english-v3.0 / multilingual-v3.0 1024 / 384 100+ languages Yes (Rerank v3)
Google Gemini text-embedding-004 768 Limited No
Mistral AI mistral-embed 1024 Limited No
Cloudflare Workers AI bge-base-en-v1.5 768 English only No
Hugging Face Inference BGE / E5 family varies Some multilingual No (manual setup)
OpenAI (paid only) text-embedding-3-large 3072 Strong multilingual No

Where Cohere wins on the free tier: the only provider on this list that ships a hosted neural reranker. For RAG quality, that single feature usually matters more than which embedding model you started with. Combined with asymmetric embeddings (separate search_query and search_document modes), Cohere’s free tier is a credible foundation for real retrieval applications — not just a demo toy.

Use Cohere with OpenClaw

OpenClaw is an AI agent platform that orchestrates multiple APIs and tools into automated workflows. Cohere fits well as the retrieval and grounding layer inside OpenClaw agents — the part that searches your private documents before the agent acts.

A common pattern: an OpenClaw agent receives a user task (“draft a reply to this customer ticket”), uses Cohere Embed + Rerank to pull the three most relevant past tickets and policies from your knowledge base, then passes those documents to Command R+ to generate a cited reply. Because Cohere returns explicit citations, the agent can attach source links to the draft for human review.

import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

def retrieve_and_answer(question: str, knowledge_base: list[str]) -> dict:
    """A retrieval-then-answer step for use inside an OpenClaw agent."""
    # Rerank handles both retrieval and ranking in one call
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=question,
        documents=knowledge_base,
        top_n=3
    )
    top_docs = [knowledge_base[r.index] for r in reranked.results]

    answer = co.chat(
        model="command-r-plus",
        messages=[{"role": "user", "content": question}],
        documents=[{"data": {"text": d}} for d in top_docs]
    )

    return {
        "answer": answer.message.content[0].text,
        "sources": top_docs,
        "citations": answer.message.citations or []
    }

# Example use inside an agent step
result = retrieve_and_answer(
    question="What is our refund policy for digital downloads?",
    knowledge_base=load_company_kb()  # your own loader
)
print(result["answer"])
Enter fullscreen mode Exit fullscreen mode

Notice: when you only have a few hundred candidate documents, you can skip the embedding/vector-DB step entirely and just pass everything to Rerank. The free trial key allows up to 1,000 documents per Rerank call, which covers a surprising number of small-to-medium knowledge bases.

Cohere Pricing (When You Need More)

Model Price Unit
Command R+ $2.50 input / $10.00 output per 1M tokens
Command R $0.15 input / $0.60 output per 1M tokens
Command R7B $0.0375 input / $0.15 output per 1M tokens
Embed v3 (English / Multilingual) $0.10 per 1M tokens
Rerank v3 $2.00 per 1,000 searches

When you graduate from a Trial key to a Production key, Command R7B at $0.15 per million output tokens is one of the cheapest production-grade models available. Embed v3 at $0.10 per million tokens is competitive with or cheaper than every comparable hosted embedding API.

When to Use Cohere

Cohere is the right choice when:

  • You’re building a RAG application and want the best free embeddings + reranker combo
  • You need multilingual retrieval across 100+ languages without changing models
  • Your application requires grounded citations (legal, medical, internal knowledge bases)
  • You want asymmetric embeddings (separate query and document modes) for better search quality
  • You’re prototyping retrieval pipelines and want generous free per-minute limits

Consider alternatives when:

  • You need raw chat throughput more than retrieval quality — use Groq or Cerebras for speed, Gemini Flash for free quota
  • You want OpenAI SDK drop-in compatibility — use Mistral AI or DeepSeek
  • You need image, audio, or multimodal generation — Cohere is text-only
  • You’re building a pure chatbot with no retrieval — Command R+ works, but the model isn’t priced or designed around that use case

Related Reads

Final Verdict

Cohere is the most underrated free AI API for one specific reason: it’s the only provider that ships a complete RAG stack — embeddings, reranker, and a chat model trained for grounded citations — all behind a single free trial key. Most “free AI API” articles skip Cohere because they only compare chat models, where Cohere is fine but not best-in-class. That misses the point of what the company actually built.

If your project involves search over your own documents, internal knowledge bases, customer tickets, product catalogs, or anything resembling RAG, Cohere’s free tier covers more of the pipeline than any other single provider. Sign up at dashboard.cohere.com, copy your trial key, and your first reranked retrieval is about ten minutes away.


Originally published at toolfreebie.com.

Top comments (0)