Search is the most fundamental thing an agent does. Before it writes, plans, acts, or decides — it looks something up. You'd expect this to be a solved problem.
It mostly is. But "mostly" is where agents fail.
We scored five search APIs on Rhumb's AN Score framework: 20 dimensions covering execution reliability, error quality, auth predictability, and access readiness. The spread is narrower than CRMs or databases — search APIs are simpler primitives — but the differences compound when your agent runs search loops overnight.
The Scores
| API | AN Score | Tier | Key Strength |
|---|---|---|---|
| Exa | 8.7 | L4 Native | Neural retrieval, structured output, agent-first design |
| Tavily | 8.6 | L4 Native | Purpose-built for agents, clean response schema |
| Serper | 8.0 | L4 Native | Google results, developer-friendly, predictable errors |
| Brave Search | 7.1 | L3 Ready | Independent index, privacy-first, solid but generic |
| Perplexity | 6.8 | L3 Ready | Returns synthesis, not raw results — changes the contract |
These aren't bad scores. Search APIs score higher than CRMs (HubSpot: 4.6, Salesforce: 4.8) because the primitives are simpler. But a 1.9-point gap between Exa and Perplexity is significant when it's your agent doing 200 searches per day unattended.
Exa — 8.7/10
Exa is designed around semantic search and structured retrieval. The key differentiator: it returns structured data about pages — not just URLs and snippets — which means agents don't have to scrape or parse.
What earns 8.7:
- Neural embeddings query returns relevance scores alongside results
-
contentsparameter returns clean text/HTML extraction as part of the search response (no separate scraping call required) - API keys are self-provisionable: no support contact, no form, no delay
- Rate limit headers always present (
x-ratelimit-limit,x-ratelimit-remaining) - Errors are structured JSON with specific
errorfield and meaningful messages - SDK available in Python and TypeScript with typed responses
Where it falls short:
- Neural/semantic search can behave unpredictably on highly specific technical queries — traditional keyword search sometimes returns more reliable results for code-related lookups
-
highlightsfield (extracted key passages) is powerful but not always accurate; agents should treat it as a hint, not ground truth - Free tier is generous but rate limits drop hard at quota: no graceful degradation, just 429s
The 3am test: If your agent runs a nightly research loop and hits Exa's rate limit, it gets a clean 429 with a Retry-After header. It can backoff and resume. The structured contents field means it doesn't need a separate extraction call. This is what L4 looks like.
Tavily — 8.6/10
Tavily was explicitly built for AI agents. It's not a general-purpose search API that developers later adapted — it's search designed for the consumption patterns agents actually have.
What earns 8.6:
- Response schema includes
answer(synthesized),results(structured), andimages(extracted) in one call -
search_depthparameter (basicvsadvanced) lets agents trade latency for result quality based on task urgency -
include_raw_contentboolean adds full page text extraction to any search — same Exa-style "search + extract" in one call - Per-API-key quotas with clear overage behavior (returns 429 with documented retry behavior)
- Python SDK with async support baked in
Where it falls short:
- Underlying results rely on multiple search backends, so result ordering can shift between API versions
- No way to force a specific search engine or verify which backend was used for a given query — opacity matters when debugging agent hallucinations
-
answersynthesis is often good but shouldn't be used as ground truth — agents should treat it as a starting point
The 3am test: Tavily's search_depth: "advanced" mode costs 2x the credits but returns notably better results for ambiguous queries. An agent that can't reason about which mode to use will either overspend on credits or return shallow results. The API gives you the tool — the agent needs to be smart about using it.
Serper — 8.0/10
Serper wraps Google Search. That means fresh index, familiar results, high query familiarity — and the predictability that comes with a mature product built on top of the world's most-used search engine.
What earns 8.0:
- Returns structured JSON including
organic,knowledgeGraph,answerBox,relatedSearches— agents can extract signal from rich results without scraping - Separate endpoints for news, images, shopping, scholar — lets agents target the right index for the task
- API keys are self-service: signup, verify email, get key immediately
- Error responses are consistent: HTTP status code + structured JSON error body
- Location parameter works reliably for geo-targeted queries
Where it falls short:
- Depends on Google — any Google change or scraping policy enforcement can affect results without notice
- No semantic/neural mode: purely keyword-based, which can miss conceptually relevant results that Exa finds
- Rate limit headers are present but response time on the
scholarendpoint is slower and less predictable thanorganic
The 3am test: For agents that need current events, news, or high-confidence freshness, Serper consistently outperforms neural search options. The news endpoint with tbs: qdr:d (past day filter) is genuinely useful for agents that need up-to-date information. The dependency on Google is a business risk more than a technical one.
Brave Search — 7.1/10
Brave runs an independent web index — not Google, not Bing. For agents with privacy requirements, regulatory constraints, or a need to avoid Google's results patterns, this matters.
What earns 7.1:
- Independent index means results genuinely differ from Google/Bing — useful for diversification or bias testing
-
extra_snippetsparameter returns additional context beyond the standard snippet - API key provisioning is clean: signup, verify, get key
- Returns
freshnessfield with query timestamp — useful for agents tracking information currency
Where it falls short:
- Index coverage is smaller than Google/Bing — some niche technical queries return sparse results
- Error responses lack specificity: 400 errors don't always explain what was malformed
- No semantic search mode — purely traditional search
- Rate limit documentation is less precise than Exa/Tavily; agents need more defensive retry logic
The 3am test: Brave is a good secondary search source for agents that want to cross-validate results or have compliance requirements against Google. As a primary search source for general research agents, the index coverage gap is real. The 7.1 reflects solid fundamentals with execution gaps on error specificity.
Perplexity — 6.8/10
Perplexity is the most interesting entry in this comparison because it changes the fundamental contract. Other search APIs return results — documents, URLs, snippets. Perplexity returns a synthesized answer with citations.
This is powerful for some agent patterns and problematic for others.
What earns 6.8:
- Synthesis model is genuinely good — for "explain this topic" queries, the answer quality is high
- Citations are structured and machine-readable
-
search_recency_filtercontrols how fresh the underlying sources are -
modelparameter lets agents select speed vs quality tradeoffs
Where it falls short:
- Agents can't verify the synthesis: you get an answer, not raw results to reason over
- Synthesis can hallucinate — the answer looks authoritative but may contain errors the agent can't detect
- Rate limits are lower than Serper/Exa at equivalent price points
- Not suitable for agents that need raw web data to make their own inferences — the synthesis happens inside Perplexity, not inside your agent
- Error messages on malformed requests can be opaque
The 3am test: If your agent is a research assistant that writes summaries, Perplexity's synthesis is useful input. If your agent needs to reason over raw data, extract specific entities, or verify claims — Perplexity is the wrong tool. The 6.8 isn't a bad score; it reflects a tool that does a different job than the others.
Which to Use When
Semantic/conceptual research: Exa (8.7). Neural embeddings find related content that keyword search misses. Best for research agents that explore topic clusters.
Agent-first search with synthesis: Tavily (8.6). Purpose-built for the patterns agents actually use. search_depth parameter is a genuine quality lever.
Current events and Google-familiar results: Serper (8.0). Best freshness, most familiar results patterns, strong structured output from rich results.
Privacy/compliance or Google alternatives: Brave (7.1). Independent index, solid fundamentals, coverage gap on niche queries.
Synthesis over raw retrieval: Perplexity (6.8). Use when you want a synthesized starting point, not when you need raw data.
The Pattern
Search APIs cluster higher than most categories (all five score ≥ 6.8) because search is a relatively simple primitive: query in, results out. The design surface is smaller than CRMs or databases.
But the failure modes are subtle:
- Result instability — what ranked #1 yesterday may not today; agents that cache results need freshness awareness
- Rate limit handling — all five have limits; only the top three communicate them clearly enough for agent backoff
- Synthesis vs retrieval mismatch — Perplexity's 6.8 is partly about using synthesis when your agent needs raw retrieval
- Index coverage assumptions — agents trained on Google-scale expectations will underperform with smaller indexes
For most agent builders, Exa or Tavily is the right default. Both are L4 Native, both score above 8.5, and both return structured data that reduces downstream processing. Exa wins on semantic retrieval; Tavily wins on agent-specific ergonomics.
Rhumb scores 645+ APIs on 20 execution dimensions — execution reliability, error quality, auth predictability, rate limit transparency, and more. All scores at rhumb.dev.
Also in this series: LLM APIs for AI Agents · Payment APIs · Database APIs · CRM APIs
Top comments (0)