Daniel Romitelli

Posted on Mar 10 • Edited on Mar 23 • Originally published at craftedbydaniel.com

Multi‑Agent Firecrawl Research: My Fallback Chain That Refuses to Pretend It Knows the Company

#python #webresearch #firecrawl #bingsearch

I didn't build my company research pipeline to be clever. I built it because enrichment is where systems quietly start lying. A recruiter asks for "Cedrus" and the internet gives you three Cedruses, two dead domains, and a LinkedIn vanity URL that doesn't match the legal name. If your pipeline collapses all of that into a single confident paragraph, you've created the worst kind of bug: one that looks correct.

So here's the multi-agent Firecrawl research flow I wired into this recruitment platform—specifically the part that uses Firecrawl for enrichment and falls back to Bing when Firecrawl fails or has low confidence.

flowchart TD
  inputQuery[Company query] --> nameResolver[CompanyNameResolver]
  nameResolver --> firecrawl[FirecrawlResearch]
  firecrawl -->|success| enrichment[FirecrawlEnricher]
  firecrawl -->|fails or low confidence| bing[BingSearchClient]
  bing --> enrichment
  enrichment --> tracer[ExtractionWorkflowTracer]
  tracer --> output[Enriched company profile]```



That's the architecture I actually meant to build: **two evidence sources, one enrichment stage, and a tracer that makes the pipeline legible when it misbehaves**.

The design shows up explicitly in the codebase:

- `app/firecrawl_enricher.py` — Firecrawl enrichment
- `app/firecrawl_research.py` — Firecrawl research module
- `app/firecrawl_v2_adapter.py` and `app/firecrawl_v2_fire_agent.py` — a V2 adapter and agent wrapper
- `app/azure_integrations/bing_search.py` — **Bing Search API client with Redis caching** and a stated role: "Provides fallback enrichment when Firecrawl fails or has low confidence."
- `app/services/company_name_resolver.py` — handles the "vanity URL" problem where domains don't match legal company names
- `app/services/extraction_workflow_tracer.py` — step-by-step tracing for extraction workflows

## The key insight: separate "evidence gathering" from "identity decisions"

The naive approach is to treat enrichment as a single call:

1) fetch content from the web
2) summarize it
3) store it as "company profile"

That fails for two reasons:

- **The web is ambiguous.** Names collide, sites redirect, and "about" pages are marketing fiction.
- **A single step can't express partial failure.** If the fetch is weak, your summary becomes guesswork.

My fix was to treat research like a courtroom:

- Firecrawl is my **investigator**: gather documents, extract text.
- Bing is my **tip line**: when the investigator comes back empty-handed (or shaky), I ask a different source for leads.
- The company name resolver is my **clerk**: reconcile "what the user typed" with "what the entity actually is."
- The tracer is my **court reporter**: every step is recorded so I can debug and audit.

I'm not claiming a magical confidence model here—I haven't open-sourced the scoring internals. But the codebase makes the fallback intent explicit: Bing is used when Firecrawl "fails or has low confidence," and the system has dedicated modules for Firecrawl research plus a resolver for name mismatches.

## How it works under the hood (the chain, not the myth)

### Step 1: normalize the company identity early

In this codebase, company identity is explicitly called out as a problem:

- `app/services/company_name_resolver.py` is described as: "Smart Company Name Extraction" and "Handles the vanity URL problem where domain names don't match legal company names."

That tells you something important about the system's philosophy: I don't want downstream steps to guess what entity we're talking about.

What surprised me when I first built flows like this is how often the *identity* step is the actual bottleneck. Not performance—correctness. If you start research with the wrong entity, every downstream step can be perfect and still produce garbage.

(If you want a deeper look at why entity resolution matters at scale, see engineering writeups from teams tackling large-scale identity problems — they make the same point: resolving entity identity early is essential to avoid merging evidence across distinct entities. For an example of that engineering perspective, see a field example on entity resolution at scale: https://eng.uber.com/entity-resolution/.)

### Step 2: Firecrawl-first research (because it's purpose-built)

The repository contains multiple Firecrawl modules:

- `app/firecrawl_research.py`
- `app/firecrawl_enricher.py`
- `app/firecrawl_v2_adapter.py`
- `app/firecrawl_v2_fire_agent.py`

That tells me I didn't just "call Firecrawl once." I built an adaptation layer and an agent wrapper, which is usually what happens when:

- the upstream API changes (hence a V2 adapter), or
- I need a consistent interface across multiple consumers.

I'm not going to invent what "research" returns, but the existence of both "research" and "enricher" modules strongly suggests a separation between **fetching/collecting** and **structuring/augmenting**.

The naive approach would merge those into one function and then you can't tell whether:

- the content was missing
- the extraction failed
- the enrichment logic misinterpreted

Splitting them makes failures diagnosable.

### Step 3: Bing fallback when Firecrawl fails (or is shaky)

This part is explicit in the codebase:

> `app/azure_integrations/bing_search.py` — "Bing Search API client with Redis caching for company enrichment. Provides fallback enrichment when Firecrawl fails or has low confidence."

That sentence encodes three design choices I care about:

1) **Bing is a fallback, not the primary.**
2) **The fallback is conditional** ("fails or has low confidence").
3) **Caching is part of the contract** (Redis caching).

I like this pattern because it's honest about reality: web enrichment is probabilistic, so the pipeline should behave like a cautious human. If your first source is weak, you don't fabricate—you corroborate.

A practical note on the caching point: if you're using a search API as a fallback, caching responses (and respecting freshness/TTL) is a common operational pattern to both reduce cost and stabilize results. The Azure/Bing docs and best-practice guidance discuss using caching and throttling strategies when integrating with web search APIs — useful background when designing the Redis layer in front of a search client: https://learn.microsoft.com/azure/cognitive-services/bing-web-search/overview.

Here's a sketch of how the chain fits together in practice. I'm showing the control-flow decisions as comments—the real implementations live in the modules listed above, but the orchestration logic is what matters here:



```python
"""company_research_flow.py

Modules in this pipeline:
- app/services/company_name_resolver.py — Smart Company Name Extraction
- app/firecrawl_research.py — Firecrawl research module
- app/firecrawl_enricher.py — Firecrawl enrichment
- app/azure_integrations/bing_search.py — Bing fallback with Redis caching
- app/services/extraction_workflow_tracer.py — step-by-step workflow tracing
"""

from dataclasses import dataclass
from typing import Optional, Dict, Any


@dataclass
class ResearchOutput:
    company_profile: Dict[str, Any]
    evidence_source: str  # "firecrawl" | "bing_fallback"
    notes: Optional[str] = None


def run_company_research(raw_query: str) -> ResearchOutput:
    """Orchestrate a Firecrawl-first, Bing-fallback research flow."""

    # 1) Resolve/normalize company identity
    # company = CompanyNameResolver(...).resolve(raw_query)

    # 2) Firecrawl research attempt
    # firecrawl_result = FirecrawlResearch(...).research(company)

    # 3) If Firecrawl fails or has low confidence, use Bing fallback
    # if firecrawl_result.failed or firecrawl_result.low_confidence:
    #     bing_result = BingSearchClient(...).search(company)
    #     combined = FirecrawlEnricher(...).enrich(company, bing_result)
    #     source = "bing_fallback"
    # else:
    #     combined = FirecrawlEnricher(...).enrich(company, firecrawl_result)
    #     source = "firecrawl"

    # 4) Trace the workflow for auditability
    # ExtractionWorkflowTracer(...).record(stage="research", inputs=..., outputs=...)

    ...  # real implementation delegates to the modules listed above

The non-obvious detail here is that "fallback" isn't just a second API call—it's a different failure mode. Firecrawl can fail because it can't fetch or parse a specific site. Bing can succeed by giving you alternate entry points: press releases, directory listings, cached copies, or simply a better canonical URL to feed back into the Firecrawl path.

Step 4: trace the workflow so enrichment is debuggable

When enrichment fails, the worst outcome is "it didn't work." The second-worst outcome is "it worked" but you can't explain why.

This codebase contains app/services/extraction_workflow_tracer.py — I originally built it for email extraction workflows, but the pattern (log every stage boundary with inputs, outputs, and the chosen evidence source) turned out to be exactly what research enrichment needed too. Same discipline, different domain.

What I like about tracing is that it changes the engineering incentives. Instead of arguing about whether a model/source is "good," I can point to a specific run and say: Firecrawl fetched X, parsing returned Y, fallback triggered, Bing returned Z, enrichment merged it, and the resolver chose this canonical name.

What went wrong (and why the fallback exists)

I don't have a formal postmortem to point to, but the codebase tells the story: Bing is explicitly used when Firecrawl fails or has low confidence.

That's already an admission of a wrong assumption most teams make:

"One source is enough."

It isn't. Even if Firecrawl is excellent, the web isn't stable. Sites block bots. Content moves. Domains expire. And the "vanity URL" problem is so common it got its own resolver module.

So I designed the pipeline to make failure boring:

Firecrawl is the first attempt.
If it can't produce strong enough evidence, I don't force it.
Bing provides alternate leads.
The resolver keeps identity consistent.
The tracer makes the chain explainable.

Nuances that matter in production

A few details from the repo that change how I think about this system:

1) "Low confidence" is a first-class state

The Bing module's description explicitly mentions "low confidence." That means the pipeline isn't binary (success/fail). It has a third state: "I got something, but I don't trust it."

That's the state most systems mishandle. They either:

treat it as success and publish nonsense, or
treat it as failure and throw away useful leads.

A fallback chain lets you keep the weak signal while still searching for corroboration.

2) Identity resolution is part of research, not a cleanup step

CompanyNameResolver exists because domain names and legal names diverge. If you postpone resolution until the end, you end up merging evidence across different entities.

Doing it early is like labeling test tubes before you start pipetting. You can be a genius chemist and still ruin the experiment if you mix up the tubes.

3) Adapters and "V2" modules are a smell—in a good way

The presence of firecrawl_v2_adapter.py tells me I had to stabilize an interface. In production systems, adapters are how you keep the rest of the codebase sane when upstream APIs evolve.

I won't bore you with what changed in V2, but the architectural intent is clear: isolate churn.

What this pipeline actually teaches

Most enrichment systems fail the same way: they ask one source, get one answer, and call it truth. The architecture I've described here—Firecrawl as investigator, Bing as corroboration, a resolver that pins identity before evidence starts flowing, and a tracer that turns every decision into an auditable record—exists because I got burned by that exact pattern.

The five modules aren't clever engineering for its own sake. Each one addresses a specific failure I hit in production:

CompanyNameResolver exists because I once merged profiles for two different companies named Cedrus.
The V2 adapter exists because Firecrawl's API changed and the rest of my codebase shouldn't care.
Bing fallback exists because even the best scraper can't fetch a site that blocks bots.
The tracer exists because I spent three hours debugging an enrichment failure that would have taken five minutes with a step log.
The separation of research from enrichment exists because I couldn't tell whether the data was missing or the parser was wrong.

Every module is scar tissue from a real failure, turned into a guard rail.

The trick with multi-agent research isn't the web scraping. It's building a system where "I'm not sure yet" is a first-class state—observable, testable, and cheaper than a confident lie.

That's how you debug reality.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant

Top comments (3)

klement Gunndu • Mar 10

The fallback chain that admits "I don't know" instead of hallucinating is the hardest pattern to get right in multi-agent setups — the evidence-gathering layer before synthesis is what actually makes this reliable.

Daniel Romitelli • Mar 10

Exactly right, and it's the part that's easiest to skip when you're under pressure to ship. The temptation is to let the LLM fill the gaps with 'confident sounding' answers, but that's how you end up with a pipeline that looks like it's working until it really isn't. The evidence layer before synthesis is doing the unglamorous work it's what separates a demo from something you'd actually trust in production.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.