Daniel Romitelli

Posted on Mar 13 • Edited on Mar 23 • Originally published at craftedbydaniel.com

Firecrawl Part 2: The Confidence Gate That Decides When Bing Gets a Vote

#python #dataquality #pipelines #featureflags

I found the bug in the most boring place: a company record that looked "complete enough" to pass a quick glance, but still had an Unknown-ish name and no usable location. The pipeline hadn't crashed. It hadn't even thrown a warning. It had just… stopped asking better questions. That record made it into a recruiter's outreach queue with "Unknown" as the company name—and the recruiter sent the email anyway, because the system had served it with the same confidence as every other record.

If you don't draw a hard line for "this result is not good enough," your system will quietly accept half-truths—and because they're half-structured, they'll flow downstream like they're facts.

TL;DR: My enrichment pipeline treated Firecrawl "success" as a stop signal—until I found records with Unknown company names flowing into recruiter queues. The fix: a blunt confidence gate that checks usable fields, not return codes. Below 0.7 confidence, missing anchors, or placeholder names → Bing gets a vote. Completion and correction are kept as separate jobs.

The key insight: I don't trust "success" — I trust "usable fields"

The naive approach is to treat enrichment like a boolean:

Firecrawl succeeded → stop
Firecrawl failed → try Bing

That's a trap, because a research call can "succeed" and still be useless for the product. My UI and downstream flows don't care whether a scraper returned JSON; they care whether I got a company identity and contact anchors that a recruiter can act on.

So my decision boundary isn't "did Firecrawl return data?" It's "is the data good enough to stop?" And in my code, "good enough" is deliberately defined in terms of:

a numeric confidence score (with a threshold)
missing key fields
placeholder-ish company names that start with unknown

That's the whole trick: the gate is not philosophical. It's operational. And there's a corollary that took me longer to internalize: if Firecrawl doesn't populate confidence at all, I treat it as 0 and augment anyway. Missing confidence is itself a reason not to stop.

How the gate works under the hood

The gate lives inside the company research flow in app/langgraph_manager.py, right where I decide whether to invoke the Bing Search augmentation.

Here's the exact logic I run to decide whether the Firecrawl result needs improvement:

#  Bing Search fallback/augmentation
# This uses the paid Bing Search API to fill missing company data
from app.config.feature_flags import ENABLE_BING_SEARCH_FALLBACK

# Check if we need improvement: low confidence OR missing key fields
needs_improvement = (research_result.get('confidence', 0) < 0.7) or \
                    not research_result.get('company_name') or \
                    (isinstance(research_result.get('company_name'), str) and research_result['company_name'].lower().startswith('unknown')) or \
                    not research_result.get('phone') or \
                    not research_result.get('city')

What surprised me when I first wired this in is how often "confidence" wasn't the deciding factor—missing phone or city was. A result can be confident and still be incomplete in the ways that matter.

Why the naive version fails

If I only used confidence < 0.7, I'd accept records that are "confidently blank" in critical places. Conversely, if I only used missing fields, I'd end up hammering Bing for every record—even when Firecrawl did a perfectly good job.

So I combine them:

Low confidence is a reason to seek corroboration.
Missing anchors is a reason to seek completion.
Placeholder names are a reason to seek identity repair.

And yes, the unknown check is intentionally crude. It's not trying to solve company identity; it's trying to detect when identity wasn't solved.

The tradeoff

This gate will sometimes call Bing even when Firecrawl was "good enough" for a human. That's the cost of making the system stop lying by omission. I'd rather pay for a second opinion than ship an Unknown-ish record into a place where it becomes someone's assumed truth.

Before research begins: the CompanyNameResolver

Before Firecrawl touches a URL, the pipeline has to answer a deceptively simple question: what is this company actually called?

The problem is vanity URLs. A recruiter gets an email from someone at bestfinancialplanning.com, and the pipeline needs to figure out that the legal entity is "Apex Wealth Advisors." The domain is a marketing phrase, not a company name—but if you naively format it, you get "Best Financial Planning" sitting in a CRM record looking plausible.

The CompanyNameResolver handles this with a layered strategy:

Correction history first. If recruiters have already corrected this domain→company mapping twice or more, use their answer. Human corrections at frequency ≥ 2 get confidence 0.95—higher than any automated source.
Meta tag scraping. Hit the website and pull og:site_name, the <title> tag, and footer copyright text. These are ranked by reliability: og:site_name is usually the most authoritative, copyright text is next, and the page title (after stripping "- Home" and "- Welcome" suffixes) is the fallback.
Vanity URL detection. A simple heuristic checks whether the domain contains verb patterns like "iwant," "get," "retire," or "plan." If the domain looks like a phrase and the scraped name looks like a proper noun, the resolver treats the scraped name as the real company name.

The resolver also produces a confidence score that feeds directly into the gate I described above. A domain-derived name gets confidence 0.30. A scraped og:site_name gets 0.80. A correction-history match gets up to 0.95. So when the gate checks confidence < 0.7, it's implicitly asking: did the resolver find the company name from a source I trust, or did it just format the domain and hope?

The dataflow: "evidence gathering" stays separate from "decision to stop"

Part 1 introduced the shape of the pipeline. The important thing to notice in Part 2 is that the gate sits between sources: it's the bouncer at the door deciding whether Bing gets a vote.

flowchart TD
  inputQuery[Company query] --> nameResolver[Company Name Resolver]
  nameResolver --> firecrawl[Firecrawl Research]
  firecrawl -->|success| enrichment[Firecrawl Enricher]
  firecrawl -->|fails or low confidence| bing[Bing Search Client]
  bing --> enrichment
  enrichment --> tracer[Extraction Workflow Tracer]
  tracer --> output[Enriched company profile]

When I look at this diagram now, I think of it like a courtroom: Firecrawl is the first witness, Bing is a rebuttal witness, and the gate decides whether the first testimony is complete enough to rest the case.

The "unknown" problem shows up in the UI too (and I defend against it twice)

One subtle thing I learned building this: even if the backend tries to be careful, the frontend can accidentally "cement" a bad value by writing it into a form field.

In the office integration taskpane, I explicitly guard against updating the form with invalid company names:

// Only update company_name if it's valid and form doesn't have a better value
const invalidValues = ['unknown', 'n/a', 'none', 'not found', 'unknown company', ''];
if (data.company_name) {
    const companyLower = data.company_name.toLowerCase().trim();
    const currentFirmValue = document.getElementById('firmName')?.value?.trim()?.toLowerCase() || '';

    // Only update if new value is valid and current value is empty or invalid
    if (!invalidValues.includes(companyLower) && 
        (!currentFirmValue || invalidValues.includes(currentFirmValue))) {
        document.getElementById('firmName').value = data.company_name;
    }
}

I treat placeholder-ish strings as toxic, because once they land in a field like firmName, they propagate into later steps as if a human typed them.

The non-obvious detail here is that I'm not relying on one layer to "do the right thing." The backend gate prevents bad stops; the UI guard prevents bad writes. Defense in depth, applied to data quality instead of security.

Bing isn't "fallback" in practice—it's targeted augmentation

Even in the backend flow, I don't treat Bing as a full replacement for Firecrawl. I treat it as a patch tool for missing anchors.

Here's the augmentation behavior captured in app/langgraph_manager.py once I have a bing_result:

if bing_result:
    # Fill missing fields from Bing Search
    if not research_result.get('company_name') and bing_result.get('company_name'):
        research_result['company_name'] = bing_result['company_name']
        logger.info(f" Found company name via Bing: {bing_result['company_name']}")

    if not research_result.get('phone') and bing_result.get('company_phone'):
        research_result['phone'] = bing_result['company_phone']

I like this pattern because it's honest about precedence:

Firecrawl remains the base.
Bing only fills holes.

The tradeoff is that this won't correct a wrong value—only a missing one. That's a deliberate limitation: correction is a different feature than completion, and mixing them is how you get "confidently wrong."

Feature flags: I can disable the entire behavior without changing the code path

This system is wired so Bing enrichment can be disabled by feature flags, and the API refuses to pretend it's available when it isn't.

From app/main.py:

from app.config.feature_flags import ENABLE_BING_SEARCH_FALLBACK, ENABLE_BING_SEARCH_PRIMARY

if not (ENABLE_BING_SEARCH_FALLBACK or ENABLE_BING_SEARCH_PRIMARY):
    raise HTTPException(status_code=503, detail="Bing Search enrichment disabled by feature flags")

bing_client = await get_bing_search_client()
if not bing_client or not bing_client.enabled:
    raise HTTPException(status_code=503, detail="Bing Search client not available or missing key")

I've grown to love this style of failure: it's explicit, and it prevents a half-enabled environment from producing half-truths. The alternative—silently skipping Bing and returning whatever Firecrawl had—looks like a graceful degradation, but it's actually a silent policy change: you've lowered your quality bar without telling anyone downstream.

The Bing client is engineered like a production dependency (because it is)

If Bing is going to be part of the decision boundary, it can't be flaky. So the client in app/azure_integrations/bing_search.py bakes in pragmatic constraints: timeouts, retries, and caching.

# Default to enabled; can be turned off with ENABLE_BING_SEARCH=false
self.enabled = os.getenv('ENABLE_BING_SEARCH', 'true').lower() == 'true'
self.timeout = 10.0
self.max_retries = 2
self.cache_ttl = timedelta(hours=24)  # Cache for 24 hours like Azure Maps

if self.enabled and not self.api_key:
    logger.warning("Bing Search is enabled but BING_SEARCH_KEY not found in environment")
elif self.enabled:
    logger.info("Bing Search API client initialized successfully")

The 24-hour caching choice is long enough to stop repeated enrichment from burning quota, but still short enough that a company's contact surface can update over time.

Nuances I only learned after running it

A few details matter more than they look:

1) The confidence threshold is a policy, not a truth

The gate uses research_result.get('confidence', 0) < 0.7. That number isn't "science." It's a product policy encoded as a switch: below that, I want a second source.

2) "Unknown" is a smell I can detect cheaply

I'm not trying to build a perfect company identity resolver inside the gate. I'm trying to detect when I'm about to freeze a placeholder into a record.

So I check:

isinstance(company_name, str)
.lower().startswith('unknown')

It's crude. It's also incredibly effective at catching the exact failure mode that makes downstream systems look competent while being wrong.

3) Completion and correction are different jobs

"Completion and correction are different jobs." Bing can fill missing anchors, but it doesn't get to rewrite identity.

Notice what I don't do in the Bing merge:

I don't override Firecrawl's company_name if it exists.
I don't override phone if it exists.

That's me drawing a line: Bing can fill missing anchors, but it doesn't get to rewrite identity. If I want correction, I build a correction mechanism—not a "maybe overwrite" heuristic.

Closing

The most important part of this pipeline isn't Firecrawl or Bing—it's the moment I decide whether I'm allowed to stop. Once I encoded "usable fields beat success flags" into a simple gate—confidence < 0.7, missing anchors, and unknown-smelling names—the enrichment chain stopped being a scraper and started behaving like a system with standards.

Postscript: the blast radius

When I added logging around the gate and ran it against the existing queue, I counted 40+ records that had passed through with "Unknown" or placeholder company names in the previous two weeks. Some had been acted on—recruiters had sent outreach with "Unknown" in the company field, because the form presented it with the same confidence styling as verified data. Nobody had complained, because nobody realized the system was guessing. That's the failure mode this gate exists to prevent: not a crash, not an error, just quiet propagation of half-truths that look like facts.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant

DEV Community